knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(BiocStyle)
library(knitr)
The package r Rpackage("genePrioritization")
employs a
"guilt-by-association" approach to find potential disease-related
genes based on their coexpression with reference "seed genes"
known to be (directly or indirectly) associated to that disease
or phenotype. This is very useful to identify genes of which
little was previously known, which can then be further studied in a
laboratory.
The analysis starts from a gene expression matrix. Given a list of known disease-related genes, or "seed genes", and a list of potential disease-related genes, or "candidates", the steps are the following:
Higher ranked candidates have a higher coexpression with the set of disease-related genes and could be functionally linked to them, thus sharing a role in the disease.
r Rpackage("genePrioritization")
can use any gene expression matrix
as the input, as long as it has genes on the rows, samples on the
columns, and has numerical values. Rows and columns also need to be
named accordingly. The matrix can be imported from a file, from a RangedSummarizedExperiment
object, or any source. As an example,
we will perform a simple analysis on the known Bioconductor
package r Biocexptpkg("airway")
and use package r Biocannopkg("EnsDb.Hsapiens.v75")
to convert between gene IDs and
symbols.
First, we attach the packages and load the expression data.
library(genePrioritization) library(airway) library(EnsDb.Hsapiens.v75) data("airway") airway
We extract the count matrix. To make the analysis faster, we will use only part of the data: remove genes with expression level 0 across all samples and subset the first 12000 genes.
counts <- assay(airway) mask <- apply(counts, MARGIN=1, sum)>0 counts <- counts[mask,] counts <- counts[1:12000,] head(counts)
Now we select some seed genes for our analysis. We can start from the reference paper from the airway package:
Himes BE, Jiang X, et al., “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
and select some of those those "previously related to steroid responsiveness and inflammation": DUSP1, FKB5, KLF15.
seedNames<- c("DUSP1","FKBP5","KLF15") seedconv <- select(EnsDb.Hsapiens.v75, keys=seedNames, keytype="SYMBOL", columns=c("GENEID")) seedconv seedIDs<-seedconv$GENEID
RankedGeneList
objectsThe analysis is performed by calling the function and requires only
one line of code. By not specifying candidates, we are implicitly
asking to use all genes in the matrix that are not part of the seed
genes as candidates. We can then visualize the results, which are
stored in a RankedGeneList
object.
gp <- prioritizeCandidates(counts, seedIDs) gp
The package also provides a summary
method to visualize more
information about the result.
summary(gp)
There are two ways to interact with this object. You can access the
slot directly or you can use the accessor functions getResults
,
getSeed
, getCandidates
, getRanks
and getCoexpression.
The results
slot contains the scores for the candidates, sorted
according to increasing scores, such that the candidates that come
first have higher coexpression with the seed genes set. seed
and
candidates
contain gene names; ranks
and coexpression
are
matrices of the respective values for candidates with respect to
seed genes, with rows sorted according to the best candidates.
Let's extract the top 20 candidates and have a look at them.
top20 <- head(getResults(gp),20) top20
To have some more information, we build a simple dataframe containing gene symbols, gene IDs and scores.
top20conv <- select(EnsDb.Hsapiens.v75, keys= names(top20), keytype = "GENEID", columns = c("SYMBOL")) data.frame(cbind(top20conv,SCORE=as.numeric(top20)))
Some of these genes also appear as upregulated in the quoted study, such as KLF9, NEXN, SPARCL1, CACNB2, CRISPLD2. We choose CRISPLD2 and examine its ranks in the coexpression list for each seed gene.
crispld2_score <- getResults(gp)["ENSG00000103196"] crispld2_score allranks <- getRanks(gp) allranks["ENSG00000103196",]
We can do the same for coexpression.
allcoexpr <- getCoexpression(gp) allcoexpr["ENSG00000103196",]
Seed and candidates used in the analysis can also be accessed.
getSeed(gp) head(getCandidates(gp))
From the summary we see that some genes are strongly anti-correlated
with the seed genes. We might want to consider downregulation as a
relevant relation between a seed and a candidate. To do so, we can
repeat the analysis setting the antiCorrelation
paramater to TRUE.
gp_anti <- prioritizeCandidates(counts, seedIDs, antiCorrelation=TRUE) gp_anti
summary(gp_anti)
top20.2 <- head(getResults(gp),20) top20.2conv <- select(EnsDb.Hsapiens.v75, keys= names(top20.2), keytype = "GENEID", columns = c("SYMBOL")) data.frame(cbind(top20.2conv,SCORE=as.numeric(top20.2)))
The analysis can also be performed step by step with the functions findCoexpression
(step 1), rankGenes
(step 2), candidateScoring
(steps 3-5). The former two will return the full coexpression / rank
matrix while the latter returns the same named vector one would get
in the results
slot of a RankedGeneList
object. A constructor for
the class is available, so such an object can be built if so desired.
coexpr <- findCoexpression(counts, seedIDs) head(coexpr) ranks <- rankGenes(coexpr) head(ranks) scores <- candidateScoring(ranks) head(scores)
myrgl <- RankedGeneList(results = scores, seed = seedIDs, candidates = names(scores), coexpr = coexpr[names(scores),, drop=FALSE], ranks = ranks[names(scores),, drop=FALSE]) summary(myrgl)
RangedSummarizedExperiment
As seen above, you can extract the counts assay from a RSE object to
run the analysis. You can also store the RankedGeneList
object in
the metadata
slot.
metadata(airway) <- append(metadata(airway), gp) metadata(airway)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.