knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of GeneNeighborhood is to extract and analyze the orientation and proximity of upstream/downstream genes.
``` {r, eval=FALSE, echo=FALSE}
install.packages("GeneNeighborhood")
The development version of the GeneNeighborhood package can be installed from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("pgpmartin/GeneNeighborhood")
Load the library:
library(GeneNeighborhood)
library(knitr) library(dplyr) library(reshape2) library(GenomicRanges)
To run examples, the package includes a GRanges named Genegr that contains 676 genes with random coordinates on a single chromosome Chr1:
Genegr
For each feature/gene, we extract information (orientation and distance, potential overlaps) about their upstream/downstream neighbors with:
GeneNeighbors <- getGeneNeighborhood(Genegr)
We define a random set of 100 genes:
set.seed(1234) randGenes <- sample(names(Genegr), 100)
We extract statistics about the orientation of their neighbors using:
## Neighbors Orientation Statistics: NOS <- analyzeNeighborsOrientation(randGenes, GeneNeighborhood = GeneNeighbors)
By default all genes are used as a universe and an enrichment test is performed.
By default, the function also analyzes the "other" orientation which may be hard to interpret. We can remove this orientation using:
NOS <- analyzeNeighborsOrientation(randGenes, GeneNeighborhood = GeneNeighbors, keepOther = FALSE)
We obtain the following table:
NOS %>% dplyr::mutate(p.value = signif(p.value, 2)) %>% knitr::kable(digits = c(1,1,1,2,1,2,300), format = "markdown", caption="100 random genes")
We can plot the corresponding percentages using:
plotNeighborsOrientation(NOS)
For a set of genes (here all genes), we can extract the distances to the upstream/downstream gene neighbors by:
alldist <- dist2Neighbors(GeneNeighborhood = GeneNeighbors, geneset = GeneNeighbors$GeneName, genesetName = "AllGenes")
We can use these distances to select genes with short upstream distances.
First we create a vector of upstream distances:
updist <- alldist$Distance[alldist$Side == "Upstream"] names(updist) <- alldist$GeneName[alldist$Side == "Upstream"]
Then we define a probability of selecting the gene that is inversely proportional to its upstream distance:
probs <- (max(updist) - updist) / sum(max(updist) - updist)
Then select 100 genes using these probabilities:
set.seed(1234) lessRandGenes <- sample(names(updist), 100, prob = probs)
We now have:
randGenes
) lessRandGenes
) We assemble these in a list:
MyGeneSets <- list("RandomGenes" = randGenes, "LessRandomGenes" = lessRandGenes, "AllGenes" = GeneNeighbors$GeneName)
For each gene in each gene set, we extract the distances to their neighbors:
Dist2GeneSets <- dist2Neighbors(GeneNeighbors, MyGeneSets)
Now we can obtain descriptive statistics on these distances:
MyDistStats <- distStats(Dist2GeneSets, confLevel = 0.95, nboot = 100, CItype="perc", ncores = 2)
The function also calculates bootstrap confidence intervals for the mean and median using the boot
package.
Note: To reduce computation time, we use only 100 boostrap samples to compute the confidence intervals of the mean and median but a value at least equal to the default 1e4 should be preferred. The function also supports parallel computing (not available on Windows) via the ncores
argument
We obtain the following table of statistics:
MyDistStats %>% knitr::kable(digits = 0, format = "markdown", caption="Descriptive statistics on intergenic distances")
We can perform statistical tests to evaluate if the distances observed in a gene set are significantly different from the distances observed in the universe. Because there are different ways to formulate this question, the distTests
function provides the p-values for different tests (see the vignette for details):
MyDistTests <- distTests(Dist2GeneSets, Universe = "AllGenes", MedianResample = TRUE, R = 1e3)
Table:
MyDistTests %>% dplyr::mutate(KS.pvalue = signif(KS.pvalue, 2), Wilcox.pvalue = signif(Wilcox.pvalue, 2), Indep.pvalue = signif(Indep.pvalue, 2), Median_resample.pvalue = signif(Median_resample.pvalue, 2)) %>% knitr::kable(digits = c(0,0,0,300,300,300,300), format = "markdown", caption="Test statistics on intergenic distances")
Now, we can plot the distribution of intergenic distances for these sets of genes:
plotDistanceDistrib(Dist2GeneSets, type = "jitterbox")
Another way to study the neighborhood of a set of genes is to produce an average profiles representing the coverage of annotations in regions surrounding this set of genes. Such representation provides information on both the orientation and the distance of the neighbors. It also allows to compare different groups of genes (e.g. upregulated genes, genes with a ChIP-seq peak or a specific transcription factor motif in their promoter, etc...).
First we extract the profiles of annotations around (+/-50bp) all our "mock" genes.
We use 3 bins to summarize the coverage on the gene body so genes of size <3bp are removed.
We also use the argument usePercent=TRUE so that the profiles only indicate the presence/absence (0/1) of an annotation at a position rather than the number of genes that cover this position.
usePercent = TRUE Prof <- annotationCoverageAroundFeatures(Genegr, sidedist = 50, usePercent = usePercent, nbins=3)
The object that is created by this function contains the following elements:
cat(paste(names(Prof), collapse="\n"))
They represent the strand-specific coverage of annotations on the gene body (Feature), in the region (-50bp to the start of the gene) located upstream of the genes (UpstreamBorder) and in the region (end of the gene to +50bp) located downstream of the genes (DownstreamBorder).
The Sense strand is the strand on which the focus gene (Feature) is annotated and the Antisense strand is the opposite strand.
Because we used usePercent=TRUE, the Feature_Sense profiles will only contain 1s, indicating the presence of annotation(s) all along the body of each focus gene (i.e. annotation of the focus gene itself). When usePercent=FALSE, values >1 can occur in these profile, when the focus gene overlaps with other genes annotated on the same strand.
Then, we assemble these different elements in order to produce a vector for each focus gene containing the upstream region (-50bp to the TSS), the gene body itself and the downstream region (TES to +50bp).
Prof <- assembleProfiles(Prof)
Now we define a group of genes that have neighbors, on the same strand, at a short distance.
#Get distances to the closest gene on the same strand: Dist2Nearest <- mcols(distanceToNearest(Genegr))$distance CloseTandemNeighbors <- names(Genegr)[Dist2Nearest<=8]
We assemble the different groups of genes that we have defined so far in a list:
GeneGroups <- list(All = names(Genegr), Random = randGenes, CloseNeighbors = CloseTandemNeighbors)
Then, for each group of genes, we calculate the average profile and its 95% confidence interval:
avgProf <- list() for (i in 1:length(GeneGroups)) { avgProf[[i]] <- list() avgProf[[i]]$sense <- getAvgProfileWithCI(Prof$Profiles_Sense, selFeatures = GeneGroups[[i]], pos = c(-50:0, 1:3, 0:50)) avgProf[[i]]$antisense <- getAvgProfileWithCI(Prof$Profiles_Antisense, selFeatures = GeneGroups[[i]], pos = c(-50:0, 1:3, 0:50)) } names(avgProf) <- names(GeneGroups)
Before plotting, we need to assemble these profiles in a single table and to provide the x-coordinates for these "metagene" profiles (in the interval [0,5] with the gene body occupying coordinates ]2,3[).
#Define the x-coordinates (the 3 sequences correspond to "Upstream", "GeneBody" and "Downstream") xcoord = c(seq(0, 2, length.out = 51), seq(2, 3, length.out = 5)[2:4], seq(3, 5, length.out = 51)) #Assemble the metagene profiles avgProf_df <- reshape2::melt(avgProf, measure.vars = "Profile", value.name = "Profile") %>% dplyr::rename("Strand" = "L2", "GeneSet" = "L1") %>% dplyr::mutate(Xcoord=rep(xcoord, 2*length(GeneGroups)))
Now we can plot these profiles using:
plotMetageneAnnotProfile(avgProf_df)
One limitation of these average profiles is that all the length of the genes/features are taken into account to produce the profiles. Sometimes, we are interested only in one part of this information. For example, we may want to know how many genes (or what fraction of genes) have a TSS, on the same strand, at less than 20bp from their end? The metagene profiles above can provide an approximation but not the true value because some features actually start and end within this interval. The TSSs are single-point features. We can produce a coverage of the TSS using:
#First obtain the coordinates of the TSS: tss <- GenomicRanges::promoters(Genegr, upstream = 0, downstream = 1) #Then the coverage of TSSs around (+/- 50bp) the genes tsscov <- annotationCoverageAroundFeatures(annot = tss, features = Genegr, sidedist = 50, usePercent = TRUE, nbins = 3)
But metagene plots as above will not work well because the data is too sparse.
Instead, once a TSS is found at, say 5bp downstream of a gene border, we would like to extend its "presence" value (1) to any distance larger than 5bp to indicate that this specific gene has a TSS at less than any distance larger than 5. We do this using:
extTSScov <- extendPointPresence(tsscov, sidedist = 50)
Now for different groups of genes we can obtain the cumulative percentage of genes that have a TSS at less than a given distance with:
CP <- getCumulPercentProfiles(extTSScov, genesets = GeneGroups)
And plot the results with:
plotCumulPercentProfile(CP)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.