geneAttribution
is an R package for identifying the most likely gene or genes
through which variation at a given genomic locus in the human genome acts. A
typical use case is the annotation of results from genome-wide association
studies (GWAS). The majority of variants identified by GWAS are located in
noncoding regions and likely act by affecting gene expression
(Maurano et al. 2012). Variants
typically contain multiple genes in the region of linkage disequilibrium and
identifying the causative one is challenging.
The most basic functionality of geneAttribution
assumes that the closer gene
is to the input locus, the more likely the gene is to be causative.
Additionally, any empirical data that links genomic regions to genes (e.g.
expression quantitative trait loci (eQTL) or genome conformation data) can be
used if it is supplied in the
UCSC .BED file format.
# Basic workflow, default parameters library(geneAttribution) geneLocs <- geneModels() # Define gene models geneAttribution("chr2", 127156000, geneLocs) # Get candidate gene probabilities
The most basic functionality assumes that the closer a gene is to the input locus, the more likely it is to be a candidate gene. The relationship between distance to locus and candidate gene likelihood is modeled as an exponential distribution. The likelihoods for each gene are then normalized by dividing by the sum of likelihoods. As a result, the presence of multiple genes in the vicinity of a locus decreases the posterior probabilities of the individual genes.
To calculate the distance between the input locus and genes, gene models are
required. For this, geneAttribution
provides the geneModels()
function.
geneModels()
takes a TxDb object containing genomic coordinates of genes as
input and returns gene models in GenomicRanges format, with gene names in the
symbol
column. Loading the gene models may take a few minutes. The default
for the TxDb input is TxDb.Hsapiens.UCSC.hg38.knownGene
, which contains gene
models for genome build GRCh38. An alternative input would be
TxDb.Hsapiens.UCSC.hg19.knownGene
, which contains gene models for build hg19.
The geneModels function has additional optional input:
maxGeneLength
. Gene models that are longer than this are excludedgenesToInclude
and genesToExclude
. A character vector of gene symbols of
genes to include (e.g. only protein coding genes) or excludeThe minimum required input for geneAttribution()
is a chromosome identifier
in the same format than in the gene models and a chromosomal position in the
same build than the gene models. The geneAttribution()
function has
additional optional input:
lambda
, the $\lambda$ parameter for the exponential distribution modeling
the candidate gene likelihood based on an exponential distribution. The
default, based on empirical eQTL data from the Genome-Tissue Expression
Project (GTEx Consortium 2015), is
7.61e-06. Decreasing lambda
gives genes that are close to the input locus
a higher probability and decreasing it gives genes that are further away a
higher probabilitymaxDist
, the maximum distance from the input locus at which a gene will be
considered. Genes that are more than maxDist
bases away from the input
locus will be ignored. The default is 1,000,000 basesminPP
, the minimum posterior probability at which a gene will be reported.
Genes whose posterior probability is less than minPP
will be summarized as
"Other". Set this to 0 to report all genes# Typical workflow, using sample data supplied with geneAttribution geneLocs <- geneModels() fileName1 <- system.file("extdata", "hiCRegions.b38.bed", package="geneAttribution") fileName2 <- system.file("extdata", "eqtlHaplotypeBlocks.b38.bed", package="geneAttribution") empirical <- loadBed(c(fileName1, fileName2), c(2, 5)) geneAttribution("chr2", 127156000, geneLocs, empirical) # As above, but with user-supplied empirical data file in UCSC .BED format geneLocs <- geneModels() empirical <- loadBed("INPUT_FILE.bed", weights=1.5) # INPUT_FILE.bed is a correctly formatted .BED file geneAttribution("chr2", 127156000, geneLocs, empirical)
In addition, geneAttribution
can make use of empirical data that links
genomic loci to genes. eQTLs, which link genetic variants to the expression of
specific genes, are an example of this. If the input locus is located inside
the regions defined in the empirical data, the likelihood of the associated
gene is multiplied by the associated weighting. User-provided empirical data
can be loaded by using the loadBed()
function.
The loadBed funtion reads user-provided files in UCSC .BED format. .BED files must be tab-separated and the columns must be in the following order: chromosome, start, end, gene symbol. An optional fifth score column may also be provided. Gene symbols used in the empirical data must match the symbols used for the gene models and the genome build (e.g. GRCh38) must also match the genome build of the gene models.
Together with the .BED files, weightings for the data may be provided. The default for this is 2, which doubles the likelihoods for genes if the input locus is located in a region defined by the empirical data. A weighting of 1 would not change the likelihood, and a weighting of less than 1 would decrease the likelihood. Alternatively to reading .BED files, users may construct empirical data themselves as a list of GenomicRanges objects with a score column containing the weights.
The extdata
directory provides two .BED files in genome build GRCh38:
hiCRegions.b38.bed
, which contains Capture Hi-C genome conformation data
linking promoters to other genomic regions in the GM12878 CD34 cell line
(Mifsud et al. 2015)eqtlHaplotypeBlocks.b38.bed
, which defines haplotype blocks with eQTLs in
at least two different tissues from the Genome-Tissue Expression Project
(GTEx Consortium 2015)Because both files are supplied as examples, they are limited to a 10 MB region (120,000,000-130,000,000) on chromosome 2.
The output of the geneAttribution
function is a named numerical vector of
candidate gene probabilities. In some instances, it may be useful to also know
the coordinates of the candidate genes, as this annotation can help with
further work with package results. It can easily be obtained by subsetting the
gene models object.
geneLocs <- geneModels() geneLocs <- geneModels() # Define gene models pp <- geneAttribution("chr2", 127156000, geneLocs, minPP=0) # Posterior prob. geneLocs[match(names(pp), geneLocs$symbol)] # Subset gene models
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.