knitr::opts_chunk$set(echo = TRUE, eval=TRUE)
Despite our fine-mapping efforts, there remained considerable uncertainty of causal variants in most loci. Even if the causal variants are known, assigning target genes can be difficult due to long-range regulation of enhancers.
Our gene mapping procedure prioritizes target genes:
(1) For every putative causal SNP, we assign a weight to each nearby gene, considering multiple ways the SNP may affect a gene. The weight of a gene can be viewed as the probability that the SNP affects that gene.
(2) For each gene, we then aggregate the causal evidence of all SNPs likely targeting this gene, expressed as the weighted sum of the PIPs of all these SNPs. To ensure that the causal evidence of a variant is not counted multiple times when it targets multiple genes, we normalize the SNP-to-gene weights in this calculation. The resulting “gene PIP” approximates the probability of a gene being causal. Similar to variant-level fine-mapping, we also define a “credible gene set”, the set of genes that capture the causal signal at a locus with high probability.
The weights of SNP-gene pairs reflect the strength of biological evidence linking SNPs to genes. For a SNP in an exon/promoter or in a regulatory region linked to a particular gene ("enhancer loops"), we assign a weight of 1 to that gene.
When a SNP cannot be linked to any gene in these ways, its target genes are assigned using a distance weighted function so that nearby genes receive higher weights.
See the Methods section in our paper for details.
knitr::include_graphics("../man/figures/gene.mapping.diagram.png")
In this tutorial, we show the gene mapping procedure using the fine-mapping result from our AFib study.
Our gene mapping procedure takes the following input data:
Load R packages
library(mapgen)
Here we use fine-mapping summary statistics from our AFib study.
Keep SNPs with PIP > 1e-5, rename columns, and save as a GRanges object.
finemapstats <- readRDS(system.file("extdata", "AF.finemapping.sumstats.rds", package = "mapgen")) finemapstats <- process_finemapping_sumstats(finemapstats, snp = 'snp', chr = 'chr', pos = 'pos', pip = 'susie_pip', pval = 'pval', zscore = 'zscore', cs = 'cs', locus = 'locus', pip.thresh = 1e-5) # For simplicity, only keep the following columns in the steps below cols.to.keep <- c('snp','chr','pos', 'pip', 'pval', 'zscore', 'cs', 'locus') finemapstats <- finemapstats[, cols.to.keep] head(finemapstats, 3)
To link SNPs to genes, we use a hierarchical approach to assign the following genomic annotation categories:
You can download gene annotations (GTF file) from GENCODE, then make gene annotations using the GTF file.
gtf_file <- '/project2/xinhe/shared_data/gencode/gencode.v19.annotation.gtf.gz' genomic.annots <- make_genomic_annots(gtf_file)
We included gene annotations (hg19) in the package, downloaded from GENCODE release 19.
genomic.annots <- readRDS(system.file("extdata", "genomic.annots.hg19.rds", package = "mapgen")) gene.annots <- genomic.annots$genes
OCRs <- readRDS(system.file("extdata", "OCRs.hg19.gr.rds", package = "mapgen"))
In this example, we include exons and active promoters in this category for gene mapping. We defined active promoters as promoters overlapping with OCRs (overlap at least 100 bp).
active_promoters <- IRanges::subsetByOverlaps(genomic.annots$promoters, OCRs, minoverlap = 100)
genomic.annots$exons_promoters <- list(exons = genomic.annots$exons, active_promoters = active_promoters)
If you don't have OCR data, you can simply use exons and promoters.
genomic.annots$exons_promoters <- list(exons = genomic.annots$exons, promoters = genomic.annots$promoters)
We need chromatin loop data, such as PC-HiC or ABC scores, with the following columns, and save as a GRanges object.
Promoter-capture HiC (PC-HiC)
Here we use promoter-capture HiC (PC-HiC) data from cardiomyocytes (CMs). You may skip this if you do not have relevant PC-HiC data.
pcHiC <- readRDS(system.file("extdata", "pcHiC.CM.gr.rds", package = "mapgen")) pcHiC <- pcHiC[pcHiC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes head(pcHiC, 3)
ABC scores
The process_ABC()
function processes a data frame of ABC scores
from Nasser et al. Nature 2021, and converts it to a GRanges object.
Here we use ABC scores from heart ventricle (from Nasser et al. Nature 2021). You may skip this if you do not have relevant ABC scores.
ABC <- data.table::fread(system.file("extdata", "heart_ventricle-ENCODE_ABC.tsv.gz", package = "mapgen")) ABC <- process_ABC(ABC, full.element = TRUE) ABC <- ABC[ABC$gene_name %in% gene.annots$gene_name, ] # restrict to protein coding genes head(ABC, 3)
Here we define enhancer regions by OCRs. You can also use histone mark data to define enhancer regions.
genomic.annots$enhancer_regions <- OCRs[OCRs$peakType!="Promoter",]
Considering the fact that PC-HiC and HiC loops may miss contacts between close regions due to technical reasons, here we also consider enhancer regions within 20 kb of active promoters as "nearby interactions".
nearby20kb <- nearby_interactions(genomic.annots$enhancer_regions, active_promoters, max.dist = 20000)
We include ABC, PC-HiC and nearby interactions in the "enhancer loop" category.
genomic.annots$enhancer_loops <- list(ABC = ABC, pcHiC = pcHiC, nearby20kb = nearby20kb)
summary(genomic.annots)
Parameters:
TRUE
, assign intronic SNPs to genes containing the introns.gene.mapping.res <- compute_gene_pip(finemapstats, genomic.annots, intron.mode = FALSE, d0 = 50000, exon.weight = 1, loop.weight = 1) head(gene.mapping.res)
table(gene.mapping.res$category)
Gene-level result table
gene.pip.res <- extract_gene_level_result(gene.mapping.res, gene.annots) head(gene.pip.res) cat(sprintf("%d genes with PIP >= 0.8", length(gene.pip.res$gene_name[gene.pip.res$gene_pip >= 0.8])))
80% credible gene sets
gene.cs <- gene_cs(gene.mapping.res, by.locus = TRUE, gene.cs.coverage = 0.8) head(gene.cs)
label genes with gene PIP > 0.8.
gene_manhattan_plot(gene.pip.res, gene.pip.thresh = 0.8)
Note: in some cases where a gene spans two nearby LD blocks and/or is linked to SNPs in multiple credible sets (L > 1), the gene PIP may exceed 1, which can be interpreted as the expected number of causal variants targeting the gene (see Methods of our paper for details).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.