suppressPackageStartupMessages({ library(TxRegInfra) library(GenomicFiles) library(TFutils) })
TxRegQuery addresses exploration of transcriptional regulatory networks by integrating data on eQTL, digital genomic footprinting (DGF), DnaseI hypersensitivity binding data (DHS), and transcription factor binding site (TFBS) data. Owing to the volume of emerging tissue-specific data, special data modalities are used.
txregnet
databaseWe have a long-running server that will respond to queries.
We focus on r CRANpkg("mongolite")
as the interface.
suppressPackageStartupMessages({ library(TxRegInfra) library(mongolite) library(Gviz) library(EnsDb.Hsapiens.v75) library(BiocParallel) register(SerialParam()) }) con1 = mongo(url=URL_txregInAWS(), db="txregnet") con1
We will write methods that work with the 'fields' of this object.
There is not much explicit reflectance in the mongolite API. The following is improvised and may be fragile:
parent.env(con1)$orig
If the mongo
utility is available as a system
command, we can get a list of collections in the database
as follows.
if (verifyHasMongoCmd()) { head(c1 <- listAllCollections(url=URL_txregInAWS(), db="txregnet")) }
Otherwise, as long as r CRANpkg("mongolite")
is installed,
as long as we know the collection names of interest, we
can use them as noted throughout this vignette.
We can get a record from a given collection:
mongo(url=URL_txregInAWS(), db="txregnet", collection="Adipose_Subcutaneous_allpairs_v7_eQTL")$find(limit=1)
Queries can be composed using JSON. We have a tool to generate queries that employ the mongodb aggregation method. Here we demonstrate this by computing, for each chromosome, the count and minimum values of the footprint statistic on CD14 cells.
m1 = mongo(url = URL_txregInAWS(), db = "txregnet", collection="CD14_DS17215_hg19_FP") newagg = makeAggregator( by="chr", vbl="stat", op="$min", opname="min")
The JSON layout of this aggregating query is
[ { "$group": { "_id": ["$chr"], "count": { "$sum": [1] }, "min": { "$min": ["$stat"] } } } ]
Invocation returns a data frame:
head(m1$aggregate(newagg))
We need to bind the metadata and information about the mongodb.
The following turns a very ad hoc filtering of the collection names into a DataFrame.
# cd = makeColData() # works when mongo does cd = TxRegInfra::basicColData head(cd,2)
rme0 = RaggedMongoExpt(con1, colData=cd) rme1 = rme0[, which(cd$type=="FP")]
A key method in development is subsetting the archive by genomic coordinates.
si = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] # to fix query genome myg = GRanges("chr17", IRanges(38.07e6,38.09e6), seqinfo=si) s1 = sbov(rme1, myg, simplify=FALSE) s1 dim(sa <- sparseAssay(s1, 3)) # compact gives segfault sa[953:956,c("fLung_DS14724_hg19_FP", "fMuscle_arm_DS17765_hg19_FP")]
ormm = txmodels("ORMDL3", plot=FALSE, name="ORMDL3") sar = strsplit(rownames(sa), ":|-") an = as.numeric gr = GRanges(seqnames(ormm)[1], IRanges(an(sapply(sar,"[", 2)), an(sapply(sar,"[", 3)))) gr1 = gr gr1$score = 1-sa[,1] gr2 = gr gr2$score = 1-sa[,2] sc1 = DataTrack(gr1, name="Lung FP") sc2 = DataTrack(gr2, name="Musc/Arm FP") plotTracks(list(GenomeAxisTrack(), sc1, sc2, ormm), showId=TRUE)
sbov
We begin with three 'single-concept' assays with relevance to lung genomics. The v7 GTEx lung eQTL data, an encode DnaseI narrowPeak report on lung fibroblasts, and a digital genomic footprint report for fetal lung.
lname_eqtl = "Lung_allpairs_v7_eQTL" lname_dhs = "ENCFF001SSA_hg19_HS" # see dnmeta, fibroblast of lung lname_fp = "fLung_DS14724_hg19_FP" si17 = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] si17n = si17 GenomeInfoDb::seqlevelsStyle(si17n) = "NCBI" s1 = sbov(rme0[,lname_eqtl], GRanges("17", IRanges(38.06e6, 38.15e6), seqinfo=si17n)) s2 = sbov(rme0[,lname_dhs], GRanges("chr17", IRanges(38.06e6, 38.15e6), seqinfo=si17)) s3 = sbov(rme0[,lname_fp], GRanges("chr17", IRanges(38.06e6, 38.15e6), seqinfo=si17))
Now we have annotated GRanges for each assay. The eQTL data in part are:
names(mcols(s1)) head(s1[, c("gene_id", "variant_id", "maf", "pval_nominal")])
The names of genes and variants used here are cumbersome -- symbols and rsids are preferable.
addsyms = function(x, EnsDb=EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75) { ensids = gsub("\\..*", "", x$gene_id) # remove post period gns = genes(EnsDb) x$symbol = gns[ensids]$symbol x } s1 = addsyms(s1)
Note that it is possible to retrieve rsids for the SNPs by address. But this is a slow operation involving a huge SNPlocs package that we do not want to work with directly for this vignette.
> snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh37, s1b) UnstitchedGPos object with 265 positions and 2 metadata columns: seqnames pos strand | RefSNP_id alleles_as_ambig <Rle> <integer> <Rle> | <character> <character> [1] 17 38061054 * | rs36049276 R [2] 17 38061439 * | rs4795399 Y [3] 17 38062196 * | rs2305480 R [4] 17 38062217 * | rs2305479 Y [5] 17 38062503 * | rs35104165 Y ... ... ... ... . ... ... [261] 17 38149258 * | rs58212353 K [262] 17 38149350 * | rs8073254 V [263] 17 38149411 * | rs34648856 R [264] 17 38149724 * | rs3785549 Y [265] 17 38149727 * | rs3785550 H ------- seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome
The object s1
computed above is available as
demo_eQTL_granges
. We convert it to a graph via
library(graph) g1 = sbov_to_graphNEL(demo_eQTL_granges) g1
Nodes are SNPs and genes, edges are present when
the resource (in this case the GTEx lung study)
declares an association (in this case, an FDR for
SNP-gene association not exceeding 0.10.) The
r Biocpkg("graph")
library includes functions
for creation of incidence matrices from graphs, and
vice versa.
Given the GRanges representations for sbov
results,
we can use overlap computations to conveniently
identify relationships between eQTL SNPs, genes,
and hypersensitivity or footprint regions.
We use sbov_output_HS
as a persistent instance of
s2
computed above.
seqlevelsStyle(demo_eQTL_granges) = "UCSC" fo1 = findOverlaps(demo_eQTL_granges, sbov_output_HS) fo1 eq_by_hs = split(demo_eQTL_granges[queryHits(fo1)], subjectHits(fo1)) eq_by_hs
This shows that there are two DHS sites that overlap with SNPs showing eQTL associations with various genes.
For the footprint data, we have:
fo2 = findOverlaps(demo_eQTL_granges, sbov_output_FP) fo2 eq_by_fp = split(demo_eQTL_granges[queryHits(fo2)], subjectHits(fo2)) eq_by_fp
We have a small number of cloud-resident FIMO search
results through the r Biocpkg("TFutils")
package.
library(TFutils) data(demo_fimo_granges) seqlevelsStyle(demo_eQTL_granges) = "UCSC" lapply(demo_fimo_granges, lapply, function(x) subsetByOverlaps(demo_eQTL_granges, x))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.