In vjcitn/edx_adv_bioc: course development, edx advanced bioconductor

suppressMessages({
suppressPackageStartupMessages({
library(AnnotationHub)
library(rtracklayer)
library(DT)
library(tibble)
library(BiocStyle)
library(lubridate)
library(ensembldb)
library(EnsDb.Hsapiens.v79)
library(GenomicFiles)
library(ontoProc)
})
})

Introduction

The Encyclopedia of DNA Elements (ENCODE) is a collection of experimental results and computing resources devoted to elaboration of mechanisms of gene regulation. The project main page is a good overview of its scope, involving four main organisms (human, mouse, worm and fly) and dozens of biotechnological innovations that are used to analyze gene structure and regulation. We will illustrate ways of using Bioconductor to acquire and interpret ENCODE resources.

Main aims for learners using this document:

You will develop an understanding of the scope of the ENCODE project, particularly in terms of the variety of assays and experimental resources.
You will be able to explore Bioconductor metadata about the project, surveying organism and cell identities, assay types, specific design objectives and assay outputs.
You will become acquainted with the use of formal ontology to organize information about relationships between cell types and anatomic regions, as employed in ENCODE metadata.
You will learn how to use the r Biocpkg("GenomicFiles") package to manage information about terabytes of data resident in public cloud storage. Specifically the reduceByFile method operates on GenomicFiles instances to make targeted selections from indexed cloud-resident files, avoiding tedious downloading of large files.
You will see how to "drill through" complex metadata to expose key aspects of informally annotated CRISPR interference experiments.
You will be provided with code that permits comparison of different cell types with respect to TF binding patterns.
With the programming framework provided in this narrative and exercises, you will be able to shift focus between different experimental aims and assay types to pose and answer questions about gene regulation of interest to you.

This document is part 1 of three parts -- the second part addresses ATAC-seq, and the third addresses RNA-seq preceded by CRISPR-based interference.

Built-in surveys through ENCODExplorer metadata

We can use AnnotationHub to get information on ENCODE results that have been curated for direct use with Bioconductor. The r Biocpkg("ENCODExplorer") package provides a metadata table, that we retrieve using r Biocpkg("AnnotationHub"). We use the "full" table because it provides links to raw and processed data files curated in the cloud by the ENCODE project.

Using a local metadata cache

We'll begin with a query to AnnotationHub to learn about resources whose metadata mentions "ENCODExplorerData".

library(AnnotationHub)
ah = AnnotationHub()
AnnotationHub::query(ah, "ENCODExplorerData")

The recent 'Full' metadata table is over 90MB in size, so it takes a few moments to download from cloud (or load from cache, in case you have retrieved it previously).

fm = ah[["AH75132"]] # implicitly retrieves if needed, or loads
dim(fm)
table(fm$organism)

We are working with r nrow(fm) records on r length(table(fm$organism)) different organisms.

Filtering the metadata table

We focus on human samples, and limit to ChIP-seq experiments for now.

# filter
hfm = fm[which(fm$organism == "Homo sapiens"),]
tail(sort(table(hfm$assay)))
hfmt = hfm[which(hfm$assay == "TF ChIP-seq"),]

The proteins for which binding patterns can be assessed are in the antibody_target field:

tail(sort(table(hfmt$antibody_target)))

Managing the information with a GenomicFiles instance

We'll manage the ChIP-seq data references with a r Biocpkg("GenomicFiles") object. For convenience, we'll focus on bigWig files.

hfmtbw = hfmt[hfmt$file_type == "bigWig",]
dim(hfmtbw)
table(hfmtbw$output_type)

library(GenomicFiles)
gf1 = GenomicFiles(files=hfmtbw$cloud_metadata.url)
colData(gf1) = as(as.data.frame(hfmtbw), "DataFrame")
gf1

The gf1 object reports only the "basename" component of the names of files managed. The actual filename (URL) for the first file is

GenomicFiles::files(gf1)[1]

A virtues of the GenomicFiles discipline is that we keep together the file identifiers and metadata about them. The following information is crucial to downstream use of the files.

table(gf1$assembly)

This table is very important, indicating that we are managing information for which two genomic coordinate systems are in play. In the introduction to Bioconductor, it is shown how liftOver can be used to translate between coordinate systems if needed.

We'll focus on data collected with GRCh38 coordinates.

gf1 = gf1[, which(gf1$assembly == "GRCh38")]
gf1
gf = GenomicFiles(files=hfmtbw$cloud_metadata.url)
colData(gf) = as(as.data.frame(hfmtbw), "DataFrame")

Anatomic sources of samples

We'll define a helper function to report on discrete properties of samples. This gives us clues on the contents of listings of biosample type and associated ontology labels.

tls = function(x) t(t(head(sort(table(x), decreasing=TRUE))))
tls(gf1$biosample_name)
tls(gf1$biosample_ontology)

The use of ontology labeling helps us to organize information on samples. Bioconductor's r Biocpkg("ontoProc") package can be used to develop a simple hierarchical display. We focus on the Experimental Factor Ontology (EFO) tags, and take a very small subset of tags in use to obtain a tractable display in this document.

library(ontoProc)
ee = getEFOOnto()
tail(sort(table(hfmt$biosample_ontology)),12) -> ioi
nn = names(ioi[grep("EFO", names(ioi))])
tails = gsub(".*_", "", nn)
tailss = gsub("/", "", tails)
tt = paste0("EFO:", tailss)
onto_plot2(ee, tt, cex=.6)

It is noteworthy that in this display HEK293 is not linked to "Homo sapiens cell line". This can be confirmed by visiting the ontology lookup service. It is not uncommon to find gaps in curated data resources. Whether there is a justification for this particular omission is unclear.

Targeted sketching of peak scores

We'll focus on the transcription factor CREB1. There are a modest number of samples in different cell lines. Some are treated with ethanol.

gf1_creb1 = gf1[, which(gf1$target == "CREB1")]
table(gf1_creb1$treatment, gf1_creb1$biosample_name, exclude=NULL)

We'll focus on samples with type signal p-value.

table(gf1_creb1$output_type, gf1_creb1$biosample_name, exclude=NULL)
gf1_creb1_use = gf1_creb1[, 
   which(gf1_creb1$output_type == "signal p-value")]

We will import a megabase of bigWig content from the files selected in this way, using reduceByFile:

myr = GRanges("chr17", IRanges(66e6,67e6))
genome(myr) = "GRCh38"
rowRanges(gf1_creb1_use) = myr
if (.Platform$OS.type == "windows") {
  sels = edxAdvBioc::creb1_sels
  } else sels = reduceByFile(gf1_creb1_use, MAP=function(range, file, ...) {
    sel = BigWigSelection(range)
    import.bw(file, selection=sel, genome=genome(myr)[1])
})

Here is our targeted sketch:

lk1 = sels[[1]][[1]]
lk2 = sels[[2]][[1]]
plot(start(lk1)+.5*(width(lk1)), lk1$score, pch=19,
 xlab="midpoint of scored interval, chr17", ylab="-log10 CREB1 signal p-value",
 cex=.5, col=lava::Col("black", alpha=.3))
points(start(lk2)+.5*width(lk2), lk2$score, pch=19, 
 col=lava::Col("blue", alpha=.3), cex=.5)

Summary

Review the following concepts before proceeding to exercises.

We created the data.table instance fm, which is a creation of the developers of the ENCODExplorer package. We retrieved this table using r Biocpkg("AnnotationHub"). The number of rows of fm is the number of metadata records related to experiments performed in ENCODE. The number of columns of fm is the number of metadata fields descriptive of the experiments.
We filtered the metadata to hfmtbw to limit attention to bigWig files collected on human samples. The metadata field cloud_metadata.url gives the URL of each such file.
The metadata fields biosample_name and biosample_ontology give information (less and more formal, respectively) about anatomic or cell-culture origins of samples. Ontological relationships among sample types can be useful for organizing investigations and the r Biocpkg("ontoProc") onto_plot2 function was used to sketch relationships among terms used to annotate samples.
We extracted information from small numbers of bigWig files using BigWigSelection and import.bw from r Biocpkg("rtracklayer"), applied to cloud-resident files through the reduceByFile method of r Biocpkg("GenomicFiles"). The resulting scored GRanges instances were visualized using standard R plotting.

Exercises

Basic descriptive statistics

gf1 represents data on r ncol(gf1) ChIP-seq experiments for organism 'Homo sapiens' in bigWig format. Use the cloud_metadata.file_size colData component to state a) the median file size, and b) the total number of gigabytes of cloud-resident data to which gf1 mediates access.

median(as.numeric(gf1$cloud_metadata.file_size))
sum(as.numeric(gf1$cloud_metadata.file_size))/1e9

What is the most common ChIP-seq target protein studied in the r ncol(gf1) experiments?

tls(gf1$target)

What is the most commonly studied cell type in the r ncol(gf1) experiments?

tls(gf1$biosample_name)

Use the fm metadata table. For the application of the ATAC-seq assay to human samples, what is the most common biosample_name?

 as_tibble(fm) %>% dplyr::filter(assay == "ATAC-seq" & organism=="Homo sapiens") %>% 
    dplyr::select(biosample_name) %>% dplyr::group_by(biosample_name) %>%
    dplyr::summarise(n=dplyr::n()) %>% dplyr::arrange(desc(n))

Chronology and scope of experimental agenda. We can use the date_created field to document submission of assays of diverse types. We'll use the hfm table for this.

newt = as_tibble(hfm) %>% 
  dplyr::select(assay, date_created) %>% 
  dplyr::mutate(ldate=lubridate::as_date(date_created)) 
par(las=2, mar=c(12,4,2,2))
ameds = sapply(split(newt$ldate, newt$assay), median)
with(newt, boxplot(split(ldate, assay)[order(ameds)]))

How many assays of type long read RNA-seq are present?

sum(hfm$assay == "long read RNA-seq", na.rm=TRUE)

RNA-seq preceded by CRISPR interference. We use the dataset_description to find out about genes knocked down in CRISPRi RNA-seq experiments. The language in that field is repetitious, and we massage its values a bit, using gsub and mutate, to get the name of the gene that was putatively knocked down.

ii = (hfm %>% dplyr::filter(assay == "CRISPRi RNA-seq" & file_format=="tsv") %>% 
  dplyr::mutate(targ = gsub(
    "RNA-seq on K562 cells treated by CRISPR interference targeting (..*).", 
     "\\1", dataset_description)) %>%
  dplyr::select(investigated_as, targ) %>% 
  dplyr::group_by(investigated_as, targ) %>% dplyr::summarise(n=dplyr::n()))
ii

For how many different transcription factors, interfered with by CRISPR prior to RNA-seq, have tsv files been prepared?

unique((ii %>% dplyr::filter(investigated_as == "transcription factor"))$targ)

Retention of metadata; plotting

The following function, annotated in roxygen format, gathers the task of filtering bigWig data of a specific output type and target in a given genomic interval.

#' import bigwig in an interval, given GenomicFiles for ENCODE bigWig files
#' @param gf GenomicFiles instance
#' @param target character(1) ChIP-seq target
#' @param biosample_names character() vector of cell types to retain
#' @param output_type character(1) defaults to "signal p-value"
#' @param selection GRanges instance, one range allowed
#' @export
 import_enc_bw = function(gf, target="CREB1", 
   biosample_names = c("HepG2", "MCF-7", "A549"), output_type="signal p-value",
   selection= GRanges("chr17", IRanges(66e6,67e6), genome="GRCh38")) {
  stopifnot(length(selection)==1)
  gf_use = gf[, which(gf$target == target)]
  gf_use = gf_use[, 
     which(gf_use$output_type == "signal p-value")]
  gf_use = gf_use[, 
     which(gf_use$biosample_name %in% biosample_names)]
  rowRanges(gf_use) = selection
  ans = reduceByFile(gf_use, MAP=function(range, file, ...) {
      sel = BigWigSelection(range)
      import.bw(file, selection=sel, genome=genome(myr)[1])
  })
  ans = GRangesList(unlist(ans, recursive=FALSE))
  mcols(ans) = colData(gf_use)
  names(ans) = make.unique(mcols(ans)$biosample_name)
  ans
}

We can generate our sketch data with a pair of simple calls, assuming the construction of gf1 is accomplished as above.

First, we generate the requested data on the available biosamples.

if (.Platform$OS.type == "windows") {
  skd = edxAdvBioc::skd
  } else skd = import_enc_bw(gf1)

A virtue of this approach is that the cell types (and all metadata) are bound to the scores:

table(mcols(skd)$biosample_name)

To complete the task, we define a plotting function. We'll call it to compare HepG2 and MCF-7 patterns in the selected interval.

#' plot the result of a pair of import_enc_bw runs
#' @param impeb output of import_enc_bw
#' @param ylim limits of y axis
#' @param xlim limits of x axis
#' @param logy logical(1) whether or not to use log="y" in plot call
#' @param leg_frac_x numeric(1) fudge factor to move legend to the right of natural
#' position at xlim[1]
#' @export
plot_pair = function(impeb, ylim=c(1,500), xlim=c(66.1e6,66.3e6),
    logy=TRUE, leg_frac_x=.015) {
  stopifnot(length(impeb)==2)
  lk1 = impeb[[1]]
  lk2 = impeb[[2]]
  if (logy) {
   lk1 = lk1[which(lk1$score>0)]
   lk2 = lk2[which(lk2$score>0)]
   }
  seqn = as.character(seqnames(impeb[[1]]))[1]
  g = grep("chr", seqn)
  if (length(g)==0) seqn=paste0("chr", seqn)
  targ = mcols(impeb)$target[1]
  oty = mcols(impeb)$output_type[1]
  pspec = list(x=start(lk1)+.5*(width(lk1)), y=lk1$score, pch=19,
   xlab=sprintf("midpoint of scored interval, %s", seqn), ylab=sprintf("bigWig '%s' for %s", oty, targ),
   cex=.5, col=lava::Col("orange", alpha=.1), ylim=ylim, xlim=xlim, log=ifelse(logy, "y", ""))
  do.call(plot, pspec)
  points(start(lk2)+.5*width(lk2), lk2$score, pch=19, 
   col=lava::Col("blue", alpha=.1), cex=.5)
  legend(xlim[1]+leg_frac_x*diff(xlim), ylim[2], pch=19, col=lava::Col(c("orange", "blue"), alpha=.4),
    legend=mcols(impeb)$biosample_name)
}

In the following calls, we select distinct pairs of samples from different cell types for overlaid visualization within pairs.

plot_pair(skd[c("HepG2.2", "MCF-7")], ylim=c(1,1500), xlim=c(66e6, 67e6), leg_frac_x=.04)
plot_pair(skd[c("HepG2.1", "MCF-7.1")], ylim=c(1,1500), xlim=c(66e6, 67e6), leg_frac_x=.04)

What is the number of positions at which binding of CREB1 to HepG2 exhibits signal p-values consistently greater than those seen for MCF-7 in the two displays here? (0, 1, 4, 10)?
Use the following code to simplify the data on one HepG2 sample:

s5 = skd[["HepG2.2"]]
GenomeInfoDb::seqlevelsStyle(s5) = "NCBI"
seqlevels(s5) = "17"

Using genes(EnsDb.Hsapiens.v79), find the gene nearest the strongest peak for CREB1 in HepG2.

g79 = ensembldb::genes(EnsDb.Hsapiens.v79::EnsDb.Hsapiens.v79)
g79[ nearest(s5[which.max(s5$score)], g79) ]

A different TF target, assayed in a breast cancer cell line.

Run the following code.

if (.Platform$OS.type == "windows") {
  uu = edxAdvBioc::uu
} else uu = import_enc_bw(gf1, target="FOXA1", 
       selection=GRanges("chr6", IRanges(151.5e6,152.5e6)), 
       biosample_names=c("A549", "MCF-7"))
plot_pair(uu[c("A549", "MCF-7")], xlim=c(151.5e6,152.5e6))

What gene is nearest to the strongest FOXA1 binding location seen in the sample named MCF-7?

top = which.max(uu[["MCF-7"]]$score)
cand = uu[[7]][top]
seqlevelsStyle(cand) = "NCBI"
seqlevels(cand) = "6"
g79[nearest(cand, g79)]

vjcitn/edx_adv_bioc documentation built on March 17, 2020, 3:58 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vjcitn/edx_adv_bioc
course development, edx advanced bioconductor

In vjcitn/edx_adv_bioc: course development, edx advanced bioconductor

Introduction

Built-in surveys through ENCODExplorer metadata

Using a local metadata cache

Filtering the metadata table

Managing the information with a GenomicFiles instance

Anatomic sources of samples

Targeted sketching of peak scores

Summary

Exercises

Basic descriptive statistics

Retention of metadata; plotting

A different TF target, assayed in a breast cancer cell line.

R Package Documentation

Browse R Packages

We want your feedback!

vjcitn/edx_adv_bioc course development, edx advanced bioconductor

In vjcitn/edx_adv_bioc: course development, edx advanced bioconductor

Introduction

Built-in surveys through ENCODExplorer metadata

Using a local metadata cache

Filtering the metadata table

Managing the information with a GenomicFiles instance

Anatomic sources of samples

Targeted sketching of peak scores

Summary

Exercises

Basic descriptive statistics

Retention of metadata; plotting

A different TF target, assayed in a breast cancer cell line.

R Package Documentation

Browse R Packages

We want your feedback!

vjcitn/edx_adv_bioc
course development, edx advanced bioconductor