The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.
A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.
Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.
Installation of DeepBlueR and its companion packages can be performed using the Bioconductor install method in the BiocManager package:
install.packages("BiocManager") BiocManager::install("DeepBlueR")
The package name is DeepBlueR
and it can be loaded via:
library(DeepBlueR)
You can test your installation and connectivity by saying hello to the DeepBlue server:
deepblue_info("me")
DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.
A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:
help(package="DeepBlueR")
In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix 'deepblue_*', e.g. deepblue_select_genes.
| Category | Command | Description | |-----------------|-------------------------|------------------------------------| | Information | info | Information about an entity | | List and search | list_genomes | List registered genomes | | | list_biosources | List registered biosources | | | list_samples | List registered samples | | | list_epigenetic marks | List registered epigenetic marks | | | list_experiments | List available experiments | | | list_annotations | List available annotations | | | search | Perform a full-text search | | Selection | select_regions | Select regions from experiments | | | select_experiments | Select regions from experiments | | | select_annotations | Select regions from annotations | | | select_genes | Select genes as regions | | | select_expressions | Select expression data | | | tiling_regions | Generate tiling regions | | | input_regions | Upload and use a small region-set | | Operation | aggregate | Aggregate and summarize regions | | | filter_regions | Filter regions by theirs attributes| | | flank | Generate flanking regions | | | intersection | Filter for intersecting regions | | | overlap | Filter for regions overlapping by at least a specific size | | | merge_queries | Merge two regions set | | Result | count_regions | Count selected regions | | | score_matrix | Request a score matrix | | | get_regions | Request the selected regions | | | binning | Bin results according to counts | | Request | get_request data | Obtain the requested data |
In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:
| Category | Command | Description | |----------|-----------------------|---------------------------------------------------| | Request | batch_export_results | Download the results for a list of requests | | | download_request_data | Download and convert the requested data (blocking)| | | export_meta_data | Export metadata to a tab delimited file | | | export_tab | Export any result as tab delimited file | | | export_bed | Export GenomicRanges results as BED file |
In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.
One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:
deepblue_search
commanddeepblue_list_{experiments, annotations, ...}
commandsIn this example, we use the command deepblue_search
to
find experiments that contain the keywords 'H3k27AC', 'blood', and 'peaks' in
their metadata. We put the names in single quotes to show that these names must
be in the metadata.
# We are selecting the experiments with terms 'H3k27AC', 'blood', and # 'peak' in the metadata. experiments_found = deepblue_search( keyword="'H3k27AC' 'blood' 'peak'", type="experiments") custom_table = do.call("rbind", apply(experiments_found, 1, function(experiment){ experiment_id = experiment[1] # Obtain the information about the experiment_id info = deepblue_info(experiment_id) # Print the experiment name, project, biosource, and epigenetic mark. with(info, { data.frame(name = name, project = project, biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark) }) })) head(custom_table)
We use the deepblue_list_experiments
command to list all experiments with
the corresponding values in their metadata.
experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3", biosource=c("inflammatory macrophage", "macrophage"), project="BLUEPRINT Epigenome")
The extra-metadata is important because it contains information that is not
stored in the mandatory metadata fields. We use the deepblue_info
command
to access an experiment's metadata- and extra-metadata fields.
The following example prints the file_url
attribute that is contained
in the data imported from the ENCODE project.
info = deepblue_info("e30000") print(info$extra_metadata$file_url)
We use the deepblue_select_experiments
command to select all genomic
regions from the two informed experiments. We use the
deepblue_count_regions
command with the query_id
value returned
by the deepblue_select_experiments
command.
The deepblue_count_regions
command is executed asynchronously.
This means that the user receives a request_id
and should check the status
of this request. In contrast to the command deepblue_get_request_data
,
the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))
request_id = deepblue_count_regions(query_id=query_id)
requested_data = deepblue_download_request_data(request_id=request_id) print(paste("The selected experiments have", requested_data, "regions."))
### Output with selected columns We use the ```deepblue_select_experiments``` command to select genomic regions from the experiments that are in chromosome 1, position 0 to 50,000,000. We then use the ```deepblue_get_regions``` command with the ```query_id``` value returned by the ```deepblue_select_experiments``` command to request the regions with the selected columns. Selecting the columns ```@NAME``` and ```@BIOSOURCE``` represent the experiment name and the experiment biosource. The ```deepblue_get_regions``` command is executed asynchronously. This means that the user receives a ```request_id``` to be able to check for the status of this request. In contrast to the command ```deepblue_get_request_data```, the DeepBlueR package-specific command ```deepblue_download_request_data``` will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object. ```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments ( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) # Retrieve the experiments data (The @NAME meta-column is used to include the # experiment name and @BIOSOURCE for experiment's biosource request_id = deepblue_get_regions(query_id=query_id, output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) regions
We use the deepblue_list_samples
command to obtain all samples with the
biosource 'myeloid cell' from the BLUEPRINT project.
The deepblue_list_samples
returns a list of samples with their IDs and
content. We extract the sample IDs from this list and use it in the
deepblue_select_regions
command to selects genomic regions that are in
chromosome 1, position 0 to 50,000.
Then, we use the deepblue_get_regions
command with the parameter
query_id
returned by the deepblue_select_regions
command and the
columns @NAME
, SAMPLE_ID
, and @BIOSOURCE
representing the
experiment name, the sample ID, and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means
that the user receives a request_id
to be able to check for the status of
this request. In contrast to the command deepblue_get_request_data
,
the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} samples = deepblue_list_samples( biosource="myeloid cell", extra_metadata = list("source" = "BLUEPRINT Epigenome")) samples_ids = deepblue_extract_ids(samples) query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids, chromosome="chr1", start=0, end=50000) request_id = deepblue_get_regions(query_id=query_id, output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) head(regions,1)
### Filter epigenomic data by region attributes We use the ```deepblue_select_experiments``` command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with ```SIGNAL_VALUE``` > 10 and ```PEAK``` > 1000. Then, we use the ```deepblue_get_regions``` command with the parameter ```query_id``` returned by the ```deepblue_select_regions``` command and the columns ```@NAME``` and ```@BIOSOURCE``` representing the experiment name and the experiment biosource. The ```deepblue_get_regions``` command is executed asynchronously. This means that the user receives a ```request_id``` to be able to check for the status of this request. In contrast to the command ```deepblue_get_request_data```, the DeepBlueR package-specific command ```deepblue_download_request_data``` will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object. ```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) query_id_filter_signal = deepblue_filter_regions( query_id=query_id, field="SIGNAL_VALUE", operation=">", value="10", type="number") query_id_filters = deepblue_filter_regions( query_id=query_id_filter_signal, field="PEAK", operation=">", value="1000", type="number") request_id = deepblue_get_regions(query_id=query_id_filters, output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) regions
We use the deepblue_select_experiments
command for selecting genomic
regions from two specific experiments that are in
chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with
SIGNAL_VALUE
> 10 and PEAK
> 1000.
The command deepblue_intersection
filters for all regions of the
query_id
that intersect with at least one region in promoters_id
.
Then, we use the deepblue_get_regions
command with the parameter
query_id
returned by the deepblue_select_regions
command and the
columns @NAME
and @BIOSOURCE
representing the
experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means
that the user receives a request_id
to be able to check for the status of
this request. In contrast to the command deepblue_get_request_data
,
the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) promoters_id = deepblue_select_annotations(annotation_name="promoters", genome="GRCh38", chromosome="chr1") intersect_id = deepblue_intersection( query_data_id=query_id, query_filter_id=promoters_id) request_id = deepblue_get_regions( query_id=intersect_id, output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) regions
```r library(Gviz) atrack <- AnnotationTrack(regions, name = "Intersecting regions", group = regions$`@BIOSOURCE`, genome="hg38") gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1") plotTracks(list(itrack, atrack, gtrack), groupAnnotation="group", fontsize=18, background.panel = "#FFFEDB", background.title = "darkblue")
The meta-column @LENGTH
contains the genomic region length,
and we filter the genomic regions where this value is smaller than 2,000.
The meta-column @SEQUENCE
includes the DNA Sequence in the
genomic region output.
The deepblue_get_regions
command is executed asynchronously. This means
that the user receives a request_id
to be able to check for the status of
this request. In contrast to the command deepblue_get_request_data
,
the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) query_id_filter_signal = deepblue_filter_regions(query_id=query_id, field="SIGNAL_VALUE", operation=">", value="10", type="number") query_id_filters = deepblue_filter_regions(query_id=query_id_filter_signal, field="PEAK", operation=">", value="1000", type="number") query_id_filter_length = deepblue_filter_regions (query_id=query_id_filters, field="@LENGTH", operation="<", value="2000", type="number") request_id = deepblue_get_regions(query_id=query_id_filter_length, output_format="CHROMOSOME,START,END,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE") regions = deepblue_download_request_data(request_id=request_id) head(regions, 1)
### DNA pattern matching operations We use the ```deepblue_find_motif``` command to find all position of a given pattern in the genome. An example is finding all locations of 'TATAA' in genome assembly GRCh38. We use the ```deepblue_select_experiments``` command to select genomic regions that are in chromosome 1, position 0 to 50,000,000 from the selected experiments. The command ```deepblue_intersect``` selects all regions of the ```query_id``` that intersect with at least one `tataa_regions` region. ```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} tataa_regions = deepblue_find_motif(motif="TATAAA", genome="GRCh38", chromosomes="chr1") query_id = deepblue_select_experiments( experiment_name= c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) overlapped = deepblue_intersection(query_data_id=query_id, query_filter_id=tataa_regions) request_id=deepblue_get_regions(overlapped, "CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE,@PROJECT") regions = deepblue_download_request_data(request_id=request_id) head(regions, 3)
The meta column @COUNT.MOTIF()
allows for counting how many times a motif
appears in the selected genomic region. For example, the following code return
the experiment regions with the DNA sequence length, the counts of G
,
CG
, GC
, and the DNA sequence itself.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiment_data = deepblue_select_experiments( "DG-75_c01.ERX297417.H3K27ac.bwa.GRCh38.20150527.bed") fmt = "CHROMOSOME,START,END,@LENGTH,@COUNT.MOTIF(C),@COUNT.MOTIF(G),@COUNT.MOTIF(CG),@COUNT.MOTIF(GC),@SEQUENCE" request_id=deepblue_get_regions(experiment_data, fmt) regions = deepblue_download_request_data(request_id=request_id) head(regions, 3)
### Genes We use the ```deepblue_select_genes``` command to select the gene `RP11-34P13` from GENCODE v23. The selected genes behave like a regular genomic region, which, for example, can be filtered by their attributes. We use the ```@GENE_ATTRIBUTE``` meta-column to select the genomic regions that are annotated as lincRNAs. ```{R, echo=TRUE, eval=FALSE, warning=FALSE, message=FALSE} q_genes = deepblue_select_genes(genes="RP11-34P13", gene_model="gencode v23") q_filter = deepblue_filter_regions(query_id=q_genes, field="@GENE_ATTRIBUTE(gene_type)", operation="==", value="lincRNA", type="string") request_id=deepblue_get_regions(q_filter, "CHROMOSOME,START,END,GTF_ATTRIBUTES") regions = deepblue_download_request_data(request_id=request_id) regions
The command deepblue_aggregate
summarizes the query_id
regions using
the cpg_islands
regions defined by the corresponding annotation
as boundaries.
The aggregated values can be accessed through the @AGG.*
meta-columns.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments ( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1", start=0, end=50000000) cpg_islands = deepblue_select_annotations(annotation_name="CpG Islands", genome="GRCh38", chromosome="chr1", start=0, end=50000000)
overlapped = deepblue_aggregate (data_id=query_id, ranges_id=cpg_islands, column="VALUE" )
request_id = deepblue_get_regions(query_id=overlapped, output_format= "CHROMOSOME,START,END,@AGG.MIN,@AGG.MAX,@AGG.MEAN,@AGG.VAR") regions = deepblue_download_request_data(request_id=request_id)
### Gene expression In the following example we obtain the gene expression levels of three genes, i.e., NOX3, NOXA1, and NOX4 from all biosources related to the ```hematopoietic stem cell``` biosource from the BLUEPRINT project. With related we refer to children of this biosource term in the ontologies used by DeepBlue. ```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} hsc_children = deepblue_get_biosource_children("hematopoietic stem cell") hsc_children_name = deepblue_extract_names(hsc_children) hsc_children_samples = deepblue_list_samples( biosource = hsc_children_name, extra_metadata = list(source="BLUEPRINT Epigenome")) hsc_samples_ids = deepblue_extract_ids(hsc_children_samples) # Note that BLUEPRINT uses Ensembl Gene IDs gene_exprs_query = deepblue_select_expressions( expression_type = "gene", sample_ids = hsc_samples_ids, identifiers = c("ENSG00000074771.3", "ENSG00000188747.7", "ENSG00000086991.11"), gene_model = "gencode v22") request_id = deepblue_get_regions( query_id = gene_exprs_query, output_format ="@GENE_NAME(gencode v22),CHROMOSOME,START,END,FPKM,@BIOSOURCE") regions = deepblue_download_request_data(request_id = request_id) regions
We use the deepblue_tiling_regions
command to generate a set of
consecutive genomic regions of size 100,000 from chromosome 1 of the genome
assembly GRCh38.
The command deepblue_aggregate
summarizes the query_id
regions using
the column VALUE
and the cpg_islands
regions as boundaries.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
query_id = deepblue_select_experiments( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1")
tiling_id = deepblue_tiling_regions(size=100000, genome="GRCh38", chromosome="chr1")
overlapped = deepblue_aggregate (data_id=query_id, ranges_id=tiling_id, column="VALUE")
request_id = deepblue_get_regions(query_id=overlapped, output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")
regions = deepblue_download_request_data(request_id=request_id) regions
Such data can now be plotted using any of the common R plotting mechanisms and packages. An example is shown here: ```r library(ggplot2) plot_data <- as.data.frame(regions) plot_data[,grepl("X.", colnames(plot_data))] <- apply(plot_data[,grepl("X.", colnames(plot_data))], 2, as.numeric) AGG.plot <- ggplot(plot_data, aes(start)) + geom_ribbon(aes(ymin = X.AGG.MEAN - (X.AGG.SD / 2), ymax = X.AGG.MEAN + (X.AGG.SD / 2)), fill = "grey70") + geom_line(aes(y = X.AGG.MEAN)) print(AGG.plot)
We use the deepblue_select_genes
command to generate a set of genes from
the gene model GENCODE v19.
The deepblue_flank
command derives flanking regions from existing
regions. First, we derive regions that start 2500bp before the initially
selected regions with a length of 2000bp. Next, we derive the regions that start
1500 base pairs after the initially selected regions with 500 base pairs length.
For each region, we consider the DNA strand.
The deepblue_merge_queries
command merges the region sets defined by
two query IDs. Here, we merge the two flanking regions sets we created based
on the initially selected genes.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
q_genes = deepblue_select_genes( genes= c("RNU6-1100P", "CICP7", "MRPL20", "ANKRD65", "HES2", "ACOT7", "HES3", "ICMT"), gene_model="gencode v19")
before_flank_id = deepblue_flank(query_id=q_genes, start=-2500, length=2000, use_strand=TRUE)
after_flank_id = deepblue_flank(query_id=q_genes, start=1500, length=500, use_strand=TRUE)
flank_merge_id = deepblue_merge_queries( query_a_id =before_flank_id, query_b_id=after_flank_id) all_merge_id = deepblue_merge_queries( query_a_id=q_genes, query_b_id=flank_merge_id)
request_id = deepblue_get_regions(query_id=all_merge_id, output_format="CHROMOSOME,START,END,STRAND,@LENGTH")
regions = deepblue_download_request_data(request_id=request_id) regions
### Calculated columns Here, we summarize DNA methylation levels for CpG islands of a specific experiment. Next, we remove those CpG islands for which no values were found using ```@AGG.COUNT``` > 0. We use the ```@CALCULATED``` meta-column to transform the ```@AGG.MEAN``` value to log scale. ```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} # Select the RP11-34P13 gene locations from gencode v23 # Selecting the data from 2 experiments: # GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig # As we already know the experiments names, we keep all others fields empty. # We are selecting all regions of chromosome 1 query_id = deepblue_select_experiments( experiment="GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", chromosome="chr1") # Select the CpG Islands annotation from GRCh38 cpg_islands = deepblue_select_annotations( annotation="CpG Islands", genome="GRCh38", chromosome="chr1") # Aggregate overlapped = deepblue_aggregate( data_id=query_id, ranges_id=cpg_islands, column="VALUE") # Select the aggregated regions that aggregated at least one region from the # selected experiments (@AGG.COUNT > 0) filtered = deepblue_filter_regions(query_id=overlapped, field="@AGG.COUNT", operation=">", value="0", type="number") # We remove all regions where the aggregation mean is zero. filtered_zeros = deepblue_filter_regions(query_id=filtered, field="@AGG.MEAN", operation="!=", value="0.0", type="number") # Retrieve the experiments data (The @NAME meta-column is used to include the # experiment name and @BIOSOURCE for experiment's biosource request_id = deepblue_get_regions(query_id=filtered_zeros, output_format= "CHROMOSOME,START,END,@CALCULATED(return math.log(value_of('@AGG.MEAN'))),@AGG.MEAN,@AGG.COUNT") regions = deepblue_download_request_data(request_id=request_id) # We have to perform a manual conversion because the # package can't know the type for calculated columns regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))` = as.numeric(regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))`) head(regions, 5)
Any numerical values returned by DeepBlue can also be conveniently displayed using, for example, the DataTrack feature of the GViz Bioconductor package as shown here:
library(Gviz) atrack <- AnnotationTrack(regions, name = "CpGs", group = regions$`@BIOSOURCE`, genome="hg38") gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1") dtrack <- DataTrack(regions, data="@AGG.MEAN", name = "Log of average methylation value") plotTracks(list(itrack, atrack, dtrack, gtrack), type="histogram", fontsize=18, background.panel = "#FFFEDB", background.title = "darkblue")
Here, we select a small number of experiments for
which we want to build a score matrix based on the column VALUE
.
We use CpG islands as aggregated regions boundaries.
The deepblue_score_matrix
command expects a named list with the
experiments names and columns that will be used for aggregation,
the regions' boundaries, and the operation that will be performed
(min, max, mean, var, sd, median, count).
The deepblue_score_matrix
command is executed asynchronously. The
command download_request_data
will return a matrix in which the first
three columns correspond to the chromosome, start position and end position.
The remaining columns will carry the names of the experiments and hold the
corresponding aggregated values.
```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiments = c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "C003N351.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "C005VG51.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S002R551.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "NBC_NC11_41.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "bmPCs-V156.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00BS451.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00D1DA1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00D39A1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig")
experiments_columns = list() for (experiment_name in experiments) { experiments_columns[[experiment_name]] = "VALUE" }
cpgs = deepblue_select_annotations( annotation_name="Cpg Islands", chromosome="chr22", start=0, end=18000000, genome="GRCh38")
request_id = deepblue_score_matrix( experiments_columns=experiments_columns, aggregation_function="mean", aggregation_regions_id=cpgs)
score_matrix = deepblue_download_request_data(request_id=request_id) head(score_matrix, 5)
```r library(ggplot2) score_matrix_plot = tidyr::gather(score_matrix, "experiment", "methylation", -CHROMOSOME, -START, -END) score_matrix_plot$START <- as.factor(score_matrix_plot$START) ggplot(score_matrix_plot, aes(x=START, y=experiment, fill=methylation)) + geom_tile() + theme(axis.text.x=element_text(angle=-90))
DeepBlueR allows you to conveniently save results to disk. Any result can be
saved as tab delimited file using deepblue_export_tab
. For example, we can
save the score matrix generated in the above example:
deepblue_export_tab(score_matrix, file.name = "my_score_matrix")
Results obtained with deepblue_get_regions
are of type GenomicRanges and
can be exported as tab delimited files preserving all columns or as BED files,
where a specific column can optionally be selected to populate the 'score'
column of the BED file. To demonstrate this, we use the result from the
tiling regions example further above:
request_id = deepblue_get_regions(query_id=overlapped, output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD") regions = deepblue_download_request_data(request_id=request_id) deepblue_export_bed(regions, file.name = "my_tiling_regions", score.field = "@AGG.MEAN")
Furthermore, metadata associated with any id can be stored locally using the
deepblue_export_meta_data
command. To this end, we first obtain the
experiment id of the file we used in the tiling regions example.
exp_id <- deepblue_name_to_id( "GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", collection = "experiments")$id deepblue_export_meta_data(exp_id, file.name = "GC_T14")
This command can also handle lists of ids, for instance:
deepblue_export_meta_data(list("e30035", "e30036"), file.name = "test_export")
In some cases, users will perform a series of requests. We provide the
command deepblue_batch_export
to save these results and their associated
metadata to disk in one go. This method will save each file as it becomes
available, i.e. it will be saved once the request is successfully
processed by DeepBlue:
experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3", biosource=c("inflammatory macrophage", "macrophage"), project="BLUEPRINT Epigenome") experiment_names = deepblue_extract_names(experiments) request_ids = foreach(experiment = experiment_names) %do%{ query_id = deepblue_select_experiments(experiment_name = experiment, chromosome = "chr21") request_id = deepblue_get_regions(query_id =query_id, output_format = "CHROMOSOME,START,END") } request_data = deepblue_batch_export_results(request_ids, target.directory = "BLUEPRINT macrophages chr21")
DeepBlueR comes with default options that can be changed by the user. To list the current options use the following command:
deepblue_options()
url
This is the URL of the DeepBlue application server and should not be changeduser_key
This option can be replaced by the personal key of the user after
successful registration at http://deepblue.mpi-inf.mpg.de. The key can be found
by logging into the web application and clicking on the user name in the top
left corner. Registered users have access to advanced features of DeepBlue, e.g.
they can review previous requests.do_not_cache
Allows users to switch off the caching functionality of
DeepBlueR. See [Caching].force_download
If the users wishes to overwrite cached results for the
following requests, this option can be switched on. See [Caching].debug
Switching on this option enables verbose output only useful for
debugging.Changing an option works as follows:
deepblue_options(do_not_cache = TRUE)
Another example (replace 'my_user_key' with the actual key):
deepblue_options(user_key = "my_user_key")
In case you wish to restore the default options simply call
deepblue_reset_options()
DeepBlueR by default creates a file 'DeepBlueR.cache' in the current working directory. Downloaded results / regions are stored there and can be instantly retrieved, which is particularly useful for users with limited network bandwith. However, in case caching is not desired it can be switched off (see [Options])
To check the status of the cache you can use the following command:
deepblue_cache_status()
This will report the cache size and the number of requests currently stored. Alternatively, users can list the request ids for which results are available:
deepblue_list_cached_requests()
Over time, the cache can quickly grow in size. It is possible to remove individual requests from the cache if the request id is known:
deepblue_delete_request_from_cache("r123")
In most cases it will be simpler to simply delete the cache:
deepblue_clear_cache()
DeepBlue has a memory limit on individual requests. As a consequence, some operations may not be executed successfully by the DeepBlue web server. To avoid this, large requests should be split by chromosomes. This has another advantage: if each chromosome is a different request, processing will be parallelized on DeepBlue and finish in a fraction of the time (depending on the queing status) compared to the same operation without splitting. Here is an example for obtaining a score matrix for each chromosome individually:
library(foreach) chromosomes_mm10 <- deepblue_extract_ids(deepblue_chromosomes(genome = "mm10")) request_ids <- foreach(chromosome = chromosomes_mm10, .combine = c) %do% { tiling_regions = deepblue_tiling_regions( size=100000, genome="mm10", chromosome=chromosome) deepblue_score_matrix( experiments_columns = list(ENCFF721EKA="VALUE", ENCFF781VVH="VALUE"), aggregation_function = "mean", aggregation_regions_id = tiling_regions ) }
Now each chromosome is processed individually and efficiently by DeepBlue. We can use the deepblue_batch_export function to download the individual matrices
list_of_score_matrices <- deepblue_batch_export_results(request_ids)
Next, we can simple merge them to obtain the final matrix
library(data.table) genome_wide_score_matrix <- data.table::rbindlist(list_of_score_matrices, use.names = TRUE) genome_wide_score_matrix
Here, we will show how DeepBlueR can be used to generate an overview heatmap of variable positions in more than 200 BLUEPRINT DNA methylation experiments. The amount of data considered here would normally be too huge to be processed on a local R installation. However, using DeepBlue and server-side processing of the data, we can facilitate this large-scale analysis easily.
In the first step, we load the DeepBlueR package, as well as packages for data retrieval, matrix operations and plotting.
library(DeepBlueR) library(gplots) library(RColorBrewer) library(matrixStats) library(stringr)
Next, we list all available BLUEPRINT DNA methylation experiments. (412 files that match the required metadata were available during the edition of this vignette.)
blueprint_DNA_meth <- deepblue_list_experiments(genome = "GRCh38", epigenetic_mark = "DNA Methylation", technique = "Bisulfite-Seq", project = "BLUEPRINT EPIGENOME") blueprint_DNA_meth
We are only interested in a subset of those files and filter for call files (opposed to coverage files).
blueprint_DNA_meth <- blueprint_DNA_meth[grep("CPG_methylation_calls.bs_call", deepblue_extract_names(blueprint_DNA_meth)),] blueprint_DNA_meth
Each of these files has a column, named VALUE
, that holds the
DNA methylation beta values. There are two possibilities to select
this column across several files.
First, we assume that the column in question has a different name in each file. We thus have to create a list that holds the column name for each of them. Such a list can be generated using standard R commands:
exp_columns <- list(nrow(blueprint_DNA_meth)) for(i in 1:nrow(blueprint_DNA_meth)){ exp_columns[[i]] <- "VALUE" } names(exp_columns) <- deepblue_extract_names(blueprint_DNA_meth)
In most cases, the same column name will apply for each file. We thus implemented a short hand function for generating the above list with a single column name for all files.
exp_columns <- deepblue_select_column(blueprint_DNA_meth, "VALUE")
In the next operation, we consider that not all methylation sites will be informative for clustering the data. We thus filter for those regions that are part of the BLUEPRINT regulatory build, a modified version of the ENSEMBL regulatory build that contains promoters, promoter flanking regions, enhancers, CTCF binding sites, transcription factor binding sites and open chromatin regions. As we can see in this example, DeepBlueR returns a vector of query ids (one for each chromosome) which we store for later use.
#list all available chromosomes in GRCh38 chromosomes_GRCh38 <- deepblue_extract_ids( deepblue_chromosomes(genome = "GRCh38") ) #keep only the essential ones chromosomes_GRCh38 <- grep(pattern = "chr([0-9]{1,2}|X)$", chromosomes_GRCh38, value = TRUE) #we split the request by chromosome to avoid hitting the memory limit of #DeepBlue blueprint_regulatory_regions <- foreach(chr = chromosomes_GRCh38, .combine = c) %do% deepblue_select_annotations( annotation_name = "Blueprint Ensembl Regulatory Build", chromosome = chr, genome = "GRCh38" ) blueprint_regulatory_regions
DeepBlue has several annotations that can be used to filter for informative sites. We could, for example, also filter for CpG islands.
deepblue_select_annotations(annotation_name = "Cpg Islands", genome = "GRCh38")
A list of all annotations currently available for a genome is given by the following command.
deepblue_list_annotations(genome = "GRCh38")
New annotations may be included upon users request.
In case we want to include the entire genome in an aggregated version DeepBlue supports the concept of tiling regions. In this process, the genomic range of interest will be binned into tiles of a given size (here 5kb).
tiling_regions <- deepblue_tiling_regions(size=5000, genome="GRCh38")
In the above step we have defined a set of regions of interest that we want to interrogate in R to, for example, cluster samples. To this end, DeepBlue can build a score matrix, in which the selected genomic regions are aggregated on the server to reduce the complexity and size of the data. We request such a score matrix in which regulatory regions are aggregated by the mean as follows. Note that we use the variables 'exp_columns' and 'blueprint_regulatory_regions' that we have defined above. Since we had to split our request by chromosome, we need to make multiple requests.
request_ids <- foreach(query_id = blueprint_regulatory_regions, .combine = c) %do% deepblue_score_matrix( experiments_columns = exp_columns, aggregation_function = "mean", aggregation_regions_id = query_id) request_ids
After triggering this function, DeepBlue queues our task and will execute it when resources become available. We also observe that DeepBlue returns a request id, which we can use to query the status of the operation.
foreach(request = request_ids, .combine = c) %do% { deepblue_info(request)$state }
When the operation is finished, we can download the score matrix and store it in a local variable. For DeepBlueR, we implemented several strategies to improve the performance of data retrieval. For instance, we modified the existing XML-RPC package to be more efficient in the context of DeepBlue when it comes to parsing nested XML data. Moreover, we retrieve tabular data directly in a tab separated file format, which can be processed much faster in R. Finally, we also compress data on the server side to reduce download time. Here, we only show the first five columns out of 215.
score_matrix <- data.table::rbindlist( deepblue_batch_export_results(request_ids), use.names = TRUE) score_matrix[,1:5, with=FALSE]
The download is 212.8 MB in size. The size of the data we handled on DeepBlue to extract this information is roughly 212 x ~450 MB ~= 95 GB and thus more than can be handled in R on most desktop computers. We next show how this score matrix can be used to plot a heatmap where samples are clustered by the Pearson correlation coefficient, revealing that samples originating from the same cell type are more similar in DNA methylation.
In preparation of the heatmap plot, we need to generate an RColorBrewer palette. This allows us to create a color palette for more than 9 colors.
getPalette <- colorRampPalette(brewer.pal(9, "Set1"))
For each experiment, we collect metadata.
experiments_info <- deepblue_info(deepblue_extract_ids(blueprint_DNA_meth))
All metadata is parsed to a nested R list. We refer to the DeepBlue paper for a description of available metadata. Here, we show the metadata associated with just one of the samples.
head(experiments_info[[1]], 10)
For this analysis, we are only interested in the biosource name, i.e. the cell type. We can retrieve this information using standard R syntax. Note that we show only the first 6 entries here.
biosource <- unlist(lapply(experiments_info, function(x){ x$sample_info$biosource_name})) head(biosource)
To save some space on the plot, we replace positive with + and negative with -.
biosource <- str_replace_all(biosource, "-positive", "+") biosource <- str_replace_all(biosource, "-negative", "-")
For the same reason, we remove the words 'terminally differentiated' from one of the cell types.
biosource <- str_replace(biosource, ", terminally differentiated", "")
Using above color palette, we can now assign a unique color to each cell type.
color_map <- data.frame(biosource = unique(biosource), color = getPalette(length(unique(biosource)))) head(color_map)
Using above table, we can now assign the colors to each experiment according to its cell type / biosource.
exp_names <- unlist(lapply(experiments_info, function(x){ x$name})) biosource_colors <- data.frame(name = exp_names, biosource = biosource) biosource_colors <- dplyr::left_join(biosource_colors, color_map, by = "biosource") head(biosource_colors)
Finally, we transform this data frame into a vector that is compatible with the heatmap function.
color_vector <- as.character(biosource_colors$color) names(color_vector) <- biosource_colors$biosource head(color_vector)
We remove the first three columns (CHROMOSOME, START, END) and convert the data frame to a numeric matrix.
filtered_score_matrix <- as.matrix(score_matrix[,-c(1:3), with=FALSE]) head(filtered_score_matrix[,1:3])
Next, we compute the variance of each row and retain only genomic regions with variance > 0.05 for plotting. Plotting all regions would consume too much memory and more importantly, regions that do not show variance also do not allow us to spot differences between cell types in the heatmap.
message("regions before: ", nrow(filtered_score_matrix)) filtered_score_matrix_rowVars <- rowVars(filtered_score_matrix, na.rm = TRUE) filtered_score_matrix <- filtered_score_matrix[which(filtered_score_matrix_rowVars > 0.05),] message("regions after: ", nrow(filtered_score_matrix))
To be able to cluster samples, we remove regions that have missing values in at least one of the experiments.
message("regions before: ", nrow(filtered_score_matrix)) filtered_score_matrix <- filtered_score_matrix[which(complete.cases(filtered_score_matrix)),] message("regions after: ", nrow(filtered_score_matrix))
IMPORTANT: The order of columns in the score matrix is not the same as in the exp_columns list used in the request. We thus have to order the matrix by the experiment names in the color map. This is crucial to make sure we assign the correct cell type to each sample!
filtered_score_matrix <- filtered_score_matrix[,exp_names]
We plot a heatmap in which the variable regions are shown across all samples. On top of the columns, we create a dendrogram based on Pearson correlation More precisely, we convert the Pearson correlation, a similarity measure, to a distance, such that it can be used with hierarchical clustering.
heatmap.2(filtered_score_matrix,labRow = NA, labCol = NA, trace = "none", ColSideColors = color_vector, hclust=function(x) hclust(x,method="complete"), distfun=function(x) as.dist(1-cor(t(x), method = "pearson")), Rowv = TRUE, dendrogram = "column", key.xlab = "beta value", denscol = "black", keysize = 1.5, key.title = NA)
plot.new() legend(x = 0, y = 1, legend = color_map$biosource, col = as.character(color_map$color), text.width = 0.6, lty= 1, lwd = 6, cex = 0.7, y.intersp = 0.7, x.intersp = 0.7, inset=c(-0.21,-0.11))
To obtain a general overview of DeepBlue, we recommend starting with the DeepBlue publication and a list of all DeepBlue commands is available in its API page.
You can have a look at the other use cases included in the R package and list them with
demo(package = "DeepBlueR")
Individual use cases can be triggered with
demo("use_case1", package = "DeepBlueR")
Note that the example presented here corresponds to use case 4 in the R package.
Finally, we encourage you to try to reproduce Python examples in R and to read the DeepBlue manual.
Finally, we want to highlight the possibility to browse and access existing data in DeepBlue conveniently in the web interface. The web interface also allows you to select experiments in a grid like view.
Should you encounter any problems with DeepBlueR, we kindly ask you to create an issue in the BioConductor DeepBlueR support page.
The R code in the DeepBlueR package is under the GPLv3 license and we welcome contributions of other developers. Finally, we would like to thank the Bioconductor team for its support in making DeepBlueR available to a wide audience of users.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.