BiocStyle::markdown()
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache=TRUE, warning=FALSE, message=FALSE, eval = FALSE )
BioDataome package is on GitHub. To install an R package from GitHub you first need to install the devtools package.
if (!require(devtools)) { install.packages("devtools") }
and then load devtools package.
library(devtools)
To install BioDataome package, type:
install_github("mensxmachina/BioDataome")
To load BioDataome package, type:
library(BioDataome)
BioDataome website and BioDataome package currently hold datasets from five different microarray technologies: four gene expression (GPL570, GPL96, GPL6244 and GPL1261) and the GPL13534 Human Methylation BeadChip from GEO and the GPL11154 high-throughput sequencing technology from recount. Therefore, all functions that take as an input argument a technology id, work only with one of the above technology ids.
In BioDataome users may select certain criteria from the drop-down lists and form a complex query. For example, users may be interested in downloading all parkinson's disease datasets of the GPL570 platform. When users select parkinson's disease from the disease drop-down menu and GPL570 from the technology drop-down menu, BioDataome datasets are filtered out to match these criteria and a list with all the metadata related to the results of their query is formed. Subsequently, users may click on the download button to download the list, that includes a download link for each dataset.
#Download the filtered list matching specific criteria #If you downloaded the metadata csv file in your currect directory you can load it with the command: #metadata<-read.csv(file=file.path(getwd(),"metadata.csv"), sep=",", header=TRUE, stringsAsFactors = F) #If you have installed and loaded BioDataome package, an example metadata file can be loaded as: metadata<-BioDataome::metadata #Let's filter out further the results by selecting datasets with sample size greater or equal to 25 and #those with no common samples with any other dataset. subsetDsets<-metadata[which( (metadata$samples>=25) & (metadata$duplicates==" ") ),] dataList<-lapply(subsetDsets$dataset,read.csv) #first five rows and columns of the data dataList[[1]][1:5,1:5]
In BioDataome we have downloaded, preprocessed and annotated several thousands of omics data. All raw data have been retreived from Gene Expression Omnibus and recount. Here we provide a workflow of all the steps we followed to download, preprocess and annotate data for BioDataome either from GEO or from recount.
In BioDataome package curateGSE
function runs all necessary steps to download and preprocess a dataset from GEO. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data. curateGSE
function is build upon several other BioDataome functions. First, it calls GSEmetadata
for creating the sample phenotype metadata, including the output of the controlSamples
, which discovers which samples are possibly used as controls, and then calls downloadRaw
to download all raw data. Subsequently, depending on the user specified techology (whether the user designated one of the gene expression technologies curated by BioDataome or the methylation technology) curateGSE
calls preprocessGEO
or preprocessGEOMethylation
respectively, to create a numeric matrix of the preprocessed data. Keep in mind that preprocessing CEL files takes a long time, even for small datasets.
#curate dataset GSE10026 measured with technology GPL570. GSE10026<-curateGSE("GSE10026","GPL570",getwd()) #first five rows and columns of the metadata GSE10026[[1]][1:5,1:5] #first five rows and columns of the preprocessed data GSE10026[[2]][1:5,1:5]
In BioDataome package curateRecountRNASeq
function runs all necessary steps to download and preprocess a dataset from recount. It returns a list, the first element of which is the sample phenotype metadata and the second the preprocessed data.
#curate dataset SRP032775. SRP032775<-curateRecountRNASeq("SRP032775",getwd()) #first five rows and columns of the metadata SRP032775[[1]][1:5,1:5] #first five rows and columns of the preprocessed data SRP032775[[2]][1:5,1:5]
In RNASeq data we often end up with variables of zero variance across samples, as a result of zero counts in all samples. In most types of analysis, especially differential gene expression, one approach is to exclude genes with zero counts in all the replicates and conditions from the analysis, considering that they are obviously not expressed, (http://homer.salk.edu/homer/basicTutorial/rnaseqR.html). Also, for downstream analysis, it is typical to filter genes with a total read count smaller than a given threshold in any of the experimental conditions. In either way, non-zero variance of gene counts across samples is guaranteed. In BioDataome however, we are preparing datasets for further downstream and meta-analysis, thus a common variable set must be ensured. We followed the same preprocessing pipeline on all downloaded datasets and we show here all the steps that we followed, as well as their effect on the data. To visualize this effect, we followed the steps proposed in: (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html)
Recount provides the coverage counts instead of typical read count matrices. The coverage counts for each disjoint exon are the sum of the base-pair coverage. The gene coverage count is the sum of the disjoint exons coverage counts. The recount function scale_counts()
computes the scaled read counts for a target library size of 40 million reads and we use this in our preprocessing pipeline, as proposed in recount quick start guide. We also estimate size factors to account for differences in sequencing depth and use the varianceStabilizingTransformation
to account for heteroscedasticity, as indicated in RNA-seq gene-level analysis.
#download dataset SRP050971 downloadRecount("SRP050971") load(file.path("SRP050971", 'rse_gene.Rdata')) # Scale counts by taking into account the total coverage per sample rse <- recount::scale_counts(rse_gene) #access count matrix data<-SummarizedExperiment::assay(rse) #access phenotype information pheno<-SummarizedExperiment::colData(rse) #construct a DESeqDataSet object for further analysis dds<-DESeq2::DESeqDataSetFromMatrix(data,pheno, ~ 1) #Estimate the size factors for the count data dds <- DESeq2::estimateSizeFactors(dds)
To assess the scaling effect, we plot the densities of counts for the different samples. We expect overlapping densities since most of the genes are affected by the experimental conditions.
source("https://bioconductor.org/biocLite.R") if (!require("geneplotter")) biocLite("geneplotter") library('geneplotter') multidensity(SummarizedExperiment::assay(dds), xlab="mean counts", xlim=c(0, 1000), ylab="Density", legend=F, main="") #account for heteroscedasticity vsd <- DESeq2::varianceStabilizingTransformation(dds, blind=FALSE) dataNorm<-SummarizedExperiment::assay(vsd) #show the effect of the variance stabilizing transformation, by plotting the first sample against the second lims <- c(-2, 20) plot(SummarizedExperiment::assay(dds)[,1:2], pch=16, main="before VST") lims <- c(-2, 20) plot(dataNorm[,1:2], pch=16, main="after VST")
To assess the overall similarity between samples we plot a heatmap of sample-to-sample distances using the transformed values.
#prerequisites for plotting colouring a heatmap if (!require("RColorBrewer")) install.packages("RColorBrewer") library(RColorBrewer) if (!require("pheatmap")) install.packages("pheatmap") library(pheatmap) distsRL <- dist(t(dataNorm)) mat <- as.matrix(distsRL) condition<-SummarizedExperiment::colData(vsd)$title condition<-strsplit(condition,"_") condition<-sapply(condition,"[[",2) rownames(mat) <- condition colnames(mat) <- SummarizedExperiment::colData(vsd)$sample hmcol <- colorRampPalette(brewer.pal(9, "Blues"))(255) pheatmap(mat,fontsize_row=8)
To build BioDataome website we have also identified duplicate samples by comparing all pairwise combinations of the preprocessed datasets. For each dataset, we report all other datasets that share at least one common sample. In BioDataome package we included two functions, compareDsets
and compareDsetList
for a user to either compare two datasets to find if they share samples, or to compare one datasets of interest to a list of datasets.
BioDataome website provides all data and metadata files in csv format for better data exchange. However, in BioDatome package a user can also load data in .rda format. For example, compareDsets
and compareDsetList
both accept either .csv or .rda files. In the following examples we demonstrate compareDsets
with loading .rda files and compareDsetList
with .csv file.
#download two preprocessed datasets from BioDataome in .rda d1<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86013.Rda"))) d2<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86015.Rda"))) # compare two preprocessed datasets commons<-compareDsets(d1,d2) #Number of samples that d1 and d2 share commons
Since BioDataome datasets are large we use fread
from package data.table
to read .csv datasets faster. In contrast to compareDsets
, compareDsetList
accepts paths to data files instead of data matrices.
#the path to GSE8601 dataset in BioDataome x<-"http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE86013.csv"
Let us suppose we would like to compare GSE8601 to datasets GSE86015, GSE9008 and GSE9119. All datasets should have been measured with the same technology, i.e. GPL570.
#create a character vector of the paths in BioDataome y<-c("GSE86015.csv","GSE9008.csv","GSE9119.csv") y<-paste0("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/",y) #find with which of the three datasets our dataset of interest shares samples commonGSEs<-compareDsetList(x,y) commonGSEs
To annotate datasets hosted in BioDataome, we programmatically retrieve text-mined results from PubTator through RESTful API. PubTator supports PubMed ID queries. When a datasets is not accompanied by its PubMed ID, we use GSEtoDiseaseGEO
, function that is based on GEO queries consisting of dataset accession ID and all disease terms from the disease ontology (D-O).
In most cases, either PubTator or GSEtoDiseaseGEO
return a collection of disease terms for each dataset. In this case, we first map all disease terms to the disease ontology first children nodes, i.e. bacterial infectious disease, immune system disease, cancer, etc, and keep disease terms that belong to the most common node(s).
To assign Disease-Ontology terms to a GEO study we call GSEtoDisease
which implements the rationale described above.
#Assign Disease-Ontology terms to study "GSE10245" diseases<-GSEtoDisease("GSE10006") diseases
For the RNASeq datasets from the recount, we first map the accession ids found in recount to GEO accession ids.
#Assign Disease-Ontology terms to study "SRP032775" gse<-recountIDtoGSE("SRP032775") diseases<-GSEtoDisease(gse) diseases
One of the most common analysis of microarray data is differential gene expression analysis. Therefore, we show here an example of how an R user can exploit BioDataome to accelerate the analysis of a single dataset. Since BioDataome hosts already processed data, analysts can avoid all the time-consuming steps of downloading and preprocessing data and focus on their downlstream analysis. In this example we follow microarray analysis steps from the Biomedical Data Science course of Rafael Irizarry and Michael Love.
#download GSE8671 preprocessed dataset from BioDataome in .rda d<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671.Rda"))) #download GSE8671 preprocessed dataset from BioDataome in .rda pheno<-get(load(url("http://dataome.mensxmachina.org/data/Homo%20sapiens/GPL570/GSE8671_Annot.Rda"))) #dimensions of dataset (rows: probes, columns:samples) dim(d) #install and load limma package source("https://bioconductor.org/biocLite.R") if (!require("limma")) biocLite("limma") library('limma') #Fit a linear model for each gene in the expression data given the design matrix specified in column class #of the phenotype data fit <- lmFit(d, design=model.matrix(~ pheno$class)) #calculate the differential expression by empirical Bayes shrinkage fit <- eBayes(fit) #A table with the top 10 most statistically significant differentially expressed genes between the groups #sorted by adjusted p-value: tt <- topTable(fit, coef=2) #create a heatmap of those top 10 highly significant genes. heatmap(d[match(rownames(tt),row.names(d)),], labCol = FALSE)
#install and load necessary packages if (!require("hgu133plus2.db")) biocLite("hgu133plus2.db") if (!require("annotate")) biocLite("annotate") library("annotate") library('hgu133plus2.db') Symbol <- getSYMBOL(row.names(d), "hgu133plus2.db") #Find gene symbols for the top 10 differentially expressed genes top10genes<-Symbol[match(rownames(tt),row.names(d))] top10genes
To infer knowledge about the differentially expressed genes we perform enrichment analysis, namely we compare them to annotated gene sets representing prior biological knowledge. We use Enrichr and the respective R package to check whether a differentially expressed input set of genes significantly overlaps with annotated gene sets. We choose Gene Ontology terms and Reactome and KEGG biological pathways as annotated gene set libraries for this example.
#Perform enrichment analysis for the top 10 differentially expressed genes #install and download enrichR package if (!require("enrichR")) install.packages("enrichR") library('enrichR') #select annotated gene set libraries dbs <- c("GO_Molecular_Function_2017", "GO_Cellular_Component_2017", "GO_Biological_Process_2017", "Reactome_2016","KEGG_2016") #Perform enrichment analysis enriched <- enrichr(top10genes, dbs) #sort Gene Ontology molecular function terms by adjusted p-values GOMF<-enriched[[1]]$Term[order(enriched[[1]]$Adjusted.P.value)] #show top-5 head(GOMF, n=5) #sort genes by the number of Gene Ontology molecular function terms GeneMembershipGOMF<-sort(table(enriched[[1]]$Genes),decreasing = T) GeneMembershipGOMF #sort Gene Ontology cellular component terms by adjusted p-values GOCC<-enriched[[2]]$Term[order(enriched[[2]]$Adjusted.P.value)] #show top-5 head(GOCC, n=5) #sort genes by the number of Gene Ontology cellular component terms GeneMembershipCC<-sort(table(enriched[[2]]$Genes),decreasing = T) GeneMembershipCC #sort Gene Ontology biological process terms by adjusted p-values GOBC<-enriched[[3]]$Term[order(enriched[[3]]$Adjusted.P.value)] #show top-5 head(GOBC, n=5) #sort genes by the number of Gene Ontology biological process terms GeneMembershipBC<-sort(table(enriched[[2]]$Genes),decreasing = T) GeneMembershipBC #sort Reactome pathway terms by adjusted p-values Reactome<-enriched[[4]]$Term[order(enriched[[4]]$Adjusted.P.value)] #show top-5 head(Reactome, n=5) #sort genes by the number of Reactome pathway terms GeneMembershipReactome<-sort(table(enriched[[4]]$Genes),decreasing = T) GeneMembershipReactome #sort KEGG pathway terms by adjusted p-values KEGG<-enriched[[5]]$Term[order(enriched[[5]]$Adjusted.P.value)] #show top-5 head(KEGG,n=5) #sort genes by the number of GKEGG pathway terms GeneMembershipKEGG<-sort(table(enriched[[5]]$Genes),decreasing = T) GeneMembershipKEGG
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.