options(width = 400) knitr::opts_chunk$set(warnings = FALSE, message = FALSE)
preciseTAD
is an R package designed to transform domain boundary calling into a supervised machine learning framework. preciseTAD
offers full functionality in building the predictor-response data and selecting the best model for precise prediction of boundary regions. This functionality can be broken into 2 primary usages:
The training/testing data used for modeling is represented as a matrix with rows being genomic regions, columns being genomic annotations, and cells containing measures of association between them. Users have the option to concatenate genomic regions from multiple chromosomes.
To create the row-wise dimension of the data, preciseTAD
uses shifted binning, a strategy for making the dimensions of the data matrix used for modeling by segmenting the linear genome into nonoverlapping regions. This step is transparent for the user. To create shifted bins, chromosome-specific bins start at half of the resolution r, and continue in congruent intervals of length r until the end of the chromosome (mod r + r/2), using hg19 genomic coordinates. The shifted genomic bins, are then defined as boundary regions (Y = 1) if they contain a called boundary, and non-boundary regions (Y = 0) otherwise, thus establishing the binary response vector (Y) used for classification. Intuitively, shifted bins are centered on borders between the original bins, thus capturing potential boundaries. We found the shifted binning strategy improves model performance.
The column-wise dimension is formed by genomic annotations, such as transcription factor binding sites (TFBSs), histone modification marks, chromatin states. The ($log_{2}$) distances, which enumerate the genomic distance from the center of each genomic bin to the center of the nearest ChIP-seq peak region of interest, form the feature space. Other feature type options include binary overlaps - an indicator for whether a ChIP-seq region overlaps with genomic bin, count overlaps - the number of overlaps in each bin, and percent overlaps - the percentage of overlap between any bin and the total width of all ChIP-seq regions overlapping it. The customized training/testing data formation offered by preciseTAD
allows users to implement any binary classification machine learning algorithm.
preciseTAD
implements a random forest (RF) model, allowing tuning hyperparameters and applying feature reduction. The primary inputs are the training and testing data, a list of hyperparameter values, the number of cross-validation folds to use (if a grid of hyperparameter values is provided), and the metric used for optimization. The output includes the model object (necessary for downstream prediction of boundaries), the variable importance values, and a list of performance metrics when validating the model on the testing data (see Table 1). This model is then used to predict base-level precise boundary locations.
To predict the base-level location of domain boundaries, the preciseTAD
model is applied for each base annotated with the aforementioned genomic annotations. The base-level probabilities are clustered using density-based clustering and scalable data partitioning techniques, to narrow boundary regions and points. First, the probability vector, $p_{n_{i}}$, is extracted ($n_{i}$ representing the length of chromosome $i$). Next, DBSCAN (Density-based Spatial Clustering of Applications with Noise) [@ester1996density; @hahsler2019dbscan] is applied to the matrix of pairwise genomic distances between bases with $p_{n_{i}} \ge t$, where $t$ is a threshold determined by the user. The resulting clusters of highly predictive bases identified by DBSCAN are termed preciseTAD boundary regions (PTBR). To precisely identify a single base among each PTBR, preciseTAD implements partitioning around medoids (PAM) within each cluster. The corresponding cluster medoid is defined as a preciseTAD boundary point (PTBP), making it the most representative and biologically meaningful base within each clustered PTBR. The output includes a list of genomic coordinates of PTBPs and PTBSs.
preciseTAD
allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only cell-type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. We provide models pre-trained using GM12878 and K562 genome annotation data, and Arrowhead- and Peakachu-predicted boundaries for chromosome-specific prediction of precise domain boundaries in other cell types.
# if (!requireNamespace("BiocManager", quietly=TRUE)) # install.packages("BiocManager") BiocManager::install("preciseTAD") # For the latest version install from GitHub # BiocManager::install("dozmorovlab/preciseTAD")
library(knitr) library(e1071) library(preciseTAD)
preciseTAD
requires users to supply genomic coordinates of the "ground-truth" domain boundaries to establish the response vector (Y). As an example, we consider TAD boundaries derived from the popular Arrowhead TAD caller, a part of the Juicebox suite of tools developed by Aiden Lab [@durand2016juicer]. To get boundaries, Arrowhead was applied on the autosomal chromosomes using 5 kb GM12878 Hi-C data [@rao20143d] (GSE63525). An example of the command line script used to called TADs with Arrowhead is provided below, as well as the first three columns of the resulting TAD coordinates. A more detailed tutorial for implementing Arrowhead can be found here.
arrowhead -c 1 \ #chromosome to call TADs on -r 5000 \ #HiC data resolution ~/GSE63525_GM12878_insitu_primary.hic \ #location of the .HIC file ~/arrowhead_output #location to store the output
data("arrowhead_gm12878_5kb") head(arrowhead_gm12878_5kb)
Users will then need to transform this TAD coordinate data into a GRanges object of unique boundary coordinates using the preciseTAD::extractBoundaries()
function. Note that the input for this function is a BED-like data frame where the second and third columns are genomic coordinates of domain anchor's midpoints. Here we only extract boundaries for CHR1 and CHR22 (CHR = c("CHR1", "CHR22")
parameter). We specify preprocess=FALSE
because we are only interested in all boundaries and not filtering TADs by length. Lastly, we specify resolution=5000
to match the resolution used by Arrowhead (although this argument is ignored given that preprocess=FALSE
). As shown below, there were a total of 1901 unique TAD boundaries reported by Arrowhead for chromosomes 1 and 22.
bounds <- extractBoundaries(domains.mat = arrowhead_gm12878_5kb, filter = FALSE, CHR = c("CHR1", "CHR22"), resolution = 5000) # View unique boundaries bounds
To identify genomic features best predictive of TAD boundaries, preciseTAD
requires functional genomic annotation data. They are used to establish the feature space ($\textbf{X}={X_{1}, X_{2}, \cdots, X_{p} }$). Cell type-specific genomic annotation data can be downloaded from the ENCODE data portal as BED files. Once you have downloaded your preferred list of functional genomic annotations, store them in a specific file location (i.e., "./path/to/BEDfiles"). These files can then be converted into a GRangesList object and used to create the feature space using the preciseTAD::bedToGRangesList()
function. The signal
argument refers to the column in the BED files containing peak signal strength values and is used to assign metadata to the corresponding GRanges (only necessary for downstream plotting). We have already provided a GRangesList object with 26 transcription factor binding sites (TFBS) specific to the GM12878 cell type. Once you load it in, you can see the list of transcription factors using the following commands.
# path <- "./path/to/BEDfiles" # tfbsList <- bedToGRangesList(filepath=path, bedList=NULL, bedNames=NULL, pattern = "*.bed", signal=4) data("tfbsList") names(tfbsList) tfbsList
Using the "ground-truth" boundaries and the following TFBS, we can build the data matrix that will be used for predictive modeling. The preciseTAD::createTADdata()
function can create the training and testing data. Here, we specify to train on chromosome 1 and test on chromosome 22. Additionally, we specify resolution = 5000
to construct 5kb shifted genomic bins (to match the Hi-C data resolution), featureType = "distance"
for a $log_2(distance + 1)$-type feature space, and resampling = "rus"
to apply random under-sampling (RUS) on the training data to balance classes of boundary vs. non-boundary regions. We also specify a seed to ensure reproducibility when performing the resampling. The result is a list containing two data frames: (1) the resampled (if specified) training data, and (2) the testing data.
set.seed(123) tadData <- createTADdata(bounds.GR = bounds, resolution = 5000, genomicElements.GR = tfbsList, featureType = "distance", resampling = "rus", trainCHR = "CHR1", predictCHR = "CHR22") # View subset of training data tadData[[1]][1:5,1:4] # Check it is balanced table(tadData[[1]]$y) # View subset of testing data tadData[[2]][1:5,1:4]
We can now implement our machine learning algorithm of choice to predict boundary regions. Here, we opt for the random forest algorithm for binary classification. preciseTAD
offers functionality for performing recursive feature elimination (RFE) as a form of feature reduction through the use of the preciseTAD::TADrfe()
function, which is a wrapper for the rfe
function in the caret
package [@kuhn2012caret]. preciseTAD::TADrfe()
implements a random forest model on the best subset of features from 2 to the maximum number of features in the data by powers of 2, using 5-fold cross-validation. We specify accuracy as the performance metric. An example is shown below.
set.seed(123) rfe_res <- TADrfe(trainData = tadData[[1]], tuneParams = list(ntree = 500, nodesize = 1), cvFolds = 5, cvMetric = "Accuracy", verbose = TRUE)
# View RFE performances rfe_res[[1]] # View the variable importance among top n features across each CV fold head(rfe_res[[2]])
Recursive feature elimination results indicate that model accuracy begins to stabilize when only considering the top four transcription factors for building random forest predictive models (Figure 1A). After aggregating the variable importance values of the top four TFBS across each cross-fold, we see that the top four most important TFBS in each of the five folds are the SMC3, RAD21, CTCF, and ZNF143 (Figure 1B). These are known components of the loop-extrusion model that has been proposed as a mechanism for the 3D architecture of the human genome [@sanborn2015chromatin; @fudenberg2016formation; @hansen2018recent].
Now that we have suitably reduced our feature space, we can implement a random forest algorithm built simply on the TFBS mentioned above (SMC3, RAD21, CTCF, and ZNF143). We can take advantage of the preciseTAD::TADrandomForest()
function, which is a wrapper of the randomForest
package [@breiman2001random; @liaw2002classification]. We specify the training and testing data, the hyperparameter values, the number of cross-validation folds, the performance metric to consider (here, accuracy), the seed to initialize for reproducibility, an indicator for retaining the model object, an indicator for retaining variable importances, the variable importance measure to consider (here, mean decrease in accuracy (MDA)), and an indicator for retaining model performances based on the test data. The function returns a list containing: 1) a trained model from caret with model information (tadModel[[1]]
), 2) a data.frame
of variable importance for each feature included in the model (tadModel[[2]]
), and 3) a data.frame
of various model performance metrics.
# Restrict the data matrix to include only SMC3, RAD21, CTCF, and ZNF143 tfbsList_filt <- tfbsList[names(tfbsList) %in% c("Gm12878-Ctcf-Broad", "Gm12878-Rad21-Haib", "Gm12878-Smc3-Sydh", "Gm12878-Znf143-Sydh")] set.seed(123) tadData <- createTADdata(bounds.GR = bounds, resolution = 5000, genomicElements.GR = tfbsList_filt, featureType = "distance", resampling = "rus", trainCHR = "CHR1", predictCHR = "CHR22") # Run RF set.seed(123) tadModel <- TADrandomForest(trainData = tadData[[1]], testData = tadData[[2]], tuneParams = list(mtry = 2, ntree = 500, nodesize = 1), cvFolds = 3, cvMetric = "Accuracy", verbose = FALSE, model = TRUE, importances = TRUE, impMeasure = "MDA", performances = TRUE)
# View model performances performances <- tadModel[[3]] performances$Performance <- round(performances$Performance, digits = 2) rownames(performances) <- performances$Metric kable(t(performances), caption = "List of model performances when validating an RF built on CHR1 on CHR22 test data.")
As you may know, there exist other machine learning binary classifiers that can be used in this setting. For example, suppose we opt to implement a support vector machine (SVM). This is easy enough to accomplish given that preciseTAD::createTADdata()
conveniently sets up the training and testing data sets. We use the e1071
package to run the SVM with a radial kernel, cost = 1, and gamma = 0.5 using the example command below. We see that the SVM model's accuracy is only 0.67, whereas the accuracy from our random forest was 0.69 (Table 1).
svmModel <- svm(y ~ ., data = tadData[[1]], kernel = "radial", cost = 1, gamma = 0.5) svmPreds <- predict(svmModel, tadData[[2]][, -1], positive = "Yes") # View confusion matrix table(svmPreds, tadData[[2]][, 1])
Recall that our model classifies boundary regions, in that each prediction refers to a genomic bin of width 5000 bases. To predict boundary coordinates at the base resolution more precisely, we can leverage our model through the use of the preciseTAD::preciseTAD()
function. Conceptually, instead of genomic bins, we annotate each base with the selected genomic annotations and featureType. We then apply our model on this annotation matrix to predict the probability of each base is a boundary given the associated genomic annotations. To minimize computational costs, the base-level prediction should be performed in selected regions.
Suppose we want to precisely predict the domain boundary coordinates for the 2Mb section of CHR22:25,500,000-27,500,000. To do so, we specify chromCoords = list(25500000, 27500000)
. Additionally, we set a range of probability threshold values including 0.975, 0.99, and 1.0 used for constructing PTBRs. For DBSCAN, we assign a range of $\epsilon$-neighborhood values including 5000, 10000, 15000, 20000, and 30000 and maintain 3 as the MinPts value. We specify verbose=TRUE
so that the function will print out some minor results for each combination of $(t, \epsilon)$ as well as the overall best combination
# Run preciseTAD set.seed(123) pt <- preciseTAD(genomicElements.GR = tfbsList_filt, featureType = "distance", CHR = "CHR22", chromCoords = list(17000000, 18000000), tadModel = tadModel[[1]], threshold = 1, verbose = TRUE, parallel = NULL, DBSCAN_params = list(eps = c(30000), # c(10000, 30000, 50000) MinPts = c(100)), # c(3, 10, 100) slope = 5000, genome = "hg19", BaseProbs = TRUE) # What is getting returned? PTBRs - precide TAD boundary regions, and PTBPs - boundary points names(pt) # View the results # pt
When specifying verbose=TRUE
the function provides brief output for all $t \times \epsilon$ combinations including how many PTBRs/PTBPs result from each combination. NOTE: it is advised to provide verbose=FALSE
when implementing preciseTAD using multiple $t$ and $\epsilon$ on the entire chromosome. Additionally, we see from the verbose statements that the optimal combination of $(t, \epsilon) = (1, 30000)$.
The full output of the preciseTAD::preciseTAD()
function is a list with 4 elements: 1) data frame with average (and standard deviation) normalized enrichment (NE) values for each combination of t and eps (only if multiple values are provided for at least paramenter; all subsequent summaries are applied to optimal combination of (t, eps)), 2) the genomic coordinates spanning each preciseTAD predicted region (PTBR), 3) the genomic coordinates of preciseTAD predicted boundaries points (PTBP). 4) a named list including summary statistics of the following: PTBRWidth - PTBR width, PTBRCoverage - the ratio of base level coordinates with probabilities that exceed the threshold to PTBRWidth, DistanceBetweenPTBR - the genomic distance between the end of the previous PTBR and the start of the subsequent PTBR, NumSubRegions - the number of elements in each PTBR cluster, SubRegionWidth - the genomic coordinates spanning the subregion associated with each PTBR, DistBetweenSubRegions - the genomic distance between the end of the previous PTBR-specific region and the start of the subsequent PTBR-specific region, NormilizedEnrichment - the normalized enrichment of the genomic annotations used in the model around flanked PTBPs, and BaseProbs - a numeric vector of probabilities for each corresponding base coordinate. Normalized enrichment is calculated as the total number of peak regions that overlap with flanked predicted boundaries divided by the number of predicted boundaries. A schematic of how each of the summaries are calculated is presented in Figure 2.
The vector of base-specific probabilities of being a boundary (BaseProbs
) may be large, depending on the chromCoords
settings. In our example, it has 2 million entries. Be careful when setting BaseProbs = TRUE
to avoid memory issues. To see how the probabilities for the first five hundred bases look like, we do:
plot(pt$Summaries$BaseProbs[1:100000])
We can see that the probability of being a boundary seems to peak around 200 and 400 bases. It is well below 1, but may be worth further investigation.
Juicebox is an interface provided by Aiden Lab that allows for superimposing boundary coordinates onto Hi-C contact maps. To visualize domains flanked by the predicted boundaries, you must first select a Hi-C map. As an example, you can import the contact matrix for GM12878 derived by Rao et al. 2014 by choosing File -> Open... -> Rao and Huntley et al. -> GM12878 -> in situ Mbol -> primary
. To format preciseTAD
results to use in Juicebox, users can take advantage of preciseTAD::juicer_func()
, which is a function available in the preciseTAD
R package that transforms a GRanges object into a data frame as shown below.
# Transform pt_juice <- juicer_func(pt$PTBP)
You will then need to save the PTBPs as a BED file using write.table
as shown below. Once saved, import them into Juicebox using Show Annotation Panel -> 2D Annotations -> Add Local -> pt_juice.bed
.
filepath = "~/path/to/store/ptbps" write.table(pt_juice, file.path(filepath, "pt_juice.bed"), quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
preciseTAD
allows us to use the pre-trained model to predict the precise location of domain boundaries in cell types without Hi-C data but with genome annotation data. Specifically, only the cell type-specific ChIP-seq data (BED format) for CTCF, RAD21, SMC3, and ZNF143 transcription factor binding sites is required. These data need to be assembled in a GRanges object using the aforementioned bedToGRangesList()
function (i.e., creating another cell-type-specific tfbsList_filt
object).
The pre-trained model tadModel
can be obtained by using bedToGRangesList()
, extractBoundaries()
, createTADdata()
, and TADrandomForest()
functions, as described. Alternatively, chromosome-specific pre-trained models can be downloaded from the OneDrive storage. We recommend models pre-trained using GM12878 genome annotation data and Peakachu peaks. Note that each model is annotated with "CHRi_holdout", representing the chromosome held out of training, and thus to be used to predict boundaries. It should be noted that the pretrained preciseTAD models were trained on cancer cell lines (GM12878 and K562). Although we expect that the fundamental rules of domain boundary formation (CTCF etc. binding) will be preserved in cancer cells, use caution when applying the pretrained models to predict boindaries in normal cell lines.
As an example, we consider the Hela-S3 cell line. We will download and use the CHR22_GM12878_5kb_Peakachu model to predict boundaries for chr22 in the Hela-S3 cell line.
# Read in RF model built on chr1-chr21 tadModel <- readRDS("~/Downloads/CHR22_GM12878_5kb_Peakachu.rds")
The Hela-S3-specific BED-formatted TFBS data can be downloaded at ENCODE. The selected TFBS data can be downloaded from the supplementary repository. They can be converted into a GRangesList.
# Establishing Hela-specific genomic annotations # Reading in CTCF, RAD21, SMC3, and ZNF143 from repository and storing as list ctcf <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/ctcf.bed", sep = "\t", header = FALSE)) rad21 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/rad21.bed", sep = "\t", header = FALSE)) smc3 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/smc3.bed", sep = "\t", header = FALSE)) znf143 <- as.data.frame(read.delim("http://raw.githubusercontent.com/stilianoudakis/preciseTAD_supplementary/master/data/bed/hela/znf143.bed", sep = "\t", header = FALSE)) helaList <- list(ctcf, rad21, smc3, znf143) hela.GR <- bedToGRangesList(bedList=helaList, bedNames=c("Ctcf", "Rad21", "Smc3", "Znf143"), signal=5)
Now we are ready to predict boundaries on chr22 for Hela-S3.
# Run preciseTAD set.seed(123) pt <- preciseTAD(genomicElements.GR = hela.GR, featureType = "distance", CHR = "CHR22", chromCoords = list(25500000, 27500000), tadModel = tadModel[[1]], threshold = 1.0, verbose = FALSE, parallel = NULL, DBSCAN_params = list(30000, 100), slope = 5000, BaseProbs = FALSE, savetobed = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.