knitr::opts_chunk$set( collapse = TRUE, comment = '#>' )
library(TOP)
To make predictions using pretrained TOP models, we first need to prepare the input data as a data frame, with six columns.
Columns from the left to right are PWM scores and five DNase (or ATAC) bins.
You can follow this page to prepare the input data.
Let's load an example ATAC-seq data to predict CTCF occupancy in K562 cell type.
This data contains binned ATAC-seq counts around CTCF motif matches in chr1.
data <- readRDS(system.file("extdata/example_data", "CTCF.K562.ATAC.chip.example.data.rds", package = "TOP")) cols <- c('chr','start','end','name','pwm.score','strand','p.value', paste0('bin', 1:5)) data <- data[, cols]
head(data,3)
We can predict TF occupancy for a TF in a cell type using
the predict_TOP()
function.
We have trained TOP models for ATAC-seq and DNase-seq (Duke and UW) data using ENCODE data, and included pre-trained TOP model coefficients (posterior mean) within the package, also on our companion resources website. We will also train new models when new data become available, and update the model parameters on our companion site.
If you want to use the pre-trained TOP model coefficients included in the package,
you simply specify the model to use, by setting use_model
as "ATAC",
"DukeDNase", or "UwDNase".
For example, here we chose the pretrained "ATAC" model, as we have ATAC-seq data.
We could use the "bottom" level regression coefficients as we know that have CTCF ChIP data from K562 cell type in our training set.
TOP_result <- predict_TOP(data, tf_name = 'CTCF', cell_type = 'K562', use_model = 'ATAC', level = 'bottom', logistic_model = FALSE, transform = 'asinh')
Or, we can let the software to choose the "best" available level of coefficients for the TF and cell type of interest. In this case, it indeed selected the "bottom" level.
TOP_result <- predict_TOP(data, tf_name = 'CTCF', cell_type = 'K562', use_model = 'ATAC', level = 'best', logistic_model = FALSE, transform = 'asinh')
If you want to use your own model, you can specify TOP_coef
to
your TOP model coefficients.
Check this and this
tutorials if you want to train your own models.
For example, we can load TOP quantitative occupancy model for ATAC-seq data included in the packge:
TOP_coef <- readRDS(system.file("extdata/trained_model_coef/ATAC", "TOP_M5_posterior_mean_coef.rds", package = "TOP"))
We can then make predictions by setting TOP_coef
to the specified model.
TOP_result <- predict_TOP(data, TOP_coef = TOP_coef, tf_name = 'CTCF', cell_type = 'K562', level = 'best', logistic_model = FALSE, transform = 'asinh')
This result contains a list with the level of model, regression coefficients (posterior mean) used to make predictions, and predicted TF occupancy.
The prediction result contains the training data together with predicted TF occupancy for each candidate site:
TOP_predictions <- TOP_result$predictions head(TOP_predictions, 5)
We can make a scatterplot of the predicted occupancy vs. measured occupancy.
data_chip <- readRDS(system.file("extdata/example_data", "CTCF.K562.ATAC.chip.example.data.rds", package = "TOP")) scatterplot_predictions(x = asinh(data_chip$chip), y = asinh(TOP_predictions$predicted), title = 'Predicting CTCF occupancy in K562 cell', xlab = 'asinh(measured occupancy)', ylab = 'asinh(predicted occupancy using bottom level coefficients)', xlim = c(0,8), ylim = c(0,8))
We can also try the "middle" level regression coefficients for CTCF, which are cell-type generic, thus could be used to predict CTCF in other cell types.
TOP_middle_result <- predict_TOP(data, TOP_coef = TOP_coef, tf_name = 'CTCF', level = 'middle', logistic_model = FALSE, transform = 'asinh') TOP_middle_predictions <- TOP_middle_result$predictions
We can see the performance of the "middle" level model is very close to the "bottom" level model.
scatterplot_predictions(x = asinh(data_chip$chip), y = asinh(TOP_middle_predictions$predicted), title = 'Predicting CTCF occupancy in K562 cell', xlab = 'asinh(measured occupancy)', ylab = 'asinh(predicted occupancy using middle level coefficients)', xlim = c(0,8), ylim = c(0,8))
We can also predict TF binding probability for these candidate sites using
pre-trained TOP logistic model, by setting logistic_model = TRUE
.
TOP_result <- predict_TOP(data, tf_name = 'CTCF', cell_type = 'K562', use_model = 'ATAC', level = 'best', logistic_model = TRUE) logistic_predicted <- TOP_result$predictions head(logistic_predicted, 5)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.