In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = '#>'
)

Load TOP R package

library(TOP)

Load data

To make predictions using pretrained TOP models, we first need to prepare the input data as a data frame, with six columns.

Columns from the left to right are PWM scores and five DNase (or ATAC) bins.

You can follow this page to prepare the input data.

Let's load an example ATAC-seq data to predict CTCF occupancy in K562 cell type.

This data contains binned ATAC-seq counts around CTCF motif matches in chr1.

data <- readRDS(system.file("extdata/example_data", "CTCF.K562.ATAC.chip.example.data.rds", package = "TOP"))
cols <- c('chr','start','end','name','pwm.score','strand','p.value', paste0('bin', 1:5))
data <- data[, cols]

head(data,3)

Predict TF occupancy using pretrained TOP model

We can predict TF occupancy for a TF in a cell type using the predict_TOP() function.

We have trained TOP models for ATAC-seq and DNase-seq (Duke and UW) data using ENCODE data, and included pre-trained TOP model coefficients (posterior mean) within the package, also on our companion resources website. We will also train new models when new data become available, and update the model parameters on our companion site.

If you want to use the pre-trained TOP model coefficients included in the package, you simply specify the model to use, by setting use_model as "ATAC", "DukeDNase", or "UwDNase".

For example, here we chose the pretrained "ATAC" model, as we have ATAC-seq data.

We could use the "bottom" level regression coefficients as we know that have CTCF ChIP data from K562 cell type in our training set.

TOP_result <- predict_TOP(data, 
                          tf_name = 'CTCF', 
                          cell_type = 'K562', 
                          use_model = 'ATAC', 
                          level = 'bottom', 
                          logistic_model = FALSE, 
                          transform = 'asinh')

Or, we can let the software to choose the "best" available level of coefficients for the TF and cell type of interest. In this case, it indeed selected the "bottom" level.

TOP_result <- predict_TOP(data, 
                          tf_name = 'CTCF', 
                          cell_type = 'K562', 
                          use_model = 'ATAC', 
                          level = 'best', 
                          logistic_model = FALSE, 
                          transform = 'asinh')

If you want to use your own model, you can specify TOP_coef to your TOP model coefficients. Check this and this tutorials if you want to train your own models.

For example, we can load TOP quantitative occupancy model for ATAC-seq data included in the packge:

TOP_coef <- readRDS(system.file("extdata/trained_model_coef/ATAC", "TOP_M5_posterior_mean_coef.rds", package = "TOP"))

We can then make predictions by setting TOP_coef to the specified model.

TOP_result <- predict_TOP(data, 
                          TOP_coef = TOP_coef, 
                          tf_name = 'CTCF', 
                          cell_type = 'K562', 
                          level = 'best', 
                          logistic_model = FALSE, 
                          transform = 'asinh')

This result contains a list with the level of model, regression coefficients (posterior mean) used to make predictions, and predicted TF occupancy.

The prediction result contains the training data together with predicted TF occupancy for each candidate site:

TOP_predictions <- TOP_result$predictions
head(TOP_predictions, 5)

We can make a scatterplot of the predicted occupancy vs. measured occupancy.

data_chip <- readRDS(system.file("extdata/example_data", "CTCF.K562.ATAC.chip.example.data.rds", package = "TOP"))
scatterplot_predictions(x = asinh(data_chip$chip),
                        y = asinh(TOP_predictions$predicted),
                        title = 'Predicting CTCF occupancy in K562 cell',
                        xlab = 'asinh(measured occupancy)',
                        ylab = 'asinh(predicted occupancy using bottom level coefficients)',
                        xlim = c(0,8),
                        ylim = c(0,8))

We can also try the "middle" level regression coefficients for CTCF, which are cell-type generic, thus could be used to predict CTCF in other cell types.

TOP_middle_result <- predict_TOP(data, 
                                 TOP_coef = TOP_coef, 
                                 tf_name = 'CTCF', 
                                 level = 'middle', 
                                 logistic_model = FALSE, 
                                 transform = 'asinh') 
TOP_middle_predictions <- TOP_middle_result$predictions

We can see the performance of the "middle" level model is very close to the "bottom" level model.

scatterplot_predictions(x = asinh(data_chip$chip),
                        y = asinh(TOP_middle_predictions$predicted),
                        title = 'Predicting CTCF occupancy in K562 cell',
                        xlab = 'asinh(measured occupancy)',
                        ylab = 'asinh(predicted occupancy using middle level coefficients)',
                        xlim = c(0,8),
                        ylim = c(0,8))

Predict TF binding probability using trained TOP logistic model

We can also predict TF binding probability for these candidate sites using pre-trained TOP logistic model, by setting logistic_model = TRUE.

TOP_result <- predict_TOP(data, 
                          tf_name = 'CTCF',
                          cell_type = 'K562',
                          use_model = 'ATAC',
                          level = 'best',
                          logistic_model = TRUE)
logistic_predicted <- TOP_result$predictions
head(logistic_predicted, 5)

HarteminkLab/TOP documentation built on July 27, 2023, 6:14 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

HarteminkLab/TOP
Predict transcription factor occupancy using DNase- or ATAC-seq data

In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

Load TOP R package

Load data

Predict TF occupancy using pretrained TOP model

Predict TF binding probability using trained TOP logistic model

R Package Documentation

Browse R Packages

We want your feedback!

HarteminkLab/TOP Predict transcription factor occupancy using DNase- or ATAC-seq data

In HarteminkLab/TOP: Predict transcription factor occupancy using DNase- or ATAC-seq data

Load TOP R package

Load data

Predict TF occupancy using pretrained TOP model

Predict TF binding probability using trained TOP logistic model

R Package Documentation

Browse R Packages

We want your feedback!

HarteminkLab/TOP
Predict transcription factor occupancy using DNase- or ATAC-seq data