In dozmorovlab/HiCdiff: HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

options(width = 400)

Introduction

HiCdiff provides functions for the joint normalization and detection of differential chromatin interactions between two or multiple Hi-C datasets. HiCdiff operates on processed Hi-C data in the form of chromosome-specific chromatin interaction matrices. It accepts three-column tab-separated text files storing chromatin interaction matrices in a sparse matrix format (see Creating the hic.table object). Functions to convert popular Hi-C data formats (.hic, .cool) to sparse format are available (see ?cooler2sparse). HiCdiff differs from other packages that attempt to compare Hi-C data in that it works on processed data in chromatin interaction matrix format instead of pre-processed sequencing data. In addition, HiCdiff provides a non-parametric method for the joint normalization and removal of biases between two Hi-C datasets for the purpose of comparative analysis. HiCdiff also provides a simple yet robust permutation method for detecting differences between Hi-C datasets.

The hic_loess function outputs normalized chromatin interactions for both matrices ready for the comparative analysis. The hic_diff function performs the comparative analysis and outputs genomic coordinates of pairs of regions detected as differentially interacting, interaction frequencies, the difference and the corresponding permutation p-value.

How to use `HiCdiff`

Installation

Install HiCdiff from Bioconductor.

source("https://bioconductor.org/biocLite.R")
biocLite("HiCdiff")
library(HiCdiff)

Getting Hi-C Data

You will need processed Hi-C data in the form of sparse upper triangular matrices or BEDPE files in order to use HiCdiff. Data is available from several sources and two examples for downloading and extracting data are listed below. If you have full Hi-C contact matrices you can convert them to sparse upper triangular format using the full full2sparse function as shown in additional functions

Extracting data from `.hic` files

Hi-C data is available from several sources and in many formats. HiCdiff is built to work with the sparse upper triangular matrix format popularized by the lab of Erez Lieberman-Aiden http://aidenlab.org/data.html. If you already have Hi-C data either in the form of a sparse upper triangular matrix or a full contact matrix you can skip to the Creating the hic.table object section. If you obtain data from the Aiden Lab in the .hic format you will need to first extract the matrices that you wish to compare.

Download the straw software from https://github.com/theaidenlab/straw/wiki and install it.
Use straw to extract a Hi-C sparse upper triangular matrix. An example is below:

Say we downloaded the GSE63525_K562_combined_30.hic file from GEO https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525

To extract the raw matrix corresponding to chromosome 22 at 500kb resolution we would use the following command within the terminal

./straw NONE GSE63525_K562_combined_30.hic 22 22 BP 500000 > K562.chr22.500kb.txt

This will extract the matrix from the .hic file and save it to the K562.chr22.500kb.txt text file, in the sparse upper triangular matrix format. See more examples on on how to use straw at https://github.com/theaidenlab/straw/wiki/CPP#running. Straw requires several inputs for the extraction of data from a .hic file.

<NONE/VC/VC_SQRT/KR> <hicFile(s)> <chr1>[:x1:x2] <chr2>[:y1:y2] <BP/FRAG> <binsize>

The first argument is the normalization method. For use in HiCdiff you want the raw data so you should selected NONE. The second argument is the .hic file name. Next is the chromosome numbers of the matrix you want. For an intrachromosomal contact map both should be the same as in the above example. If you want a matrix of interchromosomal interactions you can use different chromosomes i.e. interactions between chromosome 1 and chromosome 2 (Note that HiCdiff is only meant to be used on intrachromosomal interactions at this point in development). The next argument is whether you want basepair or fragment files. For HiCdiff use BP. The final argument is the binsize of the matrix (the resolution). To extract a matrix at a resolution of 1MB enter 10000000. Typical bin sizes are 1MB, 500KB, 100KB, 50KB, 5KB, 1KB. Note that most matrices with resolutions higher than 100KB (i.e. matrices with resolutions of 1KB - 50KB) are typically too sparse (due to insufficient sequencing coverage) for analysis in HiCdiff.

From here we can just import the matrix into R as you would normally for any tab-delimited file.

Import the data into R K562.chr22 <- read.table('K562.chr22.500kb.txt', header=FALSE)
Repeat these steps for any other Hi-C dataset that you wish to compare to the first dataset using HiCdiff, then proceed to Creating the hic.table object.

Extracting data from `.cool` files

The cooler software, http://cooler.readthedocs.io/en/latest/index.html, allows access to a large collection of Hi-C data. The cooler index ftp://cooler.csail.mit.edu/coolers contains Hi-C data for hg19 and mm9 from many different sources. To use data in the .cool format in HiCdiff follow these steps:

Download and install cooler from http://cooler.readthedocs.io/en/latest/index.html
Download a .cool file from the cooler index ftp://cooler.csail.mit.edu/coolers.
Say we downloaded the Dixon2012-H1hESC-HindIII-allreps-filtered.1000kb.cool file. See cooler dump --help for data extraction options. To extract the contact matrix we use the following commands in the terminal:
cooler dump --join Dixon2012-H1hESC-HindIII-allreps-filtered.1000kb.cool > dixon.hESC.1000kb.txt
Read in the text file as you would any tab-delimited file in R
hesc1000kb <- read.table("dixon.hESC.1000kb.txt", header = FALSE)
Convert to a sparse upper triangular matrix using the HiCdiff::cooler2sparse function.
sparse <- cooler2sparse(hesc1000kb)
Repeat the steps for another Hi-C dataset that you wish to compare to the first dataset then proceed to Creating the hic.table object.

Creating the `hic.table` object {#hic_table}

Sparse upper triangular format

A sparse matrix format represents a relatively compact and human-readable way to store pair-wise interactions. It is a tab-delimited text format containing three columns: "region1" - a start coordinate (in bp) of the first region, "region2" a start coordinate of the second region, and "IF" - the interaction frequency between them (IFs). Zero IFs are dropped (hence, the sparse format). Since the full matrix of chromatin interactions is symmetric, only the upper triangular portion, including the diagonal, is stored.

If you have two full Hi-C contact matrices you can convert them to sparse upper triangular matrices using the HiCdiff::full2sparse function. Once you have two sparse matrices you are ready to create a hic.table object. This will be illustrated using the included sparse matrices at 500kb resolution for chromosome 22 from the HMEC and NHEK cell lines.

library(`HiCdiff`)
# load the data
data("HMEC.chr22")
data("NHEK.chr22")
head(HMEC.chr22)

Now that we have 2 sparse upper triangular matrices we can create the hic.table object. create.hic.table requires the input of 2 sparse matrices and the chromosome name.

# create the `hic.table` object
chr22.table <- create.hic.table(HMEC.chr22, NHEK.chr22, chr = 'chr22')
head(chr22.table)

We can also list multiple hic.tables in order to utilize parallel computing. First we create another hic.table as was done above. Then we combine these two hic.tables into a list.

# create list of `hic.table` objects
data("HMEC.chr10")
data("NHEK.chr10")

# create the `hic.table` object
chr10.table <- create.hic.table(HMEC.chr10, NHEK.chr10, chr = 'chr10')
hic.list <- list(chr10.table, chr22.table)
head(hic.list)

The hic.table object contains a summary of the differences between the two matrices. "IF1" and "IF2" correspond to interaction frequencies in the first and second matrices, "D" is the unit distance (length of each unit is equivalent to the resolution of the data, e.g., 500kb), "M" is the $log_2(IF2)-log_2(IF1)$ difference.

BEDPE format

A hic.table object can also be created using data in the 7 column BEDPE format. An example of BEDPE data for the HMEC dataset used above is shown below.

HMEC.chr22_BEDPE <- chr22.table[, 1:7, with=FALSE]
NHEK.chr22_BEDPE <- chr22.table[, c(1:6, 8), with=FALSE]
head(HMEC.chr22_BEDPE)

To create a hic.table object using BEDPE data is very similar to using data in the sparse upper triangular format.

bed.hic.tab <- create.hic.table(HMEC.chr22_BEDPE, NHEK.chr22_BEDPE)
head(bed.hic.tab)

InteractionSet/GRanges format

A hic.table can also be created using an InteractionSet object. Simply enter two InteractionSets representing two Hi-C matrices into the create.hic.table function and they will be converted to the proper format. See the InteractionSet vignette for creating InteractionSet objects here

data("hmec.IS")
data("nhek.IS")
head(hmec.IS)
IS.hic.tab <- create.hic.table(hmec.IS, nhek.IS)

To shorten computational time, or if one is only interested in a subsection of a Hi-C matrix, one may wish to use a subset of the hic.table object. Use the subset.dist or subset.index options, see help for the create.hic.table function.

Joint Normalization

Now that you have created a hic.table object you can jointly normalize your two Hi-C matrices. The hic_loess function has many options and can accept a single hic.table or a list of hic.tables. If for example you wish to perform joint normalization for every chromosome on two cell lines while utilizing parallel computing you can create a list containing the hic.tables for each chromosome.

To change the degree of the polynomial for the loess joint normalization you can utilize the degree option, default is 1 (linear regression). A user-defined span, or the amount of data used to build the loess model, can also be set with the span option. However if span = NA (the default) the automatic smoothing parameter selection process will run and determine the optimal span for the data. The type of selection process can be changed using the loess.criterion option. Available settings are gcv (the default) for generalized cross-validation or aicc for Akaike information criterion. The loess automatic smoothing parameter selection uses a customized version of the fANCOVA::loess.as function. For more information on parameter selection please see the fANCOVA reference manual. It is recommended to use the default settings.

hic_loess can utilize the BiocParallel package to perform parallel computing and lower computation time. The parallel option (FALSE by default) will only speed up computation if a list of hic.tables is entered into the function, i.e., it parallelizes processing of several chromosome-specific matrices. This is useful for performing joint normalization for every chromosome between two Hi-C datasets. For more information on BiocParallel see the reference manual here.

The basis of HiCdiff rests on the novel concept termed the MD plot. The MD plot is similar to the MA plot or the Bland-Altman plot. $M$ is the log2 difference between the interaction frequencies from the two datasets. $D$ is the unit distance between the two interacting regions. Loess is performed on the data after it is represented in the MD coordinate system. To visualize the MD plot the Plot option can be set to TRUE.

If you wish to detect differences between the two Hi-C datasets immediately following joint normalization you can set the check.differences option to TRUE. However, if you only want to jointly normalize the data for now keep this option set to FALSE. The difference detection process will be covered in the next section.

# Jointly normalize data for a single chromosome
hic.table <- hic_loess(chr22.table, Plot = TRUE)
head(hic.table)

# Multiple hic.tables can be processed in parallel by entering a list of hic.tables
hic.list <- hic_loess(hic.list, parallel = TRUE)

The hic_loess joint normalization function extends the hic.table with the adjusted interaction frequencies, adjusted "M", and the "mc" correction factor.

Difference Detection

Difference detection also makes use of the MD plot. The jointly normalized datasets are once again plotted on the MD plot. Difference detection takes place on a per-unit-distance basis using a permutation test. Permutations are broken into blocks for each unit distance. See methods section of INSERT PAPER HERE for more details. Difference detection can be performed at the same time as joint normalization if the check.differences option in hic_loess is set to TRUE. Otherwise a hic.table object output from the hic_loess function can be input into the hic_diff function for difference detection.

hic_diff accepts a hic.table object or a list of hic.table objects output from hic_loess. hic_diff can also utilize the BiocParallel package the same way as the hic_loess loess function. The same limits and uses apply as stated above. If the Plot option is set to TRUE, an MD plot will be displayed showing the normalized MD plot along with coloring based on the significance of each difference. The dotted lines on this MD plot represent the diff.thresh threshold. This threshold is automatically calculated as 2 standard deviations of M from 0 on the MD plot by default but can be set by the user or turned off. This option helps to deal with false positives by restricting the regions that can be labelled as significant when they fall inside the threshold. The threshold can be thought of as a range in which differences do not exceed the level of the expected noise in the Hi-C data. The function will return a hic.table or a list of hic.tables with an additional column containing the p-value for the significance of the difference between the IFs in each cell of the two Hi-C matrices.

If you wish to visualize the p-values in the form of a Hi-C contact map you can use the sparse2full function to transform the results to a contact matrix with the p-values inserted instead of IFs as illustrated in the example below.

# input hic.table object into hic_diff
hic.table <- hic_diff(hic.table, Plot = TRUE)

# input a list of hic.tables
hic.list <- hic_diff(hic.list)

# transform results into contact map of p-values
library(pheatmap)
m <- sparse2full(hic.table, hic.table = TRUE, column.name = 'p.value') # convert to full matrix
fc <- sparse2full(hic.table, hic.table = TRUE, column.name = 'M')
pheatmap(-log10(m) * sign(fc), cluster_rows = FALSE, cluster_cols = FALSE)

Converting `HiCdiff` results to `InteractionSet` objects

If after running hic_loess or hic_diff on your data you wish to perform additional analyses which require the GRanges object class you can convert the results using the make_InteractionSet function. This function will produce an InteractionSet object with the genomic ranges contained in the hic.table along with several metadata files containing the additional information produced by hic_loess or hic_diff.

IntSet <- make_InteractionSet(hic.table)

Simulating Hi-C Data

HiCdiff includes functions for simulating Hi-C data. The hic_simulate function allows you to simulate two Hi-C matrices with added bias and true differences at a specified fold change. As an example we will simulate two matrices with 250 true differences added at a fold change of 4.

number_of_unitdistances <- 100 # The dimensions of the square matrix to be simualted
number_of_changes       <- 250 # How many cells in the matrix will have changes
i.range <- sample(1:number_of_unitdistances, number_of_changes, replace = TRUE) # Indexes of cells to have controlled changes
j.range <- sample(1:number_of_unitdistances, number_of_changes, replace = TRUE) # Indexes of cells to have controlled changes

sim_results <- hic_simulate(nrow = number_of_unitdistances, medianIF = 50000, sdIF = 14000, powerlaw.alpha = 1.8, fold.change = 4, i.range = i.range, j.range = j.range, Plot = TRUE, alpha = 0.1)

The results of the simulation are saved in a list.

names(sim_results)

TPR is the true positive rate, SPC is the specificity, pvals is a vector of the p-values for every cell in the matrices, hic.table is the resulting hic.table object after hic_loess and hic_diff have been applied to the simulated data, true.diff is a table for the cells that had the specified fold change applied to them, truth is a vector of 0's and 1's indicating if a cell had a true difference applied - this is useful for input into ROC packages, sim.table is the simulated data in a hic.table object before being scaled, normalized, and analyzed for differences.

Additional Functions {#addfunc}

HiCdiff contains some additional functions that may be useful.

If you do not choose to show the MD plots when you initially run hic_loess or hic_diff you can use the MD.plot1 or MD.plot2 functions. MD.plot1 will create a side by side MD plot showing before and after loess normalization. Enter your original M and D vectors along with the M correction factor, mc, calculated by hic_loess.

MD.plot1(M = hic.table$M, D = hic.table$D, mc = hic.table$mc)

MD.plot2 will create a standard MD plot with optional colouring based on p-value. Just enter an M and D vector and a p-value vector if desired.

# no p-value coloring
MD.plot2(M = hic.table$adj.M, D = hic.table$D)

# p-value coloring
MD.plot2(M = hic.table$adj.M, D = hic.table$D, hic.table$p.value)

There are two matrix transformation functions included. sparse2full will transform a sparse upper triangular matrix to a full Hi-C matrix. full2sparse will transform a full Hi-C matrix to sparse upper triangular format.

full.NHEK <- sparse2full(NHEK.chr22)
full.NHEK[1:5, 1:5]

sparse.NHEK <- full2sparse(full.NHEK)
head(sparse.NHEK)

KRnorm will perform Knight-Ruiz normalization on a Hi-C matrix. Just enter the full Hi-C matrix to be normalized.

KR.NHEK <- KRnorm(full.NHEK)

SCN will perform Sequential Component Normalization on a Hi-C matrix. Just enter the full Hi-C matrix to be normalized.

SCN.NHEK <- SCN(full.NHEK)

MA_norm will perform MA normalization on a hic.table object.

result <- MA_norm(hic.table, Plot = TRUE)

dozmorovlab/HiCdiff documentation built on May 20, 2019, 11:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dozmorovlab/HiCdiff
HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

In dozmorovlab/HiCdiff: HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

Introduction

How to use `HiCdiff`

Installation

Getting Hi-C Data

Extracting data from `.hic` files

Extracting data from `.cool` files

Creating the `hic.table` object {#hic_table}

Sparse upper triangular format

BEDPE format

InteractionSet/GRanges format

Joint Normalization

Difference Detection

Converting `HiCdiff` results to `InteractionSet` objects

Simulating Hi-C Data

Additional Functions {#addfunc}

R Package Documentation

Browse R Packages

We want your feedback!

dozmorovlab/HiCdiff HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

In dozmorovlab/HiCdiff: HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

Introduction

How to use HiCdiff

Installation

Getting Hi-C Data

Extracting data from .hic files

Extracting data from .cool files

Creating the hic.table object {#hic_table}

Sparse upper triangular format

BEDPE format

InteractionSet/GRanges format

Joint Normalization

Difference Detection

Converting HiCdiff results to InteractionSet objects

Simulating Hi-C Data

Additional Functions {#addfunc}

R Package Documentation

Browse R Packages

We want your feedback!

dozmorovlab/HiCdiff
HiCdiff: Joint normalization and comparative analysis of multiple Hi-C datasets

How to use `HiCdiff`

Extracting data from `.hic` files

Extracting data from `.cool` files

Creating the `hic.table` object {#hic_table}

Converting `HiCdiff` results to `InteractionSet` objects