knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.height = 5,
    fig.width = 6,
    fig.align = "center"
)

Introduction

In this vignette, we demonstrate the workflow for finding and visualizing differential loops between two biological conditions using the hictoolsr, DESeq2, and plotgardener packages. First, data pre-processing is required to obtain .hic files for each biological replicate and condition along with looping interactions. Loops are then merged between conditions and interaction frequency counts are extracted between loop anchors for each condition and replicate. Differential loops can then be called with DESeq2 and the results can be visualized as aggregate peak analysis (APA) plots.

As an example, we will use data from the paper, "Phase separation drives aberrant chromatin looping and cancer development", by Ahn et al. 2021 using GEO links GSE143465 and GSE143465. This paper explores the oncogenic mechanism of a rare fusion protein in acute myeloid leukemia (AML). The fusion protein, NUP98-HOXA9 (NHA9), contains a DNA-binding domain fused to an intrisically disordered region (IDR) which forms phase-separated condensates and leads to changes in 3D chromatin structure and deregulated gene expression. To explore how phase separation leads to changes in chromatin structure the IDR of NHA9 was made incapable of phase separation by mutating phenylalanine (F) amino acid residues to serines (S), then expressed in HEK293T cells (FS mutant) along with the wildtype (WT). Hi-C, ChIP-seq, and RNA-seq was performed in both WT and FS cell lines to compare chromatin structure, NHA9 binding, and gene expression in response to phase separation.

Here, we demonstrate how to find differential loops between the WT and FS biological conditions and visualize the results with aggregate peak analysis plots.

Pre-processing data

Before we can find differential loops, we must first process raw .fastq files into .hic files and identify significant looping interactions. These are pre-processing steps that are conducted outside of hictoolsr. The following sections outline how to process and call loops from Hi-C data using available tools.

Processing raw Hi-C data with dietJuicer

Raw Hi-C fastqs can be converted into .hic format using the dietJuicer pipeline. The data should be processed at two levels:

  1. Each replicate should be processed individually such that there are .hic files for each. This is necessary for count extraction used as input to DESeq. Note the exception that sequencing replicates should always be merged, even for count extraction. Follow the directions under the dietJuicerCore workflow to process these samples.
  2. Replicates can then be merged into a "mega" .hic file. These .hic files are used as input for loop calling. Follow the directions under the dietJuicerMerge workflow to create merged .hic maps.

Calling loops with SIP

After creating merged .hic files with dietJucierMerge, SIP (Significant Interaction Peak caller) can be used to identify looping interactions.

Usage:

java -jar SIP_HiC.jar hic <hicFile> <chrSizeFile> <Output> <juicerToolsPath> [options]

Example (submitting job to UNC's longleaf cluster):

sbatch -p general -t 4320 --mem=8G --wrap="java -jar /proj/phanstiel_lab/software/SIP/SIP_HiC_v1.6.1.jar hic /path/to/file.hic /proj/phanstiel_lab/software/resources/hg19_chromSizes_filt.txt /path/to/output/directory /proj/phanstiel_lab/software/juicer/scripts/juicer_tools.jar -g 2.0 -t 2000 -fdr 0.05"

Another loop caller, HiCCUPS, was used to call loops in the data at GSE143465 from Ahn et al. 2021. Since merging functions in hictoolsr require columns that are only generated through SIP, loops called with SIP are included as example data in hictoolsr.

Merging loops

We called loops in both conditions to identify loops that are unique to each dataset. However, there are often duplicate loops that are present in both datasets. It is important to merge these together to avoid testing duplicate loops and to catch all unique loops. The mergeBedpe() function does this, using DBSCAN to combine duplicate loops that are shifted slightly between conditions.

## Load packages
library(hictoolsr)
library(dbscan)

## Define WT and FS loop file paths
wt_loops <- system.file("extdata/WT_5kbLoops.txt", package = "hictoolsr")
fs_loops <- system.file("extdata/FS_5kbLoops.txt", package = "hictoolsr")

## Merge loops and convert to GInteractions
loops <- 
  mergeBedpe(bedpeFiles = c(wt_loops, fs_loops), res = 10e3) |>
  as_ginteractions()

head(loops)

Extracting Hi-C counts

DESeq2 requires a counts table from replicate .hic files to call differential loops. The code below shows how to extract these counts remotely using the GEO links to each replicate .hic file. Alternatively, these files can be downloaded and the paths to each file can be supplied (recommended due to internet instability).

## Hi-C file paths from GEO
hicFiles <- 
  c("https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259896/suppl/GSM4259896_HEK_HiC_NUP_IDR_WT_A9_1_1_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259897/suppl/GSM4259897_HEK_HiC_NUP_IDR_WT_A9_1_2_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259898/suppl/GSM4259898_HEK_HiC_NUP_IDR_WT_A9_2_1_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259899/suppl/GSM4259899_HEK_HiC_NUP_IDR_WT_A9_2_2_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259900/suppl/GSM4259900_HEK_HiC_NUP_IDR_FS_A9_1_1_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259901/suppl/GSM4259901_HEK_HiC_NUP_IDR_FS_A9_1_2_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259902/suppl/GSM4259902_HEK_HiC_NUP_IDR_FS_A9_2_1_inter_30.hic",
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259903/suppl/GSM4259903_HEK_HiC_NUP_IDR_FS_A9_2_2_inter_30.hic")

## Extract Hi-C counts between loop pixels
loopCounts <- extractCounts(bedpe = loops,
              hic = hicFiles,
              chroms = c(1:22, "X"),
              res = 10e3,
              norm = 'NONE',
              matrix = 'observed')

Since extracting counts can take some time, an example dataset has been packaged with pre-extracted counts for the code shown above.

data("loopCounts")

Extract counts takes the basename of the filepath that is supplied as the column. Lets simplify the column names:

## Load package
library(InteractionSet)

## Simplify column names
colnames(mcols(loopCounts)) <- 
  gsub(pattern = "GSM.*_IDR_(WT|FS)_A9_(1|2)_(1|2)_.*", 
       replacement = "\\1_\\2_\\3",
       x = colnames(mcols(loopCounts)))

head(loopCounts)

Differential loop calling

The following code demonstrates how to call differential loops with DESeq2 using the loopCounts object created above. First we isolate the count matrix from loop counts:

## Load package
library(DESeq2)

## Isolate count matrix
cnts <- 
  mcols(loopCounts)[grep("WT|FS", colnames(mcols(loopCounts)))] |>
  as.matrix()

head(cnts)

Then we create column data using the column names from the counts matrix:

## Create colData from column names
colData <- 
  do.call(rbind, strsplit(x = colnames(cnts), split = "_")) |>
  as.data.frame(stringsAsFactors = TRUE) |>
  `colnames<-`(value = c("condition", "biorep", "techrep"))

colData

Next we can build a DESeq data set and compare differential loops between the "WT" and "FS" conditions:

## Build DESeq data set
dds <- 
  DESeqDataSetFromMatrix(countData = cnts,
                         colData = colData,
                         design = ~techrep + biorep + condition)

## Run DEseq analysis
res <-
  DESeq(dds) |>
  lfcShrink(coef = "condition_WT_vs_FS", type="apeglm")

summary(res)  

We can explore the differential results with an MA-plot:

plotMA(res)

or a PCA plot:

plotPCA(vst(dds), intgroup = "condition") + ggplot2::theme(aspect.ratio = 1)

Let's add the differential output from DESeq2 back to our loopCounts object, then separate WT-specific and FS-specific loops use a BH-adjusted pvalue of 0.01 and log2FoldChange above or below 0:

## Attach DESeq2 results
mcols(loopCounts) <- cbind(mcols(loopCounts), res)

## Separate WT/FS-specific loops
wtLoops <- loopCounts[loopCounts$padj <= 0.01 &
                         loopCounts$log2FoldChange > 0]

fsLoops <- loopCounts[loopCounts$padj <= 0.01 &
                         loopCounts$log2FoldChange < 0]

summary(wtLoops)
summary(fsLoops)

Visualizing with Aggregate Peak Analysis

Now that we have identified differential loops between "WT" and "FS" conditions we can perform an aggregate (or average) peak analysis (APA) to visualize the quality of loop calls. APAs are pileup plots of contact frequency matrices from Hi-C data surrounding the central loop pixels.

Calculating APAs

APAs extract and aggregate matrices around a Hi-C pixel at a given resolution (res) and number of pixels in either direction (buffer). For example, if you want to extract a 21x21 matrix at 10-kb resolution you should set res = 10e3 and buffer = 10. "Short" interactions that are too close to the diagonal must be filtered out to avoid aggregation errors. The filterBedpe() function calculates which interactions would intersect the diagonal and filters them out. In the code below, we assemble our loop categories into a list and apply this filtering for a resolution of 10-kb and a buffer of 10:

## Assemble all, WT, and FS loops into a list
loopList <- 
  list(allLoops = loopCounts,
       wtLoops = wtLoops,
       fsLoops = fsLoops)

## Define resolution and buffer (pixels from center)
res <- 10e3
buffer <- 10

## Filter out short loop interactions
filteredLoops <- 
  lapply(X = loopList,
         FUN = filterBedpe,
         res = res,
         buffer = buffer) |>
  `names<-`(value = names(loopList))

lapply(filteredLoops, summary)

You will notice that many of these interactions are close to the diagonal and are filtered out. The next block of code demonstrates how to apply the calcApa() function to the list of filtered loops to extract and aggregate counts from the replicate-merged Hi-C files of the "WT" and "FS" conditions:

## Hi-C file paths from GEO
wtHicPath <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE143nnn/GSE143465/suppl/GSE143465_HEK_HiC_NUP_IDR_WT_A9_megaMap_inter_30.hic"
fsHicPath <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE143nnn/GSE143465/suppl/GSE143465_HEK_HiC_NUP_IDR_FS_A9_megaMap_inter_30.hic"

## Calculate APA matrices for loops from WT Hi-C data
loopApaWtHic <-
  lapply(X = filteredLoops,
         FUN = calcApa,
         hic = wtHicPath,
         norm = "KR",
         buffer = buffer)

## Calculate APA matrices for loops from FS Hi-C data
loopApaFsHic <-
  lapply(X = filteredLoops,
         FUN = calcApa,
         hic = fsHicPath,
         norm = "KR",
         buffer = buffer)

Since calculating the APA matrices can take some time, an example dataset has been packaged with pre-computed APA matrices for the code shown above.

data("loopApaWtHic")
data("loopApaFsHic")

lapply(loopApaWtHic, dim)
lapply(loopApaFsHic, dim)

The last step before visualizing these matrices is to normalize the summed values to the number of loops in each category so that the interpretation becomes the average signal per loop. This will also put the plots on a similar scale for visualizing together.

## Get the number of loops for each condition
nLoops <- lapply(filteredLoops, length)

## Divide each matrix by nLoops
loopApaWtHic <- Map("/", loopApaWtHic, nLoops)
loopApaFsHic <- Map("/", loopApaFsHic, nLoops)

Visualizing with ggplot2

To visualize the results in ggplot2 we must first convert the matrix to long format.

## Convert matrix to long-format
long <- 
  loopApaWtHic$allLoops |>
  as.table() |>
  as.data.frame() |>
  setNames(c('rows', 'cols', 'counts'))

## Visualize with ggplot2
library(ggplot2)
ggplot(data = long,
       mapping = aes(x = rows, y = cols, fill = counts)) +
  scale_fill_distiller(palette = 'YlGnBu', direction = 1) + 
  geom_tile() +
  theme(aspect.ratio=1, axis.text.x = element_text(angle = 45, hjust=1))

Notice that we can flip the matrix by using rev() on either the rows or columns to change the orientation of the Hi-C diagonal:

## Flip the matrix
library(ggplot2)
ggplot(data = long,
       mapping = aes(x = rev(rows), y = cols, fill = counts)) +
  scale_fill_distiller(palette = 'YlGnBu', direction = 1) + 
  geom_tile() +
  theme(aspect.ratio=1, , axis.text.x = element_text(angle = 45, hjust=1))

We can apply this to all matrices in our list and combine the datasets into a "tidy" data form for visualizing with ggplot2:

## Define function to convert a matrix to long format
toLong <- \(x) {
  x |>
    as.table() |>
    as.data.frame() |>
    setNames(c('rows', 'cols', 'counts'))
}

## Apply function to convert all matrices to long format
apas <- list(WT = lapply(loopApaWtHic, toLong),
             FS = lapply(loopApaFsHic, toLong))

## Add loopType to each data.frame and combine
apas <- lapply(apas, \(x) do.call(rbind, Map(cbind, x, loopType = names(x))))

## Add hicMap to each data.frame and combine
apas <- do.call(rbind, Map(cbind, apas, hicMap = names(apas)))

## Reorder factors
apas$loopType <- factor(x = apas$loopType,
                        levels = c("allLoops", "wtLoops", "fsLoops"))
apas$hicMap <- factor(x = apas$hicMap,
                      levels = c("WT", "FS"))

## Visualize with ggplot2
library(ggplot2)
ggplot(data = apas,
       mapping = aes(x = rows, y = cols, fill = counts)) +
  scale_fill_distiller(palette = 'YlGnBu', direction = 1) + 
  facet_grid(hicMap~loopType, scales = "free") +
  geom_tile() +
  theme(aspect.ratio=1, axis.text.x = element_text(angle = 45, hjust=1))

Visualizing with plotgardener

plotgardener is a genomics plotting package that allows for more flexibility than ggplot2. As part of the plotgardener ecosystem, hictoolsr provides a plotApa() function that is compatible with other plotgardener functions. Additionally, plotApa() can works on matrices and doesn't first require converting to long-format.

Here is an example of quickly visualizing with plotApa() using a palette from RColorBrewer:

library(RColorBrewer)

plotApa(apa = loopApaWtHic$allLoops,
        palette = colorRampPalette(brewer.pal(9, 'YlGnBu')))

By supplying positional information (e.g. x, y, width, height, etc...) plotgardener will switch to multi-figure mode and allow multiple plot arrangements on a pgPage. Let's visualize all APA results using functions from hictoolsr and plotgardener:

library(plotgardener)
library(purrr)

## Initiate plotgardener page
pageCreate(width = 4.25, height = 3)

## Define shared parameters
p <- pgParams(x = 0.5,
              y = 0.5,
              width = 1,
              height = 1,
              space = 0.075,
              zrange = c(0, max(unlist(c(loopApaWtHic, loopApaFsHic)))),
              palette = colorRampPalette(brewer.pal(9, 'YlGnBu')))

## Define grid of coordinate positions
xpos <- c(p$x, p$x + p$width + p$space, p$x + (p$width + p$space)*2)
ypos <- c(p$y, p$y + p$height + p$space, p$y + (p$height + p$space)*2)

## Plot row of WT APAs
wt_plots <- 
  pmap(list(loopApaWtHic, xpos, ypos[1]), \(a, x, y) {
    plotApa(params = p, apa = a, x = x, y = y)
  })

## Plot row of FS APAs
fs_plots <- 
  pmap(list(loopApaFsHic, xpos, ypos[2]), \(a, x, y) {
    plotApa(params = p, apa = a, x = x, y = y)
  })

## Add legend
annoHeatmapLegend(plot = wt_plots[[1]],
                  x = p$x + (p$width + p$space)*3,
                  y = ypos[1],
                  width = p$space,
                  height = p$height*0.75,
                  fontcolor = 'black')

## Add text labels
plotText(label = c("All loops", "WT loops", "FS loops"),
         x = xpos + p$width / 2,
         y = ypos[1] - p$space,
         just = c('center', 'bottom'))

plotText(label = c("WT", "FS"),
         x = xpos[1] - p$space,
         y = ypos[1:2] + p$height / 2,
         rot = 90,
         just = c('center', 'bottom'))

## Remove Guides
pageGuideHide()

As you can see, in some ways plotgardener can be more verbose, but this comes with the added flexibility to specificy exactly where and how you would like to visualize your genomic data. Just remember, with freedom comes great responsibility!

Session information

sessionInfo()


EricSDavis/hictoolsr documentation built on Sept. 4, 2022, 12:36 a.m.