In LTLA/fugi: Further Utilities for Genomic Interactions

knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
options("scipen"=10, "digits"=5)

Introduction

Chromatin interaction analysis with paired-end tag sequencing (ChIA-PET) is a technique to study protein-mediated interactions at a genome-wide scale. Like most techniques for studying chromatin interaction, it is based on chromosome conformation capture. Unlike 3C, 4C and 5C, however, it can detect interactions genome-wide, and includes a chromatin immmunoprecipitation step to enrich for interactions involving a protein of interest.

The raw data from ChIA-PET is in the form of paired-end reads attached to one of two linker sequences. Reads with chimeric linkers correspond to non-specific ligation artefacts and are removed, while the remaining reads are aligned to a reference genome. The ChIA-PET Tool [@li2010chiapet] can then be used to find pairs of regions ("anchors") that have a significant number of reads mapping between them. These are likely to represent biologically meaningful chromatin interactions in the sample.

Loading in the interactions

To demonstrate the functionality of r Biocpkg("fugi"), we will use data from @li2012extensive. Here, ChIA-PET was performed to identify interactions mediated by the initiation form of RNA polymerase II (RNPII) in the K562 myelogenous leukemia cell line. One would expect to see RNPII activity at active promoters, so the data should give us an insight into the processes which regulate genes under active transcription.

We read in our data directly from the output of the ChIA-PET tool using r Biocpkg("fugi"). At this stage, we can also provide information about the cell type and a description tag for the experiment.

library(fugi)
chiapet.data <- system.file("extdata/k562.rep1.cluster.pet3+.txt", package="fugi")
k562.rep1 <- makeGenomicInteractionsFromFile(chiapet.data, 
    type="chiapet.tool", 
    experiment_name="k562", 
    description="k562 pol2 8wg16")
k562.rep1

This loads the data into a GenomicInteractions object that represents the paired anchors for each pairwise interaction. Each anchor region is represented a GenomicRanges object.

anchorOne(k562.rep1)
anchorTwo(k562.rep1)

We also obtain the p-value, FDR and the number of reads supporting each interaction.

head(interactionCounts(k562.rep1))
head(k562.rep1$fdr)
hist(-log10(k562.rep1$p.value))

The metadata we have added can easily be accesed, and edited:

name(k562.rep1)
description(k562.rep1) <- "PolII-8wg16 Chia-PET for K562"

Investigating the interactions

The anchor regions in each interaction in a GenomicInteractions object can be at any point along the genome. This allows us to easily represent interactions detected between chromosomes, known as trans-chromosomal interactions. The is.trans function returns a logical vector specifying which interactions are trans; likewise is.cis will identify those interactions that are cis.

sprintf("Percentage of trans-chromosomal interactions %.2f", 
    100*sum(is.trans(k562.rep1))/length(k562.rep1))

The distance between anchor regions for each interaction is computed using the calculateDistances() function. This can be done using the inner edge, outer edge or midpoints of the anchors. The distance is undefined for inter-chromosomal interactions where NA is returned, so it is important to exclude these interactions from some analyses.

head(calculateDistances(k562.rep1, method="midpoint"))

GenomicInteractions objects can be subsetted by either integer or logical vectors like most R objects, making it easy to filter out or select for interactions of interest (e.g., removing all trans interactions).

# First interactions in the dataset
k562.rep1[1:10] 

# Subsample to 100 interactions 
k562.rep1[sample(length(k562.rep1), 100)] 

# Keep only cis interactions
k562.cis <- k562.rep1[is.cis(k562.rep1)]

# Subset for more local interactions
k562.short <- k562.cis[calculateDistances(k562.cis) < 1e6] 
k562.short

We can also subset based on the properties of the linked GRanges objects.

chrom <- c("chr17", "chr18")
sub <- seqnames(anchorOne(k562.rep1)) %in% chrom & 
    seqnames(anchorTwo(k562.rep1)) %in% chrom
k562.rep1 <- k562.rep1[sub]

Annotating interactions

Interactions between different elements in the genome are believed to have different functional roles. For example, interactions between promoters and their transcription termination sites are thought to be a by-product of the transcription process, whereas long-range interactions with enhancers are believed to play a role in gene regulation.

Since GenomicInteractions is based on GenomicRanges, it is very easy to interrogate GenomicInteractions objects using GenomicRanges data. In the example, we want to annotate interactions that overlap the promoters, transcription termination sites or the body of any gene. Since this can be a time-consuming and data-heavy process, this example runs the analysis for only chromosomes 17 and 18.

First we need the list of RefSeq transcripts:

library(GenomicFeatures)
hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")
refseq.genes <- genes(hg19.refseq.db)
refseq.transcripts <- transcriptsBy(hg19.refseq.db, by="gene")
non_pseudogene <- names(refseq.transcripts) %in% unlist(refseq.genes$gene_id) 
refseq.transcripts = refseq.transcripts[non_pseudogene]

Rather than downloading the whole Refseq database, these are provided for chromosomes 17 and 18:

data("hg19.refseq.transcripts")
refseq.transcripts <- hg19.refseq.transcripts

We can then use functions from GenomicRanges to call promoters and terminators for these transcripts. We have taken promoter regions to be within 2.5 kbp of an annotated TSS and terminators to be within 1 kbp of the end of an annotated transcript. Since genes can have multiple transcripts, they can also have multiple promoters/terminators, so these are GRangesList objects, which makes handling these objects slightly more complicated.

refseq.promoters <- promoters(refseq.transcripts, upstream=2500, downstream=2500)

# unlist object so "strand" is one vector
refseq.transcripts.ul <- unlist(refseq.transcripts) 

# terminators can be called as promoters with the strand reversed
strand(refseq.transcripts.ul) <- ifelse(strand(refseq.transcripts.ul) == "+", "-", "+") 
refseq.terminators.ul <- promoters(refseq.transcripts.ul, upstream=1000, downstream=1000) 

# change back to original strand
strand(refseq.terminators.ul) <- ifelse(strand(refseq.terminators.ul) == "+", "-", "+") 

# `relist' maintains the original names and structure of the list
refseq.terminators <- relist(refseq.terminators.ul, refseq.transcripts)

These can be used to subset a GenomicInteractions object directly from GRanges using the GenomicRanges overlaps methods. findOverlaps called on a GenomicInteractions object will return a list containing Hits objects for both anchors. We can finds any interaction involving a RefSeq promoter:

subsetByFeatures(k562.rep1, unlist(refseq.promoters))

A powerful feature of the r Biocpkg("fugi") package is the ability to annotate each anchor with a list of genomic regions and then summarise interactions according to these features. This annotation is implemented as metadata columns for the anchors in the GenomicInteractions object and so is fast, and facilitates more complex analyses.

The order in which we annotate the anchors is important, since each anchor can only have one node.class. The first listed take precedence. Any regions not overlapping ranges in annotation.features will be labelled as distal.

annotation.features <- list(promoter=refseq.promoters, 
    terminator=refseq.terminators, 
    gene.body=refseq.transcripts)
annotateInteractions(k562.rep1, annotation.features)
table(annotationFeatures(k562.rep1))

We can now find interactions involving promoters using the annotated node.class for each anchor:

p.one <- anchorOne(k562.rep1)$node.class == "promoter"
p.two <- anchorTwo(k562.rep1)$node.class == "promoter"
k562.rep1[p.one|p.two]

This information can be used to categorise interactions into promoter-distal, promoter-terminator etc. A table of interaction types can be generated with categoriseInteractions:

categoriseInteractions(k562.rep1)

Alternatively, we can subset the object based on interaction type:

k562.rep1[isInteractionType(k562.rep1, "terminator", "gene.body")]

The 3 most common node.class values have short functions defined for convenience (see ?is.pp for a complete list):

k562.rep1[is.pp(k562.rep1)] # promoter-promoter interactions
k562.rep1[is.dd(k562.rep1)] # distal-distal interactions
k562.rep1[is.pt(k562.rep1)] # promoter-terminator interactions

Summary plots of interactions classes can easily be produced to get an overall feel for the data:

plotInteractionAnnotations(k562.rep1, other=5)

viewpoints will only take those interactions with a certain node.class:

plotInteractionAnnotations(k562.rep1, other=5, viewpoints="promoter")

These are also combined in the function plotSummaryStats.

Feature summaries

The summariseByFeatures allows us to look in more detail at interactions involving a specific set of loci. In this example we use all RefSeq promoters, which we already have loaded in a GRangesList object.

It is however possible to use any dataset which can be represented as a named GRanges object, such as binding sites from ChIP-seq experiments, predicted cis-regulatory sites or certain categories of genes.

The categories are generated automatically from the annotated node.class values in the object.

k562.rep1.promoter.annotation <- summariseByFeatures(k562.rep1, refseq.promoters, 
    "promoter", distance.method="midpoint", annotate.self=TRUE)
colnames(k562.rep1.promoter.annotation)

This allows us to very quickly generate summaries of the data and provides a quick method to isolate genes of interest. In this case we produce lists of RefSeq IDs, which can easily be converted to Entrez IDs or gene symbols through existing BioConductor packages (in this case r Biocpkg("org.Hs.eg.db") provides bimaps between common human genome annotations).

Which promoters have the strongest Promoter-Promoter interactions based on PET-counts?

i <- order(k562.rep1.promoter.annotation$numberOfPromoterPromoterInteractions, 
    decreasing=TRUE)[1:10]
k562.rep1.promoter.annotation[i,"Promoter.id"]

Which promoters are contacting the largest number of distal elements?

i <- order(k562.rep1.promoter.annotation$numberOfUniquePromoterDistalInteractions, 
    decreasing=TRUE)[1:10]
k562.rep1.promoter.annotation[i,"Promoter.id"]

What percentage of promoters are in contact with transcription termination sites?

total <- sum(k562.rep1.promoter.annotation$numberOfPromoterTerminatorInteractions > 0)
sprintf("%.2f%% of promoters have P-T interactions", 100*total/nrow(k562.rep1.promoter.annotation))