knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) options("scipen"=10, "digits"=5)
Chromatin interaction analysis with paired-end tag sequencing (ChIA-PET) is a technique to study protein-mediated interactions at a genome-wide scale. Like most techniques for studying chromatin interaction, it is based on chromosome conformation capture. Unlike 3C, 4C and 5C, however, it can detect interactions genome-wide, and includes a chromatin immmunoprecipitation step to enrich for interactions involving a protein of interest.
The raw data from ChIA-PET is in the form of paired-end reads attached to one of two linker sequences. Reads with chimeric linkers correspond to non-specific ligation artefacts and are removed, while the remaining reads are aligned to a reference genome. The ChIA-PET Tool [@li2010chiapet] can then be used to find pairs of regions ("anchors") that have a significant number of reads mapping between them. These are likely to represent biologically meaningful chromatin interactions in the sample.
To demonstrate the functionality of r Biocpkg("fugi")
, we will use data from @li2012extensive.
Here, ChIA-PET was performed to identify interactions mediated by the initiation form of RNA polymerase II (RNPII) in the K562 myelogenous leukemia cell line.
One would expect to see RNPII activity at active promoters,
so the data should give us an insight into the processes which regulate genes under active transcription.
We read in our data directly from the output of the ChIA-PET tool using r Biocpkg("fugi")
.
At this stage, we can also provide information about the cell type and a description tag for the experiment.
library(fugi) chiapet.data <- system.file("extdata/k562.rep1.cluster.pet3+.txt", package="fugi") k562.rep1 <- makeGenomicInteractionsFromFile(chiapet.data, type="chiapet.tool", experiment_name="k562", description="k562 pol2 8wg16") k562.rep1
This loads the data into a GenomicInteractions
object that represents the paired anchors for each pairwise interaction.
Each anchor region is represented a GenomicRanges
object.
anchorOne(k562.rep1) anchorTwo(k562.rep1)
We also obtain the p-value, FDR and the number of reads supporting each interaction.
head(interactionCounts(k562.rep1)) head(k562.rep1$fdr) hist(-log10(k562.rep1$p.value))
The metadata we have added can easily be accesed, and edited:
name(k562.rep1) description(k562.rep1) <- "PolII-8wg16 Chia-PET for K562"
The anchor regions in each interaction in a GenomicInteractions
object can be at any point along the genome.
This allows us to easily represent interactions detected between chromosomes, known as trans-chromosomal interactions.
The is.trans
function returns a logical vector specifying which interactions are trans;
likewise is.cis
will identify those interactions that are cis.
sprintf("Percentage of trans-chromosomal interactions %.2f", 100*sum(is.trans(k562.rep1))/length(k562.rep1))
The distance between anchor regions for each interaction is computed using the calculateDistances()
function.
This can be done using the inner edge, outer edge or midpoints of the anchors.
The distance is undefined for inter-chromosomal interactions where NA
is returned,
so it is important to exclude these interactions from some analyses.
head(calculateDistances(k562.rep1, method="midpoint"))
GenomicInteractions
objects can be subsetted by either integer or logical vectors like most R objects,
making it easy to filter out or select for interactions of interest (e.g., removing all trans interactions).
# First interactions in the dataset k562.rep1[1:10] # Subsample to 100 interactions k562.rep1[sample(length(k562.rep1), 100)] # Keep only cis interactions k562.cis <- k562.rep1[is.cis(k562.rep1)] # Subset for more local interactions k562.short <- k562.cis[calculateDistances(k562.cis) < 1e6] k562.short
We can also subset based on the properties of the linked GRanges
objects.
chrom <- c("chr17", "chr18") sub <- seqnames(anchorOne(k562.rep1)) %in% chrom & seqnames(anchorTwo(k562.rep1)) %in% chrom k562.rep1 <- k562.rep1[sub]
Interactions between different elements in the genome are believed to have different functional roles. For example, interactions between promoters and their transcription termination sites are thought to be a by-product of the transcription process, whereas long-range interactions with enhancers are believed to play a role in gene regulation.
Since GenomicInteractions
is based on GenomicRanges
, it is very easy to
interrogate GenomicInteractions
objects using GenomicRanges
data. In the
example, we want to annotate interactions that overlap the promoters,
transcription termination sites or the body of any gene. Since this can
be a time-consuming and data-heavy process, this example runs the analysis
for only chromosomes 17 and 18.
First we need the list of RefSeq transcripts:
library(GenomicFeatures) hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene") refseq.genes <- genes(hg19.refseq.db) refseq.transcripts <- transcriptsBy(hg19.refseq.db, by="gene") non_pseudogene <- names(refseq.transcripts) %in% unlist(refseq.genes$gene_id) refseq.transcripts = refseq.transcripts[non_pseudogene]
Rather than downloading the whole Refseq database, these are provided for chromosomes 17 and 18:
data("hg19.refseq.transcripts") refseq.transcripts <- hg19.refseq.transcripts
We can then use functions from GenomicRanges
to call promoters and
terminators for these transcripts. We have taken promoter regions to be within
2.5 kbp of an annotated TSS and terminators to be within 1 kbp of the end of an
annotated transcript. Since genes can have multiple transcripts, they can also
have multiple promoters/terminators, so these are GRangesList
objects, which makes
handling these objects slightly more complicated.
refseq.promoters <- promoters(refseq.transcripts, upstream=2500, downstream=2500) # unlist object so "strand" is one vector refseq.transcripts.ul <- unlist(refseq.transcripts) # terminators can be called as promoters with the strand reversed strand(refseq.transcripts.ul) <- ifelse(strand(refseq.transcripts.ul) == "+", "-", "+") refseq.terminators.ul <- promoters(refseq.transcripts.ul, upstream=1000, downstream=1000) # change back to original strand strand(refseq.terminators.ul) <- ifelse(strand(refseq.terminators.ul) == "+", "-", "+") # `relist' maintains the original names and structure of the list refseq.terminators <- relist(refseq.terminators.ul, refseq.transcripts)
These can be used to subset a GenomicInteractions
object directly from
GRanges
using the GenomicRanges
overlaps methods. findOverlaps
called on
a GenomicInteractions
object will return a list containing Hits
objects for
both anchors.
We can finds any interaction involving a RefSeq promoter:
subsetByFeatures(k562.rep1, unlist(refseq.promoters))
A powerful feature of the r Biocpkg("fugi")
package
is the ability to annotate each anchor with a list of genomic regions and then
summarise interactions according to these features. This annotation is
implemented as metadata columns for the anchors in the GenomicInteractions
object and so is fast, and facilitates more complex analyses.
The order in which we annotate the anchors is important, since each anchor can
only have one node.class
. The first listed take precedence. Any regions not
overlapping ranges in annotation.features
will be labelled as distal
.
annotation.features <- list(promoter=refseq.promoters, terminator=refseq.terminators, gene.body=refseq.transcripts) annotateInteractions(k562.rep1, annotation.features) table(annotationFeatures(k562.rep1))
We can now find interactions involving promoters using the annotated
node.class
for each anchor:
p.one <- anchorOne(k562.rep1)$node.class == "promoter" p.two <- anchorTwo(k562.rep1)$node.class == "promoter" k562.rep1[p.one|p.two]
This information can be used to categorise interactions into promoter-distal,
promoter-terminator etc. A table of interaction types can be generated with
categoriseInteractions
:
categoriseInteractions(k562.rep1)
Alternatively, we can subset the object based on interaction type:
k562.rep1[isInteractionType(k562.rep1, "terminator", "gene.body")]
The 3 most common node.class
values have short functions defined for convenience
(see ?is.pp
for a complete list):
k562.rep1[is.pp(k562.rep1)] # promoter-promoter interactions k562.rep1[is.dd(k562.rep1)] # distal-distal interactions k562.rep1[is.pt(k562.rep1)] # promoter-terminator interactions
Summary plots of interactions classes can easily be produced to get an overall feel for the data:
plotInteractionAnnotations(k562.rep1, other=5)
viewpoints
will only take those interactions with a certain node.class
:
plotInteractionAnnotations(k562.rep1, other=5, viewpoints="promoter")
These are also combined in the function plotSummaryStats
.
The summariseByFeatures
allows us to look in more detail at interactions
involving a specific set of loci. In this example we use all RefSeq promoters,
which we already have loaded in a GRangesList
object.
It is however possible to use any dataset which can be represented as a named
GRanges
object, such as binding sites from ChIP-seq experiments, predicted
cis-regulatory sites or certain categories of genes.
The categories are generated automatically from the annotated node.class
values in the object.
k562.rep1.promoter.annotation <- summariseByFeatures(k562.rep1, refseq.promoters, "promoter", distance.method="midpoint", annotate.self=TRUE) colnames(k562.rep1.promoter.annotation)
This allows us to very quickly generate summaries of the data and provides a
quick method to isolate genes of interest. In this case we produce lists of
RefSeq IDs, which can easily be converted to Entrez IDs or gene symbols through
existing BioConductor packages (in this case r Biocpkg("org.Hs.eg.db")
provides bimaps between
common human genome annotations).
Which promoters have the strongest Promoter-Promoter interactions based on PET-counts?
i <- order(k562.rep1.promoter.annotation$numberOfPromoterPromoterInteractions, decreasing=TRUE)[1:10] k562.rep1.promoter.annotation[i,"Promoter.id"]
Which promoters are contacting the largest number of distal elements?
i <- order(k562.rep1.promoter.annotation$numberOfUniquePromoterDistalInteractions, decreasing=TRUE)[1:10] k562.rep1.promoter.annotation[i,"Promoter.id"]
What percentage of promoters are in contact with transcription termination sites?
total <- sum(k562.rep1.promoter.annotation$numberOfPromoterTerminatorInteractions > 0) sprintf("%.2f%% of promoters have P-T interactions", 100*total/nrow(k562.rep1.promoter.annotation))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.