wavClusteR: a workflow for PAR-CLIP data analysis

Abstract

A number of recently developed next-generation sequencing based methods (e.g. PAR-CLIP, Bisulphite sequencing, RRBS) specifically induce nucleotide substitutions within the short reads with respect to the reference genome. wavClusteR provides functions for the analysis of the data obtained by such methods with a major focus on PAR-CLIP. The package leverages on experimentally induced substitutions to identify high confidence signals (e.g. RNA-binding sites) in the data. A wavClusteR workflow consists of two steps:

  1. The estimation of a non-parametric two-component mixture model to identify substitution frequencies that are likely to be affected by the experimental procedure
  2. The identification of binding sites (clusters).

The package supports multicore computing. For a detailed description of the method please refer to Sievers et al., 2012; Comoglio et al., 2015.

Preparing the input

wavClusteR expects a sorted and indexed BAM file as input. A streamlined workflow to generate the required input from a fastq file is outlined below (line 1 is pseudo code, replace it with the aligner specific syntax).

    #ALIGN: 
        sample.fastq -> sample.sam
    #CONVERT: 
        samtools view -b -S sample.sam -o sample.bam
    #SORT: 
        samtools sort sample.bam sample_sorted
    #INDEXING: 
        samtools index sample_sorted.bam
  1. Short read alignment to the reference genome using relaxed alignment parameters. Experimentally-induced transitions will otherwise impede alignment of informative reads. Currently, wavClusteR has been tested with bowtie/Bowtie2 Langmead et al., 2009; Langmead and Salzberg, 2012
  2. Conversion of the alignment file from SAM to BAM format, e.g. using samtools view from SAMtools
  3. Sorting of the BAM file, e.g. using samtools sort
  4. Indexing of the sorted BAM file, e.g. with samtools index.

Example dataset

wavClusteR provides a chunk of a human Argonaute 2 (AGO2) PAR-CLIP data set from Kishore et al., 2011. Reads in the chunk map to the genomic interval chrX:23996166-24023263. This data set is used below to illustrate a workflow for PAR-CLIP data analysis.

A workflow for PAR-CLIP data analysis

Reading sorted BAM files

A sorted and indexed BAM file can be loaded into the R session with readSortedBam. This function extracts aligned reads, sequences and the mismatch MD field, and returns a GRanges object.

library(wavClusteR)
filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
Bam <- readSortedBam(filename = filename)
Bam

Extracting informative genomic positions

Prior to model fitting, genome-wide substitutions need to be identified and filtered based on a coverage threshold. The getAllSub function identifies all genomic positions exhibiting at least one substitution and coverage above this threshold.

countTable <- getAllSub( Bam, minCov = 10 )
head( countTable )

The function returns a GRanges object specifying genomic position, strand, substitution type (e.g. "TC": T in the reference genome; C in the read), strand-specific coverage and number of observed substitutions at the position.

How to choose the coverage threshold?

The coverage threshold minCov defines the genomic positions to be used for parameter estimation, thus providing a means to tune the stringency of the analysis. Currently, wavClusteR does not allow to learn this threshold from the data. However, since the model is based on relative substitution frequencies, the value of minCov will influence the variance of the estimated parameters. Therefore, values smaller than default (minCov=10) are not recommended.

Inspecting the substitutions profile (diagnostic I)

Once all substitutions are computed, a diagnostic substitution profile can be plotted with plotSubstitutions.

plotSubstitutions( countTable, highlight = "TC" )

The function returns a barplot showing the total number of genomic positions that exhibit a given type of substitution and highlights the substitution type that is expected to be generated by the experimental procedure. The percentage of substitution of this type is also shown. This plot can be readily used to assess the quality of the sequenced libraries and can be used to compare different data sets generated under the same experimental condition.

Estimating the model

wavClusteR uses the identified genomic positions to learn a non-parametric mixture model discriminating PAR-CLIP-specific from extrinsic transitions. Model parameters are estimated by fitMixtureModel.

model <- fitMixtureModel(countTable, substitution = "TC")

The function returns a list of:

As the small size of our example data set does not to estimate the model reliably, the mixture model for the entire AGO2 dataset has been precomputed and is provided by the package.

data(model)
str(model)

Inspecting the model fit (diagnostic II)

The model fit can be inspected using getExpInterval.

(support <- getExpInterval( model, bayes = TRUE ) )

The function returns two diagnostic plots. The first plot illustrates the estimated densities \(p\), \(p_1\) and \(p_2\), and log odds

\[ o=\rm{log}\frac{p(k=2|x)}{p(k=1|x)}. \]

The second plot shows the resulting posterior class probability, i.e. the probability that a given relative substitution frequency (RSF, horizontal axis) has been induced by PAR-CLIP. The area under the curve corresponding to the returned RSF interval is colored, and the RSF interval indicated. By default, getExpInterval returns the RSF interval according to the Bayes classifier, i.e. a posterior probability cutoff of 0.5. However, the user can specify a custom RSF interval:

  1. By means of the rightProb and leftProb parameters, e.g.

    r (support <- getExpInterval( model, bayes = FALSE, leftProb = 0.9, rightProb = 0.9 ) )

  2. By inspecting the posterior class probability density and passing the RSF interval boundaries when calling high-confidence substitutions (see point 6).

Finally, the model can be used to produce further diagnostic plots. Particularly, the total number of reads exhibitng a given substitution and an RSF-based partitioning of genomic positions with substitutions can be obtained by

plotSubstitutions( countTable, highlight = "TC", model )

Selecting high-confidence PAR-CLIP induced transitions

The RSF support is used to filter all observed transitions to define a set of high-confidence, PAR-CLIP induced transitions. These are identified by getHighConfSub. The function returns a GRanges object with genomic position, strand, strand-specific coverage, occurence (count), and relative substitution frequency (rsf) for each high-confidence substitution. In a call to getHighConfSub, the RSF interval returned by getExpInterval can be supplied as support argument directly

highConfSub <- getHighConfSub( countTable, 
                               support = support, 
                               substitution = "TC" )
head( highConfSub )                               

or, alternatively, by specifying supportStart and supportEnd, which define the range of RSF of interest.

highConfSub <- getHighConfSub( countTable, 
                               supportStart = 0.2, 
                               supportEnd = 0.7, 
                               substitution = "TC" )
head( highConfSub )

Identifying protein binding sites (clusters)

Binding sites (clusters) are identified by getClusters. The function takes high-confidence substitution sites and the coverage function

coverage <- coverage( Bam )
coverage$chrX

as an input. From package version 2.0, cluster boundaries are resolved using the Mini-Rank Norm (MRN) Comoglio et al., 2015, which is up to 10x faster and more sensitive than the previously adopted algorithm based on continuous wavelet transform of the coverage function Sievers et al., 2012. Briefly, the MRN algorithm finds an optimal cluster boundary for each high-confidence substitution by solving an optimization problem that integrates prior knowledge on the geometry of PAR-CLIP clusters. Two options are available:

  1. Hard thresholding, i.e. the coverage function is denoised using a global threshold. Empirically, minCov=1 worked well on all tested datasets for which minCov = 10. Alternatively, 10% of the mode of the coverage distribution at high-confidence substitutions represents a possible choice.

    r clusters <- getClusters( highConfSub = highConfSub, coverage = coverage, sortedBam = Bam, threshold = 1, cores = 1 ) clusters

  2. Local thresholding, based on a global estimation of background levels via a Gaussian mixture model. Omitting the threshold parameter in the call to getClusters enables local thresholding

    r clusters <- getClusters( highConfSub = highConfSub, coverage = coverage, sortedBam = Bam, cores = 1 ) clusters

Merging clusters

Once the clusters are identified, the corresponding genomic regions can be merged in a strand-specific manner. Statistics for each merged cluster (a wavCluster), can be computed using filterClusters.

require(BSgenome.Hsapiens.UCSC.hg19)

wavclusters <- filterClusters( clusters = clusters, 
                   highConfSub = highConfSub,
                   coverage = coverage, 
                   model = model, 
                   genome = Hsapiens, 
                   refBase = "T", 
                   minWidth = 12)

wavclusters

The function takes as input the following elements:

and it returns a GRanges object with the following metadata:

The relative log odds can be used to rank clusters according to statistical confidence.

Post-processing of the binding sites

wavClusteR provides a number of functions (summarized in the table below) for post-processing of the identified binding sites.

Task | Function | Output format | ------------- | ------------- | ------------- | Export all identified substitutions or high-confidence substitutions | exportHighConfSub | BED Export clusters | exportClusters | BED Export coverage function | exportCoverage | BigWig Visualize the size distribution of wavClusters | plotSizeDistribution | histogram Annotate clusters with respect to genomic features (e.g. CDS, introns, 3'-UTRs, 5'-UTRs) in a strand-specific manner | annotateClusters | dot chart, vector Compute metagene profiles of wavClusters, where the density of wavClusters is represented as a function of a reference genomic coordinates | getMetaGene | line plot, vector Compute metaTSS profiles based on all aligned reads in the input BAM file | \Rfunction{getMetaTSS} | line plot, vector Visualize wavClusteR statistics and meta data to learn pairwise relationships between variables | plotStatistics | pairs plot

Exporting substitutions, wavClusters and coverage function

High-confidence substitutions can be exported with

exportHighConfSub( highConfSub = highConfSub, 
                   filename = "hcTC.bed", 
                   trackname = "hcTC", 
                   description = "hcTC" )

where trackname and description set the corresponding attributes in the UCSC BED file. By replacing highConfSub with another set of substitutions (e.g. all identified substitutions of a given type), those can be exported likewise.

wavClusters can be exported with

exportClusters( clusters = wavclusters, 
                filename = "wavClusters.bed", 
                trackname = "wavClusters", 
                description = "wavClusters" )

and the coverage function can be exported with

exportCoverage( coverage = coverage, filename = "coverage.bigWig" )

Annotating binding sites

wavClusters can be annotated with respect to genomic features using annotateClusters. This function generates a strand-specific dot chart representing wavClusters annotation. annotateClusters takes as an input the wavClusters and a transcriptDB object containing transcript annotations. The latter can be generated using makeTxDbFromUCSC (GenomicFeatures package)

txDB <- makeTxDbFromUCSC(genome = "hg19", tablename = "ensGene")

and is automatically downloaded by annotateClusters if missing.

Then, the annotateClusters can be called as follows

annotateClusters( clusters = wavclusters, 
              txDB = txDB, 
              plot = TRUE, 
              verbose = TRUE)

Four dot charts are returned by the function showing the percentage of clusters mapping to different transcript features localized on the same (sense) or on the opposite (antisense) strand, the relative sequence length of different compartments relative to the total transcriptome length and the normalized enrichment of clusters across functional compartments.

Note: multiple hits, i.e. wavClusters that overlap with more than one genomic feature, are reported as "multiple", whereas wavClusters that map outside of the considered features are labeled as "other". The latter are then annotated with respect to features on the antisense strand.

Computing cluster metagene profiles

A graphical representation of the density of wavClusters as a function of a binning of genomic coordinates across all annotated genes can be obtained using the getMetaGene function.

getMetaGene( clusters = wavclusters, 
             txDB = txDB, 
             upstream = 1e3, 
             downstream = 1e3, 
             nBins = 40, 
             nBinsUD = 10, 
             minLength = 1, 
             plot = TRUE,
             verbose = TRUE ) 

In this example, genes were divided in 40 bins (nBins) and an upstream/downstream region spanning 1kb was considered. This region was subdivided in 10 bins (nBinsUD). No restriction on gene length was applied (minLength). getMetaGene returns a numeric vector of length nBins + 2*nBinsUD with normalized counts, which can be used, for instance, to compare the distribution of wavClusters across several PAR-CLIP samples.

In addition to metagene profiles, meta transcription start site (TSS) profiles based on all mapped reads can be generated using getMetaTSS.

getMetaTSS( sortedBam = Bam, 
            txDB = txDB, 
            upstream = 1e3, 
            downstream = 1e3, 
            nBins = 40, 
            unique = FALSE, 
            plot = TRUE, 
            verbose = TRUE ) 

Here, the upstream and downstream parameters control the width of the window to be considered, and nBins controls the resolution of the profile. If unique=TRUE, then overlapping windows are discarded. getMetaTSS returns a numeric vector of length nBins with normalized read counts.

Computing the size distribution of wavClusters

The size distribution of wavClusters can be visualized by

plotSizeDistribution( clusters = wavclusters, showCov = TRUE, col = "skyblue2" )

Visualizing wavCluster statistics and meta data

Cluster statistics can be plotted as pairs plot using

plotStatistics( clusters = wavclusters, 
                corMethod = "spearman", 
                lower = panel.smooth )

Session Info

sessionInfo()

References

Comoglio, F., Sievers, C. & Paro, R. (2015) Sensitive and highly resolved inidentification of RNA-protein interaction sites in PAR-CLIP data. BMC Bioinformatics, 16, 32

Langmead,B., Trapnell,C., Pop,M. \& Salzberg,S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25

Kishore, S. et al. (2011) A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nature Methods 8(7), 559-564

Sievers, C., Schlumpf, T., Sawarkar, R., Comoglio, F. & Paro, R. (2012) Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Res 40(2), e160



Try the wavClusteR package in your browser

Any scripts or data that you put into this service are public.

wavClusteR documentation built on Nov. 8, 2020, 6:54 p.m.