featuretypeCounts: Plot read distribution across genomic features

View source: R/riboseq.R

featuretypeCountsR Documentation

Plot read distribution across genomic features

Description

Counts how many reads in short read alignment files (BAM format) overlap with entire annotation categories. This utility is useful for analyzing the distribution of the read mappings across feature types, e.g. coding versus non-coding genes. By default the read counts are reported for the sense and antisense strand of each feature type separately. To minimize memory consumption, the BAM files are processed in a stream using utilities from the Rsamtools and GenomicAlignment packages. The counts can be reported for each read length separately or as a single value for reads of any length. Subsequently, the counting results can be plotted with the associated plotfeaturetypeCounts function.

Usage

featuretypeCounts(bfl, grl, singleEnd = TRUE, readlength = NULL, type = "data.frame")

Arguments

bfl

BamFileList object containing the paths to the target BAM files stored on disk. Note, memory-efficient processing is achieved by streaming through BAM files rather than reading entire files into memory at once. The maximum number of alignments to process in each iteration is determined by the yieldSize value passed on to the BamFileList function. For details see ?BamFileList.

grl

GRangesList object containing in each list component the range set of a feature type. Typically, this object is generated with the genFeatures function. For details see the example section below and ?genFeatures.

singleEnd

Specifies whether the targets BAM files contain alignments for single-end (SE) or paired-end read data. TRUE is for SE and FALSE for PE data.

readlength

Integer vector specifying the read length values for which to report counts separately. If readlength=NULL the length of the reads will be ignored resulting in a single value for each feature type and strand. Note, for PE data the two reads in a pair may differ in length. In those cases the length of the two reads is averaged and then assigned to the corresponding length category after rounding the mean length to the closest integer. This is not an ideal solution but a reasonable compromise for the purpose of the summary statistics generated by featuretypeCounts.

type

Determines whether the results are returned as data.frame (type="data.frame") or as list (type="list"). Each list component contains the counting results for one BAM file and is named after the corresponding sample. The data.frame result contains this sample assignment information in a separate column.

Value

The results are returned as data.frame or list of data.frames. For details see above under types argument. The result data.frames contain the following columns in the given order:

SampleName

Sample names obtained from BamFileList object.

Strand

Sense or antisense strand of read mappings.

Featuretype

Name of feature type provided by GRangesList object. Note, the total number of aligned reads is reported under the special feature type 'N_total_aligned'. This value is useful for scaling/normalization purposes in plots, e.g. counts per million reads.

Featuretypelength

Total genomic length of each reduced feature type in bases. This value is useful to normalize the read counts by genomic length units, e.g. in plots.

Subsequent columns

Counts for reads of any length or for individual read lengths.

Author(s)

Thomas Girke

See Also

plotfeaturetypeCounts, genFeatures

Examples

## Construct SYSargs2 object from param and targets files 
targets <- system.file("extdata", "targets.txt", package="systemPipeR")
dir_path <- system.file("extdata/cwl", package="systemPipeR")
args <- loadWorkflow(targets=targets, wf_file="hisat2/hisat2-mapping-se.cwl", 
                  input_file="hisat2/hisat2-mapping-se.yml", dir_path=dir_path)
args <- renderWF(args, inputvars=c(FileName="_FASTQ_PATH1_", SampleName="_SampleName_"))
args

## Not run: 
## Run alignments
args <- runCommandline(args, dir = FALSE, make_bam = TRUE)
outpaths <- subsetWF(args, slot = "output", subset = 1, index = 1)

## Features from sample data of systemPipeRdata package
library(txdbmaker)
file <- system.file("extdata/annotation", "tair10.gff", package="systemPipeRdata")
txdb <- makeTxDbFromGFF(file=file, format="gff3", organism="Arabidopsis")
feat <- genFeatures(txdb, featuretype="all", reduce_ranges=TRUE, upstream=1000, downstream=0, verbose=TRUE)

## Generate and plot feature counts for specific read lengths
fc <- featuretypeCounts(bfl=BamFileList(outpaths, yieldSize=50000), grl=feat, singleEnd=TRUE, readlength=c(74:76,99:102), type="data.frame")
p <- plotfeaturetypeCounts(x=fc, graphicsfile="featureCounts.pdf", graphicsformat="pdf", scales="fixed", anyreadlength=FALSE)

## Generate and plot feature counts for any read length  
fc2 <- featuretypeCounts(bfl=BamFileList(outpaths, yieldSize=50000), grl=feat, singleEnd=TRUE, readlength=NULL, type="data.frame")
p2 <- plotfeaturetypeCounts(x=fc2, graphicsfile="featureCounts2.pdf", graphicsformat="pdf", scales="fixed", anyreadlength=TRUE)

## End(Not run)

tgirke/systemPipeR documentation built on Sept. 24, 2024, 9:48 a.m.