Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/featureCounts.R
Assign mapped sequencing reads to specified genomic features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | featureCounts(files,
# annotation
annot.inbuilt = "mm10",
annot.ext = NULL,
isGTFAnnotationFile = FALSE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
GTF.attrType.extra = NULL,
chrAliases = NULL,
# level of summarization
useMetaFeatures = TRUE,
# overlap between reads and features
allowMultiOverlap = FALSE,
minOverlap = 1,
fracOverlap = 0,
fracOverlapFeature = 0,
largestOverlap = FALSE,
nonOverlap = NULL,
nonOverlapFeature = NULL,
# Read shift, extension and reduction
readShiftType = "upstream",
readShiftSize = 0,
readExtension5 = 0,
readExtension3 = 0,
read2pos = NULL,
# multi-mapping reads
countMultiMappingReads = TRUE,
# fractional counting
fraction = FALSE,
# long reads
isLongRead = FALSE,
# read filtering
minMQS = 0,
splitOnly = FALSE,
nonSplitOnly = FALSE,
primaryOnly = FALSE,
ignoreDup = FALSE,
# strandness
strandSpecific = 0,
# exon-exon junctions
juncCounts = FALSE,
genome = NULL,
# parameters specific to paired end reads
isPairedEnd = FALSE,
countReadPairs = TRUE,
requireBothEndsMapped = FALSE,
checkFragLength = FALSE,
minFragLength = 50,
maxFragLength = 600,
countChimericFragments = TRUE,
autosort = TRUE,
# number of CPU threads
nthreads = 1,
# read group
byReadGroup = FALSE,
# report assignment result for each read
reportReads = NULL,
reportReadsPath = NULL,
# miscellaneous
maxMOp = 10,
tmpDir = ".",
verbose = FALSE)
|
files |
a character vector giving names of input files containing read mapping results. The files can be in either SAM format or BAM format. The file format is automatically detected by the function. |
annot.inbuilt |
a character string specifying an in-built annotation used for read summarization. It has four possible values including |
annot.ext |
A character string giving name of a user-provided annotation file or a data frame including user-provided annotation data. If the annotation is in GTF format, it can only be provided as a file. If it is in SAF format, it can be provided as a file or a data frame. See below for more details about SAF format annotation. If an annotation file is provided, it can be uncompressed or gzip compressed. Note that |
isGTFAnnotationFile |
logical indicating whether the annotation provided via the |
GTF.featureType |
a character string or a vector of character strings giving the feature type or types used to select rows in the GTF annotation which will be used for read summarization. |
GTF.attrType |
a character string giving the attribute type in the GTF annotation which will be used to group features (eg. exons) into meta-features (eg. genes). |
GTF.attrType.extra |
a character vector specifying extra GTF attribute types that will also be included in the counting output. These attribute types will not be used to group features. |
chrAliases |
a character string giving the name of a chromosome name alias file. This should be a two-column comma-delimited text file. Chromosome name aliases included in this file are used to match chr names in annotation with those in the reads. First column in the file should include chr names in the annotation and second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file. |
useMetaFeatures |
logical indicating whether the read summarization should be performed at the feature level (eg. exons) or meta-feature level (eg. genes). If |
allowMultiOverlap |
logical indicating if a read is allowed to be assigned to more than one feature (or meta-feature) if it is found to overlap with more than one feature (or meta-feature). |
minOverlap |
|
fracOverlap |
|
fracOverlapFeature |
|
largestOverlap |
If |
nonOverlap |
|
nonOverlapFeature |
|
readShiftType |
a character string indicating how reads should be shifted. It has four possible values including |
readShiftSize |
|
readExtension5 |
|
readExtension3 |
|
read2pos |
Specifying whether each read should be reduced to its 5' most base or 3' most base. It has three possible values: |
countMultiMappingReads |
logical indicating if multi-mapping reads/fragments should be counted, |
fraction |
logical indicating if fractional counts are produced for multi-mapping reads and/or multi-overlapping reads. |
isLongRead |
logical indicating if input data contain long reads. This option should be set to |
minMQS |
|
splitOnly |
logical indicating whether only split alignments (their CIGAR strings contain letter 'N') should be included for summarization. |
nonSplitOnly |
logical indicating whether only non-split alignments (their CIGAR strings do not contain letter 'N') should be included for summarization. |
primaryOnly |
logical indicating if only primary alignments should be counted. Primary and secondary alignments are identified using bit 0x100 in the Flag field of SAM/BAM files. If |
ignoreDup |
logical indicating whether reads marked as duplicates should be ignored. |
strandSpecific |
an integer vector indicating if strand-specific read counting should be performed. Length of the vector should be either |
juncCounts |
logical indicating if number of reads supporting each exon-exon junction will be reported. Junctions are identified from those exon-spanning reads in input data. |
genome |
a character string giving the name of a FASTA-format file that includes the reference sequences used in read mapping that produced the provided SAM/BAM files. |
isPairedEnd |
A logical scalar or a logical vector, indicating whether libraries contain paired-end reads or not. |
countReadPairs |
A logical scalar or a logical vector, indicating if read pairs, instead of reads, should be counted for paired-end read data. |
requireBothEndsMapped |
logical indicating if both ends from the same fragment are required to be successfully aligned before the fragment can be assigned to a feature or meta-feature. This parameter is only appliable when |
checkFragLength |
logical indicating if the two ends from the same fragment are required to satisify the fragment length criteria before the fragment can be assigned to a feature or meta-feature. This parameter is only appliable when |
minFragLength |
|
maxFragLength |
|
countChimericFragments |
logical indicating whether a chimeric fragment, which has its two reads mapped to different chromosomes, should be counted or not. |
autosort |
logical specifying if the automatic read sorting is enabled. This option is only applicable for paired-end reads. If |
nthreads |
|
byReadGroup |
logical indicating if read counting will be performed for each individual read group. |
reportReads |
output detailed read assignment results for each read (or fragment if paired end). The detailed assignment results can be saved in three different formats including |
reportReadsPath |
a character string specifying the directory where files including detailed assignment results are saved. If |
maxMOp |
|
tmpDir |
a character string specifying the directory under which intermediate files are saved (later removed). By default, current working directory is used. |
verbose |
logical indicating if verbose information for debugging will be generated. This may include information such as unmatched chromosomes/contigs between reads and annotation. |
featureCounts
is a general-purpose read summarization function that can assign mapped reads from genomic DNA and RNA sequencing to genomic features or meta-features.
The function takes as input a set of SAM or BAM files containing read mapping results.
The files might be generated by align
or subjunc
or any suitable aligner.
featureCounts
accepts two annotation formats to specify the genomic features: SAF (Simplified Annotation Format) or GTF/GFF.
The GTF/GFF format is specified at https://genome.ucsc.edu/FAQ/FAQformat.html.
SAF is a Simplified Annotation Format with five columns.
It can be either a data.frame or a tab-delimited text file in the following style:
1 2 3 4 5 6 7 8 | GeneID Chr Start End Strand
497097 chr1 3204563 3207049 -
497097 chr1 3411783 3411982 -
497097 chr1 3660633 3661579 -
100503874 chr1 3637390 3640590 -
100503874 chr1 3648928 3648985 -
100038431 chr1 3670236 3671869 -
...
|
SAF annotation includes the following five required columns: GeneID
, Chr
, Start
, End
and Strand
.
The GeneID
column includes identifiers of features.
These identifiers can be integer or character string.
The Chr
column includes chromosome names of features and these names should match the chromosome names includes in the provided SAM/BAM files.
The Start
and End
columns include start and end coordinates of features, respectively.
Both start and end coordinates are inclusive.
The Strand
column indicates the strand of features ("+"
or "-"
).
Column names in a SAF annotation should be the same as those shown in the above example (case insensitive).
Columns can be in any order.
Extra columns are allowed to be added into the annotation.
In-built annotations, which were generated based on NCBI RefSeq gene annotations, are provided to faciliate convenient read summarization.
We provide in-built annotations for the following genomes: "hg38"
, "hg19"
, "mm10"
and "mm9"
.
The content of in-built annotations can be accessed via the getInBuiltAnnotation
function.
These annotations have a SAF format.
The in-built annotations are a modified version of NCBI RefSeq gene annotations.
We downloaded the RefSeq gene annotations from NCBI ftp server (eg. RefSeq annotation for mm10
was downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/ARCHIVE/BUILD.38.1/mapview/seq_gene.md.gz).
We then used these annotations to create our in-built annotations.
For each gene, we used its CDS (coding DNA sequence) and UTR (untranslated) regions provided in the original annotation to construct a list of exons, by merging and concatenating all CDSs and UTRs belonging to the same gene.
Exons within each gene include all chromosomal bases included in the original CDS and UTR regions, but they include each base only once (they might be included multiple times in the original CDSs and UTRs).
Also, exons within the same gene do not overlap with each other.
Users may provide an external annotation for read summarization via the annot.ext
argument.
If the external annotation is in SAF format, it can be provdied as either a data.frame
or a tab-delimited text file with proper column names included.
If it is in GTF/GFF format, it should be provided as a file only (and isGTFAnnotationFile
should be set to TRUE
).
featureCounts
function uses the GTF.attrType
attribute in a GTF/GFF annotation to group features to form meta-features when performing read summarization at meta-feature level.
The argument useMetaFeatures
specifies whether read summarization should be performed at feature level or at meta-feature level.
A feature represents a continuous genomic region and a meta-feature is a group of features.
For instance, an exon is a feature and a gene comprising one or more exons is a meta-feature.
To assign reads to meta-features, featureCounts
firstly groups into meta-features the features that have the same gene identifiers.
featureCounts
looks for gene identifiers in GeneID
column of a SAF annotation or by using GTF.attrType
attribute in a GTF/GFF annotation.
Then for each read featureCounts
searches for meta-features that have at least one feature that overlaps with the read.
A read might be found to overlap with more than one feature within the same meta-feature (eg. an exon-spanning read overlaps with more than one exon from the same gene), however this read will still be counted only once for the meta-feature.
RNA-seq reads are often summarized to meta-features to produce read counts for genes. Further downstream analysis can then be carried out to discover differentially expressed genes. Feature-level summarization of RNA-seq data often yields exon-level read counts, which is useful for investigating alternative splicing of genes.
featureCounts
provides multiple options to count multi-mapping reads (reads mapping to more than one location in the reference genome).
Users can choose to ignore such reads by setting countMultiMappingReads
to FALSE
, or fully count every alignment reported for a multi-mapping read by setting countMultiMappingReads
to TRUE
(each alignment carries 1
count), or count each alignment fractionally by setting both countMultiMappingReads
and fraction
to TRUE
(each alignment carries 1/x
count where x
is the total number of alignments reported for the read).
featureCounts
also provides multiple options to count multi-overlapping reads (reads overlapping with more than one meta-feature when conducting meta-feature-level summarization or overlapping with more than one feature when conducting feature-level summarization).
Users can choose to ignore such reads by setting allowMultiOverlap
to FALSE
, or fully count them for each overlapping meta-feature/feature by setting allowMultiOverlap
to TRUE
(each overlapping meta-feature/feature receives a count of 1 from a read), or assign a fractional count to each overlapping meta-feature/feature by setting both allowMultiOverlap
and fraction
to TRUE
(each overlapping meta-feature/feature receives a count of 1/y
from a read where y
is the total number of meta-features/features overlapping with the read).
If a read is both multi-mapping and multi-overlapping, then each overlapping meta-feature/feature will receive a fractional count of 1/(x*y)
if countMultiMappingReads
, allowMultiOverlap
and fraction
are all set to TRUE
.
When isPairedEnd
is TRUE
, fragments (pairs of reads) instead of reads will be counted.
featureCounts
function checks if reads from the same pair are adjacent to each other (this could happen when reads were for example sorted by their mapping locations), and it automatically reorders those reads that belong to the same pair but are not adjacent to each other in the input read file.
A list with the following components:
counts |
a data matrix containing read counts for each feature or meta-feature for each library. |
counts_junction (optional) |
a data frame including primary and secondary genes overlapping an exon junction, position information for the left splice site (‘Site1’) and the right splice site (‘Site2’) of a junction (including chromosome name, coordinate and strand) and number of supporting reads for each junction in each library. Both primary and secondary genes overlap at least one of the two splice sites of a junction. Secondary genes do not overlap more splice sites than the primary gene. When the primary and secondary genes overlap same number of splice sites, the gene with the smallest leftmost base position is selected as the primary gene. For each junction, no more than one primary gene is reported but there might be more than one secondary genes reported. If a junction does not overlap any genes, it has a missing value in the fields of primary gene and secondary gene. Note that this data frame is only generated when |
annotation |
a data frame with six columns including |
targets |
a character vector giving sample information. |
stat |
a data frame giving number of successfully assigned reads in each library and also number of unassigned reads falling in each filtering category (unmapped, read type, singleton, mapping quality, chimera, fragment length, duplicate, multi-mapping, secondary alignment, split alignment, no features, overlapping length and assignment ambiguity). See Users Guide for more details. |
Wei Shi and Yang Liao
Yang Liao, Gordon K Smyth and Wei Shi (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8):e47. http://www.ncbi.nlm.nih.gov/pubmed/30783653
Yang Liao, Gordon K Smyth and Wei Shi (2014). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30. http://www.ncbi.nlm.nih.gov/pubmed/24227677
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ## Not run:
# Summarize SAM format single-end reads using built-in RefSeq annotation for mouse genome mm10:
featureCounts(files="mapping_results_SE.sam",annot.inbuilt="mm10")
# Summarize single-end reads using a user-provided GTF annotation file:
featureCounts(files="mapping_results_SE.sam",annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,GTF.featureType="exon",GTF.attrType="gene_id")
# Summarize single-end reads using 5 threads:
featureCounts(files="mapping_results_SE.sam",nthreads=5)
# Summarize BAM format single-end read data:
featureCounts(files="mapping_results_SE.bam")
# Perform strand-specific read counting (strandSpecific=2 if reversely stranded):
featureCounts(files="mapping_results_SE.bam",strandSpecific=1)
# Summarize paired-end reads and counting fragments (instead of reads):
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE)
# Count fragments satisfying the fragment length criteria, eg. [50bp, 600bp]:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,
checkFragLength=TRUE,minFragLength=50,maxFragLength=600)
# Count fragments that have both ends successfully aligned without checking the fragment length:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,requireBothEndsMapped=TRUE)
# Exclude chimeric fragments from fragment counting:
featureCounts(files="mapping_results_PE.bam",isPairedEnd=TRUE,countChimericFragments=FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.