Description Usage Arguments Details Value Author(s) References See Also Examples
The align
function can align both DNA and RNA sequencing reads. Subjunc
is an RNA-seq aligner and it reports full alignment of each read (align
reports partial alignment for exon spanning reads).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | align(
# index for reference sequences
index,
# input reads and output
readfile1,
readfile2 = NULL,
type = "rna",
input_format = "gzFASTQ",
output_format = "BAM",
output_file = paste(readfile1,"subread",output_format,sep="."),
# offset value added to Phred quality scores of read bases
phredOffset = 33,
# thresholds for mapping
nsubreads = 10,
TH1 = 3,
TH2 = 1,
maxMismatches = 3,
# unique mapping and multi-mapping
unique = FALSE,
nBestLocations = 1,
# indel detection
indels = 5,
complexIndels = FALSE,
# read trimming
nTrim5 = 0,
nTrim3 = 0,
# distance and orientation of paired end reads
minFragLength = 50,
maxFragLength = 600,
PE_orientation = "fr",
# number of CPU threads
nthreads = 1,
# read group
readGroupID = NULL,
readGroup = NULL,
# read order
keepReadOrder = FALSE,
sortReadsByCoordinates = FALSE,
# color space reads
color2base = FALSE,
# dynamic programming
DP_GapOpenPenalty = -1,
DP_GapExtPenalty = 0,
DP_MismatchPenalty = 0,
DP_MatchScore = 2,
# detect structural variants
detectSV = FALSE,
# gene annotation
useAnnotation = FALSE,
annot.inbuilt = "mm10",
annot.ext = NULL,
isGTF = FALSE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
chrAliases = NULL)
subjunc(
# index for reference sequences
index,
# input reads and output
readfile1,
readfile2 = NULL,
input_format = "gzFASTQ",
output_format = "BAM",
output_file = paste(readfile1,"subjunc",output_format,sep = "."),
# offset value added to Phred quality scores of read bases
phredOffset = 33,
# thresholds for mapping
nsubreads = 14,
TH1 = 1,
TH2 = 1,
maxMismatches = 3,
# unique mapping and multi-mapping
unique = FALSE,
nBestLocations = 1,
# indel detection
indels = 5,
complexIndels = FALSE,
# read trimming
nTrim5 = 0,
nTrim3 = 0,
# distance and orientation of paired end reads
minFragLength = 50,
maxFragLength = 600,
PE_orientation = "fr",
# number of CPU threads
nthreads = 1,
# read group
readGroupID = NULL,
readGroup = NULL,
# read order
keepReadOrder = FALSE,
sortReadsByCoordinates = FALSE,
# color space reads
color2base = FALSE,
# dynamic programming
DP_GapOpenPenalty = -1,
DP_GapExtPenalty = 0,
DP_MismatchPenalty = 0,
DP_MatchScore = 2,
# detect all junctions including gene fusions
reportAllJunctions = FALSE,
# gene annotation
useAnnotation = FALSE,
annot.inbuilt = "mm10",
annot.ext = NULL,
isGTF = FALSE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
chrAliases = NULL)
|
index |
character string giving the basename of index file. Index files should be located in the current directory. |
readfile1 |
a character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. See |
readfile2 |
a character vector giving names of files that include second reads in paired-end read data. Files included in |
type |
a character string or an integer giving the type of sequencing data. Possible values include |
input_format |
character string specifying format of read input files. |
output_format |
character string specifying format of output file. |
output_file |
a character vector specifying names of output files. By default, names of output files are set as the file names provided in |
phredOffset |
numeric value added to base-calling Phred scores to make quality scores (represented as ASCII letters). Possible values include |
nsubreads |
numeric value giving the number of subreads extracted from each read. |
TH1 |
numeric value giving the consensus threshold for reporting a hit. This is the threshold for the first reads if paired-end read data are provided. |
TH2 |
numeric value giving the consensus threhold for the second reads in paired-end data. |
maxMismatches |
numeric value giving the maximum number of mis-matched bases allowed in the alignment. |
unique |
logical indicating if only uniquely mapped reads should be reported. A uniquely mapped read has one single mapping location that has less mis-matched bases than any other candidate locations. By default, multi-mapping reads will be reported in addition to uniquely mapped reads. Number of alignments reported for each multi-mapping read is determined by the |
nBestLocations |
numeric value specifying the maximal number of equally-best mapping locations that will be reported for a multi-mapping read. 1 by default. The allowed value is between 1 to 16 (inclusive). In the mapping output, ‘NH’ tag is used to indicate how many alignments are reported for the read and ‘HI’ tag is used for numbering the alignments reported for the same read. This argument is only applicable when |
indels |
numeric value giving the maximum number of insertions/deletions allowed during the mapping. |
complexIndels |
logical indicating if complex indels will be detected. If |
nTrim5 |
numeric value giving the number of bases trimmed off from 5' end of each read. |
nTrim3 |
numeric value giving the number of bases trimmed off from 3' end of each read. |
minFragLength |
numeric value giving the minimum fragment length. 50 by default. |
maxFragLength |
numeric value giving the maximum fragment length. 600 by default. |
PE_orientation |
character string giving the orientation of the two reads from the same pair. It has three possible values including |
nthreads |
numeric value giving the number of threads used for mapping. |
readGroupID |
a character string giving the read group ID. The specified string is added to the read group header field and also be added to each read in the mapping output. |
readGroup |
a character string in the format of |
keepReadOrder |
logical indicating if the order of reads in the BAM output is kept the same as that in the input file. (Reads from the same pair are always placed next to each other regardless of this option). |
sortReadsByCoordinates |
logical indicating if reads will be sorted by their mapping locations in the mapping output. This option is applicable for BAM output only. A BAI index file will also be generated for each BAM file so the generated BAM files can be directly loaded into a genome browser such as IGB or IGV. |
color2base |
logical. If |
DP_GapOpenPenalty |
a numeric value giving the penalty for opening a gap when using the Smith-Waterman dynamic programming algorithm to detect insertions and deletions. The Smith-Waterman algorithm is only applied for those reads which are found to contain insertions or deletions. |
DP_GapExtPenalty |
a numeric value giving the penalty for extending the gap, used by the Smith-Waterman algorithm. |
DP_MismatchPenalty |
a numeric value giving the penalty for mismatches, used by the Smith-Waterman algorithm. |
DP_MatchScore |
a numeric value giving the score for matches used by the Smith-Waterman algorithm. |
detectSV |
logical indicating if structural variants (SVs) will be detected during read mapping. See below for more details. |
reportAllJunctions |
logical indicating if all discovered junctions will be reported. This includes exon-exon junctions and also gene fusions. Presence of donor/receptor sites is not required when |
useAnnotation |
logical indicating if gene annotation information will be used in the mapping. |
annot.inbuilt |
a character string specifying an in-built annotation used for read summarization. It has four possible values including |
annot.ext |
A character string giving name of a user-provided annotation file or a data frame including user-provided annotation data. If the annotation is in GTF format, it can only be provided as a file. If it is in SAF format, it can be provided as a file or a data frame. The provided annotation file can be a uncompressed or gzip compressed file. If an annotation is provided via |
isGTF |
logical indicating if the annotation provided via the |
GTF.featureType |
a character string denoting the type of features that will be extracted from a GTF annotation. |
GTF.attrType |
a character string denoting the type of attributes in a GTF annotation that will be used to group features. |
chrAliases |
a character string providing the name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation and the reference genome. |
align
and subjunc
are R versions of programs in the Subread suite of programs (http://subread.sourceforge.net) that implement fast, accurate sequence read alignment based on the "seed-and-vote" mapping paradigm (Liao et al. 2013; Liao et al. 2019).
align
is the R version of the command-line program subread-align
and subjunc
is the R version of the command-line program subjunc
.
align
is a general-purpose aligner that can be used to align both genomic DNA-seq reads and RNA-seq reads.
subjunc
is designed for mapping RNA-seq reads only.
The main difference between subjunc
and align
for RNA-seq alignment is that subjunc
reports discovered exon-exon junctions and it also performs full alignments for every read including exon-spanning reads.
align
instead reports the largest mappable regions for each read and soft-clips the remainder of the read.
subjunc
requires the presence of donor and receptor sites when calling exon-exon junctions.
It can detect up to four junction locations in a junction read.
align
and subjunc
can be run with either a full index or a gapped index (see buildindex
for more details).
Maximum mapping speed is achieved when the full index is used.
With a full index built for human/mouse genome, align
maps ~14 million reads per minute with 10 threads.
subjunc
is slightly slower than align
.
Both align
and subjunc
require 17.8GB of memory for mapping.
With a gapped index built for human/mouse genome, align
maps ~9 million reads per minute with 10 threads.
align
requires 8.2GB of memory for mapping and subjunc
requires 8.8GB of memory.
The detectSV
argument of align
can be used to detect structural variants (SVs) in genomic DNA sequencing data.
The reportAllJunctions
argument of subjunc
function can be used for the detection of SVs in RNA-seq data, as well as the detection of non-canonical junctions that do not have the canonical donor/receptor sites.
For each library, breakpoints detected from SV events are saved to a file with suffix name ‘.breakpoints.txt’, which includes chromosomal coordinates of SV breakpoints and numbers of supporting reads.
The BAM/SAM output includes extra fields to describe the complete alignments of breakpoint-containing reads.
For a breakpoint-containing read, mapping of its major sequence segment is described in the main fields of BAM/SAM output whereas mapping of its minor sequence segment, which does not map along with the major segment due to the presence of a breakpoint, is described in the extra fields including 'CC'(Chr), 'CP'(Position),'CG'(CIGAR) and 'CT'(strand).
Note that each breakpoint-containing read occupies only one row in the BAM/SAM output.
For each library, mapping results including both aligned and unaligned reads are saved to a file.
Indels discovered during mapping are saved to a VCF format file (.indel.vcf
).
subjunc
also outputs a BED format file that contains list of discovered exon-exon junctions (.junction.bed
).
If detectSV
or reportAllJunctions
options are set to TRUE
, structural variants discovered during read mapping will be reported and saved to a file (.breakpoints.txt
), which includes chromosomal coordinates of breakpoints and number of supporting reads for each breakpoint.
Other than outputting data to files, align
and subjunc
also return a data.frame
that includes mapping statistics for each library, such as total number of reads, number of mapped reads, number of uniquely mapped reads, number of indels and number of junctions (subjunc
only).
Wei Shi and Yang Liao
Yang Liao, Gordon K Smyth and Wei Shi (2019). The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Research, 47(8):e47. http://www.ncbi.nlm.nih.gov/pubmed/30783653
Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108. http://www.ncbi.nlm.nih.gov/pubmed/23558742
1 2 3 4 5 6 7 8 9 10 11 12 | # Build an index for the artificial sequence included in file 'reference.fa'.
ref <- system.file("extdata","reference.fa",package="Rsubread")
buildindex(basename="./reference_index",reference=ref)
# align a sample read dataset ('reads.txt') to the sample reference
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align.stat <- align(index = "./reference_index", readfile1 = reads,
output_file = "./Rsubread_alignment.BAM", phredOffset = 64)
align.stat
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.