View source: R/preprocessReads.R
preprocessReads | R Documentation |
Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.
preprocessReads(
filename,
outputFilename = NULL,
filenameMate = NULL,
outputFilenameMate = NULL,
truncateStartBases = NULL,
truncateEndBases = NULL,
Lpattern = "",
Rpattern = "",
max.Lmismatch = rep(0:2, c(6, 3, 100)),
max.Rmismatch = rep(0:2, c(6, 3, 100)),
with.Lindels = FALSE,
with.Rindels = FALSE,
minLength = 14L,
nBases = 2L,
complexity = NULL,
nrec = 1000000L,
clObj = NULL
)
filename |
the name(s) of the input sequence file(s). |
outputFilename |
the name(s) of the output sequence file(s). |
filenameMate |
for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair. |
outputFilenameMate |
for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair. |
truncateStartBases |
integer(1): the number of bases to be truncated (removed) from the beginning of each sequence. |
truncateEndBases |
integer(1): the number of bases to be truncated (removed) from the end of each sequence. |
Lpattern |
character(1): the left (5'-end) adapter sequence. |
Rpattern |
character(1): the right (3'-end) adapter sequence. |
max.Lmismatch |
mismatch tolerance when searching for matches of
|
max.Rmismatch |
mismatch tolerance when searching for matches of
|
with.Lindels |
if |
with.Rindels |
same as |
minLength |
integer(1): the minimal allowed sequence length. |
nBases |
integer(1): the maximal number of Ns allowed per sequence. |
complexity |
|
nrec |
integer(1): the number of sequence records to read at a time. |
clObj |
a cluster object to be used for parallel processing of multiple files (see ‘Details’). |
Sequence files can be in fasta or fastq format, and can be compressed by
either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files
can be processed by a single call to preprocessReads
; in that
case all sequence file vectors must have identical lengths.
nrec
can be used to limit the memory usage when processing
large input files. preprocessReads
iteratively loads chunks of
nrec
sequences from the input until all data been processed.
Sequence pairs from paired-end experiments can be processed by
specifying pairs of input and output files (filenameMate
and
outputFilenameMate
arguments). In that case, it is assumed that
pairs appear in the same order in the two input files, and only pairs
in which both reads pass all filtering criteria are written to the
output files, maintaining the consistent ordering.
If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.
For the trimming of left and/or right flanking sequences (adapters) from
sequence reads, the trimLRPatterns
function
from package Biostrings is used, and the arguments Lpattern
,
Rpattern
, max.Lmismatch
, max.Rmismatch
,
with.Lindels
and with.Rindels
are used in the call to
trimLRPatterns
. Lfixed
and Rfixed
arguments
of trimLRPatterns
are set to TRUE
, thus only fixed
patterns (without IUPAC codes for ambigous bases) can be
used. Currently, trimming of adapters is only supported for single read
experiments.
Sequence complexity (H
) is calculated based on the dinucleotide
composition using the formula (Shannon entropy):
H = -\sum_i {f_i \log_2 f_i},
where f_i
is the fraction of dinucleotide i
from all
dinucleotides in the sequence. Sequence reads that fulfill the condition
H/H_r \ge c
are retained (not filtered out), where H_r =
3.908
is the reference complexity in bits obtained from the human
genome, and c
is the value given to the argument complexity
.
If an object that inherits from class cluster
is provided to
the clObj
argument, for example an object returned by
makeCluster
from package parallel,
multiple files will be processed in parallel using
clusterMap
from package parallel.
A matrix with summary statistics on the processed sequences, containing:
One column per input file (or pair of input files for paired-end experiments).
The number of sequences or sequence pairs in rows:
totalSequences
- the total number in the input
matchTo5pAdaptor
- matching to Lpattern
matchTo3pAdaptor
- matching to Rpattern
tooShort
- shorter than minLength
tooManyN
- more than nBases
Ns
lowComplexity
- relative complexity below complexity
totalPassed
- the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).
Anita Lerch, Dimos Gaidatzis and Michael Stadler
trimLRPatterns
from package Biostrings,
makeCluster
from package parallel
# sample files
infiles <- system.file(package="QuasR", "extdata",
c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")
# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)
# paired-end example
preprocessReads(filename=infiles[1],
outputFilename=outfiles[1],
filenameMate=infiles[2],
outputFilenameMate=outfiles[2],
nBases=0, complexity=0.6)
unlink(outfiles)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.