preprocessReads: Preprocess Short Reads
In QuasR: Quantify and Annotate Short Reads in R

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/preprocessReads.R

Truncate sequences, remove parts matching to adapters and filter out low quality or low complexity sequences from (compressed) 'fasta' or 'fastq' files.

preprocessReads(filename, outputFilename=NULL,
             filenameMate=NULL, outputFilenameMate=NULL,
             truncateStartBases=NULL, truncateEndBases=NULL, 
             Lpattern="", Rpattern="",
             max.Lmismatch=rep(0:2, c(6,3,100)), max.Rmismatch=rep(0:2, c(6,3,100)),
             with.Lindels=FALSE, with.Rindels=FALSE,
             minLength=14L, nBases=2L, complexity=NULL,
             nrec=1000000L, clObj=NULL)

`filename`	the name(s) of the input sequence file(s).
`outputFilename`	the name(s) of the output sequence file(s).
`filenameMate`	for paired-end experiments, the name(s) of the input sequence file(s) containing the second read (mate) of each pair.
`outputFilenameMate`	for paired-end experiments, the name(s) of the output sequence file(s) containing the second read (mate) of each pair.
`truncateStartBases`	integer(1): the number of bases to be truncated (removed) from the begining of each sequence.
`truncateEndBases`	integer(1): the number of bases to be truncated (removed) from the end of each sequence.
`Lpattern`	character(1): the left (5'-end) adapter sequence.
`Rpattern`	character(1): the right (3'-end) adapter sequence.
`max.Lmismatch`	mismatch tolerance when searching for matches of `Lpattern` (see ‘Details’).
`max.Rmismatch`	mismatch tolerance when searching for matches of `Rpattern` (see ‘Details’).
`with.Lindels`	if `TRUE`, indels are allowed in the alignments of the suffixes of `Lpattern` with the subject, at its beginning (see ‘Details’).
`with.Rindels`	same as `with.Lindels` but for alignments of the prefixes of `Rpattern` with the subject, at its end (see ‘Details’).
`minLength`	integer(1): the minimal allowed sequence length.
`nBases`	integer(1): the maximal number of Ns allowed per sequence.
`complexity`	`NULL` (default) or numeric(1): If not `NULL`, the minimal sequence complexity, as a fraction of the average complexity in the human genome (~3.9bits). For example, `complexity = 0.5` will filter out sequences that do not have at least half the complexity of the human genome. See ‘Details’ on how the complexity is calculated.
`nrec`	integer(1): the number of sequence records to read at a time.
`clObj`	a cluster object to be used for parallel processing of multiple files (see ‘Details’).

Sequence files can be in fasta or fastq format, and can be compressed by either gzip, bzip2 or xz (extensions .gz, .bz2 or .xz). Multiple files can be processed by a single call to preprocessReads; in that case all sequence file vectors must have identical lengths.

nrec can be used to limit the memory usage when processing large input files. preprocessReads iteratively loads chunks of nrec sequences from the input until all data been processed.

Sequence pairs from paired-end experiments can be processed by specifying pairs of input and output files (filenameMate and outputFilenameMate arguments). In that case, it is assumed that pairs appear in the same order in the two input files, and only pairs in which both reads pass all filtering criteria are written to the output files, maintaining the consistent ordering.

If output files are compressed, the processed sequences are first written to temporary files (created in the same directory as the final output file), and the output files are generated at the end by compressing the temporary files.

For the trimming of left and/or right flanking sequences (adapters) from sequence reads, the trimLRPatterns function from package Biostrings is used, and the arguments Lpattern, Rpattern, max.Lmismatch, max.Rmismatch, with.Lindels and with.Rindels are used in the call to trimLRPatterns. Lfixed and Rfixed arguments of trimLRPatterns are set to TRUE, thus only fixed patterns (without IUPAC codes for ambigous bases) can be used. Currently, trimming of adapters is only supported for single read experiments.

Sequence complexity (H) is calculated based on the dinucleotide composition using the formula (Shannon entropy):

H = -∑_i {f_i \log_2 f_i},

where f_i is the fraction of dinucleotide i from all dinucleotides in the sequence. Sequence reads that fulfill the condition H/H_r ≥ c are retained (not filtered out), where H_r = 3.908 is the reference complexity in bits obtained from the human genome, and c is the value given to the argument complexity.

If an object that inherits from class cluster is provided to the clObj argument, for example an object returned by makeCluster from package parallel, multiple files will be processed in parallel using clusterMap from package parallel.

A matrix with summary statistics on the processed sequences, containing:

One column per input file (or pair of input files for paired-end experiments).
The number of sequences or sequence pairs in rows:
- totalSequences - the total number in the input
- matchTo5pAdaptor - matching to Lpattern
- matchTo3pAdaptor - matching to Rpattern
- tooShort - shorter than minLength
- tooManyN - more than nBases Ns
- lowComplexity - relative complexity below complexity
- totalPassed - the number of sequences/sequence pairs that pass all filtering criteria and were written to the output file(s).

Anita Lerch, Dimos Gaidatzis and Michael Stadler

trimLRPatterns from package Biostrings, makeCluster from package parallel

# sample files
infiles <- system.file(package="QuasR", "extdata",
                       c("rna_1_1.fq.bz2","rna_1_2.fq.bz2"))
outfiles <- paste(tempfile(pattern=c("output_1_","output_2_")),".fastq",sep="")

# single read example
preprocessReads(infiles, outfiles, nBases=0, complexity=0.6)
unlink(outfiles)

# paired-end example
preprocessReads(filename=infiles[1],
                outputFilename=outfiles[1],
                filenameMate=infiles[2],
                outputFilenameMate=outfiles[2],
                nBases=0, complexity=0.6)
unlink(outfiles)