readFastq: Read, write, and count records in FASTQ-formatted files
In Bioconductor/ShortRead: FASTQ input and manipulation

readFastq

R Documentation

Read, write, and count records in FASTQ-formatted files

Description

readFastq reads all FASTQ-formated files in a directory dirPath whose file name matches pattern pattern, returning a compact internal representation of the sequences and quality scores in the files. Methods read all files into a single R object; a typical use is to restrict input to a single FASTQ file.

writeFastq writes an object to a single file, using mode="w" (the default) to create a new file or mode="a" append to an existing file. Attempting to write to an existing file with mode="w" results in an error.

countFastq counts the nubmer of records, nucleotides, and base-level quality scores in one or several fastq files.

Usage

readFastq(dirPath, pattern=character(0), ...)
## S4 method for signature 'character'
readFastq(dirPath, pattern=character(0), ..., withIds=TRUE)

writeFastq(object, file, mode="w", full=FALSE, compress=TRUE, ...)

countFastq(dirPath, pattern=character(0), ...)
## S4 method for signature 'character'
countFastq(dirPath, pattern=character(0), ...)

Arguments

`dirPath`	A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) or single file name of FASTQ files to be read.
`pattern`	The (`grep`-style) pattern describing file names to be read. The default (`character(0)`) results in (attempted) input of all files in the directory.
`object`	An object to be output in `fastq` format. For methods, use `showMethods(object, where=getNamespace("ShortRead"))`.
`file`	A length 1 character vector providing a path to a file to the object is to be written to.
`mode`	A length 1 character vector equal to either ‘w’ or ‘a’ to write to a new file or append to an existing file, respectively.
`full`	A logical(1) indicating whether the identifier line should be repeated `full=TRUE` or omitted `full=FALSE` on the third line of the fastq record.
`compress`	A logical(1) indicating whether the file should be gz-compressed. The default is `TRUE`.
`...`	Additional arguments. In particular, `qualityType` and `filter`: qualityType: Representation to be used for quality scores, must be one of `Auto` (choose Illumina base 64 encoding `SFastqQuality` if all characters are ASCII-encoded as greater than 58 `:` and some characters are greater than 74 `J`), `FastqQuality` (Phred-like base 33 encoding), `SFastqQuality` (Illumina base 64 encoding). filter: An object of class `srFilter`, used to filter objects of class `ShortReadQ` at input.
`withIds`	`logical(1)` indicating whether identifiers should be read from the fastq file.

Details

The fastq format is not quite precisely defined. The basic definition used here parses the following four lines as a single record:

    @HWI-EAS88_1_1_1_1001_499
    GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
    +HWI-EAS88_1_1_1_1001_499
    ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS

The first and third lines are identifiers preceded by a specific character (the identifiers are identical, in the case of Solexa). The second line is an upper-case sequence of nucleotides. The parser recognizes IUPAC-standard alphabet (hence ambiguous nucleotides), coercing . to - to represent missing values. The final line is an ASCII-encoded representation of quality scores, with one ASCII character per nucleotide.

The encoding implicit in Solexa-derived fastq files is that each character code corresponds to a score equal to the ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and corresponds to a Solexa quality score of 0). This is different from BioPerl, for instance, which recovers quality scores by subtracting 33 from the ASCII character value (so that, for instance, !, with decimal value 33, encodes value 0).

The BioPerl description of fastq asserts that the first character of line 4 is a !, but the current parser does not support this convention.

writeFastq creates files following the specification outlined above, using the IUPAC-standard alphabet (hence, sequences containing ‘.’ when read will be represented by ‘-’ when written).

Value

readFastq returns a single R object (e.g., ShortReadQ) containing sequences and qualities contained in all files in dirPath matching pattern. There is no guarantee of order in which files are read.

writeFastq is invoked primarily for its side effect, creating or appending to file file. The function returns, invisibly, the length of object, and hence the number of records written.

countFastq returns a data.frame with row names equal to the base (file) name of the fastq file, and columns records, nucleotides, and scores, corresponding to tally of each entity in each file. Parsing mistakes from poorly formmated files result in an error.

Author(s)

Martin Morgan

Examples

methods(readFastq)
methods(writeFastq)
methods(countFastq)

sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
sread(rfq)
id(rfq)
quality(rfq)

## SolexaPath method 'knows' where FASTQ files are placed
rfq1 <- readFastq(sp, pattern="s_1_sequence.txt")
rfq1

file <- tempfile()
writeFastq(rfq, file)
readLines(file, 8)
countFastq(file)

Bioconductor/ShortRead documentation built on Nov. 2, 2024, 4:38 p.m.