readAligned: (Legacy) Read aligned reads and their quality scores into R...

readAlignedR Documentation

(Legacy) Read aligned reads and their quality scores into R representations

Description

Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see ‘details’ for exceptions) read all files into a single R object.

Usage


readAligned(dirPath, pattern=character(0), ...)

Arguments

dirPath

A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute; some methods also accept a character vector of file names) of aligned read files to be input.

pattern

The (grep-style) pattern describing file names to be read. The default (character(0)) results in (attempted) input of all files in the directory.

...

Additional arguments, used by methods. When dirPath is a character vector, the argument type must be provided. Possible values for type and their meaning are described below. Most methods implement filter=srFilter(), allowing objects of SRFilter to selectively returns aligned reads.

Details

There is no standard aligned read file format; methods parse particular file types.

The readAligned,character-method interprets file types based on an additional type argument. Supported types are:

type="SolexaExport"

This type parses .*_export.txt files following the documentation in the Solexa Genome Alignment software manual, version 0.3.0. These files consist of the following columns; consult Solexa documentation for precise descriptions. If parsed, values can be retrieved from AlignedRead as follows:

Machine

see below

Run number

stored in alignData

Lane

stored in alignData

Tile

stored in alignData

X

stored in alignData

Y

stored in alignData

Multiplex index

see below

Paired read number

see below

Read

sread

Quality

quality

Match chromosome

chromosome

Match contig

alignData

Match position

position

Match strand

strand

Match description

Ignored

Single-read alignment score

alignQuality

Paired-read alignment score

Ignored

Partner chromosome

Ignored

Partner contig

Ignored

Partner offset

Ignored

Partner strand

Ignored

Filtering

alignData

The following optional arguments, set to FALSE by default, influence data input

withMultiplexIndex

When TRUE, include the multiplex index as a column multiplexIndex in alignData.

withPairedReadNumber

When TRUE, include the paired read number as a column pairedReadNumber in alignData.

withId

When TRUE, construct an identifier string as ‘Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber’. The substrings ‘#multiplexIndex’ and ‘/pairedReadNumber’ are not present if withMultiplexIndex=FALSE or withPairedReadNumber=FALSE.

withAll

A convencience which, when TRUE, sets all with* values to TRUE.

Note that not all paired read columns are interpreted. Different interfaces to reading alignment files are described in SolexaPath and SolexaSet.

type="SolexaPrealign"

See SolexaRealign

type="SolexaAlign"

See SolexaRealign

type="SolexaRealign"

These types parse s_L_TTTT_prealign.txt, s_L_TTTT_align.txt or s_L_TTTT_realign.txt files produced by default and eland analyses. From the Solexa documentation, align corresponds to unfiltered first-pass alignments, prealign adjusts alignments for error rates (when available), realign filters alignments to exclude clusters failing to pass quality criteria.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

If parsed, values can be retrieved from AlignedRead as follows:

Sequence

stored in sread

Best score

stored in alignQuality

Number of hits

stored in alignData

Target position

stored in position

Strand

stored in strand

Target sequence

Ignored; parse using readXStringColumns

Next best score

stored in alignData

type="SolexaResult"

This parses s_L_eland_results.txt files, an intermediate format that does not contain read or alignment quality scores.

Because base quality scores are not stored with alignments, the object returned by readAligned scores all base qualities as -32.

Columns of this file type can be retrieved from AlignedRead as follows (description of columns is from Table 19, Genome Analyzer Pipeline Software User Guide, Revision A, January 2008):

Id

Not parsed

Sequence

stored in sread

Type of match code

Stored in alignData as matchCode. Codes are (from the Eland manual): NM (no match); QC (no match due to quality control failure); RM (no match due to repeat masking); U0 (best match was unique and exact); U1 (best match was unique, with 1 mismatch); U2 (best match was unique, with 2 mismatches); R0 (multiple exact matches found); R1 (multiple 1 mismatch matches found, no exact matches); R2 (multiple 2 mismatch matches found, no exact or 1-mismatch matches).

Number of exact matches

stored in alignData as nExactMatch

Number of 1-error mismatches

stored in alignData as nOneMismatch

Number of 2-error mismatches

stored in alignData as nTwoMismatch

Genome file of match

stored in chromosome

Position

stored in position

Strand

(direction of match) stored in strand

‘N’ treatment

stored in alignData, as NCharacterTreatment. ‘.’ indicates treatment of ‘N’ was not applicable; ‘D’ indicates treatment as deletion; ‘|’ indicates treatment as insertion

Substitution error

stored in alignData as mismatchDetailOne and mismatchDetailTwo. Present only for unique inexact matches at one or two positions. Position and type of first substitution error, e.g., 11A represents 11 matches with 12th base an A in reference but not read. The reference manual cited below lists only one field (mismatchDetailOne), but two are present in files seen in the wild.

type="MAQMap", records=-1L

Parse binary map files produced by MAQ. See details in the next section. The records option determines how many lines are read; -1L (the default) means that all records are input. For type="MAQMap", dir and pattern must match a single file.

type="MAQMapShort", records=-1L

The same as type="MAQMap" but for map files made with Maq prior to version 0.7.0. (These files use a different maximum read length [64 instead of 128], and are hence incompatible with newer Maq map files.). For type="MAQMapShort", dir and pattern must match a single file.

type="MAQMapview"

Parse alignment files created by MAQ's ‘mapiew’ command. Interpretation of columns is based on the description in the MAQ manual, specifically

        ...each line consists of read name, chromosome, position,
        strand, insert size from the outer coordinates of a pair,
        paired flag, mapping quality, single-end mapping quality,
        alternative mapping quality, number of mismatches of the
        best hit, sum of qualities of mismatched bases of the best
        hit, number of 0-mismatch hits of the first 24bp, number
        of 1-mismatch hits of the first 24bp on the reference,
        length of the read, read sequence and its quality.
      

The read name, read sequence, and quality are read as XStringSet objects. Chromosome and strand are read as factors. Position is numeric, while mapping quality is numeric. These fields are mapped to their corresponding representation in AlignedRead objects.

Number of mismatches of the best hit, sum of qualities of mismatched bases of the best hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of the first 24bp are represented in the AlignedRead object as components of alignData.

Remaining fields are currently ignored.

type="Bowtie"

Parse alignment files created with the Bowtie alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

Identifier

id

Strand

strand

Chromosome

chromosome

Position

position; see comment below

Read

sread; see comment below

Read quality

quality; see comments below

Similar alignments

alignData, ‘similar’ column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as the number of other instances where the same read aligns against the same reference characters as were aligned against in this alignment. Previous versions marked this as ‘Reserved’

Alignment mismatch locations

alignData ‘mismatch’, column

NOTE: the default quality encoding changes to FastqQuality with ShortRead version 1.3.24.

This method includes the argument qualityType to specify how quality scores are encoded. Bowtie quality scores are ‘Phred’-like by default, with qualityType='FastqQuality', but can be specified as ‘Solexa’-like, with qualityType='SFastqQuality'.

Bowtie outputs positions that are 0-offset from the left-most end of the + strand. ShortRead parses position information to be 1-offset from the left-most end of the + strand.

Bowtie outputs reads aligned to the - strand as their reverse complement, and reverses the quality score string of these reads. ShortRead parses these to their original sequence and orientation.

type="SOAP"

Parse alignment files created with the SOAP alignment algorithm. Parsed columns can be retrieved from AlignedRead as follows:

id

id

seq

sread; see comment below

qual

quality; see comment below

number of hits

alignData

a/b

alignData (pairedEnd)

length

alignData (alignedLength)

+/-

strand

chr

chromosome

location

position; see comment below

types

alignData (typeOfHit: integer portion; hitDetail: text portion)

This method includes the argument qualityType to specify how quality scores are encoded. It is unclear from SOAP documentation what the quality score is; the default is ‘Solexa’-like, with qualityType='SFastqQuality', but can be specified as ‘Phred’-like, with qualityType='FastqQuality'.

SOAP outputs positions that are 1-offset from the left-most end of the + strand. ShortRead preserves this representation.

SOAP reads aligned to the - strand are reported by SOAP as their reverse complement, with the quality string of these reads reversed. ShortRead parses these to their original sequence and orientation.

Value

A single R object (e.g., AlignedRead) containing alignments, sequences and qualities of all files in dirPath matching pattern. There is no guarantee of order in which files are read.

Author(s)

Martin Morgan <mtmorgan@fhcrc.org>, Simon Anders <anders@ebi.ac.uk> (MAQ map)

See Also

The AlignedRead class.

Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.

The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5, 3 May, 2008.

The Bowtie reference manual, http://bowtie-bio.sourceforge.net, 28 October, 2008.

The SOAP reference manual, http://soap.genomics.org.cn/soap1, 16 December, 2008.

Examples

sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
## ELAND_EXTENDED
(aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport"))
## PhageAlign
(aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign"))

## MAQ
dirPath <- system.file('extdata', 'maq', package='ShortRead')
list.files(dirPath)
## First line
readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
countLines(dirPath)
## two files collapse into one
(aln2 <- readAligned(dirPath, type="MAQMapview"))

## select only chr1-5.fa, '+' strand
filt <- compose(chromosomeFilter("chr[1-5].fa"),
                strandFilter("+"))
(aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))

Bioconductor/ShortRead documentation built on Nov. 2, 2024, 4:38 p.m.