Description Usage Arguments Value Note Author(s) See Also Examples
readSeqFile
reads a FASTQ or FASTA file, summarizing the
nucleotide distribution across position (cycles) and the sequence
length distributions. If type
is ‘fastq’, the distribution
of qualities across position will also be recorded. If hash
is
TRUE
, the unique sequences will be hashed with counts of their
frequency. By default, only 10% of the reads will be hashed; this
proportion can be controlled with hash.prop
. If
kmer=TRUE
, k-mers of length k
will be hashed by
position, also with the sampling proportion controlled by
hash.prop
.
1 2 3 |
filename |
the name of the file which the sequences are to be read from. |
type |
either ‘fastq’ or ‘fasta’, representing the type of the file. FASTQ files will have the quality distribution by position summarized. |
max.length |
the largest sequence length likely to be encountered. For efficiency, a matrix larger than the largest sequence is allocated to *this* size in C, populated, and then trimmed in R. Specifying a value too small will lead to an error and the function will need to be re-run. |
quality |
either ‘illumina’, ‘sanger’, or ‘solexa’, this determines the quality offsets and range. See the values of QUALITY.CONSTANTS for more information. |
hash |
a logical value indicating whether to hash sequences |
hash.prop |
a numeric value in (0, 1] that functions as the proportion of reads to hash. |
kmer |
a logical value indicating whether to hash k-mers by position. |
k |
an integer value indicating the k-mer size. |
verbose |
a logical value indicating whether be verbose (in the C backend). |
An S4 object of FASTQSummary
or
FASTASummary
containing the summary statistics.
Identifying the correct quality can be difficult. readSeqFile
will error out if it a base quality outside of the range of a known
quality type, but it is possible one could have reads with a different
quality type that won't fall outside of the another type.
Here is a bit more about quality:
PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60]. This has been removed as an option since sequence reads with this type are very, very uncommon.
Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From NCBI SRA, or Illumina pipeline 1.8+.
Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.
Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62].
Vince Buffalo <vsbuffalo@ucdavis.edu>
FASTQSummary
and
FASTASummary
are the classes of the
objects returned by readSeqFile
.
basePlot
is a function that plots the distribution of
bases over sequence length for a particular FASTASummary
or
FASTQSummary
object. gcPlot
combines and plots
the GC proportion.
qualPlot
is a function that plots the distribution of
qualities over sequence length for a particular FASTASummary
or FASTQSummary
object.
seqlenPlot
is a function that plots a histogram of
sequence lengths for a particular FASTASummary
or
FASTQSummary
object.
kmerKLPlot
is a function that plots K-L divergence
of k-mers to look for possible biase in reads.
1 2 3 4 5 6 | ## Load a FASTQ file, with sequence hashing.
s.fastq <- readSeqFile(system.file('extdata', 'test.fastq', package='qrqc'))
## Load a FASTA file, without sequence hashing.
s.fasta <- readSeqFile(system.file('extdata', 'test.fasta', package='qrqc'),
type='fasta', hash=FALSE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.