digestFastqs | R Documentation |
Read sequences for one or a pair of fastq files and digest them (extract umis, constant and variable parts, filter, extract mismatch information from constant and count the observed unique variable parts). Alternatively, primer sequences could be specified, in which case the sequence immediately following the primer will be considered the variable sequence.
digestFastqs(
fastqForward,
fastqReverse = NULL,
mergeForwardReverse = FALSE,
minOverlap = 0,
maxOverlap = 0,
minMergedLength = 0,
maxMergedLength = 0,
maxFracMismatchOverlap = 1,
greedyOverlap = TRUE,
revComplForward = FALSE,
revComplReverse = FALSE,
adapterForward = "",
adapterReverse = "",
elementsForward = "",
elementLengthsForward = numeric(0),
elementsReverse = "",
elementLengthsReverse = numeric(0),
primerForward = c(""),
primerReverse = c(""),
wildTypeForward = "",
wildTypeReverse = "",
constantForward = c(""),
constantReverse = c(""),
avePhredMinForward = 20,
avePhredMinReverse = 20,
variableNMaxForward = 0,
variableNMaxReverse = 0,
umiNMax = 0,
nbrMutatedCodonsMaxForward = 1,
nbrMutatedCodonsMaxReverse = 1,
nbrMutatedBasesMaxForward = -1,
nbrMutatedBasesMaxReverse = -1,
forbiddenMutatedCodonsForward = "",
forbiddenMutatedCodonsReverse = "",
useTreeWTmatch = FALSE,
collapseToWTForward = FALSE,
collapseToWTReverse = FALSE,
mutatedPhredMinForward = 0,
mutatedPhredMinReverse = 0,
mutNameDelimiter = ".",
constantMaxDistForward = -1,
constantMaxDistReverse = -1,
variableCollapseMaxDist = deprecated(),
variableCollapseMinReads = deprecated(),
variableCollapseMinRatio = deprecated(),
umiCollapseMaxDist = 0,
filteredReadsFastqForward = "",
filteredReadsFastqReverse = "",
maxNReads = -1,
verbose = FALSE,
nThreads = 1,
chunkSize = 1e+05,
maxReadLength = 1024
)
fastqForward , fastqReverse |
Character vectors, paths to gzipped FASTQ files corresponding to forward and reverse reads, respectively. If more than one forward/reverse sequence file is given, they need to be provided in the same order. Note that if multiple fastq files are provided, they are all assumed to correspond to the same sample, and will effectively be concatenated. |
mergeForwardReverse |
Logical scalar, whether to fuse the forward and reverse variable sequences. |
minOverlap , maxOverlap |
Numeric scalar, the minimal and maximal allowed
overlap between the forward and reverse reads when merging. Only used if
|
minMergedLength , maxMergedLength |
Numeric scalar, the minimal and
maximal allowed total length of the merged product (if
|
maxFracMismatchOverlap |
Numeric scalar, maximal mismatch rate in the
overlap. Only used if |
greedyOverlap |
Logical scalar. If |
revComplForward , revComplReverse |
Logical scalar, whether to reverse complement the forward/reverse variable and constant sequences, respectively. |
adapterForward , adapterReverse |
Character scalars, the adapter sequence
for forward/reverse reads, respectively. If a forward/reverse read
contains the corresponding adapter sequence, the sequence pair will be
filtered out. If set to |
elementsForward , elementsReverse |
Character scalars representing the composition of the forward and reverse reads, respectively. The strings should consist only of the letters S (skip), C (constant), U (umi), P (primer), V (variable), and cover the full extent of the read. Most combinations are allowed (and a given letter can appear multiple times), but there can be at most one occurrence of P. If a given letter is included multiple times, the corresponding sequences will be concatenated in the output. |
elementLengthsForward , elementLengthsReverse |
Numeric vectors containing
the lengths of each read component from
|
primerForward , primerReverse |
Character vectors, representing the primer sequence(s) for forward/reverse reads, respectively. Only read pairs that contain perfect matches to both the forward and reverse primers (if given) will be retained. Multiple primers can be specified - they will be considered in order and the first match will be used. |
wildTypeForward , wildTypeReverse |
Character scalars or named character vectors, the wild type sequence for the forward and reverse variable region. If given as a single string, the reference sequence will be named 'f' (for forward) or 'r' (for reverse). |
constantForward , constantReverse |
Character vectors giving, the expected constant forward and reverse sequences. If more than one sequence is provided, they must all have the same length. |
avePhredMinForward , avePhredMinReverse |
Numeric scalar, the minimum average Phred score in the variable region for a read to be retained. If a read pair contains both forward and reverse variable regions, the minimum average Phred score has to be achieved in both for a read pair to be retained. |
variableNMaxForward , variableNMaxReverse |
Numeric scalar, the maximum number of Ns allowed in the variable region for a read to be retained. |
umiNMax |
Numeric scalar, the maximum number of Ns allowed in the UMI for a read to be retained. |
nbrMutatedCodonsMaxForward , nbrMutatedCodonsMaxReverse |
Numeric
scalar, the maximum number of mutated codons that are allowed. Note that
for the forward and reverse sequence, respectively, exactly one of
|
nbrMutatedBasesMaxForward , nbrMutatedBasesMaxReverse |
Numeric scalar,
the maximum number of mutated bases that are allowed. Note that for the
forward and reverse sequence, respectively, exactly one of
|
forbiddenMutatedCodonsForward , forbiddenMutatedCodonsReverse |
Character
vector of codons (can contain ambiguous IUPAC characters, see
|
useTreeWTmatch |
Logical scalar. Should a tree-based matching to wild type sequences be used if possible? If the number of allowed mismatches is small, and the number of wild type sequences is large, this is typically faster. |
collapseToWTForward , collapseToWTReverse |
Logical scalar, indicating whether to just represent the observed variable sequence by the closest wildtype sequence rather than retaining the information about the mutations. |
mutatedPhredMinForward , mutatedPhredMinReverse |
Numeric scalar, the
minimum Phred score of a mutated base for the read to be retained. If
any mutated base has a Phred score lower than |
mutNameDelimiter |
Character scalar, the delimiter used in the naming
of mutants. Generally, mutants will be named as XX.YY.NNN, where XX
is the closest provided reference sequence, YY is the mutated base or
codon number (depending on whether |
constantMaxDistForward , constantMaxDistReverse |
Numeric scalars, the maximum allowed Hamming distance between the extracted and expected constant sequence. If multiple constant sequences are provided, the most similar one is used. Reads with a larger distance to the expected constant sequence are discarded. If set to -1, no filtering is done. |
variableCollapseMaxDist , variableCollapseMinReads , variableCollapseMinRatio |
Deprecated. Collapsing of variable sequences is no longer performed in
|
umiCollapseMaxDist |
Numeric scalar defining
the tolerances for collapsing similar UMI sequences. If the value is
in [0, 1), it defines the maximal Hamming distance in terms of a
fraction of sequence length:
( |
filteredReadsFastqForward , filteredReadsFastqReverse |
Character scalars, the names of a (pair of) FASTQ file(s) where filtered-out reads will be written. The name(s) should end in .gz (the output will always be compressed). If empty, filtered reads will not be written to a file. |
maxNReads |
Integer scalar, the maximum number of reads to process.
The first |
verbose |
Logical scalar, whether to print out progress messages. |
nThreads |
Numeric scalar, the number of threads to use for parallel processing. |
chunkSize |
Numeric scalar, the number of read (pairs) to keep in memory for parallel processing. Reduce from the default value if you run out of memory. |
maxReadLength |
Numeric scalar, the maximum allowed read length. Longer
read lengths lead to higher memory allocation, and may require
the |
The processing of a read pair goes as follows:
Search for perfect matches to forward/reverse adapter sequences, filter out the read pair if a match is found in either the forward or reverse read.
If primer sequences are provided, search for perfect matches, and filter out the read pair if not all provided primer sequences can be found.
Extract the UMI, constant and variable sequence from forward and reverse reads, based on the definition of the respective read composition.
If requested, collapse forward and reverse variable regions by retaining, for each position, the base with the highest reported base quality.
Filter out the read (pair) if the average quality in the variable
region is below avePhredMinForward
/avePhredMinReverse
, in
either the forward or reverse read (or the merged read).
Filter out the read (pair) if the number of Ns in the variable region
exceeds variableNMaxForward
/variableNMaxReverse
.
Filter out the read (pair) if the number of Ns in the combined forward
and reverse UMI sequence exceeds umiNMax
If one or more wild type sequences (for the variable region) are provided, find the mismatches between the (forward/reverse) variable region and the provided wild type sequence (if more than one wild type sequence is provided, first find the one that is closest to the read).
Filter out the read (pair) if any mutated base has a quality below
mutatedPhredMinForward
/mutatedPhredMinReverse
.
Filter out the read (pair) if the number of mutated codons exceeds
nbrMutatedCodonsMaxForward
/nbrMutatedCodonsMaxReverse
.
Filter out the read (pair) if any of the mutated codons match any of
the codons encoded by
forbiddenMutatedCodonsForward
/forbiddenMutatedCodonsReverse
.
Assign a 'mutation name' to the read (pair). This name is a
combination of parts of the form XX.YY.NNN, where XX is the name of the
most similar reference sequence, YY is the mutated codon number, and NNN is
the mutated codon. . is a delimiter, specified via
mutNameDelimiter
. If no wildtype sequences are provided, the
variable sequence will be used as the mutation name'.
Based on the retained reads following this filtering process, count the number of reads, and the number of unique UMIs, for each variable sequence (or pair of variable sequences).
A list with four entries:
A data.frame
that contains, for each observed
mutation combination, the corresponding variable region sequences (or pair of
sequences), the number of observed such sequences, and the number of unique
UMIs observed for the sequence. It also has additional columns: 'maxNbrReads'
contains the number of reads for the most frequent observed sequence
represented by the feature (only relevant if similar variable regions are
collapsed). 'nbrMutBases', 'nbrMutCodons' and 'nbrMutAAs' give the number of
mutated bases, codons or amino acids in each variant. Alternative variant
names based on base, codon or amino acid sequence are provided in columns
'mutantNameBase', 'mutantNameCodon', 'mutantNameAA'. In addition,
'mutantNameBaseHGVS' and 'mutantNameAAHGVS' give base- and amino acid-based
names following the HGVS nomenclature (https://varnomen.hgvs.org/). Please
note that the provided reference sequence names are used for the HGVS
sequence identifiers. It is up to the user to use appropriately named
reference sequences in order to obtain valid HGVS variant names.
A data.frame
that contains the number of input
reads, the number of reads filtered out in the processing, and the number of
retained reads. The filters are named according to the convention
"fxx_filter", where "xx" indicates the order in which the filters were
applied, and "filter" indicates the type of filter. Note that filters are
applied successively, and the reads filtered out in one step are not
considered for successive filtering steps.
A data.frame
that contains, for each Phred
quality score between 0 and 99, the number of bases in the extracted constant
sequences with that quality score that match/mismatch with the provided
reference constant sequence.
A list
with all parameter settings that were used in
the processing. Also contains the version of the package and the time of
processing.
## See the vignette for complete worked-out examples for different types of
## data sets
## ----------------------------------------------------------------------- ##
## Process a single-end data set, assume that the full read represents
## the variable region
out <- digestFastqs(
fastqForward = system.file("extdata", "cisInput_1.fastq.gz",
package = "mutscan"),
elementsForward = "V", elementLengthsForward = -1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## ----------------------------------------------------------------------- ##
## Process a single-end data set, specify the read as a combination of
## UMI, constant region and variable region (skip the first base)
out <- digestFastqs(
fastqForward = system.file("extdata", "cisInput_1.fastq.gz",
package = "mutscan"),
elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
constantForward = "AACCGGAGGAGGGAGCTG"
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics
## ----------------------------------------------------------------------- ##
## Process a single-end data set, specify the read as a combination of
## UMI, constant region and variable region (skip the first base), provide
## the wild type sequence to compare the variable region to and limit the
## number of allowed mutated codons to 1
out <- digestFastqs(
fastqForward = system.file("extdata", "cisInput_1.fastq.gz",
package = "mutscan"),
elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
constantForward = "AACCGGAGGAGGGAGCTG",
wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC",
"TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
nbrMutatedCodonsMaxForward = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics
## ----------------------------------------------------------------------- ##
## Process a paired-end data set where both the forward and reverse reads
## contain the same variable region and thus should be merged to generate
## the final variable sequence, specify the reads as a combination of
## UMI, constant region and variable region (skip the first and/or last
## base), provide the wild type sequence to compare the variable region to
## and limit the number of allowed mutated codons to 1
out <- digestFastqs(
fastqForward = system.file("extdata", "cisInput_1.fastq.gz",
package = "mutscan"),
fastqReverse = system.file("extdata", "cisInput_2.fastq.gz",
package = "mutscan"),
mergeForwardReverse = TRUE, revComplForward = FALSE, revComplReverse = TRUE,
elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
elementsReverse = "SUCVS", elementLengthsReverse = c(1, 7, 17, 96, -1),
constantForward = "AACCGGAGGAGGGAGCTG",
constantReverse = "GAGTTCATCCTGGCAGC",
wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC",
"TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
nbrMutatedCodonsMaxForward = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics
## ----------------------------------------------------------------------- ##
## Process a paired-end data set where the forward and reverse reads
## contain variable regions corresponding to different proteins, and thus
## should not be merged, specify the reads as a combination of
## UMI, constant region and variable region (skip the first base), provide
## the wild type sequence to compare the variable region to and limit the
## number of allowed mutated codons to 1
out <- digestFastqs(
fastqForward = system.file("extdata", "transInput_1.fastq.gz",
package = "mutscan"),
fastqReverse = system.file("extdata", "transInput_2.fastq.gz",
package = "mutscan"),
mergeForwardReverse = FALSE,
elementsForward = "SUCV", elementLengthsForward = c(1, 10, 18, 96),
elementsReverse = "SUCV", elementLengthsReverse = c(1, 8, 20, 96),
constantForward = "AACCGGAGGAGGGAGCTG",
constantReverse = "GAAAAAGGAAGCTGGAGAGA",
wildTypeForward = c(FOS = paste0("ACTGATACACTCCAAGCGGAGACAGACCAACTAGAAGATGAGAAGTC",
"TGCTTTGCAGACCGAGATTGCCAACCTGCTGAAGGAGAAGGAAAAACTA")),
wildTypeReverse = c(JUN = paste0("ATCGCCCGGCTGGAGGAAAAAGTGAAAACCTTGAAAGCTCAGAACTC",
"GGAGCTGGCGTCCACGGCCAACATGCTCAGGGAACAGGTGGCACAGCTT")),
nbrMutatedCodonsMaxForward = 1,
nbrMutatedCodonsMaxReverse = 1
)
## Table with read counts and mutant information
head(out$summaryTable)
## Filter summary
out$filterSummary
## Error statistics
out$errorStatistics
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.