spoaAlign: Specify an algorithm for SPOA
In brendanf/spoar: SIMD Partial Order Alignment in R

Description Usage Arguments Details Value Examples

View source: R/spoar.R

Specify an algorithm for SPOA

Align sequences using SPOA (and optionally get the consensus)

SpoaAlgo(
  algorithm = c("local", "global", "semi.global"),
  m = 5L,
  n = -4L,
  g = -8L,
  e = g,
  q = g,
  c = q
)

spoaAlign(seq, algo = SpoaAlgo(), ...)

## S3 method for class 'character'
spoaAlign(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

## S3 method for class 'XStringSet'
spoaAlign(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

## S3 method for class 'QualityScaledXStringSet'
spoaAlign(seq, algo = SpoaAlgo(), w = 1, ...)

## S3 method for class 'ShortReadQ'
spoaAlign(seq, algo = SpoaAlgo(), w = 1, ...)

## S3 method for class 'derep'
spoaAlign(seq, algo = SpoaAlgo(), ...)

spoaConsensus(seq, algo = SpoaAlgo(), ...)

## S3 method for class 'character'
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

## S3 method for class 'XStringSet'
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

## S3 method for class 'ShortRead'
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, qual = NULL, ...)

## S3 method for class 'QualityScaledXStringSet'
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, ...)

## S3 method for class 'ShortReadQ'
spoaConsensus(seq, algo = SpoaAlgo(), w = 1, ...)

## S3 method for class 'derep'
spoaConsensus(seq, algo = SpoaAlgo(), ...)

`algorithm`	(`character` string) alignment mode; one of `"local"` (Smith-Watterman), `"global"` (Needleman-Wunsch), or `"semi.global"` (Overlap). Default: `"local"`
`m`	(non-negative `integer`) score for a match. Default: `5L`
`n`	(non-positive `integer`) score for a mismatch. Default: `-4L`
`g`	(non-positive `integer`) gap opening penalty. Default: `-8L`
`e`	(non-positive `integer`) gap extension penalty. Default: same value as `g`
`q`	(non-positive `integer`) second gap opening penalty. Default: same value as `g`
`c`	(non-positive `integer`) second gap extension penalty. Default: same value as `e`
`seq`	(`character` vector, `Biostrings::XStringSet`, `ShortRead::ShortRead`, or `dada2::derep`) sequences to align.
`algo`	(`SpoaAlgo` object) parameters for SPOA, generated by `SpoaAlgo()`.
`...`	additional parameters passed to methods
`w`	(`integer` vector) weights for the sequences. Default: `1L`
`qual`	(`list` of `numeric` vectors) per-base quality scores for the sequences. Default: none

Gap penalties

The general (convex) gap penalty formula is:

min(g + (i - i) * e, q + (i - 1) * c)

For q == g and c == e (as is the case unless q or c is explicitly set), this simplifies to the affine gap penalty:

g + (i - 1) * e

If, additionally, e == g (the default), then this is the linear gap penalty:

g * i

Weighting

Assigning a weight w to a sequence is equivalent to including that sequence w times; this is primarily useful for dereplicated sequences, where all sequences are unique, and the weight represents their multiplicity.

Note that POA is not order-independent, so results may differ when a set of non-unique sequences is dereplicated. For the most predictable results, it is recommended to sort dereplicated sequences in decreasing order of abundance (as in dada2::derepFastq()).

Quality scores

If seq is an object which includes quality scores (Biostrings::QualityScaledXStringSet, ShortRead::ShortReadQ, or dada2::derep) then the alignment consensus is weighted using the quality scores. The method used by SPOA uses the integer quality scores as per-position weights analogous to sequence weights; i.e., if two different bases align in the same position, one in a single sequence with quality score 40 and the other in a single sequence with quality score 20, then this is equivalent to the first base occurring at that position in 40 sequences without quality scores, and the second base occurring at that position in 20 sequences without quality scores.

It is possible to use a combination of sequence weights and quality scores. In this case, in order for a dereplicated sequence set to give the same results as the non-dereplicated sequences, the quality score for each of the dereplicated sequences should be the mean of the quality scores for the corresponding non-dereplicated sequences. This is the method used to generate quality scores for dereplicated sequences in dada2::derepFastq().

an object of class "SpoaAlgo", suitable for passing to spoaAlign() or spoaConsensus()

For spoaConsensus(), either a character string or the appropriate Biostrings::XString, depending on the class of seq.

For spoaAlign(), either a character vector or a Biostrings::MultipleAlignment, depending on the class of seq. If seq is a BStringSet (i.e., an XStringSet which is not specifically DNA, RNA, or AA) then the result is also a BStringSet, since there is no corresponding "BMultipleAlignment" class. If seq is a ShortRead::ShortRead object, the output is a Biostrings::DNAMultipleAlignment.

seq	spoaConsensus	spoaAlign
`character(n)`	`character(1)`	`character(n)`
`BStringSet`	`BString`	`BStringSet`
`DNAStringSet`	`DNAString`	`DNAMultipleAlignment`
`RNAStringSet`	`RNAString`	`RNAMultipleAlignment`
`AAStringSet`	`AAString`	`AAMultipleAlignment`
`QualityScaledBStringSet`	`BString`	`BStringSet`
`QualityScaledDNAStringSet`	`DNAString`	`DNAMultipleAlignment`
`QualityScaledRNAStringSet`	`RNAString`	`RNAMultipleAlignment`
`QualityScaledAAStringSet`	`AAString`	`AAMultipleAlignment`
`ShortReadQ`	`DNAString`	`DNAMultipleAlignment`
`derep`	`character(1)`	`character(n)`

sequences <- c(
    "CATAAAAGAACGTAGGTCGCCCGTCCGTAACCTGTCGGATCACCGGAAAGGACCCGTAAAGTGATAATGAT",
    "ATAAAGGCAGTCGCTCTGTAAGCTGTCGATTCACCGGAAAGATGGCGTTACCACGTAAAGTGATAATGATTAT",
    "ATCAAAGAACGTGTAGCCTGTCCGTAATCTAGCGCATTTCACACGAGACCCGCGTAATGGG",
    "CGTAAATAGGTAATGATTATCATTACATATCACAACTAGGGCCGTATTAATCATGATATCATCA",
    "GTCGCTAGAGGCATCGTGAGTCGCTTCCGTACCGCAAGGATGACGAGTCACTTAAAGTGATAAT",
    "CCGTAACCTTCATCGGATCACCGGAAAGGACCCGTAAATAGACCTGATTATCATCTACAT"
)
spoaAlign(sequences)
spoaConsensus(sequences)