dada: High resolution sample inference from amplicon data.

View source: R/dada.R

dadaR Documentation

High resolution sample inference from amplicon data.

Description

The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition of the sample (or samples). Put another way, dada removes all sequencing errors to reveal the members of the sequenced community.

If dada is run in selfConsist=TRUE mode, the algorithm will infer both the sample composition and the parameters of its error model from the data.

Usage

dada(
  derep,
  err,
  errorEstimationFunction = loessErrfun,
  selfConsist = FALSE,
  pool = FALSE,
  priors = character(0),
  multithread = FALSE,
  verbose = TRUE,
  ...
)

Arguments

derep

(Required). character or derep-class. The file path(s) to the fastq file(s), or a directory containing fastq file(s) corresponding to the the samples to be denoised. Compressed file formats such as .fastq.gz and .fastq.bz2 are supported. A derep-class object (or list thereof) returned by link{derepFastq} can also be provided. If multiple samples are provided, each will be denoised with a shared error model.

err

(Required). 16xN numeric matrix, or an object coercible by getErrors such as the output of the learnErrors function.

The matrix of estimated rates for each possible nucleotide transition (from sample nucleotide to read nucleotide). Rows correspond to the 16 possible transitions (t_ij) indexed such that 1:A->A, 2:A->C, ..., 16:T->T Columns correspond to quality scores. Each entry must be between 0 and 1.

Typically there are 41 columns for the quality scores 0-40. However, if USE_QUALS=FALSE, the matrix must have only one column.

If selfConsist = TRUE, err can be set to NULL and an initial error matrix will be estimated from the data by assuming that all reads are errors away from one true sequence.

errorEstimationFunction

(Optional). Function. Default loessErrfun.

If USE_QUALS = TRUE, errorEstimationFunction(dada()$trans_out) is computed after sample inference, and the return value is used as the new estimate of the err matrix in $err_out.

If USE_QUALS = FALSE, this argument is ignored, and transition rates are estimated by maximum likelihood (t_ij = n_ij/n_i).

selfConsist

(Optional). logical(1). Default FALSE.

If selfConsist = TRUE, the algorithm will alternate between sample inference and error rate estimation until convergence. Error rate estimation is performed by errorEstimationFunction.

If selfConsist=FALSE the algorithm performs one round of sample inference based on the provided err matrix.

pool

(Optional). logical(1). Default is FALSE.

If pool = TRUE, the algorithm will pool together all samples prior to sample inference. If pool = FALSE, sample inference is performed on each sample individually. If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples.

This argument has no effect if only 1 sample is provided, and pool does not affect error rates, which are always estimated from pooled observations across samples.

priors

(Optional). character. Default is character(0), i.e. no prior sequences.

The priors argument provides a set of sequences for which there is prior information suggesting they may truly exist, i.e. are not errors. The abundance p-value of dereplicated sequences that exactly match one of the priors are calculated without conditioning on presence, allowing singletons to be detected, and are compared to a reduced threshold 'OMEGA_P' when forming new partitions.

multithread

(Optional). Default is FALSE. If TRUE, multithreading is enabled and the number of available threads is automatically determined. If an integer is provided, the number of threads to use is set by passing the argument on to setThreadOptions.

verbose

(Optional). Default TRUE. Print verbose text output. More fine-grained control is available by providing an integer argument.

  • 0: Silence. No text output (same as FALSE).

  • 1: Basic text output (same as TRUE).

  • 2: Detailed text output, mostly intended for debugging.

...

(Optional). All dada_opts can be passed in as arguments to the dada() function. See setDadaOpt for a full list and description of these options.

Details

Briefly, dada implements a statistical test for the notion that a specific sequence was seen too many times to have been caused by amplicon errors from currently inferred sample sequences. Overly-abundant sequences are used as the seeds of new partitions of sequencing reads, and the final set of partitions is taken to represent the denoised composition of the sample. A more detailed explanation of the algorithm is found in two publications:

  • Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP (2016). DADA2: High resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-3.

  • Rosen MJ, Callahan BJ, Fisher DS, Holmes SP (2012). Denoising PCR-amplified metagenome data. BMC bioinformatics, 13(1), 283.

dada depends on a parametric error model of substitutions. Thus the quality of its sample inference is affected by the accuracy of the estimated error rates. selfConsist mode allows these error rates to be inferred from the data.

All comparisons between sequences performed by dada depend on pairwise alignments. This step is the most computationally intensive part of the algorithm, and two alignment heuristics have been implemented for speed: A kmer-distance screen and banded Needleman-Wunsch alignmemt. See setDadaOpt.

Value

A dada-class object or list of such objects if a list of dereps was provided.

See Also

derepFastq, setDadaOpt

Examples

fn1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fn2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2")
derep1 = derepFastq(fn1)
derep2 = derepFastq(fn2)
dada(fn1, err=tperr1)
dada(list(sam1=derep1, sam2=derep2), err=tperr1, selfConsist=TRUE)
dada(derep1, err=inflateErr(tperr1,3), BAND_SIZE=32, OMEGA_A=1e-20)


benjjneb/dada2 documentation built on Jan. 12, 2025, 10:03 a.m.