knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The MultiAmplicon package uses dedicated classes to ensure that input
to its sortAmplicons
function is provided correctly. Concerning
input files, this means for the user that file names (including paths
to the directory they are stored in) of forward and reverse sequencing
read files must be provided as a PairedReadFileSet
. These files can
contain reads in fastq or in gzipped fastq format. We focus as a
default use case on (non-interleaved) paired-end reads, as these are
the most common format Illumina sequencing reads are currently
produced in.
Similarly, primer pairs must be provided in a PrimerPairsSet
. We
then match forward and reverse primers at the start of forward and
reverse reads, respectively. The maximal number of mismatches and the
maximal distance from the start of the read can be supplied as
arguments to the function. The primer sequences are trimmed from the
reads and the read-pair is associated with the amplicon defined by the
primer-pair.
We first have a look at how PairedReadFileSet
and PrimerPairsSet
are generated. Please use files containing quality filtered sequencing
reads for the steps below. See the
dada2 tutorial or
our
real-world example vignette
for how files are trimmed and quality filtered. But we here first use
a very small data set included in the MultiAmplicon package.
library(MultiAmplicon) ## we'll also use some dada2 functions directly library(dada2) fastq.dir <- system.file("extdata", "fastq", package = "MultiAmplicon") fastq.files <- list.files(fastq.dir, full.names=TRUE) Ffastq.file <- fastq.files[grepl("F_filt", fastq.files)] Rfastq.file <- fastq.files[grepl("R_filt", fastq.files)] names(Rfastq.file) <- names(Ffastq.file) <- gsub("_F_filt\\.fastq\\.gz", "", basename(Ffastq.file)) PRF <- PairedReadFileSet(Ffastq.file, Rfastq.file) primerF <- c(Amp1F = "AGAGTTTGATCCTGGCTCAG", Amp2F = "ACTCCTACGGGAGGCAGC", Amp3F = "GAATTGACGGAAGGGCACC", Amp4F = "YGGTGRTGCATGGCCGYT", Amp5F = "AAAAACCCCGGGGGGTTTTT", Amp6F = "AGAGTTTGATCCTGCCTCAG") primerR <- c(Amp1R = "CTGCWGCCNCCCGTAGG", Amp2R = "GACTACHVGGGTATCTAATCC", Amp3R = "AAGGGCATCACAGACCTGTTAT", Amp4R = "TCCTTCTGCAGGTTCACCTAC", Amp5R = "AAAAACCCCGGGGGGTTTTT", Amp6R = "CCTACGGGTGGCAGATGCAG") PPS <- PrimerPairsSet(primerF, primerR) MA <- MultiAmplicon(PPS, PRF)
Note that the primer pairs in the PrimerPairsSet
can be named and
the same also is true for the sequencing read file pairs (although not
in our example above). This results in names for primer-pairs and for
samples.
Based on the primer sequences we can now sort the sequences into
amplicons. This is done by the function sortAmplicons
, which updates
the MultiAmplicon object with the sorted amplicons.
For this small example data set we can look at the resulting
rawCounts
(function getRawCounts
) i.e. the number of sequences for
each of the fully stratified amplicon x sample matrix.
MA1 <- sortAmplicons(MA, filedir=tempfile("dir"))
knitr::kable(getRawCounts(MA1))
Now we can plot the result. The function plotAmpliconNumbers allows us also to store clustering information (see \code{pheatmap}). Which can afterwards be used to filter data.
clusters <- plotAmpliconNumbers(MA1)
We can use this clustering information now to subset our MultiAmplicon object.
two.clusters.row <- cutree(clusters$tree_row, k=2) two.clusters.col <- cutree(clusters$tree_col, k=2) knitr::kable(two.clusters.row) knitr::kable(two.clusters.col) MA.sub <- MA1[which(two.clusters.row==1), which(two.clusters.col==2)] knitr::kable(getRawCounts(MA.sub))
In real world applications (see our other vignette) the samples to be excluded from further analysis can be selected based on clustering with negative controls (similar to the example data here where S00 is an empty file and S01 contains one sequence).
We first estimate the errors separately for both for dada's probabilistic sequence variant inference. We do this here on the combined stratified files for different amplicons. This should be done separately for forward ad reverse sequence though!
errF <- learnErrors(unlist(getStratifiedFilesF(MA1)), verbose=0) errR <- learnErrors(unlist(getStratifiedFilesR(MA1)), verbose=0)
We can now run the pipeline of the dada2 package in parallel over the different amplicons. We update our object from each previous step and safe it under a new name, here for illustration. In your own workflow you can also overwrite the old object.
MA3 <- dadaMulti(MA1, Ferr=errF, Rerr=errR, pool=FALSE) MA4 <- mergeMulti(MA3, justConcatenate=TRUE) MA5 <- makeSequenceTableMulti(MA4) MA6 <- removeChimeraMulti(MA5, mc.cores=1)
The steps after sortAmplcions
until removeChimeraMulti
recapitulate different processes in the
dada2 pipeline.
The above workflow assumes that errors could be estimated for each amplicon together. This would be beneficial if errors were largely similar between different amplicons (i.e. not influenced by the primer sequence comprising the first bases the original sequence). Below an alternative approach is demonstrated with error probabilities estimated separately for each amplicon (here we also overwrite objects from each previous processing steps).
MA.alt <- dadaMulti(MA3, selfConsist=TRUE, pool=FALSE) MA.alt <- mergeMulti(MA.alt, justConcatenate=TRUE) MA.alt <- makeSequenceTableMulti(MA.alt) MA.alt <- removeChimeraMulti(MA.alt, mc.cores=1)
Now we can make a comparison between the two different versions:
together <- getSequencesFromTable(MA6) alt <- getSequencesFromTable(MA.alt) lapply(seq_along(together), function (i){ union <- union(alt[[i]], together[[i]]) table("combined Inference"=union%in%together[[i]], "seperate amplicon"=union%in%alt[[i]]) })
So for all amplicons we got very similar output, the overwhelming majority of ASVs was found using both methods. This speaks for error rates not differing between amplicons (and for the robustness of the ASV inference of the amazing dada2). Which method you should use depends on your data: if each amplicon has enough reads you can estimate seperately and potentially get a better estimate of errors for amplicons with different errors. If on the other hand some amplicons have very few reads, it might make sense to pool them to improve error estimates with larger size for the training set.
In some real world applications it might be desirable to have more control over parameters used in the pipeline for different amplicons.
We use ellipsis (aka "...", dots, dot-dot-dot or three-dots) to pass parameters on to the underlying dada2 functions. The MultiAmplicon package allows all those arguments to be vectors with values varying for different amplicons.
Let's suppose for this we e.g. want to merge a subset of our amplicons, while we concatenate others (with Ns between forward and reverser reads).
MA.mixed <- mergeMulti(MA3, justConcatenate=c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE))
The above example would make sense if (only) amplicons Amp1F.Amp1R and Amp4F.Amp4R (1st and 4th in our object) had too little overlap between forward and reverse reads to be merged and would better be concatenated. Btw: Here this makes not really sense as to much sequences for the 2nd and 3rd amplicon don't merge for this dataset.
Coming back to our example on dada sequence inference, we could also
mix err estimates to use errors either seperately or accross
amplicons. Always set the references to both pre-computed error
parameters NULL
when using self consistent estimation within one
amplicon.
MA.mixed <- dadaMulti(MA3, Ferr=c(NULL, errF, NULL, errF, NULL, NULL), Rerr=c(NULL, errR, NULL, errR, NULL, NULL), selfConsist=c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE))
Taxonimical annotation is difficult as for 18S markers reference databases are not as mature as for 16S. As we use Multiamplicon in our own research it includes for now only a 18S annotation via BLAST+. This only works on UNIX systems with the NCBI BLAST+ toolkit installed. In addition we have to build a taxonomizr database.
## echo=TRUE for this? inputFasta <- system.file("extdata", "in.fasta", package = "MultiAmplicon") outputBlast <- system.file("extdata", "out.blt", package = "MultiAmplicon") MA7 <- blastTaxAnnot(MA6, db="/SAN/db/blastdb/nt/nt", negative_gilist = "/SAN/db/blastdb/uncultured_gilist.txt", taxonSQL="/SAN/db/taxonomy/taxonomizr.sql", num_threads=20, infasta = inputFasta, outblast = outputBlast)
In the above code you have to adjust the path's for the BLAST and taxonomizr databases.
The phyloseq package provides impressive tools for microbiome data handling and analysis. We can now convert our MultiAmpicon object into a phlyoseq object.
## echo=TRUE for this? library(phyloseq) smpls<- MA7@colnames colnames(MA7) cat("\n vignette samples:", smpls, "\n") cat("\n vignette rownames:", colnames(MA7), "\n") ## PHY <- toPhyloseq(MA7, samples=smpls, multi2Single=TRUE) ## PHY.list <- toPhyloseq(MA7, samples=smpls, multi2Single=FALSE) ## PHYgenus <- tax_glom(PHY, "genus") ## knitr::kable(table(tax_table(PHYgenus)[, "superkingdom"])) ## knitr::kable(table(tax_table(PHYgenus)[, "phylum"]))
The multi2Single
option controls...
## PHYWO <- toPhyloseq(MA6, samples=colnames(MA6)) ## PHYWO.list <- toPhyloseq(MA6, samples=colnames(MA6), multi2Single=FALSE) cat("\n for some strange reason toPhyloseq fails in the vignette building, but works in iteractive sessions... have to investigate this later.") ### reminder: I should really look into something like showMethods("sample_names")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.