knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
We first
download the data
set from consisting of 192 fastq file-pairs from 18S and 16S amplicon
sequencing of feces and intestinal contents of carnivores (hyenas and
wolves). Use the botton on the upper right of the page to download a
zipped folder and unzip it after downloading. The help file
for the
carnivoreSeqRuns
(containing sample data included in the package), shows how this data
has been downloaded from
NCBI-SRA.
Now we can run our standard workflow. But as usual with real data we have to filter the input files first. This particular sequencing was of quite low quality, so we have to trim and screen quite harshly.
library(MultiAmplicon) ## And we'll also use some dada2 functions directly library(dada2) path <- "download_sra" ## change according to where you downloaded fastqFiles <- list.files(path, pattern=".fastq.gz$", full.names=TRUE) fastqF <- grep("_1.fastq.gz", fastqFiles, value = TRUE) fastqR <- grep("_2.fastq.gz", fastqFiles, value = TRUE) samples <- gsub("_1.fastq\\.gz", "\\1", basename(fastqF)) filt_path <- "filtered_sra" if(!file_test("-d", filt_path)) dir.create(filt_path) filtFs <- file.path(filt_path, paste0(samples, "_F_filt.fastq.gz")) names(filtFs) <- samples filtRs <- file.path(filt_path, paste0(samples, "_R_filt.fastq.gz")) names(filtRs) <- samples ## some files will be filtered out completely, therefore allowing 50 ## files less present and still don't redo filtering redo <- sum(file.exists(fastqF)) - sum(file.exists(filtFs)) > 50 if(redo){ lapply(seq_along(fastqF), function (i) { filterAndTrim(fastqF[i], filtFs[i], fastqR[i], filtRs[i], truncLen=c(170,170), minLen=c(170,170), maxN=0, maxEE=2, truncQ=2, compress=TRUE, verbose=TRUE) }) } names(filtFs) <- names(filtRs) <- samples files <- PairedReadFileSet(filtFs, filtRs)
Now we have our sequencing read input data, only the primers are still missing. The primer set for the above is included in the MultiAmplicon package. The data has been generated using a microfluidics PCR system, if you want to know more about this you can consult Heitlinger et al. (2017) or Lesniak et al. (2017).
primer.file <- system.file("extdata", "real_world_primers.csv", package = "MultiAmplicon") ptable <- read.csv(primer.file, sep=",", header=TRUE, stringsAsFactors=FALSE) primerF <- ptable[, "TS.SequenceF"] primerR <- ptable[, "TS.SequenceR"] names(primerF) <- as.character(ptable[, "corrected.NameF"]) names(primerR) <- as.character(ptable[, "corrected.NameR"]) primers <- PrimerPairsSet(primerF, primerR) MA <- MultiAmplicon(primers, files)
We start by sorting our amplicons by primer sequences cutting off the latter from sequencing reads. The directory for sorted amplicons must be empty before that.
filedir <- "stratified_files" if(dir.exists(filedir)) unlink(filedir, recursive=TRUE) MA <- sortAmplicons(MA, n=1e+05, filedir=filedir)
The function is "streaming" fastq files, so that potentially large files can be handled on machines with modest memory. The parameter "n" controls how big the streamed chunks are. Increasing this value leads to faster processing and higher memory usage.
errF <- learnErrors(unlist(getStratifiedFilesF(MA)), nbase=1e8, verbose=0) errR <- learnErrors(unlist(getStratifiedFilesR(MA)), nbase=1e8, verbose=0) MA <- dadaMulti(MA, Ferr=errF, Rerr=errR, pool=FALSE, verbose=0)
The sequences were quite low quality and we shortened them massively. Therefore we have to assume that most sequences won't have enough overlap to merge. Now we figure out for which amplicons this is the case.
MA <- mergeMulti(MA) propMerged <- MultiAmplicon::calcPropMerged(MA) summary(propMerged) table(propMerged<0.8)
So we just found out that for this dataset only one amplicon could be
merged retaining over 80% of the sequence. Lets run the merge again,
concatenating all amplicons with less than 80% merged by setting
justConcatenate
to TRUE
for those.
MA <- mergeMulti(MA, justConcatenate=propMerged<0.8) MA <- makeSequenceTableMulti(MA) MA <- removeChimeraMulti(MA)
tracking <- getPipelineSummary(MA) plotPipelineSummary(tracking)
The pipeline summary tells us how many samples, how many unique sequences and how many sequence read (pairs) in total were recovered through the pipeline for different amplicons.
library(ggplot2) plotPipelineSummary(tracking) + scale_y_log10()
The output of the pipeline summary can be modified with standard ggplot syntax.
library(phyloseq) PH <- toPhyloseq(MA, samples=colnames(MA)) PHlist <- toPhyloseq(MA, samples=colnames(MA), multi2Single=FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.