16S rRNA amplicon sequencing is commonly used for microbial community characterization, including differential abundance analysis. A limitation to 16S rRNA amplicon sequencing is a lack of taxonomic resolution, where organisms are only identifiable to the genus or family level. We define taxonomic resolution as the ability to differentiate between groups within a taxonomic level, for example differentiating between species within a genus. While similar to determining whether a sequence represents a novel species, here we are only interested in determining whether the 16S rRNA region of interest contains sufficient information for species-level taxonomic assignment. Taxonomic resolution varies by clade and amplicon regions. Though the extent to which taxonomic resolition varies is not well characterized.
Here we demonstrate how metagenomeFeatures and the MgDb
annotation packages can be used to characterize taxonomic resolution for a specific clade and amplicon region, specifically for the Paenibacillus genus and V12 and V4 regions. Originally classified under the Bacillus genus, a novel genus was formed based on the 16S rRNA gene similarity in the 1990s.
Paenibacillus spp. are facultative anaerobic bacteria present in a variety of environments including the soil, water, and can act as opportunistic pathogens in humans [@ouyang2008].
Paenibacillus spp. will play an important role in sustainable agricultural industries [@grady2016]. As such, appropriate speciation is of interest.
The V12 and V4 region were used as they represent two commonly used amplicons for 16S rRNA marker-gene surveys.
We will use the Greengenes 13.5 database, accessed using the greengenes13.5MgDb annotation package for our analysis of the Paenibacillus genus.
The Greengenes 13.5 database is used for demonstration purposes but the other MgDb
annotation packages can also be used; RDP 11.5 - ribosomaldatabaseproject11.5MgDb or SILVA 128.1 -
silva128.1MgDb`.
In addition to metagenomeFeatures and greengenes13.5MgDb the DECIPHER, tidyverse, and ggpubr packages are also used in the following analysis.
Our analysis uses the DECIPHER
package to extract the amplicon regions, perform multiple sequence alignment, and generate a pairwise sequence distance matrix [@Wright2016-mo].
The tidyverse
and ggpubr
packages will be used to reformat the taxonomic and distance matrix data and generate summary figures [@tidyverse;@ggpubr].
library(tidyverse); packageVersion("tidyverse") library(ggpubr); packageVersion("ggpubr") library(DECIPHER); packageVersion("DECIPHER") library(metagenomeFeatures); packageVersion("metagenomeFeatures") library(greengenes13.5MgDb); packageVersion("greengenes13.5MgDb")
We first subset the Greeengenes 13.5 database using the mgDb_select
function.
Then summarize the taxonomy data using functions from tidyverse
package, specifically dplyr
, stringr
and forcats
functions for manipulating data.frames
, strings
, and factor
vectors respectively.
paeni_16S <- metagenomeFeatures::mgDb_select(gg13.5MgDb, type = c("taxa","seq"), keys = "Paenibacillus", keytype = "Genus")
## Per genus count data taxa_df <- paeni_16S$taxa %>% ## cleaning up species names mutate(Species = if_else(Species == "s__", "Unassigned", Species), Species = str_replace(Species, "s__","")) %>% group_by(Species) %>% summarise(Count = n()) %>% ungroup() %>% mutate(Species = fct_reorder(Species, Count)) ## Count info for text total_otus <- sum(taxa_df$Count) unassigned_idx <- taxa_df$Species == "Unassigned" no_species_assignment <- taxa_df$Count[unassigned_idx]
For the Greengenes 13.5 database, there are a total of r total_otus
sequences classified as r nlevels(taxa_df$Species)
species in the Genus Paenibacillus.
The number of sequences assigned to specific Paenibacillus species range from 199 Paenibacillus amylolyticus to 2 Paenibacillus illinoisensis (Fig. \@ref(fig:speciesCount)).
Sequences only classified to the genus level, "Unassigned", is the most abundant group, r no_species_assignment
.
taxa_df %>% ggplot() + geom_bar(aes(x = Species, y = Count), stat = "identity") + geom_text(aes(x = Species, y = Count, label = Count), nudge_y = 75) + labs(y = "Number of OTUs") + coord_flip() + theme_bw() + theme(axis.text.y = element_text(face = "italic"))
Next, we evaluate the 16S rRNA amplicon sequencing taxonomic resolution for Paenibacillus species by comparing within and between species amplicon pairwise distance for the V12 and V4 regions. To differentiate between species the pairwise distances for within-species amplicon sequences must be less than the between species distances. Additionally ,the difference in amplicon sequence pairwise distances between and within species must be greater than the sequencing error rate to detect the difference. For our taxonomic resolution analysis, we used pattern matching to extract the V12 and V4 regions of the 16S rRNA sequences. We then generate a pairwise distance matrix for the two regions and compare the within and between species pairwise distances.
For our in-silico PCR we will use the following PCR primers:
|Region |Direction |Primer | |:------|:-------- |:----------------------------------| |V12 |Forward | 27F - AGAGTTTGATCATGGCTCAG | | |Reverse | 336R - CACTGCTGCSYCCCGTAGGAGTCT | |V4 |Forward | 515F - GTGCCAGCMGCCGCGGTAA | | |Reverse | 806R - GGACTACHVGGGTWTCTAAT |
Extracting the V12 region from the database sequences, only sequences with containing both forward and reverse primers are included in the analysis.
forward_primer <- "AGAGTTTGATCATGGCTCAG" ## reverse complementing reverse primer reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>% reverseComplement() %>% as.character() ## Finding sequeces with forward primer forward_match <- Biostrings::vmatchPattern(forward_primer, subject = paeni_16S$seq, max.mismatch = 2) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## Finding sequences with reverse primer reverse_match <- Biostrings::vmatchPattern(reverse_primer, subject = paeni_16S$seq, max.mismatch = 2, fixed = FALSE) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## sequences with both forward and reverse primers seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id) seqs_to_use <- names(paeni_16S$seq) %in% seqs_to_use_ids ## Trimming sequences with both primers paeni_V12 <- TrimDNA(paeni_16S$seq[seqs_to_use], leftPatterns = forward_primer, rightPatterns = reverse_primer, type = "both") ## Excluding seqs with lenght 0 paeni_V12_seqs <- paeni_V12[[2]][width(paeni_V12[[2]]) != 0]
Generating a multiple sequence alignment using the AlignSeqs
function in the DECIPHER
package.
v12_align <- AlignSeqs(paeni_V12[[2]], verbose = FALSE)
The resulting alignment can be viewed using the BrowseSeqs
function in the DECIPHER
package.
BrowseSeqs(v12_align)
Generating pairwise distance matrix using the DistanceMatrix
function in the DECIPHER
package for taxonomic resolution analysis and converting distance matrix to a data frame for analysis.
v12_dist <- DistanceMatrix(v12_align, correction = "none", verbose = FALSE, includeTerminalGaps = FALSE) v12_dist_df <- v12_dist %>% as.data.frame() %>% rownames_to_column(var = "Keys") %>% gather("Keys2","distance", -Keys) %>% mutate(Keys = as.numeric(Keys), Keys2 = as.numeric(Keys2)) %>% filter(Keys < Keys2) %>% mutate(Keys = as.character(Keys), Keys2 = as.character(Keys2)) tax_df <- dplyr::select(paeni_16S$taxa, "Keys", "Species") v12_dist_anno_df <- v12_dist_df %>% left_join(tax_df) %>% left_join(tax_df,by = c("Keys2" = "Keys")) %>% dplyr::rename(Keys_Species = Species.x, Keys2_Species = Species.y) %>% mutate(group_comp = if_else(Keys_Species == Keys2_Species, "within","between")) %>% filter(Keys_Species != "s__", Keys2_Species != "s__")
For the V4 region, we will use the same approach, extract amplicon region, filter extracted sequences based on amplicon length, generate pairwise distance matrix using a multiple sequence alignment, and then evaluate pairwise distances.
## Finding sequeces with forward primer forward_match <- Biostrings::vmatchPattern("GTGCCAGCMGCCGCGGTAA", subject = paeni_16S$seq, fixed = FALSE) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## Finding sequences with reverse primer reverse_match <- Biostrings::vmatchPattern("ATTAGAWACCCBDGTAGTCC", subject = paeni_16S$seq, fixed = FALSE) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## sequences with both forward and reverse primers seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id) seqs_to_use <- names(paeni_16S$seq) %in% seqs_to_use_ids ## Extract amplicon region paeni_V4 <- TrimDNA(paeni_16S$seq[seqs_to_use], leftPatterns = "GTGCCAGCMGCCGCGGTAA", rightPatterns = "ATTAGAWACCCBDGTAGTCC", type = "both") ## Excluding seqs with lenght 0 paeni_V4_seqs <- paeni_V4[[2]][width(paeni_V4[[2]]) != 0] ### Calculate distance matrix from multiple sequence alignment v4_align <- AlignSeqs(paeni_V4_seqs, verbose = FALSE) v4_dist <- DistanceMatrix(v4_align, correction = "none", verbose = FALSE, includeTerminalGaps = FALSE) ## Creating a data frame for exploratory analysis v4_dist_df <- v4_dist %>% as.data.frame() %>% rownames_to_column(var = "Keys") %>% gather("Keys2","distance", -Keys) %>% mutate(Keys = as.numeric(Keys), Keys2 = as.numeric(Keys2)) %>% filter(Keys < Keys2) %>% mutate(Keys = as.character(Keys), Keys2 = as.character(Keys2)) tax_df <- dplyr::select(paeni_16S$taxa, "Keys", "Species") v4_dist_anno_df <- v4_dist_df %>% left_join(tax_df) %>% left_join(tax_df,by = c("Keys2" = "Keys")) %>% dplyr::rename(Keys_Species = Species.x, Keys2_Species = Species.y) %>% mutate(group_comp = if_else(Keys_Species == Keys2_Species, "within","between")) %>% filter(Keys_Species != "s__", Keys2_Species != "s__") ## Excluding outlier sequence "329842" - mean pairwise distance to all other ## sequences is 0.2 v4_dist_anno_filt <- filter(v4_dist_anno_df, Keys != "329842", Keys2 != "329842")
The trimmed sequence length varies for forward and reverse primers resulting in varying amplicon sizes for both the V12 and V4 amplicons (Fig. \@ref(fig:trimInfo)).
list(V12 = paeni_V12[[1]], V4 = paeni_V4[[1]]) %>% map_dfr(as.data.frame, .id = "amplicon") %>% filter(width != 0) %>% gather("key","value", -names, -amplicon) %>% mutate(key = if_else(key == "width", "length", key), key = factor(key, levels = c("length","start","end"))) %>% mutate(position_type = if_else(key == "length", "length", "position")) %>% ggplot() + geom_histogram(aes(x = value, fill = key), bins = 100) + facet_grid(amplicon~position_type, scales = "free") + theme_bw()
Pairwise distance is significantly different for within and between species comparisons indicating that the V12 and V4 regions are potentially suitable forclassifying members of the Paenibacillus genus to the species level (Fig. \@ref(fig:pairDist)). Overall the V12 region had greater pairwise distances than V4 for both within and between species. It is important also to consider that the majority of sequences in the database were only classified to the genus level. Species-level information for these sequences might yield results that are inconsistent with our analysis. Additionally, our analysis does not identify the pairwise sequence distance required to classify a sequence as a novel Paenibacillus species.
list("V4" = v4_dist_anno_filt, "V12" = v12_dist_anno_df) %>% bind_rows(.id = "amplicon") %>% ggplot(aes(x = group_comp, y = distance, fill = group_comp)) + geom_boxplot(aes(x = group_comp, y = distance, fill = group_comp)) + facet_wrap(~amplicon) + labs(x = "Amplicon", fill = "Species Comparison") + theme_bw() + theme(legend.position = "bottom")
While the overall pairwise distance is greater between species than within species for the Paenibacillus genus, it is important to understand how the within and between species pairwise distances compare for individual species. The heatmap below shows pairwise distance information for within and between different Paenibacillus species for the V12 and V4 regions (Fig. \@ref(fig:distHeatMap)). Whether the sequences are assigned to more than one OTU depends on the pairwise sequence distance metric and linkage method employed by the clustering algorithm. In general though for species levels classification the maximum within species distance should be less than the minimum between species distance. For example as the maximum within species pairwise distance for P. lentimorbus is 0.13 and the minimum between species pairwise distance for P. lentimorbus and P. alvei is 0.08 (Fig. \@ref(fig:distHeatMap)A), correctly assigning a V12 amplicon sequences to one of these two species is not possible.
```r sequences with the V12 primer in the database.", fig.height = 8} v12_heat <- v12_dist_anno_df %>% group_by(Keys_Species, Keys2_Species, group_comp) %>% summarise(mean_dist = mean(distance, na.rm = TRUE), max_dist = max(distance), min_dist = min(distance)) %>% mutate(lab_dist = if_else(group_comp == "within", max_dist, min_dist)) %>% ungroup() %>% mutate(Keys_Species = str_remove(Keys_Species, "s")) %>% mutate(Keys2_Species = str_remove(Keys2_Species, "s")) %>% ggplot() + geom_raster(aes(x = Keys_Species, y = Keys2_Species, fill = mean_dist)) + geom_text(aes(x = Keys_Species, y = Keys2_Species, label = round(lab_dist,2)), size = 2, color = "white") + labs(x = "Species",y = "Species", fill = "Mean Dist") + theme_bw() + theme(axis.text.x = element_text(angle = -45, hjust = -0.1))
v4_heat <- v4_dist_anno_df %>% filter(Keys != "329842", Keys2 != "329842") %>% group_by(Keys_Species, Keys2_Species, group_comp) %>% summarise(mean_dist = mean(distance), max_dist = max(distance), min_dist = min(distance)) %>% mutate(lab_dist = if_else(group_comp == "within", max_dist, min_dist)) %>% ungroup() %>% mutate(Keys_Species = str_remove(Keys_Species, "s")) %>% mutate(Keys2_Species = str_remove(Keys2_Species, "s")) %>% ggplot() + geom_raster(aes(x = Keys_Species, y = Keys2_Species, fill = mean_dist)) + geom_text(aes(x = Keys_Species, y = Keys2_Species, label = round(lab_dist,2)), size = 2, color = "white") + labs(x = "Species",y = "Species", fill = "Mean Dist") + theme_bw() + theme(axis.text.x = element_text(angle = -45, hjust = -0.1))
ggpubr::ggarrange(v12_heat, v4_heat,labels = "AUTO", ncol = 1, nrow = 2,align = "v")
### Conclusion Here we demonstrate how the _metagenomeFeatures_ package in conjunction with one of the associated 16S rRNA database packages, _greengenes13.5MgDb_, and other R packages, can be used to evaluate whether species-level taxonomic classification is possible for a specific amplicon region. The approach used here can easily be extended to use different 16S rRNA databases (starting with a different `MgDb`class object), taxonomic groups (changing filtering parameters), or amplicon regions (changing primer sequences). ******************************************************************************** ## Session Information ### System Information ```r s_info <- devtools::session_info(include_base = FALSE) pander::pander(s_info$platform)
s_info$packages %>% filter(`*` == "*") %>% select(-`*`) %>% knitr::kable(booktabs = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.