16S rRNA amplicon sequencing is commonly used for microbial community characterization, including differential abundance and diversity analysis. A limitation to 16S rRNA amplicon sequencing is a lack of taxonomic resolution, where organisms are only identifiable to the genus or family level. We define taxonomic resolution as the ability to differentiate between groups within a taxonomic level, for example, differentiating between species within a genus. We are interested in determining whether the 16S rRNA region of interest contains sufficient information for species-level taxonomic assignment. Taxonomic resolution varies by clade and amplicon regions. However, the extent to which taxonomic resolution varies is not well characterized.
There are multiple 16S rRNA databases each using their own file format and whose maintainers use different curation approaches resulting in different taxonomic and sequence compositions.
One can perform taxonomic or other 16S rRNA sequence analysis on an individual database but with metagenomeFeatures
and MgDb
annotation packages the same analysis can easily be performed on multiple databases, increasing the power of the analysis.
Here we demonstrate how metagenomeFeatures
and the MgDb
annotation packages can be used to characterize taxonomic resolution for the Paenibacillus genus and V12 and V4 amplicon regions.
Originally, Paenibacillus was classified under the Bacillus genus, a novel genus was formed based on the 16S rRNA gene similarity in the 1990s.
Paenibacillus spp. are facultative anaerobic bacteria present in a variety of environments including the soil, water, and can act as opportunistic pathogens in humans [@ouyang2008].
It has been shown that Paenibacillus spp. will play an important role in sustainable agricultural industries [@grady2016].
Thus, an appropriate speciation of this genus is of an interest to the community.
We used V12 and V4 region as they represent two commonly used amplicons for 16S rRNA marker-gene surveys.
We will use the Greengenes 13.5 database, accessed using the greengenes13.5MgDb
, Silva 128 database, accessed using the silva128.1MgDb
, and RDP 11.5 database, accessed using the ribosomaldatabaseproject11.5MgDb
, as annotation packages for our analysis of the Paenibacillus genus.
First, we show the distribution of sequences belonging to Paenibacillus in these three databases, and then a way to combine different subsets of database sequences to build a custom database object.
This helps investigators to pool sequences from different databases, to create an accurate and most representative database for their sample.
This is especially helpful when a certain clade of interest is not very well represented in the databases.
We then evaluate 16S rRNA amplicon sequencing taxonomic resolution for Paenibacillus species for V12 and V4 regions.
In many analyses, the Greengenes 13.5 database is used for demonstration purposes but the other MgDb
annotation packages such as RDP 11.5, SILVA 128, can also be used.
While the databases provide a consistent interface to the different 16S rRNA databases, the databases use different approaches to formatting the taxonomic names. As a result, the majority of the code used to produce this document is tidying the data. This code is not shown in the document but is available in the source Rmarkdown file which is provided as supplemental material and available in the metagenomeFeatures GitHub repository.^1
In addition to metagenomeFeatures
, greengenes13.5MgDb
, ribosomaldatabaseproject11.5MgDb
, and silva128.1MgDb
, the DECIPHER
, tidyverse
, and ggpubr
packages are also used in the following analysis.
Our analysis uses the DECIPHER
package to extract the amplicon regions, perform multiple sequence alignment, and generate a pairwise sequence distance matrix [@Wright2015-me].
The tidyverse
and ggpubr
packages will be used to reformat the taxonomic and distance matrix data and generate summary figures [@tidyverse @ggpubr].
See Session Information Section for package version information.
library(tidyverse) library(ggpubr) library(UpSetR) library(DECIPHER) library(metagenomeFeatures) library(greengenes13.5MgDb) library(silva128.1MgDb) library(ribosomaldatabaseproject11.5MgDb)
We developed 16S rRNA MgDb annotation packages for three 16S rRNA databases, Greengenes, SILVA, and RDP. RNAcentral, a meta-database for non-coding RNA sequences provides cross database unique identifiers for sequences in the databases. The number of sequences in the databases and fraction of sequences with RNAcentralIDs varies for the three databases (Table \@ref(tab:dbSize)). Using the RNAcentral IDs we can evaluate the overlap between the three databases (Fig. \@ref(fig:dbVenn)).
get_rnacentral_entries <- function(mgdb){ if("rnacentral_ids" %in% taxa_columns(mgdb)){ rna_filt <- mgDb_taxa(mgdb) %>% filter(!is.na(rnacentral_ids)) } else { rna_filt <- mgDb_taxa(mgdb) %>% filter(!is.na(RNAcentralID)) } rna_filt } get_size_and_rnacentral <- function(mgdb){ db_size <- mgDb_taxa(mgdb) %>% tally() %>% collect() %>% as.numeric() db_wrna_id <- get_rnacentral_entries(mgdb) %>% tally() %>% collect() %>% as.numeric() ## return size and size with rna central ids data_frame(Size = db_size, n_entries_rnaid = db_wrna_id) } db_list <- list(Greengenes = gg13.5MgDb, SILVA = slv128.1MgDb, RDP = rdp11.5MgDb) db_info_df <- db_list %>% map_dfr(get_size_and_rnacentral, .id = "Database") db_info_df %>% mutate(`Frac Ids` = (n_entries_rnaid)/Size) %>% dplyr::select(-n_entries_rnaid) %>% knitr::kable(digits = 2, format.args = list(big.mark = ","), booktabs = TRUE, caption = "Number of sequences (size) and fraction of sequences with RNAcentral ids (frac ids) for the three 16S rRNA databases with MgDb annotation packages.")
get_rnacentral_ids <- function(mgdb){ if("rnacentral_ids" %in% taxa_columns(mgdb)){ rna_filt <- mgDb_taxa(mgdb) %>% filter(!is.na(rnacentral_ids)) } else { rna_filt <- mgDb_taxa(mgdb) %>% filter(!is.na(RNAcentralID)) } rna_filt %>% select(contains("central")) %>% collect() %>% distinct() %>% .[[1,]] } ## This is really slow ... rna_central_id_list <- db_list %>% map(get_rnacentral_ids)
upset(fromList(rna_central_id_list), order.by = "freq")
For the taxonomic resolution analysis, we will: 1. characterize the taxonomic composition of the Paenibacillus genus in the three databases, 2. calculate pairwise distances between sequences for 30 species with the most sequences, 3. evaluate pairwise distances within and between species.
First obtain the taxonomic information for Paenibacillus genus from the three databases using the mgDb_select()
function.
paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb, type = c("taxa"), keys = "Paenibacillus", keytype = "Genus") paeni_16S_silva <- metagenomeFeatures::mgDb_select(slv128.1MgDb, type = c("taxa"), keys = "Paenibacillus", keytype = "Genus") paeni_16S_rdp <- metagenomeFeatures::mgDb_select(rdp11.5MgDb, type = c("taxa"), keys = "Paenibacillus", keytype = "Genus")
## Per genus count data - greengenes ########################################### tidy_gg_taxa <- paeni_16S_greengenes %>% ## cleaning up species names mutate(Species = if_else(Species == "s__", "Unassigned", Species), Species = str_replace(Species, "s__","")) taxa_df_greengenes <- tidy_gg_taxa %>% group_by(Species) %>% summarise(Count = n()) %>% ungroup() %>% mutate(Species = fct_reorder(Species, Count)) ## Count info for text total_otus_greengenes <- sum(taxa_df_greengenes$Count) unassigned_idx_greengenes <- taxa_df_greengenes$Species == "Unassigned" no_species_assignment_greengenes <- taxa_df_greengenes$Count[unassigned_idx_greengenes] ## Per genus count data - SILVA ################################################ tidy_slv_taxa <- paeni_16S_silva %>% rowwise() %>% mutate(new = str_split(Species, boundary("word")),Species = paste(new[1],new[2])) %>% mutate(new = NULL) %>% mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>% mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>% mutate(Species = str_remove(Species, "Paenibacillus ")) taxa_df_silva <- tidy_slv_taxa %>% group_by(Species) %>% summarise(Count = n()) %>% ungroup() %>% mutate(Species = fct_reorder(Species, Count)) ## Count info for text total_otus_silva <- sum(taxa_df_silva$Count) unassigned_idx_silva <- taxa_df_silva$Species == "Unassigned" no_Species_assignment_silva <- taxa_df_silva$Count[unassigned_idx_silva] ## Per genus count data - RDP ################################################## tidy_rdp_taxa <- paeni_16S_rdp %>% rowwise() %>% dplyr::rename(Species = species, Genus = genus) %>% mutate(new = str_split(Species, boundary("word")), Species = paste(new[1],new[2])) %>% mutate(new = NULL) %>% mutate(Species = if_else(!str_detect(Species, "Paenibacillus"), "Unassigned", Species)) %>% mutate(Species = if_else(str_detect(Species, paste(c("Paenibacillus sp", "uncultured", "unidentified"), collapse = '|')), "Unassigned", Species)) %>% mutate(Species = str_remove(Species, "Paenibacillus ")) taxa_df_rdp <- tidy_rdp_taxa %>% group_by(Species) %>% summarise(Count = n()) %>% ungroup() %>% mutate(Species = fct_reorder(Species, Count)) ## Count info for text total_otus_rdp <- sum(taxa_df_rdp$Count) unassigned_idx_rdp <- taxa_df_rdp$Species == "Unassigned" no_Species_assignment_rdp <- taxa_df_rdp$Count[unassigned_idx_rdp]
For Paenibacillus genus the databases vary in the number of sequences for each species with some databases not having any sequences for some species (Fig. \@ref(fig:speciesCount)).
With a relatively small overlap between the databases, multiple database analysis provides additional power (Fig. \@ref(fig:dbVenn)).
The three database maintainers differ in their curation approaches, taxonomic name formatting, and update frequency.
There are benefits to using all three databases in a cross database analysis.
The Greengenes database taxonomic name is more rigidly formatted, only including the major taxonomic levels, e.g. does not include intermediate level such as suborder, and only goes to the species level.
As a result, species name parsing is unlikely to negatively impact the results.
SILVA and RDP are consistently updated while Greengenes has not been updated since 2013, and therefore it does not contain sequences that were not available in 2013. As a result, the SILVA and RDP databases are significantly more comprehensive than Greengenes.
Both, the SILVA and RDP taxonomic name formatting include intermediate taxonomic levels allowing for more fine grained taxonomic analysis.
Additionally, the SILVA database is the only one of the three to include eukaryotic 16S sequences as well as prokaryotic sequences.
For our analysis of the Paenibacillus, the and issue with all databases is that the sequences only classified to the genus level ("Unassigned" at species level) is the most abundant group, Greeengenes - r no_species_assignment_greengenes
, SILVA - r no_Species_assignment_silva
, and RDP - r no_Species_assignment_rdp
(Fig. \@ref(fig:speciesCount)).
combined_taxa_df <- bind_rows(list(Greengenes = taxa_df_greengenes, SILVA = taxa_df_silva, RDP = taxa_df_rdp), .id = "Database") %>% group_by(Species) %>% mutate(max_count = max(Count, na.rm = TRUE)) %>% ungroup() %>% top_n(30, max_count) %>% mutate(Species = fct_reorder(Species, max_count)) combined_taxa_df %>% group_by(Species) %>% mutate(ymin = min(Count), ymax = max(Count)) %>% ggplot() + geom_linerange(aes(x = Species, ymin = ymin + 1, ymax = ymax + 1)) + geom_point(aes(x = Species, y = Count + 1, color = Database)) + labs(y = "Number of sequences") + scale_y_log10() + coord_flip() + theme_bw() + theme(axis.text.y = element_text(face = "italic"), legend.position = "bottom")
Next, we evaluate the 16S rRNA amplicon sequencing taxonomic resolution for Paenibacillus Species by comparing within and between Species amplicon pairwise distance for the V12 and V4 regions. To differentiate between Species, the pairwise distances for within-Species amplicon sequences must be less than the between Species distances. Additionally, the difference in amplicon sequence pairwise distances between and within Species must be greater than the sequencing error rate to detect the difference.
Here, we will demonstrate the process for extracting the target amplicon region and calculating pairwise distances using the Greengenes 13.5 database for simplicity. See the source Rmarkdown file for code used to calculate pairwise distances for other databases and the V4 region as well as tidying the data for downstream analysis.
The V12 and V4 regions were extracted from the database sequences using pattern matching. Some sequences do not contain both forward and reverse primers as not all the 16S rRNA sequences are full length. Only sequences with both forward and reverse primers are included in the analysis.
The following PCR primers were used for our in-silico PCR:
|Region |Direction |Primer | |:------|:-------- |:----------------------------------| |V12 |Forward | 27F - AGAGTTTGATCATGGCTCAG | | |Reverse | 336R - CACTGCTGCSYCCCGTAGGAGTCT | |V4 |Forward | 515F - GTGCCAGCMGCCGCGGTAA | | |Reverse | 806R - GGACTACHVGGGTWTCTAAT |
forward_primer <- "AGAGTTTGATCATGGCTCAG" ## reverse complementing reverse primer reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>% reverseComplement() %>% as.character() ## Selecting Paenibacillus taxa and seq paeni_16S_greengenes <- metagenomeFeatures::mgDb_select(gg13.5MgDb, type = c("taxa", "seq"), keys = "Paenibacillus", keytype = "Genus") ## Finding sequences with forward primer forward_match <- Biostrings::vmatchPattern(forward_primer, subject = paeni_16S_greengenes$seq, max.mismatch = 2) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## Finding sequences with reverse primer reverse_match <- Biostrings::vmatchPattern(reverse_primer, subject = paeni_16S_greengenes$seq, max.mismatch = 2, fixed = FALSE) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## sequences with both forward and reverse primers seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id) seqs_to_use <- names(paeni_16S_greengenes$seq) %in% seqs_to_use_ids ## Trimming sequences with both primers paeni_V12 <- TrimDNA(paeni_16S_greengenes$seq[seqs_to_use], leftPatterns = forward_primer, rightPatterns = reverse_primer, type = "both") ## Excluding seqs with length 0 paeni_V12_seqs <- paeni_V12[[2]][width(paeni_V12[[2]]) != 0]
For more accurate distance estimates, a multiple sequence alignment is used to calculate pairwise distances.
We will use the AlignSeqs
function in the DECIPHER
package to generate the multiple sequence alignment.
v12_align <- AlignSeqs(paeni_V12[[2]], verbose = FALSE)
The resulting alignment can be viewed using the BrowseSeqs
function in the DECIPHER
package.
BrowseSeqs(v12_align)
Next a pairwise distance matrix is generated using the DistanceMatrix
function in the DECIPHER
package for taxonomic resolution analysis and converting distance matrix to a data frame for analysis.
v12_dist <- DistanceMatrix(v12_align, correction = "none", verbose = FALSE, includeTerminalGaps = FALSE)
Now that we have our pairwise distance matrix, we can evaluate the taxonomic resolution of the genus Paenibacillus by comparing the within and between species distances. To reduce the complexity of our analysis, we are only going to look at the pairwise distances between sequences assigned to the 10 species with the most sequences (Fig. \@ref(fig:speciesCount)). With more sequences we will have better estimates for within species pairwise distances but is likely to result in the exclusion of closely related species. We first compare the distribution of within and between species pairwise distances for the genus. Then look at the pairwise distances within species P. amylolyticus compared distances between P. amylolyticus and the other nine most represented Paenibacillus species.
get_amp <- function(paeni_16S, forward_primer, reverse_primer){ ## Finding sequeces with forward primer forward_match <- Biostrings::vmatchPattern(forward_primer, subject = paeni_16S, max.mismatch = 2) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## Finding sequences with reverse primer reverse_match <- Biostrings::vmatchPattern(reverse_primer, subject = paeni_16S, max.mismatch = 2, fixed = FALSE) %>% as.list() %>% map_dfr(as.data.frame,.id = "seq_id") ## sequences with both forward and reverse primers seqs_to_use_ids <- intersect(forward_match$seq_id, reverse_match$seq_id) seqs_to_use <- names(paeni_16S) %in% seqs_to_use_ids ## Trimming sequences with both primers paeni_amp <- TrimDNA(paeni_16S[seqs_to_use], leftPatterns = forward_primer, rightPatterns = reverse_primer, type = "both") ## Excluding seqs with lenght 0 paeni_amp <- paeni_amp[[2]][width(paeni_amp[[2]]) != 0] ## Returning target seqs paeni_amp } get_dist <- function(amp_seqs){ aligned_seqs <- AlignSeqs(amp_seqs, verbose = FALSE) amp_dist <- DistanceMatrix(aligned_seqs, correction = "none", verbose = FALSE, includeTerminalGaps = FALSE) Key_ord <- data.frame(Keys = rownames(amp_dist), Keys_ord = 1:nrow(amp_dist)) Key2_ord <- data.frame(Keys2 = rownames(amp_dist), Keys2_ord = 1:nrow(amp_dist)) ## Creating a data frame for exploratory analysis amp_dist_df <- amp_dist %>% as.data.frame() %>% rownames_to_column(var = "Keys") %>% gather("Keys2","distance", -Keys) %>% left_join(Key_ord) %>% left_join(Key2_ord) %>% filter(Keys_ord < Keys2_ord) %>% select(-Keys_ord, -Keys2_ord) } get_tidy_dist_df <- function(mgdb, tax_df, forward_primer, reverse_primer){ ## Get seqs and taxa paeni_16S <- metagenomeFeatures::mgDb_select(mgdb, type = "seq", keys = tax_df$Keys, keytype = "Keys") ## Subset seqs amp_seqs <- get_amp(paeni_16S, forward_primer, reverse_primer) ## Get distance DF amp_dist_df <- get_dist(amp_seqs) ## get distance df amp_dist_anno_df <- amp_dist_df %>% left_join(tax_df) %>% left_join(tax_df,by = c("Keys2" = "Keys")) %>% dplyr::rename(Keys_Species = Species.x, Keys2_Species = Species.y) %>% mutate(group_comp = if_else(Keys_Species == Keys2_Species, "within","between")) ## Return tidy dataframe amp_dist_anno_df }
### GG gg_for_seq_comp <- tidy_gg_taxa %>% filter(!is.na(rnacentral_ids), Species != "Unassigned", Species %in% combined_taxa_df$Species) ### SILVA slv_for_seq_comp <- tidy_slv_taxa %>% filter(!is.na(RNAcentralID), Species != "Unassigned", Species %in% combined_taxa_df$Species) ### RDP rdp_for_seq_comp <- tidy_rdp_taxa %>% filter(!is.na(RNAcentralID), Species != "Unassigned", Species %in% combined_taxa_df$Species) ## V12 Amplicon forward_primer <- "AGAGTTTGATCATGGCTCAG" ## reverse complementing reverse primer reverse_primer <- DNAString("CACTGCTGCSYCCCGTAGGAGTCT") %>% reverseComplement() %>% as.character() taxa_list <- list(Greengenes = gg_for_seq_comp, SILVA = slv_for_seq_comp, RDP = rdp_for_seq_comp) multidb_dist_V12_df <- map2_dfr(db_list, taxa_list, get_tidy_dist_df, forward_primer, reverse_primer, .id = "Database") ## V4 Amplicon forward_primer <- "GTGCCAGCMGCCGCGGTAA" ## reverse complementing reverse primer reverse_primer <- "ATTAGAWACCCBDGTAGTCC" multidb_dist_V4_df <- map2_dfr(db_list, taxa_list, get_tidy_dist_df, forward_primer, reverse_primer, .id = "Database") ## Combine into a single dataframe multidb_dist_df <- bind_rows( list(V12 = multidb_dist_V12_df, V4 = multidb_dist_V4_df), .id = "amplicon") ## Excluding outlier sequence "329842" - mean pairwise distance to all other ## sequences is 0.2 multidb_dist_df <- filter(multidb_dist_df, Keys != "329842",Keys2 != "329842")
Pairwise distance is smaller within than between species, indicating that the V12 and V4 regions are potentially suitable for classifying members of the Paenibacillus genus to the Species level (Fig. \@ref(fig:pairDist)). The median and interquartile pairwise distance range for within and between species comparisons in consistent across the three databases further supporting our observation. A large number of outliers, pairwise distances past the boxplot whiskers, were observed for all three databases. Outliers below the between species boxplots and above the within species are problematic for species level classification. The sequences and taxonomic assignment of these outliers should be evaluated to ensure they are not errors in the database, a result of incorrect database parsing, or are examples where phylogeny and taxonomy are inconsistent.
Overall, the V12 region had greater pairwise distances than V4 for both within and between Species. It is important also to consider that we only included sequences with species assignments and the species with the most sequences in the database. Species-level information for the unclassified sequences and inclusion of less well-represented species might yield results that are inconsistent with our analysis. Additionally, our analysis does not identify the pairwise sequence distance required to classify a sequence as a novel Paenibacillus Species.
multidb_dist_df %>% ggplot(aes(x = Database, y = distance, fill = group_comp)) + geom_boxplot() + facet_grid(.~amplicon) + labs(x = "Species Comparison", fill = "Species Comparison") + theme_bw() + theme(legend.position = "bottom")
While the overall pairwise distance is greater between than within species for the Paenibacillus genus, it is important to understand how within and between species pairwise distances compare for individual Species. The within species pairwise for P. amylolyticus distances are similar to between species distances for P. pabuli and P. illnoisensis (Fig. \@ref(fig:speciesComp)). When including outliers the within pairwise distances for P. amylolyticus is in the range of between pairwise distances for other P. spp.. Similar to the genus level comparison investigation of the outliers sequences and strains is required to evaluate the taxonomic resolution for Paenibacillus.
multidb_dist_df %>% mutate(Species_Comp = paste0(Keys_Species, Keys2_Species)) %>% filter(str_detect(Species_Comp, "amylolyticus")) %>% mutate(Species_Comp = str_remove(Species_Comp, "amylolyticus")) %>% mutate(Species_Comp = fct_reorder(Species_Comp, distance), Species_Comp = fct_relevel(Species_Comp, "amylolyticus", after = 0)) %>% ggplot() + geom_boxplot(aes(x = Species_Comp, y = distance, fill = Database)) + facet_wrap(~amplicon, ncol = 1) + theme_bw() + theme(axis.text.x = element_text(angle = -45, hjust = 0), legend.position = "bottom") + labs(x = "Species compared to amylolyticus", y = "Pairwise Distance")
The metagenomeFeatures
package and associated 16S rRNA database packages greengenes13.5MgDb
, silva128.1MgDb
, ribosomaldatabaseproject11.5MgDb
, provide a consistent interface for working with the different databases.
Furthermore, through the use of the RNAcentralIDs we are able to evaluate database overlap and perform cross database analysis.
We demonstrate how the metagenomeFeatures package in conjunction with one of the associated 16S rRNA database packages, and other R packages, can be used to evaluate whether species-level taxonomic classification is possible for a specific amplicon region.
The approach used here can easily be extended to evaluate taxonomic groups (changing filtering parameters), or amplicon regions (changing primer sequences).
sessioninfo::platform_info()
sessioninfo::package_info() %>% filter(attached == TRUE) %>% select(package, loadedversion, source) %>% knitr::kable(booktabs = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.