Background: bugsigdb.org provides signatures of
differentially abundant microbial taxa that have been reported in the published
literature, and is planned for official public release in the near future.
One planned feature is to provide users with automated identification of similar
signatures, and this can already be done offline with the calcPairwiseOverlaps
function in BugSigDBStats.
A useful comparison to have would be with the "typical" microbes that are present
in healthy people at each body site, as these might be expected to be depleted
in disease states but not enriched. This is not a typical use case of
bugsigdb.org but would be a useful addition nonetheless.
Proposed methods:
Using the curatedMetagenomicData
Biocondutor package (>=3.0.0): create a
TreeSummarizedExperiment object
per body site for all profiles from healthy individuals.
See the body_site
and disease
columns in the sampleMetadata
object, and
the returnSamples
function.
Use the mia::splitByRanks
function to add genus-level altExps
to the SummarizedExperiment
objects.
Separately for each body site and for species/genus levels, identify taxa that
have relative abundance > 0 in at least 50% of samples.
Write these to file as comma-separated NCBI IDs. Also write to file the number
of samples in each body site. You can use something simple like writeLines
to
write out anything that is needed for bugsigdb.org data
entry, then the NCBI IDs will be copy-pasted into the signature entry form.
Also write the species and genus matrices to file. Upload code and result files
to zenodo.org to get a DOI.
Add these signatures as a Study in bugsigdb.org, including the Zenodo DOI. It can be one Study, an Experiment for each body site, and a Signature for species plus a signature for genus in each body site. Note that comma-separated NCBI IDs can be bulk copy-pasted into the Signature entry form.
What a successful result would look like: One new study in bugsigdb.org referencing the Zenodo DOI, containing one Experiment per body site in curatedMetagenomicData, and two signatures (species and genus) per Experiment. Some creativity will be required for figuring out how to fill in Experiment fields in [bugsigdb.org, for example "group 0 name" might be "blank control" with sample size 0, and we might have to allow some new statistical methods to be entered, because this isn’t standard.
Potential follow-up work: This study would be used in analyses of other BugSigDB signatures, and mentioned in the manuscript in preparation.
Background: There are currently more than 2,000 published signatures in BugSigDB, and we would like to do graph-based and clustering analysis based on all pairwise similarities of these signatures, such as by Jaccard index. Performant methods are needed to calculate such a large distance matrix, visualize similar signatures, and add a small number of new signatures to the comparison without re-computing the entire distance matrix.
Proposed methods: The calcPairwiseOverlaps
function from
BugSigDBStats currently does a
reasonably fast job of calculating pairwise overlaps between BugSigDB signatures
(see the vignette
for full usage including data import using the
bugsigdbr
package), returning a long-format data.frame
providing a network edge list.
What a successful result would look like: Pull requests to BugSigDBStats for 1) calculation of a full distance matrix for ~2,000 signatures in a few seconds, 2) cluster analysis, e.g. k-means clustering and hierarchical clustering.
Potential follow-up work: Provision of a function using the results of above to immediately compare results of a new differential abundance analysis to the existing signatures in BugSigDB. A shiny app to do the same.
Background: The ultimate goal of BugSigDB would be to capture the entire literature of human microbiome studies that have reported signatures of differentially abundant microbial taxa in some comparison of different study subjects. One challenge to meeting this goal is finding published studies that meet the basic criteria for inclusion: 1) the study is indexed in PubMed, 2) the study reports one or more microbial signatures of host-associated microbiota, ie lists of microbial taxa found to be differentially abundant between study conditions or groups of study subjects.
Proposed methods: A recently published tool called ASReview (van de Schoot et al., 2021) applies machine learning to assist with the prioritization of papers for systematic review. ASReview takes as input text (ie titles and abstracts) from search results, along with "true positive" results that have been manually confirmed to be relevant to the review, then predicts which other publications are most likely to be relevant.
ASReview is a command-line tool written in Python, that must be installed and run on the command-line. BugSigDB provides bulk export of its contents, including study PMIDs, titles, and abstracts, that could be used as the "true positive" input data (for example from the Studies export page or from the bugsigdbr R package).
Steps to apply ASReview to this problem would be something like:
Find a PMID search term that captures most of the studies already in BugSigDB.
This will probably return tens or hundreds of thousands of studies.
Doing this precisely would require comparison to the >500 studies already in
BugSigDB, but this isn’t critical at least at first, and a general PMID search
like (microbiome or microbiota) AND (sequencing OR 16S OR shotgun OR amplicon)
(currently returning >33,000 results) could be used as a start.
One would then need to figure out how to use this search result as input to ASReview,
and how to use the current studies in BugSigDB as positive results input in ASReview.
Subsequently, one would use ASReview to produce a ranking of likely relevant
studies not yet included in BugSigDB.
What a successful result would look like: All studies near the top of the ASReview ranked list should be appropriate for entry into BugSigDB. The process should be easy to automate and update. On updates, studies already in BugSigDB should not be shown, leaving only candidates for new entry.
Potential follow-up work: This search process could be incorporated into a GitHub Action and the results kept up to date in a public location, and provide a go-to list for studies that can be entered into BugSigDB. People wanting to do a systematic review of studies reporting differential microbial abundance for one particular health condition or exposure could make minor modifications to the process to assist in identifying studies for their review.
Background:
The Semantic MediaWiki curation interface at bugsigdb.org enforces metadata
annotation of signatures to follow established ontologies such as the
Experimental Factor Ontology (EFO) for condition, and
the Uber-Anatomy Ontology (UBERON) for body
site.
The bugsigdbr package implements
access to BugSigDB from within R/Bioconductor. This includes import of
BugSigDB data via the importBugSigDB
function into an ordinary data.frame
from which subsets of interests can be obtained.
Such subsets can eg be obtained for signatures associated with
certain experimental factors or specific body sites of interest.
Objective: Support ontology-based subsetting of BugSigDB signatures.
Proposed methods:
The ontologyIndex
package implements functions for reading and querying ontologies in R.
This includes the get_ontology
function for reading ontologies from files in
OBO format.
The OBO file for EFO is available
here
and the OBO file for UBERON is available
here.
Subsetting BugSigDB signatures by an EFO term will then involve subsetting the
Condition
column to all descendants of that term in the EFO ontology and that
are present in the Condition
column. And analogously, subsetting
by an UBERON term will then involve subsetting the Body site
column to all
descendants of that term in the UBERON ontology and that are present in the
Body site
column.
What a successful result would look like: Pull request to the bugsigdbr github repository on a new branch (named ontoquery). Pull requests will be reviewed and discussed. Contributions will be acknowledged.
Potential follow-up work:
Discussion of how to implement high-level queries also for other columns of
interest such as Location of subjects
, Host species
, and Statistical test
.
Background: Differential abundance studies typically report signatures resolved to the genus level (16S rRNA sequencing) or species level (whole-metagenome sequencing). In some cases, authors also report differential abundance at higher taxonomic levels based on analysis of relative abundance of that taxonomic level as a whole, resulting from summing relative abundances across branches at a lower taxonomic rank. An example would be if the sum of the relative abundances of both known species of the genus Gabonibacter, ie Gabonibacter massiliensis and Gabonibacter timonensis, would be found with increased abundance in a certain condition, authors would conclude that the genus Gabonibacter is overall found with increased abundance, resulting in mixed signatures of different taxonomic ranks.
Ancestral state reconstruction (ASR) is a phylogenetic approach for inferring ancestor states from characteristics measured for their descendants. For example, given differential abundance (or any other microbial trait) on the species level, this could thus be used to infer differential abundance (or any other microbial trait) on the genus level, and further up the taxonomy.
Objective: Can we apply ASR for harmonization of BugSigDB signatures to a given taxonomic rank?
Proposed method:
Microbial signatures in BugSigDB follow the nomenclature of the
NCBI Taxonomy Database,
restricted to microbial clades profiled by MetaPhlAn3.
The phylogenetic tree for MetaPhlan3 species in
Newick format
is available
here.
The ace
function from the
ape
package provides a standard implementation of ancestral state reconstruction based
on maximum likelihood estimation for discrete characters (here: "UP" or "DOWN" to
indicate increased or decreased abundance.)
What a successful result would look like:
Pull request to the
BugSigDBStats
github repository outlining the approach in a separate ancestral state
reconstruction vignette (.Rmd
file in the vignettes
folder).
Pull requests will be reviewed and discussed.
Accepted contributions will be acknowledged.
Potential follow-up work: If shown to be a feasible approach, incorporation
into bugsigdbr::getSignatures
should be considered. The logical argument
exact.tax.level
currently harmonizes signatures by only including taxa given
at the indicated tax.level
(exact.tax.level = TRUE
), or extracts a more
general tax.level
for microbes given at a more specific taxonomic
level by simply cutting the tree at the desired tax.level
(exact.tax.level = FALSE
).
Alternatively, simple majority votes or estimation via ASR
could be provided as additional options for the exact.tax.level
argument.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.