In kevinrue/BiocChallenges: Challenges for the Bioconductor community

Project 1: Addition of body site-typical signatures

Background: bugsigdb.org provides signatures of differentially abundant microbial taxa that have been reported in the published literature, and is planned for official public release in the near future. One planned feature is to provide users with automated identification of similar signatures, and this can already be done offline with the calcPairwiseOverlaps function in BugSigDBStats. A useful comparison to have would be with the "typical" microbes that are present in healthy people at each body site, as these might be expected to be depleted in disease states but not enriched. This is not a typical use case of bugsigdb.org but would be a useful addition nonetheless.

Proposed methods: Using the curatedMetagenomicData Biocondutor package (>=3.0.0): create a TreeSummarizedExperiment object per body site for all profiles from healthy individuals. See the body_site and disease columns in the sampleMetadata object, and the returnSamples function. Use the mia::splitByRanks function to add genus-level altExps to the SummarizedExperiment objects.

Separately for each body site and for species/genus levels, identify taxa that have relative abundance > 0 in at least 50% of samples. Write these to file as comma-separated NCBI IDs. Also write to file the number of samples in each body site. You can use something simple like writeLines to write out anything that is needed for bugsigdb.org data entry, then the NCBI IDs will be copy-pasted into the signature entry form. Also write the species and genus matrices to file. Upload code and result files to zenodo.org to get a DOI.

Add these signatures as a Study in bugsigdb.org, including the Zenodo DOI. It can be one Study, an Experiment for each body site, and a Signature for species plus a signature for genus in each body site. Note that comma-separated NCBI IDs can be bulk copy-pasted into the Signature entry form.

What a successful result would look like: One new study in bugsigdb.org referencing the Zenodo DOI, containing one Experiment per body site in curatedMetagenomicData, and two signatures (species and genus) per Experiment. Some creativity will be required for figuring out how to fill in Experiment fields in [bugsigdb.org, for example "group 0 name" might be "blank control" with sample size 0, and we might have to allow some new statistical methods to be entered, because this isn’t standard.

Potential follow-up work: This study would be used in analyses of other BugSigDB signatures, and mentioned in the manuscript in preparation.

Project 2: Fast conversion and analysis of signature similarity

Background: There are currently more than 2,000 published signatures in BugSigDB, and we would like to do graph-based and clustering analysis based on all pairwise similarities of these signatures, such as by Jaccard index. Performant methods are needed to calculate such a large distance matrix, visualize similar signatures, and add a small number of new signatures to the comparison without re-computing the entire distance matrix.

Proposed methods: The calcPairwiseOverlaps function from BugSigDBStats currently does a reasonably fast job of calculating pairwise overlaps between BugSigDB signatures (see the vignette for full usage including data import using the bugsigdbr package), returning a long-format data.frame providing a network edge list.

What a successful result would look like: Pull requests to BugSigDBStats for 1) calculation of a full distance matrix for ~2,000 signatures in a few seconds, 2) cluster analysis, e.g. k-means clustering and hierarchical clustering.

Potential follow-up work: Provision of a function using the results of above to immediately compare results of a new differential abundance analysis to the existing signatures in BugSigDB. A shiny app to do the same.

Project 3: Automatic identification of candidate papers

Background: The ultimate goal of BugSigDB would be to capture the entire literature of human microbiome studies that have reported signatures of differentially abundant microbial taxa in some comparison of different study subjects. One challenge to meeting this goal is finding published studies that meet the basic criteria for inclusion: 1) the study is indexed in PubMed, 2) the study reports one or more microbial signatures of host-associated microbiota, ie lists of microbial taxa found to be differentially abundant between study conditions or groups of study subjects.

Proposed methods: A recently published tool called ASReview (van de Schoot et al., 2021) applies machine learning to assist with the prioritization of papers for systematic review. ASReview takes as input text (ie titles and abstracts) from search results, along with "true positive" results that have been manually confirmed to be relevant to the review, then predicts which other publications are most likely to be relevant.

ASReview is a command-line tool written in Python, that must be installed and run on the command-line. BugSigDB provides bulk export of its contents, including study PMIDs, titles, and abstracts, that could be used as the "true positive" input data (for example from the Studies export page or from the bugsigdbr R package).

Steps to apply ASReview to this problem would be something like: Find a PMID search term that captures most of the studies already in BugSigDB. This will probably return tens or hundreds of thousands of studies. Doing this precisely would require comparison to the >500 studies already in BugSigDB, but this isn’t critical at least at first, and a general PMID search like (microbiome or microbiota) AND (sequencing OR 16S OR shotgun OR amplicon) (currently returning >33,000 results) could be used as a start. One would then need to figure out how to use this search result as input to ASReview, and how to use the current studies in BugSigDB as positive results input in ASReview. Subsequently, one would use ASReview to produce a ranking of likely relevant studies not yet included in BugSigDB.

What a successful result would look like: All studies near the top of the ASReview ranked list should be appropriate for entry into BugSigDB. The process should be easy to automate and update. On updates, studies already in BugSigDB should not be shown, leaving only candidates for new entry.

Potential follow-up work: This search process could be incorporated into a GitHub Action and the results kept up to date in a public location, and provide a go-to list for studies that can be entered into BugSigDB. People wanting to do a systematic review of studies reporting differential microbial abundance for one particular health condition or exposure could make minor modifications to the process to assist in identifying studies for their review.

Project 4: Ontology-based queries for experimental factors and body sites

Background: The Semantic MediaWiki curation interface at bugsigdb.org enforces metadata annotation of signatures to follow established ontologies such as the Experimental Factor Ontology (EFO) for condition, and the Uber-Anatomy Ontology (UBERON) for body site. The bugsigdbr package implements access to BugSigDB from within R/Bioconductor. This includes import of BugSigDB data via the importBugSigDB function into an ordinary data.frame from which subsets of interests can be obtained. Such subsets can eg be obtained for signatures associated with certain experimental factors or specific body sites of interest.

Objective: Support ontology-based subsetting of BugSigDB signatures.

Proposed methods: The ontologyIndex package implements functions for reading and querying ontologies in R. This includes the get_ontology function for reading ontologies from files in OBO format. The OBO file for EFO is available here and the OBO file for UBERON is available here. Subsetting BugSigDB signatures by an EFO term will then involve subsetting the Condition column to all descendants of that term in the EFO ontology and that are present in the Condition column. And analogously, subsetting by an UBERON term will then involve subsetting the Body site column to all descendants of that term in the UBERON ontology and that are present in the Body site column.

What a successful result would look like: Pull request to the bugsigdbr github repository on a new branch (named ontoquery). Pull requests will be reviewed and discussed. Contributions will be acknowledged.

Potential follow-up work: Discussion of how to implement high-level queries also for other columns of interest such as Location of subjects, Host species, and Statistical test.

Project 5: Inference of abundance changes via ancestral state reconstruction

Background: Differential abundance studies typically report signatures resolved to the genus level (16S rRNA sequencing) or species level (whole-metagenome sequencing). In some cases, authors also report differential abundance at higher taxonomic levels based on analysis of relative abundance of that taxonomic level as a whole, resulting from summing relative abundances across branches at a lower taxonomic rank. An example would be if the sum of the relative abundances of both known species of the genus Gabonibacter, ie Gabonibacter massiliensis and Gabonibacter timonensis, would be found with increased abundance in a certain condition, authors would conclude that the genus Gabonibacter is overall found with increased abundance, resulting in mixed signatures of different taxonomic ranks.

Ancestral state reconstruction (ASR) is a phylogenetic approach for inferring ancestor states from characteristics measured for their descendants. For example, given differential abundance (or any other microbial trait) on the species level, this could thus be used to infer differential abundance (or any other microbial trait) on the genus level, and further up the taxonomy.

Objective: Can we apply ASR for harmonization of BugSigDB signatures to a given taxonomic rank?

Proposed method: Microbial signatures in BugSigDB follow the nomenclature of the NCBI Taxonomy Database, restricted to microbial clades profiled by MetaPhlAn3. The phylogenetic tree for MetaPhlan3 species in Newick format is available here. The ace function from the ape package provides a standard implementation of ancestral state reconstruction based on maximum likelihood estimation for discrete characters (here: "UP" or "DOWN" to indicate increased or decreased abundance.)

What a successful result would look like: Pull request to the BugSigDBStats github repository outlining the approach in a separate ancestral state reconstruction vignette (.Rmd file in the vignettes folder). Pull requests will be reviewed and discussed. Accepted contributions will be acknowledged.

Potential follow-up work: If shown to be a feasible approach, incorporation into bugsigdbr::getSignatures should be considered. The logical argument exact.tax.level currently harmonizes signatures by only including taxa given at the indicated tax.level (exact.tax.level = TRUE), or extracts a more general tax.level for microbes given at a more specific taxonomic level by simply cutting the tree at the desired tax.level (exact.tax.level = FALSE). Alternatively, simple majority votes or estimation via ASR could be provided as additional options for the exact.tax.level argument.