knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
Transcription factors (TFs) are proteins that bind to cis-regulatory elements
in promoter regions of genes and regulate their expression.
Identifying them in a genome is useful for a variety of reasons, such as
exploring their evolutionary history across clades and inferring
gene regulatory networks. r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")
allows users to identify plant TFs from whole-genome protein sequences and
classify them into families and subfamilies (when applicable) using the
classification scheme implemented in PlantTFDB.
As r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")
interoperates with
core Bioconductor packages (i.e., AAStringSet
objects as input,
SummarizedExperiment
objects as output), it can be easily incorporated in
pipelines for TF identification and classification in large-scale genomic
data sets.
You can install r BiocStyle::Githubpkg("almeidasilvaf/planttfhunter")
with the following code:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("planttfhunter")
Loading package after installation:
library(planttfhunter)
In this vignette, we will use protein sequences of TFs from the algae
species Galdieria sulphuraria as an example, as its proteome is
very small. The proteome file was
downloaded from the PLAZA Diatoms database [@osuna2020seminavis], and
it was filtered to keep only TFs for demonstration purposes.
The object gsu
stores the protein sequences in an AAStringSet
object.
data(gsu)
gsu
TF identification and classification is based on the presence of signature protein domains, which are identified using profile hidden Markov models (HMMs). The family classification scheme is the same as the one used by PlantTFDB [@jin2016planttfdb], and it is summarized below: [^1]
data(classification_scheme) knitr::kable(classification_scheme)
[^1]: Tip: You can access this classification scheme in your R session by
loading the data frame data(classification_scheme)
.
To identify TFs from protein sequence data, you will use the
function annotate_pfam()
. This function takes as input an AAStringSet
object [^2] and returns a data frame of protein domains associated with
each sequence. The HMMER program [@finn2011hmmer] is used to scan protein
sequences for the presence of DNA-binding protein domains, as well as
auxiliary and forbidden domains. Pre-built HMM profiles can be found in the
extdata/ directory of this package.
[^2]: Tip: If you have protein sequences in a FASTA file, you can read
them into an AAStringSet
object with the function readAAStringSet()
from
the r BiocStyle::Biocpkg("Biostrings")
package.
This is how you can run annotate_pfam()
[^3]:
[^3]: Note: in the code chunk below, the if statement is not required.
We just added it to make sure that the function annotate_pfam()
is only
executed if HMMER is installed, to avoid problems when building this
vignette in machines that do not have HMMER installed.
data(gsu_annotation) # Annotate TF-related domains using a local installation of HMMER if(hmmer_is_installed()) { gsu_annotation <- annotate_pfam(gsu) } # Take a look at the first few lines of the output head(gsu_annotation)
Now that we have our TF-related domains, we can classify TFs in families
with the function classify_tfs()
.
# Classify TFs into families gsu_families <- classify_tfs(gsu_annotation) # Take a look at the output head(gsu_families) # Count number of TFs per family table(gsu_families$Family)
If you want to get TF counts per family for multiple species, you can use
the function get_tf_counts()
. This function takes a list of AAStringSet
objects containing proteomes as input [^4], and it returns a SummarizedExperiment
object containing TF counts per family in each species, as well as species
metadata (optional). If you are not familiar with the SummarizedExperiment
class, you should consider checking the vignettes of the
r BiocStyle::Biocpkg("SummarizedExperiment")
Bioconductor package.
[^4]: Tip: If you have whole-genome protein sequences for multiple
species as FASTA files in a given directory, you can read them all as a list
of AAStringSet
objects with the function fasta2AAStringSetlist()
from
the Bioconductor package r BiocStyle::Biocpkg("syntenet")
.
To demonstrate how get_tf_counts()
works, we will simulate a
list of 4 AAStringSet
objects by sampling 50 random genes from the example
data set gsu
4 times.
set.seed(123) # for reproducibility # Simulate 4 different species by sampling 100 random genes from `gsu` proteomes <- list( Gsu1 = gsu[sample(names(gsu), 50, replace = FALSE)], Gsu2 = gsu[sample(names(gsu), 50, replace = FALSE)], Gsu3 = gsu[sample(names(gsu), 50, replace = FALSE)], Gsu4 = gsu[sample(names(gsu), 50, replace = FALSE)] ) proteomes
Great, we have a list of 4 AAStringSet
objects. Now, let's also create a
simulated species metadata data frame for each "species" (simulated).
# Create simulated species metadata species_metadata <- data.frame( row.names = names(proteomes), Division = "Rhodophyta", Origin = c("US", "Belgium", "China", "Brazil") ) species_metadata
You can add as many columns as you want to the species metadata data frame,
but make sure that species names are in row names, and
that names(proteomes)
match rownames(species)
,
otherwise get_tf_counts()
will return an error.
Now that we have a list of AAStringSet
objects and species metadata, we
can execute get_tf_counts()
. This function uses annotate_pfam()
under
the hood, so you also need to have HMMER installed and in your PATH to run it.
Here is how you can run it:
data(tf_counts) # Get TF counts per family in each species as a SummarizedExperiment object if(hmmer_is_installed()) { tf_counts <- get_tf_counts(proteomes, species_metadata) } # Take a look at the SummarizedExperiment object tf_counts # Look at the matrix of counts: assay() function from SummarizedExperiment SummarizedExperiment::assay(tf_counts) # Look at the species metadata: colData() function from SummarizedExperiment SummarizedExperiment::colData(tf_counts)
Cool, huh? In real-world analyses, once you have TF counts per family in
multiple species obtained with get_tf_counts()
,
you can try to find associations between TF counts and
eco-evolutionary aspects or traits of each species (e.g., higher frequencies of
a stress-related TF family in a species that inhabits a stressful environment).
This document was created under the following conditions:
sessioninfo::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.