knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL,
    dpi = 200
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")
library(BiocStyle)

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    Biostrings = citation("Biostrings")[1],
    dplyr = citation("dplyr")[1],
    taxize = citation("taxize")[1],
    tibble = citation("tibble")[1],
    tidyr = citation("tidyr")[1],
    httptest = citation("httptest")[1],
    taxizedb = citation("taxizedb")[1],
    tidyselect = citation("tidyselect")[1],
    BiocCheck = citation("BiocCheck")[1],
    jsonlite = citation("jsonlite")[1],
    jsonlite = citation("jsonlite")[1],
    Peptides = citation("Peptides")[1]
)

Basics

Install HMMERutils

R is an open-source statistical environment that can be easily modified to enhance its functionality via packages. r Biocpkg("HMMERutils") is an R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install r Biocpkg("HMMERutils") by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("HMMERutils")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Required knowledge

r Biocpkg("HMMERutils") is based on many other packages and in particular those that have implemented the infrastructure needed for dealing with protein sequences data. That is, packages like r Biocpkg("Biostrings") or r Biocpkg("Peptides").

If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this post.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the HMMERutils tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive the help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing HMMERutils

We hope that r Biocpkg("HMMERutils") will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
# citation("HMMERutils")

Quick start using HMMERutils

The objective of this article is to show the main functionalities of the utilsHMMER library using a case study. We will search for homologous sequences for the whole SARS-CoV-2 reference proteome.

library("HMMERutils")
library("Biostrings")
library("dplyr")
library("ggplot2")

Download the SARS-CoV-2 reference proteome

First, let's read the reference proteome for the Sars_CoV-2. We will read the compressed file directly from UniProt with the readAAStringSet function from the r Biocpkg("Biostrings") library.

uniprot.url <- "https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/pan_proteomes/UP000464024.fasta.gz"

sars.cov.fasta <- readAAStringSet(uniprot.url)

Preprocess sequences and select case studies

Now, we have an AAStringSet stored in the variable sars.cov.fasta and we can access the sequences it contains using the $ operator and the sequence header. However, headers usually contain, as in this case, a lot of information that is not useful right now such as descriptions.

names(sars.cov.fasta)

HMMERutils has a very convenient function for this purpose which is parse_fasta_headers(). Note that we use the name function to access the sequence names and the %>% operator to pass the result of one function/argument to the other in the sequence

(
  names(sars.cov.fasta) <- names(sars.cov.fasta) %>%
  parse_fasta_headers()
)

We can take a look at the contents of the fasta file by "showing" it (calling the show is done implicitly):

sars.cov.fasta

Now, we are going to select a series of proteins that are of interest to us for this case study:

VEMP_SARS <- as.character(sars.cov.fasta$VEMP_SARS2) #Envelope small membrane protein
NS8A_SARS <- as.character(sars.cov.fasta$NS8_SARS2) #ORF8 protein
NS7B_SARS <- as.character(sars.cov.fasta$NS7B_SARS) #Protein non-structural 7b
R1AB_SARS2 <- as.character(sars.cov.fasta$R1AB_SARS2) #Replicase polyprotein 1ab OS=Severe acute respirator

All these proteins have in common being small in size (it will speed up some computational times) and represent different possibilities: we have from membrane proteins (which we would expect to show homology with other membrane proteins), proteins similar to each other (NS8A_SARS and NS8B_SARS), non-structural proteins, and uncharacterized proteins.

Proteins homologous to VEMP_SARS using PHMMER

The most frequently used algorithm of HMMER is phmmer. This algorithm allows us to search for homologous sequences using a protein sequence as a target against protein sequence databases.

We can post a query to the HMMER web server and read the data into a user-friendly DataFrame by using the "search_" family of functions that HMMERutils provides.

Let's start by searching for proteins homologous to VEMP_SARS (a envelope small membrane protein) against the Swissprot database

hmmer.VEMP_SARS <- search_phmmer(
  seq = VEMP_SARS, seqdb = "swissprot", verbose = FALSE
  )

Let's inspect the content of the new variable we have created. We have two types of variables: the first 21 columns give us information about the query itself (such as the algorithm or the unique identifier), while the rest give us information about the content of the query.

glimpse(hmmer.VEMP_SARS[,1:21])

As you can see, all rows have the same values for all the previous variables. That is a deliberate decision: each row represents one sequence hit that we get by doing just one query.

Most of the parameters contained in stats are for internal use and we are only interested in a small part of them. For example, by studying the table above, we can know that HMMER found 43 hits (nhits) out of a total of 565928 sequences available (Z) in the SwissProt and that a single HMM, Hidden Markov Model, (nmodels) has been used in the search (built from our input sequence since we are using PHMMER algorithm). In case you are interested in learning more about how the results data are structured, we invite you to read the appendix of the HMMER web server.

Let's explore the rest of the Data Frame:

glimpse(hmmer.VEMP_SARS[,22:47])

To get a detailed description of each variable, please read the documentation of ?phmmer_2abl (our example data set).

?phmmer_2abl

Data preprocessing

E-values

First, we'll begin by studying the e-values obtained. There are several types of e-values, some referring to the sequence and others referring to the domains. But what do we mean by domains? According to the HMMER manual:

A target sequence that reaches this point is very likely to contain one or more significant matches to the profile. These matches are referred to as "domains", since the main use of HMMER has historically been to match profile HMMs from protein domain databases like Pfam, and one of HMMER's strengths is to be able to cleanly parse a multidomain target sequence into its multiple nonoverlapping hits to the same domain model.

It is of special interest to study carefully the domain E-value when it happens that, although the full sequence E-value is good, none of the domain E-value reaches the significance threshold. In that case, what is happening is that those weak hits, none of which is good enough on its own, add up to raise the sequence to a high score. This may happen because the sequence contains several weak homologous domains, or it may contain a repetitive sequence that is hitting by chance.

HMMERutils has a function to display all the E-values of the sequences. This function hmmer_evalues_cleveland_dot_plot. In this plot, the $−log(\text{E−value})$ is represented on the $x$ axis, and on the $y$ axis each of the identified sequences is represented one at a time. The green dots represent the E-values of the full sequence and the red dots the E-values of each of the domains.

hmmer_evalues_cleveland_dot_plot(
  hmmer.VEMP_SARS, threshold = 10^-5
  )

By looking at it, you can see at a glance, first, how many domains have been identified for each sequence, and then how these contribute to the E-value of the full sequence and how significant they are relative to the full sequence. In this case, as can be seen, only one domain per sequence has been found (probably because they are very small).

As we didn't identify any "red-flag", let's filter our data taking into account only the hits sequence E-value. We can use the HMMERutils function filter_hmmer() to achieve this.

(
 hmmer.VEMP_SARS <- filter_hmmer(hmmer.VEMP_SARS,threshold = 10^-5)
)

Adding protein sequences

To download the sequences from HMMER and add them to our DataFrame, you can use the add_sequences_to_hmmer_tbl(). Notice that sequences associated with the UUID will be available in HMMER only temporarily.

(
  hmmer.VEMP_SARS <- add_sequences_to_hmmer_tbl(hmmer.VEMP_SARS)
)

By having the sequences as a column in our Data Frane, now we can preprocess it using Tidyverse "verbs". This is how we could filter the sequences to keep only non-redundant sequences.

(
hmmer.VEMP_SARS <- hmmer.VEMP_SARS %>%
  distinct(hits.fullfasta, .keep_all = TRUE)  
)

Data Exploration

Taxonomic distribution

Now we can explore the nature of these sequences. One aspect of great interest is the taxonomic distribution of these sequences. We can consult it because we have the hits.taxid parameter, which identifies taxonomic species. We proceed to taxonomically annotate the results using the add_taxa_to_HMMER_tbl function

hmmer.VEMP_SARS <- add_taxa_to_hmmer_tbl(hmmer.VEMP_SARS)

Now, we can use all the meta library of tidyverse and ggplot2 to explore the data and make graphs. Let's see an example of what could be done:

ggplot(
  data =  hmmer.VEMP_SARS,
  mapping = aes(x = hits.evalue, fill = taxa.subgenus)
  ) + 
  geom_histogram(binwidth = 10^-8)

Calculating the identity of pairwise alignments

To examine the diversity of the sequences we can align all the sequences in pairs and then calculate the percentages of identity for each one using the pairwise_alignment_sequence_identity function.

When calling this function, we need to specify which sequences we want to compare (as a vector of characters), the type of alignment, and the type of sequence identity we are going to use. In this case, we will use the default options: aln_type = "global" (align whole strings with end gap penalties by using the Needleman-Wunsch algorithm) and pid_type = "PID1" ( PID1=100⋅identical position aligned positions + internal gap positions).

percentage_identities <- pairwise_alignment_sequence_identity(
  seqs = hmmer.VEMP_SARS$hits.fullfasta,
  aln_type = "global",
  pid_type = "PID1"

)

Now, we can inspect the distribution of those values in a histogram:

pairwise_sequence_identity_histogram(percentage_identities)

And, to compare this information with other parameters, a heatmap is. Note that we use the taxa.subgenus column to annotate the heatmap, for which we use the annotation argument and the $ operator.

pairwise_sequence_identity_heatmap(
  percentage_identities, annotation = hmmer.VEMP_SARS$taxa.subgenus
  )

Physicochemical properties

We can study what the sequences we have obtained are like in more detail by calculating physicochemical properties based on their primary sequence. To do this, we can use the add_physicochemical_properties_to_HMMER_tbl function

hmmer.VEMP_SARS <- hmmer.VEMP_SARS %>% 
  add_physicochemical_properties_to_HMMER_tbl()

Let's see how are these new properties distributed along our sequences:

hmmer.VEMP_SARS %>%
  ggplot(
    aes(x = taxa.subgenus, y = properties.molecular.weight, fill = taxa.subgenus)
    )+
  geom_boxplot()

Proteins homologous to NS8A_SARSin SwissProt and PDB using PHMMER

So far, we have been working with the results of a single search. This case study has allowed us to describe in depth most of the utilities of this library. However, the most common scenario is that we want to work with the results of several searches. Let's try performing a search with PHMMER for the sequence of NS8A_SARS against SwissProt and PDB.

For this example, moreover, we will not go step by step but will show how HMMERutils can perform all of the above operations in a few lines of code. You can refer to the information regarding the search_phmmer function using ?search_phmmer for a complete listing of options.

(
  hmmer.NS8A_SARS <- search_phmmer(
    seq = NS8A_SARS, 
    seqdb = c("pdb", "swissprot"), 
    verbose = FALSE
  ) 
)

Note that, now, the number of searches performed is 2:

unique(hmmer.NS8A_SARS$uuid)

Now, using the pipe operator, we are going to filter the sequences, read the sequences, annotate them taxonomically, and calculate the physicochemical properties of the sequences.

hmmer.NS8A_SARS <- hmmer.NS8A_SARS %>%
  filter_hmmer() %>%
  add_sequences_to_hmmer_tbl() %>%
  add_taxa_to_hmmer_tbl(mode = "local") %>%
  add_physicochemical_properties_to_HMMER_tbl()

Extracting nested hashes

As we have searched against PDB, now we have a new column, which is called hits.pdbs. To explore the PDB files available for homologous sequences we can "extract" those nested rows.

pdbs <- extract_from_hmmer(hmmer.NS8A_SARS, column = "hits.pdbs")
unique(pdbs$pdbs)

We can do the same to study the domains (default behavior):

domains <- extract_from_hmmer(hmmer.NS8A_SARS)
unique(domains$domains.alisqdesc)

Using hmmsearch

So far, we have only used PHMMER, since it is the most used and default algorithm. However, HMMERutils allows us to work with all HMMER algorithms. Let's see how it can be used to perform a search using hmmsearch. This algorithm performs a sequence search in the same databases as PHMMER with the difference that, instead of building a hidden Markov model with a single sequence, it does it with an alignment.

For this example, we will use a fake alignment (that's why we are creating it from a vector, instead of reading an actual file, using the readAAMultipleAlignment function.

alignment <- AAMultipleAlignment(c(
  "FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF",
  "KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEK",
  "EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVE",
  "EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL",
  "SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTT",
  "FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKL",
  "SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNN"
  )
)

Now, let's perform a query using hmmsearch:

fake_data <- search_hmmsearch(alignment)

Using HMMSCAN

Another HMMER algorithm that can be used through HMMERutils is HMMSCAN. This algorithm performs a search of a protein sequence against an HMM database and can be used to identify domains in our sequence. Let's see which domains we can find in VEMP_SARS:

(
  scan.VEMP_SARS <- search_hmmscan(VEMP_SARS)
)
scan.VEMP_SARS$hits.acc

Reproducibility

The r Biocpkg("HMMERutils") package r #Citep(bib[["HMMERutils"]]) was made possible thanks to:

This package was developed using r BiocStyle::Biocpkg("biocthis").

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("HMMERutils.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("HMMERutils.Rmd", tangle = TRUE)

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

Bibliography

This vignette was generated using r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]]) with r CRANpkg("knitr") r Citep(bib[["knitr"]]) and r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]]) running behind the scenes.

Citations made with r CRANpkg("RefManageR") r Citep(bib[["RefManageR"]]).

## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))


currocam/HMMERutils documentation built on Feb. 15, 2023, 8:41 p.m.