In mirnavazquez/RbiMs: R Tools for Reconstruncting Bin Metabolisms

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7, fig.height=7
)

First, load the rbims package.

library(rbims)

Example with PFAM database

First, I will read the InterProScan output in a long format and extract the PFAM abundance information.

If you want to follow this example, you can download the use rbims test file.

interpro_pfam_long<-read_interpro(data_interpro = "../inst/extdata/Interpro_test.tsv", database="Pfam", profile = F)

You can use the subsetting functions to create subsets of the InterPro profile table. Here, we will extract the most important PFAMs, and we need to use them as an input, not the profile output from read_interpro.

The function get_subset_pca calculates a PCA over the data to find the PFAM that explains the variation within the data.

important_PFAMs<-get_subset_pca(tibble_rbims=interpro_pfam_profile, 
               cos2_val=0.95,
               analysis="PFAM")

head(important_PFAMs)

The distance argument

Let's plot the results.

plot_heatmap can help explore the results. We can perform two types of analyses; if we set the distance option as TRUE, we can plot to show how the samples could cluster based on the protein domains.

plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = T)

If we set that to FALSE, we observed the presence and absence of the domains across the genome samples.

plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)

plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)

We can also visualize using a bubble plot.

plot_bubble(important_PFAMs, 
            y_axis=PFAM, 
            x_axis=Bin_name, 
            calc = "Binary",
            analysis = "INTERPRO", 
            data_experiment = metadata, 
            color_character = Clades)

Example with INTERPRO database

First, I will read the InterProScan output in a wide format and extract the PFAM abundance information.

interpro_INTERPRO_profile<-read_interpro(data_interpro = "Interpro_test.tsv", database="INTERPRO", profile = F)

head(interpro_INTERPRO_profile)

We are going to look for the InterProScan IDs that conform the DNA topoisomerase 1. To do this, we will create a vector of the IDs associated to that enzyme.

DNA_topoisomerase_1<-c("IPR013497", "IPR023406", "IPR013824")

With the function get_subset_pathway we can create a subset of the INTERPRO table.

DNA_tipo_INTERPRO<-get_subset_pathway(interpro_INTERPRO_profile, type_of_interest_feature=INTERPRO,
                   interest_feature=DNA_topoisomerase_1)

head(DNA_tipo_INTERPRO)

We can create a bubble plot to visualize the distribution of these enzymes across the bins.

plot_bubble(DNA_tipo_INTERPRO, 
            y_axis=INTERPRO,
            x_axis=Bin_name,
            calc = "Binary",
            analysis = "INTERPRO", 
            data_experiment = metadata, 
            color_character = Sample_site)

Example with KEGG database

First, I will read the InterProScan output in a long format and extract the KEGG information. When you use the KEGG option, the profile option is disabled.

interpro_KEGG_long<-read_interpro(data_interpro = "Interpro_test.tsv", database="KEGG")

head(interpro_KEGG_long)

Mapping INTERPRO to KEGG database

We can use the mapping_ko function here, to get the extended KEGG table.

interpro_map<-mapping_ko(tibble_interpro = interpro_KEGG_long)

head(interpro_map)

We can plot all the KOs and the Modules to which they belong. An important thing here is that we will set analysis = "KEGG" despite this workflow started with the InterProScan output in analysis.

plot_heatmap(tibble_ko=interpro_map,
             data_experiment = metadata,
             y_axis=KO,
             order_y = Module,
             order_x = Sample_site,
             split_y = TRUE,
             analysis = "KEGG",
             calc="Percentage")