Biokit is an unified wrapper package for functions and utilities to perform the analysis of transcriptomic and proteomic data. It contains several functions to carry out a basic exploratory analysis of the omics data tables, together with tools to normalize and perform the differential analysis between the sample groups of interest. In addition, biokit can also conduct the functional analysis of the results to reduce its dimensionality and increase its interpretability, using over-representation analysis (ORA) or functional class scoring (FCS) approaches.
The package can be installed through the install_github()
function from devtools.
devtools::install_github(repo = "https://github.com/martingarridorc/biokit")
library(biokit) data("sarsCovData") data("humanHallmarks") # set output directory for images knitr::opts_chunk$set( fig.path = "man/figures/" )
To exemplify the capabilities and features of biokit, we will apply it to a transcriptomic dataset obtained from the following GEO entry. This dataset contains a subset of the samples analyzed in this project, where the reaction from different human cell lines to SARS-COV-2 infection is evaluated using RNA-Seq. The starting materials for the analysis are:
The RNA-Seq counts table
head(sarsCovMat[, 1:3])
A data frame containing the sample group information
head(sarsCovSampInfo)
And a list containing the MSigDb functional categories, that we will use for the functional analysis
lapply(humanHallmarks[1:10], head)
In a first step, we can explore the per-sample value distribution of the raw counts table. Then, we can filter and normalize the count matrix using a minimum cutoff of counts across samples, with a default value of 15. Then, we can normalize the resulting count matrix with the edgeR TMM approach, using the countsToTmm()
function from the biokit.
biokit::violinPlot(sarsCovMat) sarsCovMat <- sarsCovMat[rowSums(sarsCovMat) >= 15, ] tmmMat <- countsToTmm(sarsCovMat)
Next, we can explore the new per-sample value distribution using again the violin plot function.
biokit::violinPlot(tmmMat)
In a second exploratory step, we can apply a Principal Component Analysis (PCA) to reduce the dataset dimensionality and explore the group distribution in a bidimensional space formed by the first two principal components.
biokit::pcaPlot(mat = tmmMat, sampInfo = sarsCovSampInfo, groupCol = "group")
Next, we can explore the top 25 genes with the higuest standard deviation in teh entire dataset, representing and clustering them through a heatmap representation.
heatmapPlot(mat = tmmMat, sampInfo = sarsCovSampInfo, groupCol = "group",scaleBy = "row", nTop = 25)
Once that we have evaluated the distribution of sample groups and of most variable genes with basic exploratory analysis, we can perform a differential expression between the sample groups of interest using the for the linear models included in the limma package. The volcanoPlot()
function can be used to obtain a broad spectrum view of the results for each of the comparisons carried out.
diffRes <- biokit::autoLimmaComparison(mat = tmmMat, sampInfo = sarsCovSampInfo, groupCol = "group") biokit::volcanoPlot(diffRes)
In a final step, we can perform the functional analysis for each comparison using the GSEA approach and visualize the significant results using the gseaPlot()
function.
gseaResults <- gseaFromStats(df = diffRes, funCatList = humanHallmarks, rankCol = "logFc", splitCol = "comparison") gseaPlot(gseaResults)
Session information
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.