knitr::opts_chunk$set(echo = TRUE)
Pathway Integrated Regression-based Kernel Association Test (PaIRKAT) is a model framework for assessing statistical relationships between networks and some outcome of interest while adjusting for potential confounders and covariates.
The PaIRKAT method is motivated by the analysis of networks of metabolites from a metabolomics assay and the relationship of those networks with a phenotype or clinical outcome of interest, though the method can be generalized to other domains.
PaIRKAT queries the KEGG database to determine interactions between metabolites from which network connectivity is constructed. This model framework improves testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included. We include a graph regularization or ‘smoothing’ approach for managing this issue.
# install from bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("pairkat")
PaIRKAT comes with an example data set to use as a reference when formatting data inputs to its functions. This synthetic dataset called "smokers" contains a set of human subjects with phenotype variables related to lung health among smokers. Subjects have associated metabolomics assay data that are linked to KEGG pathway database IDs.
# load pairkat library library(pairkat) data("smokers")
Data need to be added to a SummarizedExperiment class object before they can be analyzed with PaIRKAT. There are three components of the SummarizedExperiment that need to be added:
Phenotype Data: Contains outcomes of interest and any meaningful covariates to be adjusted for. Rows should be subjects and columns should be variables.
Pathway Data: Maps measured metabolites names to KEGG IDs in order to query the KEGG database. Rows are metabolite names and one column should contain KEGG IDs.
Metabolite Assay: Contains measurements of metabolites for all subjects. Rows should be subjects and columns should be metabolite names. One column should have subject IDs matching subject IDs in clinical data.
We will walk through each of these data and discuss their formats.
Phenotype data should be formatted such that subject ID's are row names and variables of interest are column names. Categorical variables should be converted to dummy variables (i.e. one-hot-encoded).
In our example, we will be working with a data set containing measures related to lung health among smokers. Variables in this data set include:
log_FEV1_FVC_ratio
: [log(FEV1/FVC)] A measure of lung capacity.
The FEV1/FVC ratio is the ratio of the forced expiratory volume in the first
one second to the forced vital capacity of the lungs. This measure has been
log transformed.
high_log_FEV1_FVC_ratio_binary
: A binarized transformation of
log_FEV1_FVC_ratio
which indicates whether log_FEV1_FVC_ratio
is "high"
with a 1, otherwise 0.
race_binary
: A binary indicator of race.
age
: Age in years
bmi
: Body Mass Index
smoking_status
: Binary indicator of whether a subject is a current smoker
(1) or not (0)
pack_years
: A way to measure the amount a person has smoked over a long
period of time. It is calculated by multiplying the number of packs of
cigarettes smoked per day by the number of years the person has smoked.
The example data are already packaged as a SummarizedExperiment. We will extract each of the components to look at their structures. Phenotype data are saved in the colData partition of the SummarizedExperiment can can be extracted as follows:
phenotype <- colData(smokers) head(phenotype)
Pathway data primarily functions to link metabolite names with KEGG ID's. Row names should be metabolite names and one column should contain KEGG IDs. Other columns may be present if desired.
Pathway data are saved in the rowData partition of the SummarizedExperiment can can be extracted as follows:
pathways <- rowData(smokers) head(pathways)[, 1:2]
Metabolite assay data should have metabolite names as row names (matching pathway data names) and subject IDs as column names.
Metabolite Assay data are saved in the assays partition of the SummarizedExperiment. Multiple assays can be stored in the SummarizedExperiment object. PaIRKAT relies on the assay to be named "metabolomics". The assay data can can be extracted as follows:
metabolome <- assays(smokers)$metabolomics head(metabolome)[, 1:3]
Once the three components are formatted correctly, it is straightforward to add these dataframes to the SummarizedExperiment object with the SummarizedExperiment function.
In this example, we re-save our SummarizedExperiment data to an object
called smokers
smokers <- SummarizedExperiment( assays = list(metabolomics = metabolome), rowData = pathways, colData = phenotype )
GatherNetworks queries the KEGG API to discover molecular interactions and build a network graph between measured metabolites.
GatherNetworks takes 4 arguments
SE
- a SummarizedExperiment object
keggID
- a string of the column name in pathway data containing KEGG IDs
species
- a three letter species IDs can be found by
running keggList("organism")
minPathwaySize
- this argument filters KEGG pathways that contain fewer
metabolites than the number specified
Our data were gathered from humans, so we will use the three letter species code "hsa".
head(keggList("organism"))[, 2:3]
networks <- GatherNetworks( SE = smokers, keggID = "kegg_id", species = "hsa", minPathwaySize = 5 )
PaIRKAT can be used to get a kernel score for all pathways found in GatherNetworks. PaIRKAT takes 3 arguments:
formula.H0
- The null model in the "formula" format used in lm and glm
functions.
networks
- networks object obtained with GatherNetworks
tau
- A parameter to control the amount of smoothing, analogous to a
bandwidth parameter in kernel smoothing. We found 1 often gave reasonable
results, as over-smoothing can lead to inflated Type I errors.
# run PaIRKAT output <- PaIRKAT(log_FEV1_FVC_ratio ~ age, networks = networks)
The formula call and results of the kernel test can be viewed by calling items from the PaIRKAT function output. In our example, the first result can be interpreted as follows:
Histidine metabolism has a significant relationship with log_FEV1_FVC_ratio when controlling for age (p = 0.0038). Similarly, Arginine biosynthesis has a significant relationship with log_FEV1_FVC_ratio when controlling for age (p = 0.0070). etc.
library(dplyr) # view formula call output$call # view results output$results %>% arrange(p.value) %>% head()
To look further into significant metabolic pathways, it is possible to plot pathway networks using the plotNetworks function and passing the networks object along with one of the pathway names as a string. To visualize all networks, pass "all" to the pathways argument.
plotNetworks(networks = networks, pathway = "Glycerophospholipid metabolism")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.