This vignette covers an introduction to network-regularized generalized linear regression models (GLMs). Network-regularization makes use of graphs to model interactions of covariables and responses in generalized linear regression models and penalizes the coefficients in GLMs using this information.
edgenet
uses a $(p \times p)$-dimensional affinity matrix $\mathbf{G}X \in \mathbb{R}+^{p \times p}$ to model the interaction of $p$
covariables in $\mathbf{X} \in \mathbb{R}^{n \times p}$ and analogously a matrix $\mathbf{G}Y \in \mathbb{R}+^{q \times q}$
to model interactions of $q$ response variables in $\mathbf{Y} \in \mathbb{R}^{n \times q}$. The affinity matrices
are used for regularization of regression coefficients like this:
\begin{equation} \small \begin{split} \hat{\mathbf{B}} = \arg \min_{\mathbf{B}} \, & -\ell(\mathbf{B}) + \lambda ||\mathbf{B} ||1 \ & + \frac{\psi_1}{2} \sum{i=1}^p \sum_{j=1}^p || \boldsymbol \beta_{i,} - \boldsymbol \beta_{j,} ||2^2 (\mathbf{G}_X){i,j} \ & + \frac{\psi_2}{2} \sum_{i=1}^q \sum_{j=1}^q || \boldsymbol \beta_{,i} - \boldsymbol \beta_{,j} ||2^2 (\mathbf{G}_Y){i,j} \ = \arg \min_{\mathbf{B}} \, & -\ell(\mathbf{B}) + \lambda ||\mathbf{B} ||1 \ & + \psi_1 \text{tr}\left( \mathbf{B}^T \left( \mathbf{D}{\mathbf{G}X} - \mathbf{G}_X \right) \mathbf{B} \right) \ & + \psi_2 \text{tr}\left( \mathbf{B} \left( \mathbf{D}{\mathbf{G}_Y} - \mathbf{G}_Y \right) \mathbf{B}^T \right) \end{split} \label{equ:graphreg} \end{equation}
Matrix $\mathbf{B} \in \mathbb{R}^{(p + 1) \times q}$ is the matrix of regression coefficients to be estimated (including an intercept). Vectors $\boldsymbol \beta_{i, }$ and $\boldsymbol \beta_{, i}$ are the $i$-th row or column of $\mathbf{B}$, respectively. Shrinkage parameters $\lambda$, $\psi_1$ and $\psi_2$ are fixed or need to be estimated (e.g., using cross-validation). The matrices $\mathbf{D}_{\mathbf{G}} - \mathbf{G}$ are the combinatorial (or normalized) graph Laplacians of an affinity matrix $\mathbf{G}$ [@chung1997spectral].
TODO [@yuan2006model] and [@meier2008group]
So far netReg
supports the following exponential family random variables:
The log-likelihood function $\ell(\mathbf{B})$ over $q$ response variables is defined as:
\begin{equation} \small \ell(\mathbf{B}) = \sum_{j}^q \sum_i^n \log \ \mathbb{P}\left({y}{i, j} \mid h\left(\mathbf{x}{i,} \cdot \beta_{,j}\right), \phi \right) \end{equation}
where $h$ is the inverse of a link function, such as the logarithm, and $\phi$ is a dispersion parameter.
The following section shows how to use network regularization models. We shall use edgenet
, but the calls for the other methods are analogous and examples can be found in the method documentation. We first load some libraries:
library(tensorflow) library(tfprobability) library(netReg)
Set some parameters and create affinity matrices:
# parameters n <- 100 p <- 10 q <- 10 # affinity matrices G.X <- abs(rWishart(1, 10, diag(p))[,,1]) G.Y <- abs(rWishart(1, 10, diag(q))[,,1])
We created the affinity matrix absolutely random, although in practice a real (biological) observed affinity matrix should be used, because in the end the affinity matrix decides the shrinkage of the coefficients.
The actual fit is straightforward. We create Gaussian data first and then fit the model:
# data X <- matrix(rnorm(n * p), n) B <- matrix(rnorm(p * q), p) Y <- X %*% B + matrix(rnorm(n * q, 0, 0.1), n) fit <- edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=gaussian, maxit=10) summary(fit)
The fit
object contains information about coefficients, intercepts etc.
coef(fit)[,1:5]
Having the coefficients estimated we are able to predict novel data-sets:
pred <- predict(fit, X) pred[1:5, 1:5]
The binomial case is the same only with a different family
argument:
try({ edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=binomial, maxit=11) })
# data X <- matrix(rnorm(n * p), n) B <- matrix(rnorm(p * q), p) eta <- 1 / (1 + exp(-X %*% B)) Y.binom <- do.call("cbind", lapply(seq(10), function(.) rbinom(n, 1, eta[,.]))) fit <- edgenet(X=X, Y=Y, G.X=G.X, G.Y=G.Y, family=binomial, maxit=10) summary(fit)
The Poisson case of course works analogously:
# data X <- matrix(rnorm(n * p), n) B <- matrix(rnorm(p * q), p) eta <- exp(-X %*% B) Y.pois <- do.call("cbind", lapply(seq(10), function(.) rpois(n, eta[,.]))) fit <- edgenet(X=X, Y=Y.pois, G.X=G.X, G.Y=G.Y, family=poisson, maxit=10) summary(fit)
In most cases we do not have the optimal shrinkage parameters $\lambda$,
$\psi_{1}$ and $\psi_{2}$. For these settings you can use netReg
's
included model selection. We use Powell's BOBYQA algorithm
[@powell2009bobyqa] for gradient-free optimization in a coss-validation framework.
Doing the model selection only requires calling cv.edgenet
:
cv <- cv.edgenet(X=X, Y=Y, G.X=G.Y, G.Y, family=gaussian, optim.maxit=10, maxit=10) summary(cv)
The cross-validation and BOBYQA is quite computationally intensive, so we set optim.maxit
to $10$.
I recommend using TensorFlow
on a GPU for that, since edgenet
needs to compute many matrix multiplications.
The cv.edgenet
object also contains a fit with the optimal parameters.
summary(cv$fit)
You can obtain the coefficients like this:
coef(cv)[,1:5]
Furthermore, you can also directly predict using a cv.edgenet
object:
pred <- predict(cv, X) pred[1:5, 1:5]
This section explains how to fit a linear model and do parameter estimation.
At first we load the library and some data:
library(netReg) data("yeast") ls(yeast) X <- yeast$X Y <- yeast$Y G.Y <- yeast$GY
The yeast data $\mathbf{X}$ and $\mathbf{Y}$ set is taken and adopted from [@brem2005genetic], [@storey2005multiple], and [@cheng2014graph]. $\mathbf{GY}$ is taken from BioGRID.
$\mathbf{X}$ is a $(n \times p)$ matrix of genetic markers where $n$ is the number of samples (112) and $p$ is the number of markers. $\mathbf{Y}$ is a $(n \times q)$ matrix of expression values for $q$ yeast genes. $n$ is again the numer of samples (112). $\mathbf{GY}$ is a $(q \times q)$ adjacency matrix representing protein-protein interactions for the $q$ response variables.
We only use a smaller network in order to be able to print the results here.
fit <- edgenet(X=X, Y=Y, G.Y=G.Y, lambda=5, family=gaussian, maxit=10, thresh=1e-3) summary(fit)
For the response variables we use an affinity matrix that represents biological relationships with $\mathbf{GY}$. We promote sparsity by setting $\lambda = 5$ and put a weak (default) penalty on similar coefficients by setting $\psi_{gy} = 1$. Other than that we used standard parameters in this case.
The fit
object contains information about coefficients and intercepts. Having the coefficients estimated we are able to
predict novel data-sets:
X.new <- matrix(rnorm(10 * ncol(X)), 10) pred <- predict(fit, X.new) pred[1:10, 1:5]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.