bllim | R Documentation |
EM Algorithm for Block diagonal Gaussian Locally Linear Mapping
bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)
tapp |
An |
yapp |
An |
in_K |
Initial number of components or number of clusters |
in_r |
Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates. |
ninit |
Number of random initializations. Not used of |
maxiter |
Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds |
verb |
Verbosity: print out the progression of the algorithm. If |
in_theta |
Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values. |
plot |
Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the |
The BLLiM model implemented in this function adresses the following non-linear mapping issue:
E(Y | X=x) = g(x),
where Y
is a L-vector of multivariate responses and X
is a large D-vector of covariates' profiles such that D \gg L
. As gllim
and sllim
, the bllim
function aims at estimating the non linear regression function g
.
First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation p(X | Y)
is specified in a way that the forward relation of interest p(Y | X)
can be deduced in closed-from. Under some hypothesis on covariance structures, the large number D
of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear g
regression function by a piecewise affine function. Therefore, a hidden discrete variable Z
is introduced, in order to divide the space into K
regions such that an affine model holds between responses Y and variables X in each region k
:
X = \sum_{k=1}^K I_{Z=k} (A_k Y + b_k + E_k)
where A_k
is a D \times L
matrix of coeffcients for regression k
, b_k
is a D-vector of intercepts and E_k
is a Gaussian noise with covariance matrix \Sigma_k
.
BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density (X | Y)
:
p(X | Y=y,Z=k;\theta) = N(X; A_kx+b_k,\Sigma_k)
p(Y | Z=k; \theta) = N(Y; c_k,\Gamma_k)
p(Z=k)=\pi_k
where \Sigma_k
is a D \times D
block diagonal covariance structure automatically learnt from data. \theta
is the set of parameters \theta=(\pi_k,c_k,\Gamma_k,A_k,b_k,\Sigma_k)_{k=1}^K
.
The forward conditional density of interest p(Y | X)
is deduced from these equations and is also a Gaussian mixture of regression model.
For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is:
(K-1)+ K(DL+D+L+ nbpar_{\Sigma}+L(L+1)/2)
where L
is the dimension of the response, D
is the dimension of covariates and nbpar_{\Sigma}
is the total number of parameters in the large covariance matrix \Sigma_k
in each cluster. This number of parameters depends on the number and size of blocks in each matrices.
Two hyperparameters must be estimated to run BLLiM:
Number of mixtures components (or clusters) K
: we propose to use BIC criterion or slope heuristics as implemented in capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each \Sigma_k
matrix is automatically learnt from data, using an extension of the shock procedure (see shock-package
). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.
BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.
Returns a list with the following elements:
LLf |
Final log-likelihood |
LL |
Log-likelihood value at each iteration of the EM algorithm |
pi |
A vector of length |
c |
An |
Gamma |
An |
A |
An |
b |
An |
Sigma |
An |
r |
An |
nbpar |
The number of parameters estimated in the model |
Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)
[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.
xLLiM-package
, emgm
, gllim_inverse_map
,capushe-package
,shock-package
data(data.xllim)
## Setting 5 components in the model
K = 5
## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K);
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns
## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)
## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.