Overview

This vignette explores basic aspects of PCA in bivariate and 5-dimensional data. It concludes with some remarks about "eigengenes", which can be verified using the computations shown in the simpler cases.

Make a bivariate dataset with positive correlation and heterogeneous variance

First we create a covariance matrix with greater variance for second variable of our pair.

library(MASS)
set.seed(1234)
options(digits=3)
cm = matrix(c(1,1,1,4), nr=2)
cm

Then we generate a 20000 x 2 matrix of bivariate normal deviates.

sim1 = mvrnorm(20000, c(0,0), cm)
cov(sim1)
cor(sim1)

The data in the original units is easy to visualize:

plot(sim1,xlim=c(-10,10), ylim=c(-10,10))

Now we will perform a PCA. We don't have to reduce dimensions, but we can get a handle on how the components are formed and interpreted.

prc = prcomp(sim1, center=FALSE)
plot(prc$x, xlim=c(-10,10), ylim=c(-10,10))

PC1 is produced by taking linear combinations of the rows of sim1.

We'll illustrate the linear combination concept. The data vector for the first row may be written $(x_1, x_2)$, and a linear combination has the form $ax_1 + bx_2$ for some coefficients $a$ and $b$.

The coefficients are derived from the rotation component of the PCA.

prc$rotation
c11 = prc$rotation[1,1]
c21 = prc$rotation[2,1]
sim1[1,1]*c11 + sim1[1,2]*c21
prc$x[1,1]

This can be done wholesale using matrix multiplication %*%:

(sim1 %*% prc$rotation)[1:5,]
prc$x[1:5,]
all.equal(prc$x, sim1%*% prc$rotation)

Exercises.

par(mfrow=c(2,2), mar=c(4,4,3,1))
plot(sim1, xlim=c(-10,10), ylim=c(-10,10), main="raw data",
  xlab="data column 1", ylab="data col. 2")
plot(sim1 %*% prc$rotation, xlim=c(-10,10), ylim=c(-10,10),
  main="data %*% prc$rotation", xlab="PC1 via rotation",
   ylab="PC2 via rotation")
plot(prc$x, xlim=c(-10,10), ylim=c(-10,10), main="x from prcomp",
    )
plot(sim1 %*% 
 prc$rotation %*% t(prc$rotation), xlim=c(-10,10), ylim=c(-10,10),
  main="data %*% rot %*% t(rot)", xlab="data %*% VVt (col 1)",
  ylab = "data %*% VVt (col 2)")

The rotation has been "undone". Letting $V$ denote the 'rotation' component of the PCA, this shows that the matrix product $VV^t = I$, where $I$ is a diagonal matrix with 1 on the diagonal. More background on the underlying computations can be gleaned from the Wikipedia entry on singular value decomposition.

A larger covariance matrix

Here we have a 5-dimensional dataset. We set up the covariance matrix so that columns 1 and 2 have negative correlation, columns 3 and 4 have positive correlation, column 2 has greatest overall variance, and columns 1 and 3 have elevated variance.

cm = diag(5)
cm[3,4] = cm[4,3] = .8
cm[1,2] = cm[2,1] = -.6
A = diag(5)
A[1,1] = 2
A[2,2] = 3
A[3,3] = 2
covm = A%*%cm%*%A
myd = mvrnorm(2000, rep(0,5), covm)

The pairs plot shows the data in original units.

pairs(myd, xlim=c(-10,10), ylim=c(-10,10))

We verify the multivariate structure:

cor(myd)
cov(myd)
dim(myd)

Compute PCA

pp = prcomp(myd)
pairs(pp$x)

The biplot shows the projection to PC1-PC2 and shows how the different variables are related, and how they drive the projection.

par(lwd=2)
biplot(pp, xlabs=rep(".", 2000), expand=.8)

Exercise: Explain the configuration of arrows in the biplot.

"Eigengenes" derived from PCA

When the rows are samples and columns are genes, the x components of prcomp's output are linear combinations of all genes. The coefficients of the combination are derived from the PCA rotation matrix, which is constructed so as to

Note that the simple reconstructions above have required that prcomp be used with center=FALSE.



vjcitn/CSHstats documentation built on Jan. 9, 2025, 8:37 p.m.