Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.
A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data.
The package includes a list of references to python modules.
library(BiocSklearn) SklearnEls()
We can acquire python documentation of included modules with
reticulate's py_help
:
Help on package sklearn.decomposition in sklearn: NAME sklearn.decomposition FILE /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques. PACKAGE CONTENTS _online_lda base cdnmf_fast dict_learning factor_analysis fastica_ incremental_pca ...
The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language.
irloc = system.file("csv/iris.csv", package="BiocSklearn") irismat = SklearnEls()$np$genfromtxt(irloc, delimiter=',')
To examine a submatrix, we use the take method from numpy. The bracket format notifies us that we are not looking at data native to R.
SklearnEls()$np$take(irismat, 0:2, 0L )
We'll use R's prcomp as a first test to demonstrate performance of the sklearn modules with the iris data.
fullpc = prcomp(data.matrix(iris[,1:4]))$x
We have a python representation of the iris data. We compute the PCA as follows:
ppca = skPCA(irismat) ppca
This returns an object that can be reused through python methods.
The numerical transformation is accessed via getTransformed
.
tx = getTransformed(ppca) dim(tx) head(tx)
The native methods can be applied to the pyobj
output.
pyobj(ppca)$fit_transform(irismat)[1:3,]
Concordance with the R computation can be checked:
round(cor(tx, fullpc),3)
A computation supporting a priori bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute.
ippca = skIncrPCA(irismat) # ippcab = skIncrPCA(irismat, batch_size=25L) round(cor(getTransformed(ippcab), fullpc),3)
This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified.
ta = SklearnEls()$np$take # provide slicer utility ipc = skPartialPCA_step(ta(irismat,0:49,0L)) ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc) ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc) ipc$transform(ta(irismat,0:5,0L)) fullpc[1:5,]
We need more applications and profiling.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.