Scientific computing in python is well-established. This package takes advantage of new work at Rstudio that fosters python-R interoperability. Identifying good practices of interface design will require extensive discussion and experimentation, and this package takes an initial step in this direction.
A key motivation is experimenting with an incremental PCA implementation with very large out-of-memory data.
The package includes a list of references to python modules.
library(BiocSklearn) SklearnEls()
We can acquire python documentation of included modules with
reticulate's py_help
:
Help on package sklearn.decomposition in sklearn: NAME sklearn.decomposition FILE /Users/stvjc/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/__init__.py DESCRIPTION The :mod:`sklearn.decomposition` module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques. PACKAGE CONTENTS _online_lda base cdnmf_fast dict_learning factor_analysis fastica_ incremental_pca ...
The reticulate package is designed to limit the amount of effort required to convert data from R to python for natural use in each language.
irloc = system.file("csv/iris.csv", package="BiocSklearn") irismat = SklearnEls()$np$genfromtxt(irloc, delimiter=',')
To examine a submatrix, we use the take method from numpy. The bracket format notifies us that we are not looking at data native to R.
SklearnEls()$np$take(irismat, 0:2, 0L )
We'll use R's prcomp as a first test to demonstrate performance of the sklearn modules with the iris data.
fullpc = prcomp(data.matrix(iris[,1:4]))$x
We have a python representation of the iris data. We compute the PCA as follows:
ppca = skPCA(irismat) ppca
This returns an object that can be reused through python methods.
The numerical transformation is accessed via getTransformed
.
tx = getTransformed(ppca) dim(tx) head(tx)
The native methods can be applied to the pyobj
output.
pyobj(ppca)$fit_transform(irismat)[1:3,]
Concordance with the R computation can be checked:
round(cor(tx, fullpc),3)
A computation supporting a priori bounding of memory consumption is available. In this procedure one can also select the number of principal components to compute.
ippca = skIncrPCA(irismat) # ippcab = skIncrPCA(irismat, batch_size=25L) round(cor(getTransformed(ippcab), fullpc),3)
This procedure can be used when data are provided in chunks, perhaps from a stream. We iteratively update the object, for which there is no container at present. Again the number of components computed can be specified.
ta = SklearnEls()$np$take # provide slicer utility ipc = skPartialPCA_step(ta(irismat,0:49,0L)) ipc = skPartialPCA_step(ta(irismat,50:99,0L), obj=ipc) ipc = skPartialPCA_step(ta(irismat,100:149,0L), obj=ipc) ipc$transform(ta(irismat,0:5,0L)) fullpc[1:5,]
We need more applications and profiling.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.