Description Usage Arguments Details Value Author(s) References See Also Examples
This function fits classification algorithms to sequencing data and measures model performances using various statistics.
1 2 3 4 |
data |
a |
method |
a character string indicating the name of classification method. Methods are implemented from the |
B |
an integer. It is the number of bootstrap samples for bagging classifiers, for example "bagFDA" and "treebag". Default is 25. |
ref |
a character string indicating the user defined reference class. Default is |
class.labels |
a character string indicating the column name of colData(...). Should be given as "character". The column from colData() which matches with given column name is used as class labels of samples. If NULL, first column is used as class labels. Default is NULL. |
preProcessing |
a character string indicating the name of the preprocessing method. This option consists both the normalization and transformation of the raw sequencing data. Available options are:
IMPORTANT: See Details for further information. |
normalize |
a character string indicating the type of normalization. Should be one of 'deseq', 'tmm' and 'none'. Default is 'deseq'. This option should be used with discrete and voom-based classifiers since no transformation is applied on raw counts. For caret-based classifiers, the argument 'preProcessing' should be used. |
control |
a list including all the control parameters passed to model training process. This arguement should be defined using wrapper functions
|
... |
optional arguments passed to selected classifiers. |
MLSeq consists both microarray-based and discrete-based classifiers along with the preprocessing approaches. These approaches include both normalization techniques, i.e. deseq median ratio (Anders et al., 2010) and trimmed mean of M values (Robinson et al., 2010) normalization methods, and the transformation techniques, i.e. variance- stabilizing transformation (vst)(Anders and Huber, 2010), regularized logarithmic transformation (rlog)(Love et al., 2014), logarithm of counts per million reads (log-cpm)(Robinson et al., 2010) and variance modeling at observational level (voom)(Law et al., 2014). Users can directly upload their raw RNA-Seq count data, preprocess their data, build one of the numerous classification models, optimize the model parameters and evaluate the model performances.
MLSeq package consists of a variety of classification algorithms for the classification of RNA-Seq data. These classifiers are categorized into two class: i) microarray-based classifiers after proper transformation, ii) discrete-based classifiers. First option is to transform the RNA-Seq data to bring it hierarchically closer to microarrays and apply microarray-based algorithms. These methods are implemented from the caret package. Run availableMethods() for a list of available methods. Note that voom transformation both exports transformed gene-expression matrix as well as the precision weight matrices in same dimension. Hence, the classifier should consider these two matrices. Zararsiz (2015) presented voom-based diagonal discriminant classifiers and the sparse voom-based nearest shrunken centroids classifier. Second option is to build new discrete-based classifiers to classify RNA-Seq data. Two methods are currently available in the literature. Witten (2011) considered modeling these counts with Poisson distribution and proposed sparse Poisson linear discriminant analysis (PLDA) classifier. The authors suggested a power transformation to deal with the overdispersion problem. Dong et al. (2016) extended this approach into a negative binomial linear discriminant analysis (NBLDA) classifier. More detailed information can be found in referenced papers.
an MLSeq
object for trained model.
Dincer Goksuluk, Gokmen Zararsiz, Selcuk Korkmaz, Vahap Eldem, Ahmet Ozturk and Ahmet Ergun Karaagaoglu
Kuhn M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, (http://www.jstatsoft.org/v28/i05/)
Anders S. Huber W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106
Witten DM. (2011). Classification and clustering of sequencing data using a poisson model. The Annals of Applied Statistics, 5(4), 2493:2518
Law et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology, 15:R29, doi:10.1186/gb-2014-15-2-r29
Witten D. et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8:58
Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biology, 11:R25, doi:10.1186/gb-2010-11-3-r25
M. I. Love, W. Huber, and S. Anders (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol, 15(12):550,. doi: 10.1186/s13059-014-0550-8.
Dong et al. (2016). NBLDA: negative binomial linear discriminant analysis for rna-seq data. BMC Bioinformatics, 17(1):369, Sep 2016. doi: 10.1186/s12859-016-1208-1.
Zararsiz G (2015). Development and Application of Novel Machine Learning Approaches for RNA-Seq Data Classification. PhD thesis, Hacettepe University, Institute of Health Sciences, June 2015.
predictClassify
, train
, trainControl
,
voomControl
, discreteControl
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | ## Not run:
library(DESeq2)
data(cervical)
# a subset of cervical data with first 150 features.
data <- cervical[c(1:150), ]
# defining sample classes.
class <- data.frame(condition = factor(rep(c("N","T"), c(29, 29))))
n <- ncol(data) # number of samples
p <- nrow(data) # number of features
# number of samples for test set (30% test, 70% train).
nTest <- ceiling(n*0.3)
ind <- sample(n, nTest, FALSE)
# train set
data.train <- data[ ,-ind]
data.train <- as.matrix(data.train + 1)
classtr <- data.frame(condition = class[-ind, ])
# train set in S4 class
data.trainS4 <- DESeqDataSetFromMatrix(countData = data.train,
colData = classtr, formula(~ 1))
## Number of repeats (repeats) might change model accuracies
## 1. caret-based classifiers:
# Random Forest (RF) Classification
rf <- classify(data = data.trainS4, method = "rf",
preProcessing = "deseq-vst", ref = "T",
control = trainControl(method = "repeatedcv", number = 5,
repeats = 2, classProbs = TRUE))
rf
# 2. Discrete classifiers:
# Poisson Linear Discriminant Analysis
pmodel <- classify(data = data.trainS4, method = "PLDA", ref = "T",
class.labels = "condition",normalize = "deseq",
control = discreteControl(number = 5, repeats = 2,
tuneLength = 10, parallel = TRUE))
pmodel
# 3. voom-based classifiers:
# voom-based Nearest Shrunken Centroids
vmodel <- classify(data = data.trainS4, normalize = "deseq", method = "voomNSC",
class.labels = "condition", ref = "T",
control = voomControl(number = 5, repeats = 2, tuneLength = 10))
vmodel
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.