library('randomForest') library('data.table') library('stats') library('binomialRF') knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The $\textit{binomialRF}$ is a $\textit{randomForest}$ feature selection wrapper (Zaim 2019) that treats the random forest as a binomial process where each tree represents an iid bernoulli random variable for the event of selecting $X_j$ as the main splitting variable at a given tree. The algorithm below describes the technical aspects of the algorithm.
\begin{algorithm} \caption{binomialRF Feature Selection Algorithm } \label{alg1} \begin{algorithmic} \For{i=1:N} \State Grow $T_{i}$ \State $S_{ij} = \left{ \begin{array}{ll} 1 & X_j \text{ is splitting variable at root node by } T_i \ 0 & \text{otherwise} \ \end{array} \right. $ \State $S \gets S_{ij}$ \EndFor \State $S_j = \sum_{i=1}^N S_{ij}= \sum_{i=1}^N I[X_j \in root(T_i)]$ \State $S_j \sim \text{binomial}(N,p_0), \hspace{6mm} \text{where } p_0=\bigg(1 - \prod_{i=1}^m\frac{L-i}{L-(i-1)} \bigg)\frac{1}{m}$. \State Test for Significance \State Adjust for Multiple Comparisons
\end{algorithmic} \end{algorithm}
Since $\textit{binomialRF}$ is a wrapper algorithm that internally calls and grows a randomForest object based on the inputted parameters. First we generate a simple simulated logistic data as follows:
$X_{10}\sim MNV(0, I_{10})$,
$p(x) = \frac{1}{1+e^{-X\beta}}$, and
$y \sim Binom(10,p)$.
where $\beta$ is a vector of coefficients where the first 2 coefficients are set to 3, and the rest are 0.
$$\beta = \begin{bmatrix} 3 & 3 & 0 & \cdots & 0 \end{bmatrix}^T$$
set.seed(324) ### Generate multivariate normal data in R10 X = matrix(rnorm(1000), ncol=10) ### let half of the coefficients be 0, the other be 10 trueBeta= c(rep(3,2), rep(0,8)) ### do logistic transform and generate the labels z = 1 + X %*% trueBeta pr = 1/(1+exp(-z)) y = rbinom(100,1,pr)
To generate data looking like this:
knitr::kable(head(cbind(round(X,2),y), 10))
Since the binomialRF requires a correlation adjustment to adjust for the tree-to-tree sampling correlation, we first generate the appropriately-parameterized stable correlated binomial distribution. Note, the correlbinom function call can take a while to execute for large number trials (i.e., trials > 1000).
require(correlbinom) rho = 0.33 ntrees = 250 cbinom = correlbinom(rho, successprob = 1/ncol(X), trials = ntrees, precision = 1024, model = 'kuk')
Then we can run the binomialRF function call as below:
binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = .6, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) print(binom.rf)
Note that since the binomial exact test is contingent on a test statistic measuring the likelihood of selecting a feature, if there is a dominant feature, then it will render all remaining 'important' features useless as it will always be selected as the splitting variable. So it is important to set the $percent_features$ parameter < 1. The results below show how setting the parameter to a fraction between .6 to 1 can allow other features to stand out as important.
# set.seed(324) binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = 1, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) cat('\n\nbinomialRF 100%\n\n') print(binom.rf) binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = .8, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) cat('\n\nbinomialRF 80%\n\n') print(binom.rf) binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = .6, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) cat('\n\nbinomialRF 60%\n\n') print(binom.rf)
We recommend growing at least 500 to 1,000 trees at a minimum so that the algorithm has a chance to stabilize, but also recommend choosing ntrees as a function of the number of features in your dataset. The ntrees tuning parameter must be set in conjunction with the percent_features as these two are inter-connectedm as well as the number of true features in the model. Since the correlbinom function call is slow to execute for ntrees > 1000, we recommend growing random forests with only 500-1000 trees.
set.seed(324) binom.rf1000 <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = .5, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) rho = 0.33 ntrees = 500 cbinom = correlbinom(rho, successprob = 1/ncol(X), trials = ntrees, precision = 1024, model = 'kuk') binom.rf500 <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05, ntrees = ntrees,percent_features = .5, fdr.method = 'BY', user_cbinom_dist = cbinom, sampsize = round(nrow(X)*.33)) cat('\n\nbinomialRF 250 trees\n\n') print(binom.rf500) cat('\n\nbinomialRF 500 trees \n\n') print(binom.rf1000)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.