Title: | Random Subspace Ensemble Classification and Variable Screening |
---|---|
Description: | We propose a general ensemble classification framework, RaSE algorithm, for the sparse classification problem. In RaSE algorithm, for each weak learner, some random subspaces are generated and the optimal one is chosen to train the model on the basis of some criterion. To be adapted to the problem, a novel criterion, ratio information criterion (RIC) is put up with based on Kullback-Leibler divergence. Besides minimizing RIC, multiple criteria can be applied, for instance, minimizing extended Bayesian information criterion (eBIC), minimizing training error, minimizing the validation error, minimizing the cross-validation error, minimizing leave-one-out error. There are various choices of base classifier, for instance, linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbour, logistic regression, decision trees, random forest, support vector machines. RaSE algorithm can also be applied to do feature ranking, providing us the importance of each feature based on the selected percentage in multiple subspaces. RaSE framework can be extended to the general prediction framework, including both classification and regression. We can use the selected percentages of variables for variable screening. The latest version added the variable screening function for both regression and classification problems. |
Authors: | Ye Tian [aut, cre] and Yang Feng [aut] |
Maintainer: | Ye Tian <[email protected]> |
License: | GPL-2 |
Version: | 3.0.0 |
Built: | 2025-02-27 04:18:06 UTC |
Source: | https://github.com/cran/RaSEn |
Alon et al.'s Colon cancer dataset containing information on 62 samples for 2000 genes. The samples belong to tumor and normal colon tissues.
colon
colon
A list with the predictor matrix x
and binary 0/1 response vector y
.
The link to this data set: http://genomics-pubs.princeton.edu/oncology/
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J., 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), pp.6745-6750.
Tian, Y. and Feng, Y., 2021. RaSE: A Variable Screening Framework via Random Subspace Ensembles. arXiv preprint arXiv:2102.03892.
Predict the outcome of new observations based on the estimated RaSE classifier (Tian, Y. and Feng, Y., 2021).
## S3 method for class 'RaSE' predict(object, newx, type = c("vote", "prob", "raw-vote", "raw-prob"), ...)
## S3 method for class 'RaSE' predict(object, newx, type = c("vote", "prob", "raw-vote", "raw-prob"), ...)
object |
fitted |
newx |
a set of new observations. Each row of |
type |
the type of prediction output. Can be 'vote', 'prob', 'raw-vote' or 'raw-prob'. Default = 'vote'.
|
... |
additional arguments. |
depends on the parameter type
. See the list above.
Tian, Y. and Feng, Y., 2021. RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.
Rase
.
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y model.fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 0, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) ypred <- predict(model.fit, xtest) mean(ypred != ytest) ## End(Not run)
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y model.fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 0, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) ypred <- predict(model.fit, xtest) mean(ypred != ytest) ## End(Not run)
Predict the outcome of new observations based on the estimated super RaSE classifier (Zhu, J. and Feng, Y., 2021).
## S3 method for class 'super_RaSE' predict(object, newx, type = c("vote", "prob", "raw-vote", "raw-prob"), ...)
## S3 method for class 'super_RaSE' predict(object, newx, type = c("vote", "prob", "raw-vote", "raw-prob"), ...)
object |
fitted |
newx |
a set of new observations. Each row of |
type |
the type of prediction output. Can be 'vote', 'prob', 'raw-vote' or 'raw-prob'. Default = 'vote'.
|
... |
additional arguments. |
depends on the parameter type
. See the list above.
Zhu, J. and Feng, Y., 2021. Super RaSE: Super Random Subspace Ensemble Classification. https://www.preprints.org/manuscript/202110.0042
Rase
.
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y # fit a super RaSE classifier by sampling base learner from kNN, LDA and # logistic regression in equal probability fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T), criterion = "cv", cv = 5, iteration = 1, cores = 2) ypred <- predict(fit, xtest) mean(ypred != ytest) ## End(Not run)
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y # fit a super RaSE classifier by sampling base learner from kNN, LDA and # logistic regression in equal probability fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T), criterion = "cv", cv = 5, iteration = 1, cores = 2) ypred <- predict(fit, xtest) mean(ypred != ytest) ## End(Not run)
Similar to the usual print methods, this function summarizes results.
from a fitted 'RaSE'
object.
## S3 method for class 'RaSE' print(x, ...)
## S3 method for class 'RaSE' print(x, ...)
x |
fitted |
... |
additional arguments. |
No value is returned.
Rase
.
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 0, cutoff = TRUE, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) # print the summarized results print(fit)
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 0, cutoff = TRUE, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) # print the summarized results print(fit)
Similar to the usual print methods, this function summarizes results.
from a fitted 'super_RaSE'
object.
## S3 method for class 'super_RaSE' print(x, ...)
## S3 method for class 'super_RaSE' print(x, ...)
x |
fitted |
... |
additional arguments. |
No value is returned.
Rase
.
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 0, cutoff = TRUE, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) # print the summarized results print(fit)
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 0, cutoff = TRUE, base = 'lda', cores = 2, criterion = 'ric', ranking = TRUE) # print the summarized results print(fit)
from various models in two papers.RaModel
generates data from 4 models described in Tian, Y. and Feng, Y., 2021(b) and 8 models described in Tian, Y. and Feng, Y., 2021(a).
RaModel(model.type, model.no, n, p, p0 = 1/2, sparse = TRUE)
RaModel(model.type, model.no, n, p, p0 = 1/2, sparse = TRUE)
model.type |
indicator of the paper covering the model, which can be 'classification' (Tian, Y. and Feng, Y., 2021(b)) or 'screening' (Tian, Y. and Feng, Y., 2021(a)). |
model.no |
model number. It can be 1-4 when |
n |
sample size |
p |
data dimension |
p0 |
marginal probability of class 0. Default = 0.5. Only used when |
sparse |
a logistic object indicating model sparsity. Default = TRUE. Only used when |
x |
n * p matrix. n observations and p features. |
y |
n responses. |
When model.type
= 'classification' and sparse
= TRUE, models 1, 2, 4 require and model 3 requires
. When
model.type
= 'classification' and sparse
= FALSE, models 1 and 4 require and
, respectively. When
model.type
= 'screening', models 1, 4, 5 and 7 require . Models 2 and 8 require
. Model 3 requires
. Model 5 requires
.
Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.
Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.
train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y ## Not run: train.data <- RaModel("screening", 2, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y ## End(Not run)
train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y ## Not run: train.data <- RaModel("screening", 2, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y ## End(Not run)
This function plots the feature ranking results from a fitted 'RaSE'
object via ggplot2
. In the figure, x-axis represents the feature number and y-axis represents the selected percentage of each feature in B1 subspaces.
RaPlot( object, main = NULL, xlab = "feature", ylab = "selected percentage", ... )
RaPlot( object, main = NULL, xlab = "feature", ylab = "selected percentage", ... )
object |
fitted |
main |
title of the plot. Default = |
xlab |
the label of x-axis. Default = 'feature'. |
ylab |
the label of y-axis. Default = 'selected percentage'. |
... |
additional arguments. |
a 'ggplot'
object.
Tian, Y. and Feng, Y., 2021. RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.
Rase
.
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # fit RaSE classifier with QDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 1, base = 'qda', cores = 2, criterion = 'ric') # plot the selected percentage of each feature appearing in B1 subspaces RaPlot(fit)
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y # fit RaSE classifier with QDA base classifier fit <- Rase(xtrain, ytrain, B1 = 50, B2 = 50, iteration = 1, base = 'qda', cores = 2, criterion = 'ric') # plot the selected percentage of each feature appearing in B1 subspaces RaPlot(fit)
RaScreen
.Rank the features by selected percentages provided by the output from RaScreen
.
RaRank(object, selected.num = "all positive", iteration = object$iteration)
RaRank(object, selected.num = "all positive", iteration = object$iteration)
object |
output from |
selected.num |
the number of selected variables. User can either choose from the following popular options or input an positive integer no larger than the dimension.
|
iteration |
indicates results from which iteration to use. It should be an positive integer. Default = the maximal interation round used by the output from |
Selected variables (indexes).
Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("screening", 1, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y # test RaSE screening with linear regression model and BIC fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm', cores = 2, criterion = 'bic') # Select floor(n/logn) variables RaRank(fit, selected.num = "n/logn") ## End(Not run)
## Not run: set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("screening", 1, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y # test RaSE screening with linear regression model and BIC fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm', cores = 2, criterion = 'bic') # Select floor(n/logn) variables RaRank(fit, selected.num = "n/logn") ## End(Not run)
RaSE
is a general framework for variable screening. In RaSE screening, to select each of the B1 subspaces, B2 random subspaces are generated and the optimal one is chosen according to some criterion. Then the selected proportions (equivalently, percentages) of variables in the B1 subspaces are used as importance measure to rank these variables.
RaScreen( xtrain, ytrain, xval = NULL, yval = NULL, B1 = 200, B2 = NULL, D = NULL, dist = NULL, model = NULL, criterion = NULL, k = 5, cores = 1, seed = NULL, iteration = 0, cv = 5, scale = FALSE, C0 = 0.1, kl.k = NULL, classification = NULL, ... )
RaScreen( xtrain, ytrain, xval = NULL, yval = NULL, B1 = 200, B2 = NULL, D = NULL, dist = NULL, model = NULL, criterion = NULL, k = 5, cores = 1, seed = NULL, iteration = 0, cv = 5, scale = FALSE, C0 = 0.1, kl.k = NULL, classification = NULL, ... )
xtrain |
n * p observation matrix. n observations, p features. |
ytrain |
n 0/1 observatons. |
xval |
observation matrix for validation. Default = |
yval |
0/1 observation for validation. Default = |
B1 |
the number of weak learners. Default = 200. |
B2 |
the number of subspace candidates generated for each weak learner. Default = |
D |
the maximal subspace size when generating random subspaces. Default = |
dist |
the distribution for features when generating random subspaces. Default = |
model |
the model to use. Default = 'lda' when
|
criterion |
the criterion to choose the best subspace. Default = 'ric' when
|
k |
the number of nearest neightbors considered when |
cores |
the number of cores used for parallel computing. Default = 1. |
seed |
the random seed assigned at the start of the algorithm, which can be a real number or |
iteration |
the number of iterations. Default = 0. |
cv |
the number of cross-validations used. Default = 5. Only useful when |
scale |
whether to normalize the data. Logistic, default = FALSE. |
C0 |
a positive constant used when |
kl.k |
the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = |
classification |
the indicator of the problem type, which can be TRUE, FALSE or |
... |
additional arguments. |
A list including the following items.
model |
the model used in RaSE screening. |
criterion |
the criterion to choose the best subspace for each weak learner. |
B1 |
the number of selected subspaces. |
B2 |
the number of subspace candidates generated for each of B1 subspaces. |
n |
the sample size. |
p |
the dimension of data. |
D |
the maximal subspace size when generating random subspaces. |
iteration |
the number of iterations. |
selected.perc |
A list of length ( |
scale |
a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to |
Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.
Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.
Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.
Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.
Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("screening", 1, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y # test RaSE screening with linear regression model and BIC fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm', cores = 2, criterion = 'bic') # Select D variables RaRank(fit, selected.num = "D") ## Not run: # test RaSE screening with knn model and 5-fold cross-validation MSE fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'knn', cores = 2, criterion = 'cv', cv = 5) # Select n/logn variables RaRank(fit, selected.num = "n/logn") # test RaSE screening with SVM and 5-fold cross-validation MSE fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'svm', cores = 2, criterion = 'cv', cv = 5) # Select n/logn variables RaRank(fit, selected.num = "n/logn") # test RaSE screening with logistic regression model and eBIC (gam = 0.5). Set iteration number = 1 train.data <- RaModel("screening", 6, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 1, model = 'logistic', cores = 2, criterion = 'ebic', gam = 0.5) # Select n/logn variables from the selected percentage after one iteration round RaRank(fit, selected.num = "n/logn", iteration = 1) ## End(Not run)
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("screening", 1, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y # test RaSE screening with linear regression model and BIC fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm', cores = 2, criterion = 'bic') # Select D variables RaRank(fit, selected.num = "D") ## Not run: # test RaSE screening with knn model and 5-fold cross-validation MSE fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'knn', cores = 2, criterion = 'cv', cv = 5) # Select n/logn variables RaRank(fit, selected.num = "n/logn") # test RaSE screening with SVM and 5-fold cross-validation MSE fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'svm', cores = 2, criterion = 'cv', cv = 5) # Select n/logn variables RaRank(fit, selected.num = "n/logn") # test RaSE screening with logistic regression model and eBIC (gam = 0.5). Set iteration number = 1 train.data <- RaModel("screening", 6, n = 100, p = 100) xtrain <- train.data$x ytrain <- train.data$y fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 1, model = 'logistic', cores = 2, criterion = 'ebic', gam = 0.5) # Select n/logn variables from the selected percentage after one iteration round RaRank(fit, selected.num = "n/logn", iteration = 1) ## End(Not run)
RaSE
is a general ensemble classification framework to solve the sparse classification problem. In RaSE algorithm, for each of the B1 weak learners, B2 random subspaces are generated and the optimal one is chosen to train the model on the basis of some criterion.
Rase( xtrain, ytrain, xval = NULL, yval = NULL, B1 = 200, B2 = 500, D = NULL, dist = NULL, base = NULL, super = list(type = c("separate"), base.update = TRUE), criterion = NULL, ranking = TRUE, k = c(3, 5, 7, 9, 11), cores = 1, seed = NULL, iteration = 0, cutoff = TRUE, cv = 5, scale = FALSE, C0 = 0.1, kl.k = NULL, lower.limits = NULL, upper.limits = NULL, weights = NULL, ... )
Rase( xtrain, ytrain, xval = NULL, yval = NULL, B1 = 200, B2 = 500, D = NULL, dist = NULL, base = NULL, super = list(type = c("separate"), base.update = TRUE), criterion = NULL, ranking = TRUE, k = c(3, 5, 7, 9, 11), cores = 1, seed = NULL, iteration = 0, cutoff = TRUE, cv = 5, scale = FALSE, C0 = 0.1, kl.k = NULL, lower.limits = NULL, upper.limits = NULL, weights = NULL, ... )
xtrain |
n * p observation matrix. n observations, p features. |
ytrain |
n 0/1 observatons. |
xval |
observation matrix for validation. Default = |
yval |
0/1 observation for validation. Default = |
B1 |
the number of weak learners. Default = 200. |
B2 |
the number of subspace candidates generated for each weak learner. Default = 500. |
D |
the maximal subspace size when generating random subspaces. Default = |
dist |
the distribution for features when generating random subspaces. Default = |
base |
the type of base classifier. Default = 'lda'. Can be either a single string chosen from the following options or a string/probability vector. When it indicates a single type of base classifiers, the classical RaSE model (Tian, Y. and Feng, Y., 2021(b)) will be fitted. When it is a string vector which includes multiple base classifier types, a super RaSE model (Zhu, J. and Feng, Y., 2021) will be fitted, by samling base classifiers with equal probabilty. It can also be a probability vector with row names corresponding to the specific classifier type, in which case a super RaSE model will be trained by sampling base classifiers in the given sampling probability.
|
super |
a list of control parameters for super RaSE (Zhu, J. and Feng, Y., 2021). Not used when base equals to a single string. Should be a list object with the following components:
|
criterion |
the criterion to choose the best subspace for each weak learner. For the classical RaSE (when
|
ranking |
whether the function outputs the selected percentage of each feature in B1 subspaces. Logistic, default = TRUE. |
k |
the number of nearest neightbors considered when |
cores |
the number of cores used for parallel computing. Default = 1. |
seed |
the random seed assigned at the start of the algorithm, which can be a real number or |
iteration |
the number of iterations. Default = 0. |
cutoff |
whether to use the empirically optimal threshold. Logistic, default = TRUE. If it is FALSE, the threshold will be set as 0.5. |
cv |
the number of cross-validations used. Default = 5. Only useful when |
scale |
whether to normalize the data. Logistic, default = FALSE. |
C0 |
a positive constant used when |
kl.k |
the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = |
lower.limits |
the vector of lower limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of |
upper.limits |
the vector of upper limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of |
weights |
observation weights. Should be a vector of length equal to training sample size (the length of |
... |
additional arguments. |
An object with S3 class 'RaSE'
if base
indicates a single base classifier.
marginal |
the marginal probability for each class. |
base |
the type of base classifier. |
criterion |
the criterion to choose the best subspace for each weak learner. |
B1 |
the number of weak learners. |
B2 |
the number of subspace candidates generated for each weak learner. |
D |
the maximal subspace size when generating random subspaces. |
iteration |
the number of iterations. |
fit.list |
sequence of B1 fitted base classifiers. |
cutoff |
the empirically optimal threshold. |
subspace |
sequence of subspaces correponding to B1 weak learners. |
ranking |
the selected percentage of each feature in B1 subspaces. |
scale |
a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to |
An object with S3 class 'super_RaSE'
if base
includes multiple base classifiers or the sampling probability of multiple classifiers.
marginal |
the marginal probability for each class. |
base |
the list of B1 base classifier types. |
criterion |
the criterion to choose the best subspace for each weak learner. |
B1 |
the number of weak learners. |
B2 |
the number of subspace candidates generated for each weak learner. |
D |
the maximal subspace size when generating random subspaces. |
iteration |
the number of iterations. |
fit.list |
sequence of B1 fitted base classifiers. |
cutoff |
the empirically optimal threshold. |
subspace |
sequence of subspaces correponding to B1 weak learners. |
ranking.feature |
the selected percentage of each feature corresponding to each type of classifier. |
ranking.base |
the selected percentage of each classifier type in the selected B1 learners. |
scale |
a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to |
Ye Tian (maintainer, [email protected]) and Yang Feng. The authors thank Yu Cao (Exeter Finance) and his team for many helpful suggestions and discussions.
Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.
Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.
Zhu, J. and Feng, Y., 2021. Super RaSE: Super Random Subspace Ensemble Classification. https://www.preprints.org/manuscript/202110.0042
Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.
Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.
Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, 1973 (pp. 267-281). Akademiai Kaido.
Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.
predict.RaSE
, RaModel
, print.RaSE
, print.super_RaSE
, RaPlot
, RaScreen
.
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'lda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) ## Not run: # test RaSE classifier with LDA base classifier and 1 iteration round fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'lda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) # test RaSE classifier with QDA base classifier and 1 iteration round fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'qda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) # test RaSE classifier with kNN base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'knn', cores = 2, criterion = 'loo') mean(predict(fit, xtest) != ytest) # test RaSE classifier with logistic regression base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'logistic', cores = 2, criterion = 'bic') mean(predict(fit, xtest) != ytest) # test RaSE classifier with SVM base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'svm', cores = 2, criterion = 'training') mean(predict(fit, xtest) != ytest) # test RaSE classifier with random forest base classifier fit <- Rase(xtrain, ytrain, B1 = 20, B2 = 10, iteration = 0, base = 'randomforest', cores = 2, criterion = 'cv', cv = 3) mean(predict(fit, xtest) != ytest) # fit a super RaSE classifier by sampling base learner from kNN, LDA and logistic # regression in equal probability fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T), criterion = "cv", cv = 5, iteration = 1, cores = 2) mean(predict(fit, xtest) != ytest) # fit a super RaSE classifier by sampling base learner from random forest, LDA and # SVM with probability 0.2, 0.5 and 0.3 fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c(randomforest = 0.2, lda = 0.5, svm = 0.3), super = list(type = "separate", base.update = F), criterion = "cv", cv = 5, iteration = 0, cores = 2) mean(predict(fit, xtest) != ytest) ## End(Not run)
set.seed(0, kind = "L'Ecuyer-CMRG") train.data <- RaModel("classification", 1, n = 100, p = 50) test.data <- RaModel("classification", 1, n = 100, p = 50) xtrain <- train.data$x ytrain <- train.data$y xtest <- test.data$x ytest <- test.data$y # test RaSE classifier with LDA base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'lda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) ## Not run: # test RaSE classifier with LDA base classifier and 1 iteration round fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'lda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) # test RaSE classifier with QDA base classifier and 1 iteration round fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'qda', cores = 2, criterion = 'ric') mean(predict(fit, xtest) != ytest) # test RaSE classifier with kNN base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'knn', cores = 2, criterion = 'loo') mean(predict(fit, xtest) != ytest) # test RaSE classifier with logistic regression base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'logistic', cores = 2, criterion = 'bic') mean(predict(fit, xtest) != ytest) # test RaSE classifier with SVM base classifier fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'svm', cores = 2, criterion = 'training') mean(predict(fit, xtest) != ytest) # test RaSE classifier with random forest base classifier fit <- Rase(xtrain, ytrain, B1 = 20, B2 = 10, iteration = 0, base = 'randomforest', cores = 2, criterion = 'cv', cv = 3) mean(predict(fit, xtest) != ytest) # fit a super RaSE classifier by sampling base learner from kNN, LDA and logistic # regression in equal probability fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T), criterion = "cv", cv = 5, iteration = 1, cores = 2) mean(predict(fit, xtest) != ytest) # fit a super RaSE classifier by sampling base learner from random forest, LDA and # SVM with probability 0.2, 0.5 and 0.3 fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100, base = c(randomforest = 0.2, lda = 0.5, svm = 0.3), super = list(type = "separate", base.update = F), criterion = "cv", cv = 5, iteration = 0, cores = 2) mean(predict(fit, xtest) != ytest) ## End(Not run)
Affymetrix rat genome 230 2.0 array annotation data (chip rat2302). For this data set, 120 twelve-week old male rats were selected for tissue harvesting from the eyes and for microarray analysis. The expression of gene TRIM32 is set as the response and the 18975 probes that are expressed in the eye tissue are considered as the predictors.
rat
rat
A list with the predictor matrix x
and the response vector y
.
The link to this data set: https://bioconductor.org/packages/release/data/annotation/html/rat2302.db.html
Scheetz, T.E., Kim, K.Y.A., Swiderski, R.E., Philp, A.R., Braun, T.A., Knudtson, K.L., Dorrance, A.M., DiBona, G.F., Huang, J., Casavant, T.L. and Sheffield, V.C., 2006. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39), pp.14429-14434.
Tian, Y. and Feng, Y., 2021. RaSE: A Variable Screening Framework via Random Subspace Ensembles. arXiv preprint arXiv:2102.03892.