Random Forest Regressor Implementation for multi output in sklearn - scikit-learn

I would like to get more information regarding the implementation of random forest regressor with more then one predictor, y in sklearn. According to the documentation the implementation supports (n_samples, n_outputs) . If this is the case, does this mean that the algorithm creates n models for n outputs (f.e. 100 variables to predict -> 100 models) or is the implementation more complicated?

Related

What is the classifier used in scikit-learn's VotingClassifier?

I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.

Balanced Random Forest in scikit-learn (python)

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.
RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.
What you're looking for is the BalancedBaggingClassifier from imblearn.
imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
replacement=False, n_jobs=1, random_state=None, verbose=0)
Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.
There is now a class in imblearn called BalancedRandomForestClassifier. It works similar to previously mentioned BalancedBaggingClassifier but is specifically for random forests.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)

Is there a direct implementation of multiclass SVM in R(e1071)

I have five classes and I want to use SVM(e1071 package) for the classification. I can see some good examples for binary classification using SVM, however,for Multiclass support, some members have suggested using either of One_Vs_Rest or One_vs_One binary classifier and then combine them to get the final prediction. Is there a direct implementation of Multiclass (either approach is fine for me) available?
Yes, now, I got the solution. I used the basic help file from the R and implemented the One_vs_One Multiclass using e1071 which is very short and to the point with clear comments in it.
library(xlsx)
library(gdata)
data(iris)
library(e1071)
library(caTools)
##---------- Split the overall dataset into two parts:70% for training and 30% for testing-----------
index_iris<-sample.split(iris$Species,SplitRatio=.7)
trainset_iris<-iris[index_iris==TRUE,]
testset_iris<-iris[index_iris==FALSE,]
y <- testset_iris$Species
##---------- Now Create an SVM Model with the training dataset--------------------
model <- svm(Species ~ ., data = trainset_iris)
# print(model)
# summary (model)
##-------------Use the model to predict the test dataset so that we can find the accuracy of the model-----
pred <- predict(model,testset_iris)
table(pred, y)
##-------------- Compute decision values and probabilities--------------
pred <- predict(model, testset_iris, decision.values = TRUE)
attr(pred, "decision.values")

scikit-learn RandomForestClassifier probabilistic prediction vs majority vote

In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.

Extracting the trees (predictor) from random forest classifier

I have a specific technical question about sklearn, random forest classifier.
After fitting the data with the ".fit(X,y)" method,
is there a way to extract the actual trees
from the estimator object, in some common format, so the ".predict(X)"
method can be implemented outside python?
Yes, the trees of a forest are stored in the estimators_ attribute of
the forest object.
You can have a look at the implementation of the export_graphviz
function to learn out to write your custom exporter:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/export.py
Here is the usage doc for this function:
http://scikit-learn.org/stable/modules/tree.html#classification
Yes there is and #ogrisel answer enabled me to implement the following snippet, which enables to use a (partially trained) random forest to predict the values. It saves a lot of time if you want to cross validate a random forest model over the number of trees:
rf_model = RandomForestRegressor()
rf_model.fit(x, y)
estimators = rf_model.estimators_
def predict(w, i):
rf_model.estimators_ = estimators[0:i]
return rf_model.predict(x)
I explained this in more details here : extract trees from a Random Forest

Resources