Balanced Random Forest in scikit-learn (python) - scikit-learn

I'm wondering if there is an implementation of the Balanced Random Forest (BRF) in recent versions of the scikit-learn package. BRF is used in the case of imbalanced data. It works as normal RF, but for each bootstrapping iteration, it balances the prevalence class by undersampling. For example, given two classes N0 = 100, and N1 = 30 instances, at each random sampling it draws (with replacement) 30 instances from the first class and the same amount of instances from the second class, i.e. it trains a tree on a balanced data set. For more information please refer to this paper.
RandomForestClassifier() does have the 'class_weight=' parameter, which might be set to 'balanced', but I'm not sure that it is related to downsampling of the bootsrapped training samples.

What you're looking for is the BalancedBaggingClassifier from imblearn.
imblearn.ensemble.BalancedBaggingClassifier(base_estimator=None,
n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True,
bootstrap_features=False, oob_score=False, warm_start=False, ratio='auto',
replacement=False, n_jobs=1, random_state=None, verbose=0)
Effectively what it allow you to do is to successively undersample your majority class while fitting an estimator on top. You can use random forest or any base estimator from scikit-learn. Here is an example.

There is now a class in imblearn called BalancedRandomForestClassifier. It works similar to previously mentioned BalancedBaggingClassifier but is specifically for random forests.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
brf.fit(X_train, y_train)
y_pred = brf.predict(X_test)

Related

What is the classifier used in scikit-learn's VotingClassifier?

I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.

In scikit-learn Stochastic Gradient Descent classifier, how to find the most influential independent variables?

I do this:
from sklearn.linear_model import SGDClassifier
sgclass = SGDClassifier(random_state=10)
sgclass.fit(X_train,y_train)
pred = sgclass.predict(X_test)
from sklearn.metrics import classification_report,accuracy_score
print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))
These are useful reports on the recall and precision of the model.
However how to acquire the most influential independent variables that predict the dependent variable? I started with about 12 candidates and want to see their rank order in terms of influence in the model.
As the documentation specifies, you can use the coef_ attribute to get feature weights. The greater the absolute value of the feature is, the greater is its importance.
You can see that in action in the feature selection class from scikit, SelectFromModel. The best features are selected from any classifier that has attributes feature_importances_ or coef_.

Recursive feature elemination with CV doesn't reduce feature count

I have this protein dataset that I need to perform a RFE on. There are 100 examples with binary class labels (sick - 1, healthy - 0) and 9847 features for each example. To reduce the dimensionality I am performing a RFECV with a LogisticRegression estimator and 5 fold CV. This is the code:
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(5), n_jobs=-1)
rfecv.fit(X_train, y_train)
print("Number of features selected: %d" % rfecv.n_features_)
Number of features selected: 9874
I then plot the number of features vs the CV scores:
plt.figure()
plt.xlabel("feature count")
plt.ylabel("CV accuracy")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
What I think is happening (and this is what I need an expert for) is that the first peak shows the optimal number of features. After that the curve drops and only starts to climb again because of overfitting, not really seperating classes but examples. Could this be the case? And if so how can I obtain these features (i.e. the ones at that first peak), because rfecv.support_ only gives me the ones where the highest accuracy was reached (meaning: all of them).
And while I am at it: How would I choose the best estimator for the RFE? Is it just by trial and error, going through all possible classifiers or is there any logic why I would use a Logit over a linear SVC for example?
One way that i use for feature relevance is the RandomForest or ExtremeRandomizedTrees.
i can use:
rfecv.n_features
to see how much features the find and:
rfec.ranking
to see the features index in descending order. another algorithm that you can use is the PCA to reduce the dimension of you Dataset.

What is the difference between LinearSVC and SVC(kernel="linear")?

I found sklearn.svm.LinearSVC and sklearn.svm.SVC(kernel='linear') and they seem very similar to me, but I get very different results on Reuters.
sklearn.svm.LinearSVC: 81.05% in 28.87s train / 9.71s test
sklearn.svm.SVC : 33.55% in 6536.53s train / 2418.62s test
Both have a linear kernel. The tolerance of the LinearSVC is higher than the one of SVC:
LinearSVC(C=1.0, tol=0.0001, max_iter=1000, penalty='l2', loss='squared_hinge', dual=True, multi_class='ovr', fit_intercept=True, intercept_scaling=1)
SVC (C=1.0, tol=0.001, max_iter=-1, shrinking=True, probability=False, cache_size=200, decision_function_shape=None)
How do both functions differ otherwise? Even if I set kernel='linear, tol=0.0001, max_iter=1000 anddecision_function_shape='ovr'theSVCtakes much longer thanLinearSVC`. Why?
I use sklearn 0.18 and both are wrapped in the OneVsRestClassifier. I'm not sure if this makes the same as multi_class='ovr' / decision_function_shape='ovr'.
Truly, LinearSVC and SVC(kernel='linear') yield different results, i. e. metrics score and decision boundaries, because they use different approaches. The toy example below proves it:
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC, SVC
X, y = load_iris(return_X_y=True)
clf_1 = LinearSVC().fit(X, y) # possible to state loss='hinge'
clf_2 = SVC(kernel='linear').fit(X, y)
score_1 = clf_1.score(X, y)
score_2 = clf_2.score(X, y)
print('LinearSVC score %s' % score_1)
print('SVC score %s' % score_2)
--------------------------
>>> 0.96666666666666667
>>> 0.98666666666666669
The key principles of that difference are the following:
By default scaling, LinearSVC minimizes the squared hinge loss while SVC minimizes the regular hinge loss. It is possible to manually define a 'hinge' string for loss parameter in LinearSVC.
LinearSVC uses the One-vs-All (also known as One-vs-Rest) multiclass reduction while SVC uses the One-vs-One multiclass reduction. It is also noted here. Also, for multi-class classification problem SVC fits N * (N - 1) / 2 models where N is the amount of classes. LinearSVC, by contrast, simply fits N models. If the classification problem is binary, then only one model is fit in both scenarios. multi_class and decision_function_shape parameters have nothing in common. The second one is an aggregator that transforms the results of the decision function in a convenient shape of (n_features, n_samples). multi_class is an algorithmic approach to establish a solution.
The underlying estimators for LinearSVC are liblinear, that do in fact penalize the intercept. SVC uses libsvm estimators that do not. liblinear estimators are optimized for a linear (special) case and thus converge faster on big amounts of data than libsvm. That is why LinearSVC takes less time to solve the problem.
In fact, LinearSVC is not actually linear after the intercept scaling as it was stated in the comments section.
The main difference between them is linearsvc lets your choose only linear classifier whereas svc let yo choose from a variety of non-linear classifiers. however it is not recommended to use svc for non-linear problems as they are super slow. try importing other libraries for doing non-linear classifications.
now the point that even after defining kernel='linear' we don't get same output is because both linearsvc and svc try different approaches while doing the background mathematics. also linearsvc works on principle of one-vs-rest, and svc works on one-vs-one.
I hope this answers your question.

scikit-learn RandomForestClassifier probabilistic prediction vs majority vote

In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.

Resources