In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.
Related
How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.
Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.
I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)
I have a business problem, I have run the regression model in python to predict my target value. When validating it with my test set I came to know that my predicted variable is very far from my actual value. Now the thing I want to extract from this model is that, which feature played the role to deviate my predicted value from actual value (let say difference is in some threshold value)?
I want to rank the features impact wise so that I could address to my client.
Thanks
It depends on the estimator you chose, linear models often have a coef_ method you can call to get the coef used for each feature, given they are normalized this tells you what you want to know.
As told above for tree model you have the feature importance. You can also use libraries like treeinterpreter described here:
Interpreting Random Forest
examples
You can have a look at this -
Feature selection
Check the Random Forest Regressor - for performing Regression.
# Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0, 0, 0, 0]]))
Check regr.feature_importances_ for getting the higher, more important features. Further information on FeatureImportance
Edit-1:
As pointed out in user (#blacksite) comment, only feature_importance does not provide complete interpretation of Random forest. For further analysis of results and responsible Features. Please refer to following blogs
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 (preferred as it provides multiple techniques )
https://blog.datadive.net/interpreting-random-forests/ (focuses on 1 technique but also provides python library - treeinterpreter)
More on feature_importance:
If you simply use the feature_importances_ attribute to select the
features with the highest importance score. Feature selection using
feature
importances
Feature importance also depends on the criteria used for splitting
and calculating importance Interpreting Decision Tree in context of
feature
importances
I looked at the documentation of scikit-learn but it is not clear to me what sort of classification method is used under the hood of the VotingClassifier? Is it logistic regression, SVM or some sort of a tree method?
I'm interested in ways to vary the classifier method used under the hood. If Scikit-learn is not offering such an option is there a python package which can be integrated easily with scikit-learn which would offer such functionality?
EDIT:
I meant the classifier method used for the second level model. I'm perfectly aware that the first level classifiers can be any type of classifier supported by scikit-learn.
The second level classifier uses the predictions of the first level classifiers as inputs. So my question is - what method does this second level classifier use? Is it logistic regression? Or something else? Can I change it?
General
The VotingClassifier is not limited to one specific method/algorithm. You can choose multiple different algorithms and combine them to one VotingClassifier. See example below:
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(...)
clf2 = RandomForestClassifier(...)
clf3 = SVC(...)
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('svm', clf3)], voting='hard')
Read more about the usage here: VotingClassifier-Usage.
When it comes down to how the VotingClassifier "votes" you can either specify voting='hard' or voting='soft'. See the paragraph below for more detail.
Voting
Majority Class Labels (Majority/Hard Voting)
In majority voting, the predicted class label for a particular sample
is the class label that represents the majority (mode) of the class
labels predicted by each individual classifier.
E.g., if the prediction for a given sample is
classifier 1 -> class 1 classifier 2 -> class 1 classifier 3 -> class
2 the VotingClassifier (with voting='hard') would classify the sample
as “class 1” based on the majority class label.
Source: scikit-learn-majority-class-labels-majority-hard-voting
Weighted Average Probabilities (Soft Voting)
In contrast to majority voting (hard voting), soft voting returns the
class label as argmax of the sum of predicted probabilities.
Specific weights can be assigned to each classifier via the weights
parameter. When weights are provided, the predicted class
probabilities for each classifier are collected, multiplied by the
classifier weight, and averaged. The final class label is then derived
from the class label with the highest average probability.
Source/Read more here: scikit-learn-weighted-average-probabilities-soft-voting
The VotingClassifier does not fit any meta model on the first level of classifiers output.
It just aggregates the output of each classifier in the first level by the mode (if voting is hard) or averaging the probabilities (if the voting is soft).
In simple terms, VotingClassifier does not learn anything from the first level of classifiers. It only consolidates the output of individual classifiers.
If you want your meta model to be more intelligent, try using the adaboost, gradientBoosting models.
I want to use the latent dirichlet allocation from sklearn for anomaly detection. I need to obtain the likelihood for a new samples as formally described in equation here.
How can I get that?
Solution to your problem
You should be using the score() method of the model which returns the log likelihood of the passed in documents.
Assuming you have created your documents as per the paper and trained an LDA model for each host. You should then get the lowest likelihood from all the training documents and use it as a threshold. Example untested code follows:
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
# Assuming X contains a host's training documents
# and X_unknown contains the test documents
lda = LatentDirichletAllocation(... parameters here ...)
lda.fit(X)
threshold = min([lda.score([x]) for x in X])
attacks = [
i for i, x in enumerate(X_unknown)
if lda.score([x]) < threshold
]
# attacks now contains the indexes of the anomalies
Exactly what you asked
If you want to use exact equation in the paper you linked I would advise against trying to do it in scikit-learn because the expectation step interface is not clear.
The parameters θ and φ can be found at lines 112 - 130 as doc_topic_d and norm_phi. The function _update_doc_distribution() returns the doc_topic_distribution and the sufficient statistics from which you could try to infer the θ and φ by the following again untested code:
theta = doc_topic_d / doc_topic_d.sum()
# see the variables exp_doc_topic_d in the source code
# in the function _update_doc_distribution()
phi = np.dot(exp_doc_topic_d, exp_topic_word_d) + EPS
Suggestion for another library
If you want to have more control over the expectation and maximization steps and the variational parameters I would suggest you look at LDA++ and specifically the EStepInterface (disclaimer I am one of the authors of LDA++).
How can i know sample's probability that it belongs to a class predicted by predict() function of Scikit-Learn in Support Vector Machine?
>>>print clf.predict([fv])
[5]
There is any function?
Definitely read this section of the docs as there's some subtleties involved. See also Scikit-learn predict_proba gives wrong answers
Basically, if you have a multi-class problem with plenty of data predict_proba as suggested earlier works well. Otherwise, you may have to make do with an ordering that doesn't yield probability scores from decision_function.
Here's a nice motif for using predict_proba to get a dictionary or list of class vs probability:
model = svm.SVC(probability=True)
model.fit(X, Y)
results = model.predict_proba(test_data)[0]
# gets a dictionary of {'class_name': probability}
prob_per_class_dictionary = dict(zip(model.classes_, results))
# gets a list of ['most_probable_class', 'second_most_probable_class', ..., 'least_class']
results_ordered_by_probability = map(lambda x: x[0], sorted(zip(model.classes_, results), key=lambda x: x[1], reverse=True))
Use clf.predict_proba([fv]) to obtain a list with predicted probabilities per class. However, this function is not available for all classifiers.
Regarding your comment, consider the following:
>> prob = [ 0.01357713, 0.00662571, 0.00782155, 0.3841413, 0.07487401, 0.09861277, 0.00644468, 0.40790285]
>> sum(prob)
1.0
The probabilities sum to 1.0, so multiply by 100 to get percentage.
When creating SVC class to compute the probability estimates by setting probability=True:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Then call fit as usual and then predict_proba([fv]).
For clearer answers, I post again the information from scikit-learn for svm.
Needless to say, the cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores, in the sense that the “argmax” of the scores may not be the argmax of the probabilities. (E.g., in binary classification, a sample may be labeled by predict as belonging to a class that has probability <½ according to predict_proba.) Platt’s method is also known to have theoretical issues. If confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and use decision_function instead of predict_proba.
For other classifiers such as Random Forest, AdaBoost, Gradient Boosting, it should be okay to use predict function in scikit-learn.
This is one way of obtaining the Probabilities
svc = SVC(probability=True)
preds_svc = svc.fit(X_train, y_train).predict(X_test)
probs_svc = svc.decision_function(X_test)#The decision function tells us on which side of the hyperplane generated by the classifier we are (and how far we are away from it).
probs_svc = (probs_svc - probs_svc.min()) / (probs_svc.max() - probs_svc.min())