RandomForestClassifier in Multi-label problem - how it works? - scikit-learn

How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.

Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.

I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)

Related

In the scikit learn implementation of LDA what is the difference between transform and decision_function?

I am currently working on a project that uses Linear Discriminant Analysis to transform some high-dimensional feature set into a scalar value according to some binary labels.
So I train LDA on the data and the labels and then use either transform(X) or decision_function(X) to project the data into a one-dimensional space.
I would like to understand the difference between these two functions. My intuition would be that the decision_function(X) would be transform(X) + bias, but this is not the case.
Also, I found that those two functions give a different AUC score, and thus indicate that it is not a monotonic transformation as I would have thought.
In the documentation, it states that the transform(X) projects the data to maximize class separation, but I would have expected decision_function(X) to do this.
I hope someone could help me understand the difference between these two.
LDA projects your multivariate data onto a 1D space. The projection is based on a linear combination of all your attributes (columns in X). The weights of each attribute are determined by maximizing the class separation. Subsequently, a threshold value in 1D space is determined which gives the best classification results. transform(X) gives you the value of each observation in this 1D space x' = transform(X). decision_function(X) gives you the log-likelihood of an attribute being a positive class log(P(y=1|x')).

How to know which features have more impact in predicting the target class?

I have a business problem, I have run the regression model in python to predict my target value. When validating it with my test set I came to know that my predicted variable is very far from my actual value. Now the thing I want to extract from this model is that, which feature played the role to deviate my predicted value from actual value (let say difference is in some threshold value)?
I want to rank the features impact wise so that I could address to my client.
Thanks
It depends on the estimator you chose, linear models often have a coef_ method you can call to get the coef used for each feature, given they are normalized this tells you what you want to know.
As told above for tree model you have the feature importance. You can also use libraries like treeinterpreter described here:
Interpreting Random Forest
examples
You can have a look at this -
Feature selection
Check the Random Forest Regressor - for performing Regression.
# Example
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0, 0, 0, 0]]))
Check regr.feature_importances_ for getting the higher, more important features. Further information on FeatureImportance
Edit-1:
As pointed out in user (#blacksite) comment, only feature_importance does not provide complete interpretation of Random forest. For further analysis of results and responsible Features. Please refer to following blogs
https://medium.com/usf-msds/intuitive-interpretation-of-random-forest-2238687cae45 (preferred as it provides multiple techniques )
https://blog.datadive.net/interpreting-random-forests/ (focuses on 1 technique but also provides python library - treeinterpreter)
More on feature_importance:
If you simply use the feature_importances_ attribute to select the
features with the highest importance score. Feature selection using
feature
importances
Feature importance also depends on the criteria used for splitting
and calculating importance Interpreting Decision Tree in context of
feature
importances

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

scikit-learn RandomForestClassifier probabilistic prediction vs majority vote

In the documentation of scikit-learn in section 1.9.2.1 (excerpt is posted below), why does the implementation of random forest differ from the original paper by Breiman? As far as I'm aware, Breiman opted for a majority vote (mode) for classification and an average for regression (paper written by Liaw and Wiener, the maintainers of the original R code with citation below) when aggregating the ensembles of classifiers.
Why does scikit-learn use probabilistic prediction instead of a majority vote?
Is there any advantage in using probabilistic prediction?
The section in question:
In contrast to the original publication [B2001], the scikit-learn
implementation combines classifiers by averaging their probabilistic
prediction, instead of letting each classifier vote for a single
class.
Source: Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R news, 2(3), 18-22.
This question has now been answered on Cross Validated. Included here for reference:
Such questions are always best answered by looking at the code, if
you're fluent in Python.
RandomForestClassifier.predict, at least in the current version
0.16.1, predicts the class with highest probability estimate, as given by predict_proba. (this
line)
The documentation for predict_proba says:
The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the trees in the forest. The
class probability of a single tree is the fraction of samples of the
same class in a leaf.
The difference from the original method is probably just so that
predict gives predictions consistent with predict_proba. The
result is sometimes called "soft voting", rather than the "hard"
majority vote used in the original Breiman paper. I couldn't in quick
searching find an appropriate comparison of the performance of the two
methods, but they both seem fairly reasonable in this situation.
The predict documentation is at best quite misleading; I've
submitted a pull
request to
fix it.
If you want to do majority vote prediction instead, here's a function
to do it. Call it like predict_majvote(clf, X) rather than
clf.predict(X). (Based on predict_proba; only lightly tested, but
I think it should work.)
from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted
def predict_majvote(forest, X):
"""Predict class for X.
Uses majority voting, rather than the soft voting scheme
used by RandomForestClassifier.predict.
Parameters
----------
X : array-like or sparse matrix of shape = [n_samples, n_features]
The input samples. Internally, it will be converted to
``dtype=np.float32`` and if a sparse matrix is provided
to a sparse ``csr_matrix``.
Returns
-------
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
"""
check_is_fitted(forest, 'n_outputs_')
# Check data
X = check_array(X, dtype=DTYPE, accept_sparse="csr")
# Assign chunk of trees to jobs
n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
forest.n_jobs)
# Parallel loop
all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
backend="threading")(
delayed(_parallel_helper)(e, 'predict', X, check_input=False)
for e in forest.estimators_)
# Reduce
modes, counts = mode(all_preds, axis=0)
if forest.n_outputs_ == 1:
return forest.classes_.take(modes[0], axis=0)
else:
n_samples = all_preds[0].shape[0]
preds = np.zeros((n_samples, forest.n_outputs_),
dtype=forest.classes_.dtype)
for k in range(forest.n_outputs_):
preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
return preds
On the dumb synthetic case I tried, predictions agreed with the
predict method every time.
This was studied by Breiman in Bagging predictor (http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf).
This gives nearly identical results, but using soft voting gives smoother probabilities. Note that if you are using completely developed tree, you won't have any difference.

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources