How to apply learnt rules from sciki-learn decision tree - scikit-learn

I am building decision tree in scikit-learn. Searching stackoverflow one can find a way to extract rules associated with each leaf. Now my goal is to apply these rules to new observation and see in what leaf new observation will end up.
Here is an abstract example. Suppose we got rule for leaf #1. a<5 and b>7, then observation belong to leaf #1. Now I would like to take new observation and apply these rules to it to check in what leaf it ends up.
I am trying to use decision tree for the purpose of segmentation.

You can use the apply method of DecisionTreeClassifier to get the index of hte leaf that each sample is predicted as.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit([[1,2,3],[10,19,20],[6,7,7]],[1,1,0])
clf.apply([[6,7,7]])
# array([3])

An example for using a decision tree classifier with scikit learn can be found here. This example includes training the classifier and validating the results for a second data set.
The predict function can be used to return the results for a new data sample when applying the trained decision tree to it:
predict(X, check_input=True)
where X is the feature vector of the new data sample under examination.
This link might help you to understand how to output the rules of your decision tree classifier.

Related

Adjust the tree parameters for a specific tree node

I use DecisionTreeClassifier from sklearn
I need to correct splitter (feature), min_samples_leaf used in the particular tree node.
How can I do it?
You cannot define the min_samples_leaf for a single node, because the model would probably end up assigning fewer samples to other nodes than the min_samples_leaf of the whole model to ensure compliance with the rule applicable to this individual node.
If you are dealing with imbalanced data set, I suggest you oversample or undersample your data prior to inputting in the model or you could manually set the class weights.
According to scikit-learn's user guide:
Balance your dataset before training to prevent the tree from being
biased toward the classes that are dominant. Class balancing can be
done by sampling an equal number of samples from each class, or
preferably by normalizing the sum of the sample weights
(sample_weight) for each class to the same value.

RandomForestClassifier in Multi-label problem - how it works?

How does the RandomForestClassifier of sklearn handle a multilabel problem (under the hood)?
For example, does it brake the problem in distinct one-label problems?
Just to be clear, I have not really tested it yet but I see y : array-like, shape = [n_samples] or [n_samples, n_outputs] at the .fit() function of the RandomForestClassifier.
Let me cite scikit-learn. The user guide of random forest:
Like decision trees, forests of trees also extend to multi-output problems (if Y is an array of size [n_samples, n_outputs]).
The section multi-output problems of the user guide of decision trees:
… to support multi-output problems. This requires the following changes:
Store n output values in leaves, instead of 1;
Use splitting criteria that compute the average reduction across all n outputs.
And I hope this will answer your question. If not, you can look at the section's reference:
M. Dumont et al., Fast multi-class image annotation with random subwindows and multiple output randomized trees, International Conference on Computer Vision Theory and Applications, 2009.
I was a bit confused when I started using trees. If you refer to the sklearn doc:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
If you go down on the methods to predict_proba, you can see:
"The predicted class probability is the fraction of samples of the same class in a leaf."
So in predict, the class is the mode of the classes on that node. This can change if you use weighted classes
"class_weight : dict, list of dicts, “balanced” or None, default=None
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one."
Hope this helps! :)

How to control feature subsetting in random forest in scikit-learn?

I am trying to change the way that random forest algorithm using in subsetting features for every node. The original algorithm as it is implemented in Scikit-learn way is randomly subsetting. I want to define which subset for every new node from several choices of several subsets. Is there direct way in scikit-learn to control such method? If not, is there any way to update the same code of Scikit-learn? If yes, which function in the source code is what you think should be updated?
Short version: This is all you.
I assume by "subsetting features for every node" you are referring to the random selection of a subset of samples and possibly features used to train individual trees in the forest. If that's what you mean, then you aren't building a random forest; you want to make a nonrandom forest of particular trees.
One way to do that is to build each DecisionTreeClassifier individually using your carefully specified subset of features, then use the VotingClassifier to combine the trees into a forest. (That feature is only available in 0.17/dev, so you may have to build your own, but it is super simple to build a voting classifier estimator class.)

Extracting the trees (predictor) from random forest classifier

I have a specific technical question about sklearn, random forest classifier.
After fitting the data with the ".fit(X,y)" method,
is there a way to extract the actual trees
from the estimator object, in some common format, so the ".predict(X)"
method can be implemented outside python?
Yes, the trees of a forest are stored in the estimators_ attribute of
the forest object.
You can have a look at the implementation of the export_graphviz
function to learn out to write your custom exporter:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/export.py
Here is the usage doc for this function:
http://scikit-learn.org/stable/modules/tree.html#classification
Yes there is and #ogrisel answer enabled me to implement the following snippet, which enables to use a (partially trained) random forest to predict the values. It saves a lot of time if you want to cross validate a random forest model over the number of trees:
rf_model = RandomForestRegressor()
rf_model.fit(x, y)
estimators = rf_model.estimators_
def predict(w, i):
rf_model.estimators_ = estimators[0:i]
return rf_model.predict(x)
I explained this in more details here : extract trees from a Random Forest

Setting feature weights for KNN

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.
kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.
The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Resources