decision tree and maximum depth - decision-tree

I am new to decision tree model and have a seemingly stupid question. Say I have a decision of which max_depth is set to be 20. Does it mean that the tree has to be 20 layers deep or any integer smaller than 20 is possible?

If you precise max_depth = 20, then the tree can have leaves anywhere between 1 and 20 layers deep.
That's why they put max_ next to depth ;) or else it would've been just depth

It can have any number between 1 to 20, it depends on at which level the hypothetical model performs best.For more details on decision tree refer this.
https://infocenter.informationbuilders.com/wf80/topic/pubdocs/RStat16/source/topic47.htm

Related

Improving accuracy of nearest neighbours algorithm - unsupervised learning problem

I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.
from sklearn.neighbors import KDTree
def findsuccess(sso_id):
neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
print('Neighbors of id', neighbors_f_sso_id)
kdt = KDTree(X, leaf_size=40, metric='euclidean')
kdt.query(X, k=4, return_distance=False)
The above code will return the ID itself and the 3 nearest neighbours ,hence k=4
I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.
What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance?
Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?

how to explain this decision tree interpretability question?

enter image description here
enter image description here
The 2 pictures above shown the 2 decision tree....
Question is: It is often claimed that a strength of decision trees is their interpretability.
Is this always justified? Refer to Figures 5 and 6 to help with your answer.
I think the point of the question is saying that a decision tree is interpretable if its depth is relatively small. The second tree is very deep i.e for one single prediction, you will get a high number of different splitting decisions to process. Therefore, you lose interpretability because the explanation for any prediction is an intersection of too many conditions for a human-user to process.

Rapidminer Decision Tree results

Sorry I am new to rapidminer and am a bit confused about the results of my decision tree. I have increased the minimal leaf size which has resulted in a smaller more readable decision tree, however I lost 2% in my accuracy results. I have been asked to explain why, but I'm at a loss as to what a better tree but less accuracy says about my data.
Any help us much appreciated.
Neil.
Increasing the minimal leaf size prevents very specialized leaves, which only "match" very few examples. This is usually done to prevent overfitting, which in its extreme form is simply reproducing the training data with 100% accuracy. Still, you expect to make the model more robust on real, new data. So depending on your way to estimate the accuracy, you expect some drop in accuracy when forcing the algorithm to create a more general model.
Also, when talking about a "more readable" tree, don't try to interpret the nodes individually. Don't assume the attribute in the root node is the most important one. It is always the combination of the attributes on your way to the leaf.

Binary decision tree in Scikit Learn

i have a simple question that i didn't understand:
Why decision tree in Scikit Learn is Binary tree instead of n-ary tree?
Anyone knows the answer? Please tell me, thank you so much.
This is better suited for the cross-validated site, but the answer is simplicity. Any decision tree simply partitions space with leaf nodes being data and assignment being a function of that data, typically majority in case of classification and empirical average in case of regression.
However, every decision tree can be converted into a binary decision tree. Intuitively, if you have a rule at level 1 like ( X1 < 1 AND X2 > 10) then this can be converted into a two-level run by shifting one part of the predicate downward.
It is much simpler to train binary decision tree than n-ary because of the combinatorial explosion that takes place. Instead of randomly picking a splitting variable and then optimizing over that field (1-d optimization) N-ary trees must select a subset of variables and optimize over that set.

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.
However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.
There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, "Classification and regression trees", 1984.
The usual way to compute the feature importance values of a single tree is as follows:
you initialize an array feature_importances of all zeros with size n_features.
you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].
The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.
Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.
As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.
Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.
Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.
As #GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.
Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.
If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:
https://github.com/pjh2011/rf_perm_feat_import
Edit: This works for Python 2.7, not 3
code:
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
decision_tree plot:
enter image description here
We get
compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]
Check source code:
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef double normalizer = 0.
cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances
Try calculate the feature importance:
print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))
We get feature_importance: np.array([0,1.332,6.418,92.30]).
After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.
Be careful all classes are supposed to have weight one.
For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by #GillesLouppe:
In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.
In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."
Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.
[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Feature Importance in Random Forest
Random forest uses many trees, and thus, the variance is reduced
Random forest allows far more exploration of feature combinations as well
Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
Each tree has a different Order of Importance
Here is what happens in the background!
We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute

Resources