Binary decision tree in Scikit Learn - decision-tree

i have a simple question that i didn't understand:
Why decision tree in Scikit Learn is Binary tree instead of n-ary tree?
Anyone knows the answer? Please tell me, thank you so much.

This is better suited for the cross-validated site, but the answer is simplicity. Any decision tree simply partitions space with leaf nodes being data and assignment being a function of that data, typically majority in case of classification and empirical average in case of regression.
However, every decision tree can be converted into a binary decision tree. Intuitively, if you have a rule at level 1 like ( X1 < 1 AND X2 > 10) then this can be converted into a two-level run by shifting one part of the predicate downward.
It is much simpler to train binary decision tree than n-ary because of the combinatorial explosion that takes place. Instead of randomly picking a splitting variable and then optimizing over that field (1-d optimization) N-ary trees must select a subset of variables and optimize over that set.

Related

What is the difference between growing a tree based learning algorithm vertically as compared to horizontally?

I came across the tree based algorithm Light GBM and I have read that it grows the trees vertically meaning that the Light GBM grows tree leaf-wise (while some other algorithms grow level-wise). I was just wondering and thinking: What is the advantage of growing a tree vertically? Are there any?
A difference (not necessarily an advantage) which I can see is the way you need to define early-stopping criteria while growing the tree. Any thoughts on this?
As described in this section of LightGBM's documentation
LightGBM uses leaf-wise (or what XGBoost calls lossguide) tree growth because it can achieve lower loss (i.e. better fit to the training data) than depth-wise tree growth, holding the number of leaves constant.
In leaf-wise tree growth, the split with the largest gain is chosen, regardless of its level of depth.
A difference ... I can see is the way you need to define early-stopping criteria while growing the tree
It's true that in this type of tree growth, you now have to consider two closely-related ways to prevent overfitting:
maximum depth (max_depth in LightGBM)
total allowed number of leaves (num_leaves in LightGBM)
I'm assuming this is what you meant by "early-stopping criteria", but wanted to also note that the phrase "early stopping" has a special meaning in GBMs that isn't related to how individual trees are grown. Early stopping, as XGBoost, LightGBM, and other GBM libraries refer to it, means "if performance on held-out data fails to improve for n iterations, stop training".

How to compute the iteration matrix for nth NLBGS iteration

I was wondering if there was a direct way of computing the iteration matrix for nth Linear Block Gauss Seidel iteration within OpenMDAO?
thank you
If I understand you correctly, you are referring to the matrix-form of the Gauss Seidel algorithm where you take Ax=b, and break A up into the Diagonal (D), Lower (L) and Upper (U) parts, then use those parts to compute the next iterate.
Specifically you compute [D-L]^-1. This, I believe is what you are referring to as the "iteration matrix" (I am not familiar with this terminology, but based on the algorithm I'm comfortable making an educated guess).
This formulation of the algorithm is useful to think about and a simple way to implement it, but OpenMDAO takes a different approach. The LBGS algorithm implemented in OpenMDAO is set up to work in a matrix-free manner. That means it only interacts with the linear operator methods solve_linear and apply_linear and never explicitly assembles the A matrix at all. Hence there isn't an opportunity to split A up into D, L, U.
Depending on the way you constructed the model, the A matrix you would need might or might not be there at all because OpenMDAO is capable of working in a completely matrix free context. However, if all of your components use the compute_partials or linearize methods to provide partial derivatives then the data you would need for the A matrix does exist in memory.
You'll have to dig for it a bit, and ironically the best place to see how to do that is in the direct solver which does actually require the matrix be formed to compute a factorization.
Also, in that code you'll see a function can iteratively call the linear operator to construct a dense matrix even if the underlying components don't provide their partials directly. Please note that this approach for assembling the matrix is extremely slow and is not recommended for normal operations.

Rapidminer Decision Tree results

Sorry I am new to rapidminer and am a bit confused about the results of my decision tree. I have increased the minimal leaf size which has resulted in a smaller more readable decision tree, however I lost 2% in my accuracy results. I have been asked to explain why, but I'm at a loss as to what a better tree but less accuracy says about my data.
Any help us much appreciated.
Neil.
Increasing the minimal leaf size prevents very specialized leaves, which only "match" very few examples. This is usually done to prevent overfitting, which in its extreme form is simply reproducing the training data with 100% accuracy. Still, you expect to make the model more robust on real, new data. So depending on your way to estimate the accuracy, you expect some drop in accuracy when forcing the algorithm to create a more general model.
Also, when talking about a "more readable" tree, don't try to interpret the nodes individually. Don't assume the attribute in the root node is the most important one. It is always the combination of the attributes on your way to the leaf.

Feature scaling and its affect on various algorithm

Despite going through lots of similar question related to this I still could not understand why some algorithm is susceptible to it while others are not.
Till now I found that SVM and K-means are susceptible to feature scaling while Linear Regression and Decision Tree are not.Can somebody please elaborate me why? in general or relating to this 4 algorithm.
As I am a beginner, please explain this in layman terms.
One reason I can think of off-hand is that SVM and K-means, at least with a basic configuration, uses an L2 distance metric. An L1 or L2 distance metric between two points will give different results if you double delta-x or delta-y, for example.
With Linear Regression, you fit a linear transform to best describe the data by effectively transforming the coordinate system before taking a measurement. Since the optimal model is the same no matter the coordinate system of the data, pretty much by definition, your result will be invariant to any linear transform including feature scaling.
With Decision Trees, you typically look for rules of the form x < N, where the only detail that matters is how many items pass or fail the given threshold test - you pass this into your entropy function. Because this rule format does not depend on dimension scale, since there is no continuous distance metric, we again have in-variance.
Somewhat different reasons for each, but I hope that helps.

How to collapse a RandomForest into an equivalent decision tree?

The way I understand it, in creating a random forest, the algorithm bundles a bunch of randomly generated decision trees together, weighting them such that they fit the training data.
Is it reasonable to say that this average of forests could be simplified into a simple decision tree? And, if so - how can I access and present this tree?
What I'm looking to do here is extract the information in the tree to help identify both the leading attributes, their boundary values and placement in the tree. I'm assuming that such a tree would provide insight to a human (or computer heuristic) as to which attributes within a dataset provide the most insight into determining the target outcome.
This probably seems a naive question - and if so, please be patient, I'm new to this and want to get to a stage where I understand it sufficiently.
RandomForest uses bootstrap to create many training sets by sampling the data with replacement (bagging). Each bootstrapped set is very close to the original data, but slightly different, since it may have multiples of the some points and some other points in the original data will be missing. (This helps create a whole bunch of similar but different sets that as a whole represent the population your data came from, and allow better generalization)
Then it fits a DecisionTree to each set. However, what a regular DecisionTree does at each step, is to loop over each feature, find the best split for each feature, and in the end choose to do the split in the feature that produced the best one among all. In RandomForest, instead of looping over every feature to find the best split, you only try a random subsample at each step (default is sqrt(n_features)).
So, every tree in RandomForest is fit to a bootstrapped random training set. And at each branching step, it only looks at a subsample of features, so some of the branching will be good but not necessarily the ideal split. This means that each tree is a less than ideal fit to the original data. When you average the result of all these (sub-ideal) trees, though, you get a robust prediction. Regular DecisionTrees overfit the data, this two-way randomization (bagging and feature subsampling) allow them to generalize and a forest usually does a good job.
Here is the catch: While you can average out the output of each tree, you cannot really "average the trees" to get an "average tree". Since trees are a bunch of if-then statements that are chained, there is no way of taking these chains and coming up with a single chain that produces the result that's the same as averaged result from each chain. Each tree in the forest is different, even if same features show up, they show up in different places of the trees, which makes it impossible to combine. You cannot represent a RandomForest as a single tree.
There are two things you can do.
1) As RPresle mentioned, you can look at the .feature_importances_ attribute, which for each feature averages the splitting score from different trees. The idea is, while you can't get an average tree, you can quantify how much and how effectively each feature is used in the forest by averaging their score in each tree.
2) When I fit a RandomForest model and need to get some insight into what's happening, how the features are affecting the result, I also fit a single DecisionTree. Now, this model is usually not good at all by itself, it will easily be outperformed by the RandomForest and I wouldn't use it to predict anything, but by drawing and looking at the splits in this tree, combined with the .feature_importances_ of the forest, I usually get a pretty good idea of the big picture.

Resources