What is the difference between growing a tree based learning algorithm vertically as compared to horizontally? - decision-tree

I came across the tree based algorithm Light GBM and I have read that it grows the trees vertically meaning that the Light GBM grows tree leaf-wise (while some other algorithms grow level-wise). I was just wondering and thinking: What is the advantage of growing a tree vertically? Are there any?
A difference (not necessarily an advantage) which I can see is the way you need to define early-stopping criteria while growing the tree. Any thoughts on this?

As described in this section of LightGBM's documentation
LightGBM uses leaf-wise (or what XGBoost calls lossguide) tree growth because it can achieve lower loss (i.e. better fit to the training data) than depth-wise tree growth, holding the number of leaves constant.
In leaf-wise tree growth, the split with the largest gain is chosen, regardless of its level of depth.
A difference ... I can see is the way you need to define early-stopping criteria while growing the tree
It's true that in this type of tree growth, you now have to consider two closely-related ways to prevent overfitting:
maximum depth (max_depth in LightGBM)
total allowed number of leaves (num_leaves in LightGBM)
I'm assuming this is what you meant by "early-stopping criteria", but wanted to also note that the phrase "early stopping" has a special meaning in GBMs that isn't related to how individual trees are grown. Early stopping, as XGBoost, LightGBM, and other GBM libraries refer to it, means "if performance on held-out data fails to improve for n iterations, stop training".

Related

Testing statistics behind decision tree

I found myself a problem about statistics. I am looking for solution...
Like if you have two decision trees, and one tree has dropped 970 leaves and the other has dropped 1027, you assume they should drop the same number of leaves.
How do you test if this number of counts is significantly different?
We use Poisson distribution for trees. And for over 1000 leaves in tree it's almost normal distribution plot. From this point it's easy to use statistics for it.

Rapidminer Decision Tree results

Sorry I am new to rapidminer and am a bit confused about the results of my decision tree. I have increased the minimal leaf size which has resulted in a smaller more readable decision tree, however I lost 2% in my accuracy results. I have been asked to explain why, but I'm at a loss as to what a better tree but less accuracy says about my data.
Any help us much appreciated.
Neil.
Increasing the minimal leaf size prevents very specialized leaves, which only "match" very few examples. This is usually done to prevent overfitting, which in its extreme form is simply reproducing the training data with 100% accuracy. Still, you expect to make the model more robust on real, new data. So depending on your way to estimate the accuracy, you expect some drop in accuracy when forcing the algorithm to create a more general model.
Also, when talking about a "more readable" tree, don't try to interpret the nodes individually. Don't assume the attribute in the root node is the most important one. It is always the combination of the attributes on your way to the leaf.

How to collapse a RandomForest into an equivalent decision tree?

The way I understand it, in creating a random forest, the algorithm bundles a bunch of randomly generated decision trees together, weighting them such that they fit the training data.
Is it reasonable to say that this average of forests could be simplified into a simple decision tree? And, if so - how can I access and present this tree?
What I'm looking to do here is extract the information in the tree to help identify both the leading attributes, their boundary values and placement in the tree. I'm assuming that such a tree would provide insight to a human (or computer heuristic) as to which attributes within a dataset provide the most insight into determining the target outcome.
This probably seems a naive question - and if so, please be patient, I'm new to this and want to get to a stage where I understand it sufficiently.
RandomForest uses bootstrap to create many training sets by sampling the data with replacement (bagging). Each bootstrapped set is very close to the original data, but slightly different, since it may have multiples of the some points and some other points in the original data will be missing. (This helps create a whole bunch of similar but different sets that as a whole represent the population your data came from, and allow better generalization)
Then it fits a DecisionTree to each set. However, what a regular DecisionTree does at each step, is to loop over each feature, find the best split for each feature, and in the end choose to do the split in the feature that produced the best one among all. In RandomForest, instead of looping over every feature to find the best split, you only try a random subsample at each step (default is sqrt(n_features)).
So, every tree in RandomForest is fit to a bootstrapped random training set. And at each branching step, it only looks at a subsample of features, so some of the branching will be good but not necessarily the ideal split. This means that each tree is a less than ideal fit to the original data. When you average the result of all these (sub-ideal) trees, though, you get a robust prediction. Regular DecisionTrees overfit the data, this two-way randomization (bagging and feature subsampling) allow them to generalize and a forest usually does a good job.
Here is the catch: While you can average out the output of each tree, you cannot really "average the trees" to get an "average tree". Since trees are a bunch of if-then statements that are chained, there is no way of taking these chains and coming up with a single chain that produces the result that's the same as averaged result from each chain. Each tree in the forest is different, even if same features show up, they show up in different places of the trees, which makes it impossible to combine. You cannot represent a RandomForest as a single tree.
There are two things you can do.
1) As RPresle mentioned, you can look at the .feature_importances_ attribute, which for each feature averages the splitting score from different trees. The idea is, while you can't get an average tree, you can quantify how much and how effectively each feature is used in the forest by averaging their score in each tree.
2) When I fit a RandomForest model and need to get some insight into what's happening, how the features are affecting the result, I also fit a single DecisionTree. Now, this model is usually not good at all by itself, it will easily be outperformed by the RandomForest and I wouldn't use it to predict anything, but by drawing and looking at the splits in this tree, combined with the .feature_importances_ of the forest, I usually get a pretty good idea of the big picture.

kd-tree BBF algorithm time complexity

I hava 2000 points with 5000 dimensions , and I want to get the nearest neighbour.
Now I have some problems , could anybody give a answer.
People say , it works good with high dimensions. What's the time complexity ?
#param max_nn_chks search is cut off after examining this many tree entries
After I read the algorithm, I wonder if I would get the wrong answer when I set the max_nn_chks too low. If yes, then just tell me how to set this parameter, else give a reason, thanks.
Is the kdtree the best Data Structures for my data to get nearest neighbour?
The time complexity is basically the same as in restricted KD-Tree search plus some little time to maintain the priority queue. The restricted KD-Tree search algorithm needs to traverse the tree in its full depth (log2 of the point count) times the limit (maximum number of leaf nodes/points allowed to be visited).
Yes, you will get a wrong answer if the limit is too low. You can only measure fraction of true NN found versus number of leaf nodes searched. From this, you can determine your optimal value.
Usually a randomized kd-tree forest and hierarchical k-means tree perform best. FLANN provides a method to determine which algorithm to use (k-means vs randomized kd-tree forest) and sets the optimal parameters for you.
The structure of data also has a big impact. If you know there are clusters of points being close together, for example, you can group them in a single node of a tree (represent them by their centroid, for example) and speed up the search.
Another techniques such as visual words, PCA or random projections can be employed on the data. It's a quite active field of research.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Resources