Testing statistics behind decision tree - statistics

I found myself a problem about statistics. I am looking for solution...
Like if you have two decision trees, and one tree has dropped 970 leaves and the other has dropped 1027, you assume they should drop the same number of leaves.
How do you test if this number of counts is significantly different?

We use Poisson distribution for trees. And for over 1000 leaves in tree it's almost normal distribution plot. From this point it's easy to use statistics for it.

Related

What is the difference between growing a tree based learning algorithm vertically as compared to horizontally?

I came across the tree based algorithm Light GBM and I have read that it grows the trees vertically meaning that the Light GBM grows tree leaf-wise (while some other algorithms grow level-wise). I was just wondering and thinking: What is the advantage of growing a tree vertically? Are there any?
A difference (not necessarily an advantage) which I can see is the way you need to define early-stopping criteria while growing the tree. Any thoughts on this?
As described in this section of LightGBM's documentation
LightGBM uses leaf-wise (or what XGBoost calls lossguide) tree growth because it can achieve lower loss (i.e. better fit to the training data) than depth-wise tree growth, holding the number of leaves constant.
In leaf-wise tree growth, the split with the largest gain is chosen, regardless of its level of depth.
A difference ... I can see is the way you need to define early-stopping criteria while growing the tree
It's true that in this type of tree growth, you now have to consider two closely-related ways to prevent overfitting:
maximum depth (max_depth in LightGBM)
total allowed number of leaves (num_leaves in LightGBM)
I'm assuming this is what you meant by "early-stopping criteria", but wanted to also note that the phrase "early stopping" has a special meaning in GBMs that isn't related to how individual trees are grown. Early stopping, as XGBoost, LightGBM, and other GBM libraries refer to it, means "if performance on held-out data fails to improve for n iterations, stop training".

How to collapse a RandomForest into an equivalent decision tree?

The way I understand it, in creating a random forest, the algorithm bundles a bunch of randomly generated decision trees together, weighting them such that they fit the training data.
Is it reasonable to say that this average of forests could be simplified into a simple decision tree? And, if so - how can I access and present this tree?
What I'm looking to do here is extract the information in the tree to help identify both the leading attributes, their boundary values and placement in the tree. I'm assuming that such a tree would provide insight to a human (or computer heuristic) as to which attributes within a dataset provide the most insight into determining the target outcome.
This probably seems a naive question - and if so, please be patient, I'm new to this and want to get to a stage where I understand it sufficiently.
RandomForest uses bootstrap to create many training sets by sampling the data with replacement (bagging). Each bootstrapped set is very close to the original data, but slightly different, since it may have multiples of the some points and some other points in the original data will be missing. (This helps create a whole bunch of similar but different sets that as a whole represent the population your data came from, and allow better generalization)
Then it fits a DecisionTree to each set. However, what a regular DecisionTree does at each step, is to loop over each feature, find the best split for each feature, and in the end choose to do the split in the feature that produced the best one among all. In RandomForest, instead of looping over every feature to find the best split, you only try a random subsample at each step (default is sqrt(n_features)).
So, every tree in RandomForest is fit to a bootstrapped random training set. And at each branching step, it only looks at a subsample of features, so some of the branching will be good but not necessarily the ideal split. This means that each tree is a less than ideal fit to the original data. When you average the result of all these (sub-ideal) trees, though, you get a robust prediction. Regular DecisionTrees overfit the data, this two-way randomization (bagging and feature subsampling) allow them to generalize and a forest usually does a good job.
Here is the catch: While you can average out the output of each tree, you cannot really "average the trees" to get an "average tree". Since trees are a bunch of if-then statements that are chained, there is no way of taking these chains and coming up with a single chain that produces the result that's the same as averaged result from each chain. Each tree in the forest is different, even if same features show up, they show up in different places of the trees, which makes it impossible to combine. You cannot represent a RandomForest as a single tree.
There are two things you can do.
1) As RPresle mentioned, you can look at the .feature_importances_ attribute, which for each feature averages the splitting score from different trees. The idea is, while you can't get an average tree, you can quantify how much and how effectively each feature is used in the forest by averaging their score in each tree.
2) When I fit a RandomForest model and need to get some insight into what's happening, how the features are affecting the result, I also fit a single DecisionTree. Now, this model is usually not good at all by itself, it will easily be outperformed by the RandomForest and I wouldn't use it to predict anything, but by drawing and looking at the splits in this tree, combined with the .feature_importances_ of the forest, I usually get a pretty good idea of the big picture.

kd-tree BBF algorithm time complexity

I hava 2000 points with 5000 dimensions , and I want to get the nearest neighbour.
Now I have some problems , could anybody give a answer.
People say , it works good with high dimensions. What's the time complexity ?
#param max_nn_chks search is cut off after examining this many tree entries
After I read the algorithm, I wonder if I would get the wrong answer when I set the max_nn_chks too low. If yes, then just tell me how to set this parameter, else give a reason, thanks.
Is the kdtree the best Data Structures for my data to get nearest neighbour?
The time complexity is basically the same as in restricted KD-Tree search plus some little time to maintain the priority queue. The restricted KD-Tree search algorithm needs to traverse the tree in its full depth (log2 of the point count) times the limit (maximum number of leaf nodes/points allowed to be visited).
Yes, you will get a wrong answer if the limit is too low. You can only measure fraction of true NN found versus number of leaf nodes searched. From this, you can determine your optimal value.
Usually a randomized kd-tree forest and hierarchical k-means tree perform best. FLANN provides a method to determine which algorithm to use (k-means vs randomized kd-tree forest) and sets the optimal parameters for you.
The structure of data also has a big impact. If you know there are clusters of points being close together, for example, you can group them in a single node of a tree (represent them by their centroid, for example) and speed up the search.
Another techniques such as visual words, PCA or random projections can be employed on the data. It's a quite active field of research.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Create CDF for Anderson Darling test for Octave forge Statistics package function

I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. "
My question therefore is: how would I convert my data values into CDF values for the reference distribution?
Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise.
If your data has a specific distribution, for instance beta(3,3) then
p = betacdf(x, 3, 3)
will be uniform by the definition of a CDF. If you want to transform it to a normal, you can just call the inverse CDF function
x=norminv(p,0,1)
on the uniform p. Once transformed, use your favorite test. I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality.
Your approach is misguided in multiple ways. Several points:
The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. The distribution should be known - not come from data. While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file:
Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values.
So, don't do it.
Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems:
Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence.
Given your description, I assume there is some sort of time predictor involved. So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time.
The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. Suppose the trend is sin(t). Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). Obviously you could get values well above 1.
We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. Look up time series theory - separating cyclic components is a major topic there. But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction.

Resources