Decision trees: information gain - bias against attributes - how and why is it say so? - decision-tree

I am confused to get the context on biases in the following line (marked in bold):
Information gain ratio biases the decision tree against considering attributes with a large number of distinct values which might lead to overfitting.

Did you mean Information gain, as information gain is bias towards variables with large distinct values and information gain ratio is tries to solve this by taking into account the number of branches that would result before making the split, It corrects information gain by taking the intrinsic information of a split into account.
Answer for why information gain is biased towards variables with large distinct values
Please note that information gain (IG) is biased toward variables with large number of distinct values not variables that have observations with large values. Before describing the reason of this condition, lets review the definition of IG.
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
In other words, a variable with the highest number of distinct values probability can divide data to smaller chunks. Also, we know that lower number of observations in each chunk reduces probability of variation occurrence.
Using ID variable in splitting data is a common example for this issue. Since each individual sample has their own distinct value, selecting ID features leads to many clusters with one sample and entropy of zero. Therefore, a decision tree that works with IG, selects the ID as the first separator attribute. Indeed, entropy will approach to zero by selecting the ID feature. However, we are not interested to such a feature. We are more interested to features that highly explain the variation of dependent variable.
Please refer to this discussion where this point was initially written.

Related

Testing significance of Strauss parameters in mppm model

I have a follow up question from my previous post.
Upon creating mppm models like these:
Str <- hyperframe(str=with(simba, Strauss(mean(nndist(Points)))))
fit0 <- mppm(Points ~ group, simba)
fit1 <- mppm(Points ~ group, simba, interaction=Str,
iformula = ~str + str:id)
Using anova.mppm to run a likelihood ratio test shows that the interaction is highly significant as a whole, but I would also like to test:
whether each individual id shows significant regularity.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
perform pairwise comparisons of regularity between different ids.
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach. Also, I do not think the "summary output" with the p-values for each Strauss interaction parameter can be used for pairwise comparisons other than to the reference level.
Any advice is greatly appreciated.
Thank you!
First let me explain that, for Gibbs models, the likelihood is intractable, so anova.mppm performs the adjusted composite likelihood ratio test, not the likelihood ratio test. However, you can essentially treat this as if it were the likelihood ratio test based on deviance differences.
whether each individual id shows significant regularity
I am aware I could build separate ppm models for each id to test for significant regularity in each id, but I am not sure this is the best approach.
This is appropriate. Use ppm to fit a Strauss model to an individual point pattern, and use anova.ppm to test whether the Strauss interaction is statistically significant.
whether some groups of ids show significantly stronger inhibition than other groups, for example, whether ids 1-7 are are significantly more regular than ids 8-10.
Introduce a new categorical variable (factor) f, say, that separates the two groups that you want to compare. In your model, add the term f:str to the interaction formula; this gives you the alternative hypothesis. The null and alternative models are identical except that the alternative includes the term f:str in the interaction formula. Now apply anova.mppm. Like all analyses of variance, this performs a two-sided test. For the one-sided test, inspect the sign of the coefficient of f:str in the fitted alternative model. If it has the sign that you wanted, report it as significant at the same p-value. Otherwise, report it as non-significant.
perform pairwise comparisons of regularity between different ids.
This is not yet supported (in theory or in software).
[Congratulations, you have reached the boundary of existing methodology!]

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

kd-tree BBF algorithm time complexity

I hava 2000 points with 5000 dimensions , and I want to get the nearest neighbour.
Now I have some problems , could anybody give a answer.
People say , it works good with high dimensions. What's the time complexity ?
#param max_nn_chks search is cut off after examining this many tree entries
After I read the algorithm, I wonder if I would get the wrong answer when I set the max_nn_chks too low. If yes, then just tell me how to set this parameter, else give a reason, thanks.
Is the kdtree the best Data Structures for my data to get nearest neighbour?
The time complexity is basically the same as in restricted KD-Tree search plus some little time to maintain the priority queue. The restricted KD-Tree search algorithm needs to traverse the tree in its full depth (log2 of the point count) times the limit (maximum number of leaf nodes/points allowed to be visited).
Yes, you will get a wrong answer if the limit is too low. You can only measure fraction of true NN found versus number of leaf nodes searched. From this, you can determine your optimal value.
Usually a randomized kd-tree forest and hierarchical k-means tree perform best. FLANN provides a method to determine which algorithm to use (k-means vs randomized kd-tree forest) and sets the optimal parameters for you.
The structure of data also has a big impact. If you know there are clusters of points being close together, for example, you can group them in a single node of a tree (represent them by their centroid, for example) and speed up the search.
Another techniques such as visual words, PCA or random projections can be employed on the data. It's a quite active field of research.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Resources