Intuition on KL-divergence and feature selection - decision-tree

I'm having a bit of a hard time understanding KL-divergence and how I can use it for feature selection. So let's say I have a set of observations (e.g. zeroes and ones) and a 2 features generated for each observation. My question now is: why feature is the 'best'?
I know I can use KL-divergence (given by $$D_{KL} = \sum_i p(i) \log \frac{p(i)}{q(i)}$$), however what is P and what is Q? My intuition says that P is e.g. feature 1 and Q is the true distributions (so the set of zeroes and ones), but it is also my understand that a good feature maximizes the KL-divergence. But if Q is the actual distribution of classes then you want to minimize it right? So the feature distribution does not err badly on the actual distribution?

KL divergence has the same formula as mutual information. Mutual information explains correlation between variables.

KL divergence is used for feature selection as the amount of entropy in marginal probability of the target reduced by the entropy of the target given the feature:
I(t ; f) = H(t)-H(t|f)
To put it another way, it calculates the KL divergence of the product of marginals of the target and the feature and their joint probability.
I(t ; f) = KL(p(t,f)||p(t)*p(f))
Find more here.

Related

Why is Standard Deviation the square of difference of an obsevation from the mean?

I am learning statistics, and have some basic yet core questions on SD:
s = sample size
n = total number of observations
xi = ith observation
μ = arithmetic mean of all observations
σ = the usual definition of SD, i.e. ((1/(n-1))*sum([(xi-μ)**2 for xi in s])**(1/2) in Python lingo
f = frequency of an observation value
I do understand that (1/n)*sum([xi-μ for xi in s]) would be useless (= 0), but would not (1/n)*sum([abs(xi-μ) for xi in s]) have been a measure of variation?
Why stop at power of 1 or 2? Would ((1/(n-1))*sum([abs((xi-μ)**3) for xi in s])**(1/3) or ((1/(n-1))*sum([(xi-μ)**4 for xi in s])**(1/4) and so on have made any sense?
My notion of squaring is that it 'amplifies' the measure of variation from the arithmetic mean while the simple absolute difference is somewhat a linear scale notionally. Would it not amplify it even more if I cubed it (and made absolute value of course) or quad it?
I do agree computationally cubes and quads would have been more expensive. But with the same argument, the absolute values would have been less expensive... So why squares?
Why is the Normal Distribution like it is, i.e. f = (1/(σ*math.sqrt(2*pi)))*e**((-1/2)*((xi-μ)/σ))?
What impact would it have on the normal distribution formula above if I calculated SD as described in (1) and (2) above?
Is it only a matter of our 'getting used to the squares', it could well have been linear, cubed or quad, and we would have trained our minds likewise?
(I may not have been 100% accurate in my number of opening and closing brackets above, but you will get the idea.)
So, if you are looking for an index of dispersion, you actually don't have to use the standard deviation. You can indeed report mean absolute deviation, the summary statistic you suggested. You merely need to be aware of how each summary statistic behaves, for example the SD assigns more weight to outlying variables. You should also consider how each one can be interpreted. For example, with a normal distribution, we know how much of the distribution lies between ±2SD from the mean. For some discussion of mean absolute deviation (and other measures of average absolute deviation, such as the median average deviation) and their uses see here.
Beyond its use as a measure of spread though, SD is related to variance and this is related to some of the other reasons it's popular, because the variance has some nice mathematical properties. A mathematician or statistician would be able to provide a more informed answer here, but squared difference is a smooth function and is differentiable everywhere, allowing one to analytically identify a minimum, which helps when fitting functions to data using least squares estimation. For more detail and for a comparison with least absolute deviations see here. Another major area where variance shines is that it can be easily decomposed and summed, which is useful for example in ANOVA and regression models generally. See here for a discussion.
As to your questions about raising to higher powers, they actually do have uses in statistics! In general, the mean (which is related to average absolute mean), the variance (related to standard deviation), skewness (related to the third power) and kurtosis (related to the fourth power) are all related to the moments of a distribution. Taking differences raised to those powers and standardizing them provides useful information about the shape of a distribution. The video I linked provides some easy intuition.
For some other answers and a larger discussion of why SD is so popular, See here.
Regarding the relationship of sigma and the normal distribution, sigma is simply a parameter that stretches the standard normal distribution, just like the mean changes its location. This is simply a result of the way the standard normal distribution (a normal distribution with mean=0 and SD=variance=1) is mathematically defined, and note that all normal distributions can be derived from the standard normal distribution. This answer illustrates this. Now, you can parameterize a normal distribution in other ways as well, but I believe you do need to provide sigma, whether using the SD or precisions. I don't think you can even parametrize a normal distribution using just the mean and the mean absolute difference. Now, a deeper question is why normal distributions are so incredibly useful in representing widely different phenomena and crop up everywhere. I think this is related to the Central Limit Theorem, but I do not understand the proofs of the theorem well enough to comment further.

What is the difference between Freidman mse and mse?

I'm looking into a GradientBoostingClassifier in sklearn. Then, I found there are 3 kind of criterion. Friedman mse, mse, mae.
the descriptions provided by sklearn are:
The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
I can't understand what is different?
Who's gonna let me know?
thanks!
I've provided a full answer in this link due to the convenience of writing TeX. However, it resumes in the fact that this splitting criterion allow us to take the decision not only on how close we're to the desired outcome (which is what MSE does), but also based on the probabilities of the desired k-class that we're going to find in the region l or in the region r (by considering a global weight w1*w2 / (w1 + w2)). I strongly recommend you to check the above link for a full explanation.
According to the scikit-learn source code, the main difference between these two criteria is the impurity-improvement method. The MSE / FriedmanMSE criterion calculates an impurity of the current node and tries to reduce (improve) it, The smaller the impurity the better.
Mean squared error impurity criterion.
MSE = sum_square_of_left / w_l + sum_square_of_right / w_r
source
On the other side FriedmanMSE impurity criterion use following to improve purity:
diff = w_r * total_left_sum - w_l * total_rigth_sum
improvement = diff**2 / (w_r * w_l)
Note: w_r (right) is for total left sum and visa versa.
you can simplify the following equations with the better notation, which was provided in Friedman published paper itself (eq. 35).
which says
improvement = (w_l * w_r) / (w_l + w_r) * (mean_left - mean_right) ^ 2
Which w_l, w_r are the corresponding sum of weights for respective left or right part.
source
For assigning meaning to left and right keywords, imagine the whole system in an array (e.g samples[start: end]), so for example left means the left elements of the current node.

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result to what extent. Therefore I am just using the feature_importances_, which works well for me.
However, I would like to know how they are getting calculated and which measure/algorithm is used. Unfortunately I could not find any documentation on this topic.
There are indeed several ways to get feature "importances". As often, there is no strict consensus about what this word means.
In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read...). It is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble.
In the literature or in some other packages, you can also find feature importances implemented as the "mean decrease accuracy". Basically, the idea is to measure the decrease in accuracy on OOB data when you randomly permute the values for that feature. If the decrease is low, then the feature is not important, and vice-versa.
(Note that both algorithms are available in the randomForest R package.)
[1]: Breiman, Friedman, "Classification and regression trees", 1984.
The usual way to compute the feature importance values of a single tree is as follows:
you initialize an array feature_importances of all zeros with size n_features.
you traverse the tree: for each internal node that splits on feature i you compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to feature_importances[i].
The error reduction depends on the impurity criterion that you use (e.g. Gini, Entropy, MSE, ...). Its the impurity of the set of examples that gets routed to the internal node minus the sum of the impurities of the two partitions created by the split.
Its important that these values are relative to a specific dataset (both error reduction and the number of samples are dataset specific) thus these values cannot be compared between different datasets.
As far as I know there are alternative ways to compute feature importance values in decision trees. A brief description of the above method can be found in "Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
It's the ratio between the number of samples routed to a decision node involving that feature in any of the trees of the ensemble over the total number of samples in the training set.
Features that are involved in the top level nodes of the decision trees tend to see more samples hence are likely to have more importance.
Edit: this description is only partially correct: Gilles and Peter's answers are the correct answer.
As #GillesLouppe pointed out above, scikit-learn currently implements the "mean decrease impurity" metric for feature importances. I personally find the second metric a bit more interesting, where you randomly permute the values for each of your features one-by-one and see how much worse your out-of-bag performance is.
Since what you're after with feature importance is how much each feature contributes to your overall model's predictive performance, the second metric actually gives you a direct measure of this, whereas the "mean decrease impurity" is just a good proxy.
If you're interested, I wrote a small package that implements the Permutation Importance metric and can be used to calculate the values from an instance of a scikit-learn random forest class:
https://github.com/pjh2011/rf_perm_feat_import
Edit: This works for Python 2.7, not 3
code:
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
decision_tree plot:
enter image description here
We get
compute_feature_importance:[0. ,0.01333333,0.06405596,0.92261071]
Check source code:
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count
cdef double normalizer = 0.
cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data
with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]
importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1
importances /= nodes[0].weighted_n_node_samples
if normalize:
normalizer = np.sum(importances)
if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer
return importances
Try calculate the feature importance:
print("sepal length (cm)",0)
print("sepal width (cm)",(3*0.444-(0+0)))
print("petal length (cm)",(54* 0.168 - (48*0.041+6*0.444)) +(46*0.043 -(0+3*0.444)) + (3*0.444-(0+0)))
print("petal width (cm)",(150* 0.667 - (0+100*0.5)) +(100*0.5-(54*0.168+46*0.043))+(6*0.444 -(0+3*0.444)) + (48*0.041-(0+0)))
We get feature_importance: np.array([0,1.332,6.418,92.30]).
After normalized, we get array ([0., 0.01331334, 0.06414793, 0.92253873]),this is same as clf.feature_importances_.
Be careful all classes are supposed to have weight one.
For those looking for a reference to the scikit-learn's documentation on this topic or a reference to the answer by #GillesLouppe:
In RandomForestClassifier, estimators_ attribute is a list of DecisionTreeClassifier (as mentioned in the documentation). In order to compute the feature_importances_ for the RandomForestClassifier, in scikit-learn's source code, it averages over all estimator's (all DecisionTreeClassifer's) feature_importances_ attributes in the ensemble.
In DecisionTreeClassifer's documentation, it is mentioned that "The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance [1]."
Here is a direct link for more info on variable and Gini importance, as provided by scikit-learn's reference below.
[1] L. Breiman, and A. Cutler, “Random Forests”, http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Feature Importance in Random Forest
Random forest uses many trees, and thus, the variance is reduced
Random forest allows far more exploration of feature combinations as well
Decision trees gives Variable Importance and it is more if there is reduction in impurity (reduction in Gini impurity)
Each tree has a different Order of Importance
Here is what happens in the background!
We take an attribute and check in all the trees where it is present and take the average values of the change in the homogeneity on this attribute split. This average value of change in the homogeneity gives us the feature importance of the attribute

How do I efficiently estimate a probability based on a small amount of evidence?

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths.
Here is the scenario:
I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. I know that coins from the same box that this one came from have an average probability of p, and I also know the standard deviation of these probabilities (call it s).
(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.)
I toss the coin n times, and it comes up heads h times.
The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate.
Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small?
I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-)
Other answers:
There are some other answers on SO that are similar, but the answers provided are unsatisfactory. For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. And this one turned out to be incorrect.
Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff.
The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). It's pretty much like the usual thinking except you can put in the prior distribution.
Fancy words, but you may ask, well where did the h/(h+t) formula come from? Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". And the method below is the next level of sophistication up when you add a prior. Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary.
As I understand it the problem is two fold: first you draw a coin from the bag of coins. This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S.
What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails)
L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S).
Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2)
This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta.
The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. You do that with calculus. The trick to make it easy is that you take logarithms first. Define LL = log(L). Wherever L is maximized, then LL will be too.
so
LL = hlog(theta) + tlog(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2)
By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0.
Since the last term with the log has no theta in it you can ignore it.
dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0.
If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t.
If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t).
And where does that 'obvious' estimate come from? If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t).
If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. More appropriate is a Gaussian prior on log theta ('lognormal distribution'). Plug it in the same way and work through the calculus.
You can use p as a prior on your estimated probability. This is basically the same as doing pseudocount smoothing. I.e., use
(h + c * p) / (n + c)
as your estimate. When h and n are large, then this just becomes h / n. When h and n are small, this is just c * p / c = p. The choice of c is up to you. You can base it on s but in the end you have to decide how small is too small.
You don't have nearly enough info in this question.
How many coins are in the box? If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful.
What is a small n? 2? 5? 10? 100? What is the probability of a weighted coin coming up heads/tail? 100/0, 60/40, 50.00001/49.99999? How is the weighting distributed? Is every coin one of 2 possible weightings? Do they follow a bell curve? etc.
It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence.
The name for what you're trying to do is a Bernoulli trial. Knowing the name should be helpful in finding better resources.
Response to comment:
If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it.
Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias.
How many tosses, again, will be determined under these circumstances by the weighting of the coins. Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split.
In general, what you are looking for is Maximum Likelihood Estimation. Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses.
Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. Others above have already suggested this, but perhaps if your like me you would prefer more verbosity.
In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors.
So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H.
Very simple if you think about it a bit, and maybe not so if you think about it a bit more.

Probability of selecting an element from a set

The expected probability of randomly selecting an element from a set of n elements is P=1.0/n .
Suppose I check P using an unbiased method sufficiently many times. What is the distribution type of P? It is clear that P is not normally distributed, since cannot be negative. Thus, may I correctly assume that P is gamma distributed? And if yes, what are the parameters of this distribution?
Histogram of probabilities of selecting an element from 100-element set for 1000 times is shown here.
Is there any way to convert this to a standard distribution
Now supposed that the observed probability of selecting the given element was P* (P* != P). How can I estimate whether the bias is statistically significant?
EDIT: This is not a homework. I'm doing a hobby project and I need this piece of statistics for it. I've done my last homework ~10 years ago:-)
With repetitions, your distribution will be binomial. So let X be the number of times you select some fixed object, with M total selections
P{ X = x } = ( M choose x ) * (1/N)^x * (N-1/N)^(M-x)
You may find this difficult to compute for large N. It turns out that for sufficiently large N, this actually converges to a normal distribution with probability 1 (Central Limit theorem).
In case P{X=x} will be given by a normal distribution. The mean will be M/N and the variance will be M * (1/N) * ( N-1) / N.
This is a clear binomial distribution with p=1/(number of elements) and n=(number of trials).
To test whether the observed result differs significantly from the expected result, you can do the binomial test.
The dice examples on the two Wikipedia pages should give you some good guidance on how to formulate your problem. In your 100-element, 1000 trial example, that would be like rolling a 100-sided die 1000 times.
As others have noted, you want the Binomial distribution. Your question seems to imply an interest in a continuous approximation to it, though. It can actually be approximated by the normal distribution, and also by the Poisson distribution.
Is your distribution a discrete uniform distribution?

Resources