Mean absoluate error of each tree in Random Forest - statistics

I am using the evaluation class of weka for the the mean absolute error of each generated tree in random forest. The explanation says that "Refers to the error of the predicted values for numeric classes, and the error of the predicted probability distribution for nominal classes."
Can someone explain it in easy words or probably with an exammple ?

The mean absolute error is an indication of how close your predictions are, on average, to the actual values of the test data.
For numerical classes this is easy to think about.
Example:
True values: {0, 1, 4}
Predicted values: {1, 3, 1}
Differences: {-1, -2, 3} (subtract predicted from true)
Absolute differences: {1, 2, 3}
Mean Absolute Difference: (1+2+3)/3 = 2
For nominal classes a prediction is no longer a single value, but rather the probability distribution of the instance belonging to the different possible classes. The provided example will have two classes.
Example:
Notation: [0.5, 0.5] indicates an instance with 50% chance of belonging to class Y, 50% chance of belonging to class X.
True distributions: { [0,1] , [1,0] }
Predicted distributions: { [0.25, 0.75], [1, 0] }
Differences: { [-0.25, 0.25], [0, 0] }
Absolute differences: { (0.25 + 0.25)/2, (0 + 0)/2 } = {0.25, 0}
Mean absolute difference: (0.25 + 0)/2 = 0.125
You can double check my explanation by visiting the source code for Weka's evaluation class.
Also as a side note, I believe the mean absolute difference reported by Weka for random forest is for the forest as a whole, not the individual trees.

Related

Why I am not getting the identity matrix?

Hello I am trying to understant why after this operation:
a = np.array([[1, 2], [3, 4]])
ainv = inv(a)
print(np.dot(a,ainv))
I am getting:
[[1.0000000e+00 0.0000000e+00]
[8.8817842e-16 1.0000000e+00]]
Since I am using the a's inverse matrix I think that I shoud get:
[[1,0],[0,1]]
SO I would like support to understand the result
a = np.array([[1.0, 2.0], [3.0, 4.0]])
ainv = np.linalg.inv(a) #[[-2.0, 1.0],[1.5, -0.5]]
print(np.dot(a,ainv))
Yields as you discovered:
[[1.0000000e+00 0.0000000e+00]
[8.8817842e-16 1.0000000e+00]]
Lets look at the type of the array elements
type(ainv[1][1])
Shows us that the type of the array is
numpy.float64
Lets look at the numpy precision for this type
numpy.finfo(numpy.float64).precision
Numpy says the aproximate number of decimal digits to which this kind of float is precise is 15.
15
For curiosity, we can also look at the machine epsilon for the type;
np.finfo(np.float64).eps
Which yields the smallest number n where 1 +n is indistinguishable from 1
2.220446049250313e-16
So even though the number you get is technically distinguishable from 0 for the datatype, the overall precision is 15 decimals, calculations on large matrices might compound floating point imprecision even further.
That is the identity matrix, almost. You are getting numbers very close to zero instead of zero, which is a common issue with floating point numbers since they are only a finite approximation of real numbers. For all practical purposes 8.8e-16 or 0.00000000000000088 is ~ zero.

Getting probability as 0 or 1 in KNN (predict_proba)

I was using KNN from sklearn and predicted the labels using predict_proba. I was expecting the values in the range of 0 to 1 since it tells the probability for a particular class. But I am only getting 0 & 1.
I have put large k values also but to no gain. Though I have only 1000 samples with features around 200 and the matrix is largely sparse.
Can anybody tell me what could be the solution here?
sklearn.neighbors.KNeighborsClassifier(n_neighbors=**k**)
The reason why you're getting only 0 & 1 is because of the n_neighbors = k parameter. If k value is set to 1, then you will get 0 or 1. If it's set to 2, you will get 0, 0.5 or 1. And if it's set to 3, then the probability outputs will be 0, 0.333, 0.666, or 1.
Also note that probability values are essentially meaningless in KNN. The algorithm is based on similarity and distance.
The reason might be lack of variety of data in training and test sets.
If the features of a sample may only exist in a particular class and its features don't exist in any sample of other classes in training set, then that sample will be predicted to belong that class with probability of 100% (1) and 0% (0) for other classes.
Otherwise; let say you have 2 classes and test a sample like knn.predict_proba(sample) and expect some result like [[0.47, 0.53]] The result will yield 1 in total in either way.
If thats the case, try generating your own test sample that has features from more than one classes objects in training set.

spark ml 2.0 - Naive Bayes - how to determine threshold values for each class

I am using NB for document classification and trying to understand threshold parameter to see how it can help to optimize algorithm.
Spark ML 2.0 thresholds doc says:
Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.
0) Can someone explain this better? What goal it can achieve? My general idea is if you have threshold 0.7 then at least one class prediction probability should be more then 0.7 if not then prediction should return empty. Means classify it as 'uncertain' or just leave empty for prediction column. How can p/t function going to achieve that when you still pick the category with max probability?
1) What probability it adjust? default column 'probability' is actually conditional probability and 'rawPrediction' is
confidence according to document. I believe threshold will adjust 'rawPrediction' not 'probability' column. Am I right?
2) Here's how some of my probability and rawPrediction vector look like. How do I set threshold values based on this so I can remove certain uncertain classification? probability is between 0 and 1 but rawPrediction seems to be on log scale here.
Probability:
[2.233368649314982E-15,1.6429456680945863E-9,1.4377313514127723E-15,7.858651849363202E-15]
rawPrediction:
[-496.9606736723107,-483.452183395287,-497.40111830218746]
Basically I want classifier to leave Prediction column empty if it doesn't have any probability that is more then 0.7 percent.
Also, how to classify something as uncertain when more then one category has very close scores e.g. 0.812, 0.800, 0.799 . Picking max is something I may not want here but instead classify as "uncertain" or leave empty and I can do further analysis and treatment for those documents or train another model for those docs.
I haven't played with it, but the intent is to supply different threshold values for each class. I've extracted this example from the docstring:
model = nb.fit(df)
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.42..., 0.57...])
>>> result.rawPrediction
DenseVector([-1.60..., -1.32...])
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
If I understand correctly, the effect was to transform [0.42, 0.58] into [.42/.01, .58/10] = [42, 5.8], switching the prediction ("largest p/t") from column 1 (third row above) to column 0 (last row above). However, I couldn't find the logic in the source. Anyone?
Stepping back: I do not see a built-in way to do what you want: be agnostic if no class dominates. You will have to add that with something like:
def weak(probs, threshold=.7, epsilon=.01):
return np.all(probs < threshold) or np.max(np.diff(probs)) < epsilon
>>> cases = [[.5,.5],[.5,.7],[.7,.705],[.6,.1]]
>>> for case in cases:
... print '{:15s} - {}'.format(case, weak(case))
[0.5, 0.5] - True
[0.5, 0.7] - False
[0.7, 0.705] - True
[0.6, 0.1] - True
(Notice I haven't checked whether probs is a legal probability distribution.)
Alternatively, if you are not actually making a hard decision, use the predicted probabilities and a metric like Brier score, log loss, or info gain that accounts for the calibration as well as the accuracy.

What does `sample_weight` do to the way a `DecisionTreeClassifier` works in sklearn?

I've read from the relevant documentation that :
Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value.
But, it is still unclear to me how this works. If I set sample_weight with an array of only two possible values, 1's and 2's, does this mean that the samples with 2's will get sampled twice as often as the samples with 1's when doing the bagging? I cannot think of a practical example for this.
Some quick preliminaries:
Let's say we have a classification problem with K classes. In a region of feature space represented by the node of a decision tree, recall that the "impurity" of the region is measured by quantifying the inhomogeneity, using the probability of the class in that region. Normally, we estimate:
Pr(Class=k) = #(examples of class k in region) / #(total examples in region)
The impurity measure takes as input, the array of class probabilities:
[Pr(Class=1), Pr(Class=2), ..., Pr(Class=K)]
and spits out a number, which tells you how "impure" or how inhomogeneous-by-class the region of feature space is. For example, the gini measure for a two class problem is 2*p*(1-p), where p = Pr(Class=1) and 1-p=Pr(Class=2).
Now, basically the short answer to your question is:
sample_weight augments the probability estimates in the probability array ... which augments the impurity measure ... which augments how nodes are split ... which augments how the tree is built ... which augments how feature space is diced up for classification.
I believe this is best illustrated through example.
First consider the following 2-class problem where the inputs are 1 dimensional:
from sklearn.tree import DecisionTreeClassifier as DTC
X = [[0],[1],[2]] # 3 simple training examples
Y = [ 1, 2, 1 ] # class labels
dtc = DTC(max_depth=1)
So, we'll look trees with just a root node and two children. Note that the default impurity measure the gini measure.
Case 1: no sample_weight
dtc.fit(X,Y)
print dtc.tree_.threshold
# [0.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0, 0.5]
The first value in the threshold array tells us that the 1st training example is sent to the left child node, and the 2nd and 3rd training examples are sent to the right child node. The last two values in threshold are placeholders and are to be ignored. The impurity array tells us the computed impurity values in the parent, left, and right nodes respectively.
In the parent node, p = Pr(Class=1) = 2. / 3., so that gini = 2*(2.0/3.0)*(1.0/3.0) = 0.444..... You can confirm the child node impurities as well.
Case 2: with sample_weight
Now, let's try:
dtc.fit(X,Y,sample_weight=[1,2,3])
print dtc.tree_.threshold
# [1.5, -2, -2]
print dtc.tree_.impurity
# [0.44444444, 0.44444444, 0.]
You can see the feature threshold is different. sample_weight also affects the impurity measure in each node. Specifically, in the probability estimates, the first training example is counted the same, the second is counted double, and the third is counted triple, due to the sample weights we've provided.
The impurity in the parent node region is the same. This is just a coincidence. We can compute it directly:
p = Pr(Class=1) = (1+3) / (1+2+3) = 2.0/3.0
The gini measure of 4/9 follows.
Now, you can see from the chosen threshold that the first and second training examples are sent to the left child node, while the third is sent to the right. We see that impurity is calculated to be 4/9 also in the left child node because:
p = Pr(Class=1) = 1 / (1+2) = 1/3.
The impurity of zero in the right child is due to only one training example lying in that region.
You can extend this with non-integer sample-wights similarly. I recommend trying something like sample_weight = [1,2,2.5], and confirming the computed impurities.

Scikit Learn - Random Forest: How continuous feature is handled?

Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.
for example:
I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF?
Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?
As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable
x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5).
Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.
Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).
You are asking about DecisionTrees. Because RandomForest is ensemble model, and by itself it don't know anything about data, it fully relies on decisons from base estimators (In this case DecisionTrees), and aggregates them.
So, how DecisionTree is treating continious features: Look at this official documentation page. DecisionTreeClassifier was fitted on continuous dataset (Fisher irises), if you will look at the picture of tree - it has threshold value in each node over some chosen feature at this node.

Resources