Adaboost Implementation with Decision stump - decision-tree

I have been trying to implement Adaboost using decision stump as weak classifier but i do not know how to give preference to the weighted miss classified instances?

A decision stump is basically a rule that specifies a feature, a threshold and a polarity. So given samples you have to find the one feature-threshold-polarity combination that has the lowest error. Usually you count the misclassifications and divide it by the number of samples to get the error. In Adaboost a weighted error used, which means that instead of counting the misclassifications, you sum up the weights that are assigned to the misclassified samples. I hope this is all clear so far.
Now, to give a higher preference to misclassified sample in the next round you adjust the weights assigned to the samples by either increasing the weights of the misclassified samples or decreasing the weights of the correctly classified ones. Assume that E is your weighted error, you multiply the misclassified sample weights by the value (1-E)/E. Since the decision stump is better than random guessing, E will be < 0.5 which means that (1-E)/E will be > 1, so that the weights are increased (e.g. E = 0.4 => (1-E)/E = 1.5). If on the other hand, you want to decrease the correctly classified sample weights, use E/(1-E) instead. However, do not forget to normalized the weights afterwards so that they sum up to 1. This is important for the computation of the weighted error.

Related

Does it make sense to use sample_weights for balanced datasets?

I have limited knowledge about sample_weights in the sklearn library, but from what I gather, it's generally used to help balance imbalanced datasets during training. What I'm wondering is, if I already have a perfectly balanced binary classification dataset (i.e. equal amounts of 1's and 0's in the label/Y/class column), could one add a sample weight to the 0's in order to put more importance on predicting the 1's correctly?
For example, let's say I really want my model to predict 1's well, and it's ok to predict 0's even though they turn out to be 1's. Would setting a sample_weight of 2 for 0's, and 1 for the 1's be the correct thing to do here in order to put more importance on correctly predicting the 1's? Or does that matter? And then I guess during training, is the f1 scoring function generally accepted as the best metric to use?
Thanks for the input!
ANSWER
After a couple rounds of testing and more research, I've discovered that yes, it does make sense to add more weight to the 0's with a balanced binary classification dataset, if your goal is to decrease the chance of over-predicting the 1's. I ran two separate training sessions using a weight of 2 for 0's and 1 for the 1's, and then again vice versa, and found that my model predicted less 1's when the weight was applied to the 0's, which was my ultimate goal.
In case that helps anyone.
Also, I'm using SKLearn's Balanced Accuracy scoring function for those tests, which takes an average of each separate class's accuracy.

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Improving linear regression model by taking absolute value of predicted output?

I have a particular classification problem that I was able to improve using Python's abs() function. I am still somewhat new when it comes to machine learning, and I wanted to know if what I am doing is actually "allowed," so to speak, for improving a regression problem. The following line describes my method:
lr = linear_model.LinearRegression()
predicted = abs(cross_val_predict(lr, features, labels_postop_IS, cv=10))
I attempted this solution because linear regression can sometimes produce negative predictions values, even though my particular case, these predictions should never be negative, as they are a physical quantity.
Using the abs() function, my predictions produce a better fit for the data.
Is this allowed?
Why would it not be "allowed". I mean if you want to make certain statistical statements (like a 95% CI e.g.) you need to be careful. However, most ML practitioners do not care too much about underlying statistical assumptions and just want a blackbox model that can be evaluated based on accuracy or some other performance metric. So basically everything is allowed in ML, you just have to be careful not to overfit. Maybe a more sensible solution to your problem would be to use a function that truncates at 0 like f(x) = x if x > 0 else 0. This way larger negative values don't suddenly become large positive ones.
On a side note, you should probably try some other models as well with more parameters like a SVR with a non-linear kernel. The thing is obviously that a LR fits a line, and if this line is not parallel to your x-axis (thinking in the single variable case) it will inevitably lead to negative values at some point on the line. That's one reason for why it is often advised not to use LRs for predictions outside the "fitted" data.
A straight line y=a+bx will predict negative y for some x unless a>0 and b=0. Using logarithmic scale seems natural solution to fix this.
In the case of linear regression, there is no restriction on your outputs.
If your data is non-negative (as in your case the values are physical quantities and cannot be negative), you could model using a generalized linear model (GLM) with a log link function. This is known as Poisson regression and is helpful for modeling discrete non-negative counts such as the problem you described. The Poisson distribution is parameterized by a single value λ, which describes both the expected value and the variance of the distribution.
I cannot say your approach is wrong but a better way is to go towards the above method.
This results in an approach that you are attempting to fit a linear model to the log of your observations.

How to debug scikit classifier that chooses wrong class with high confidence

I am using the LogisticRegression classifier to classify documents. The results are good (macro-avg. f1 = 0.94). I apply an extra step to the prediction results (predict_proba) to check if a classification is "confident" enough (e.g. >0.5 confidence for the first class, >0.2 distance in confidence to the 2. class etc.). Otherwise, the sample is discarded as "unknown".
The score that is most significant for me is the number of samples that, despite this additional step, are assigned to the wrong class. This score is unfortunately too high (~ 0.03). In many of these cases, the classifier is very confident (0.8 - 0.9999!) that he chose the correct class.
Changing parameters (C, class_weight, min_df, tokenizer) so far only lead to a small decrease in this score, but a significant decrease in correct classifications. However, looking at several samples and the most discriminative features of the respective classes, I cannot understand where this high confidence comes from. I would assume it is possible to discard most of these samples without discarding significantly more correct samples.
Is there a way to debug/analyze such cases? What could be the reason for these high confidence values?

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

Resources