How to debug scikit classifier that chooses wrong class with high confidence - scikit-learn

I am using the LogisticRegression classifier to classify documents. The results are good (macro-avg. f1 = 0.94). I apply an extra step to the prediction results (predict_proba) to check if a classification is "confident" enough (e.g. >0.5 confidence for the first class, >0.2 distance in confidence to the 2. class etc.). Otherwise, the sample is discarded as "unknown".
The score that is most significant for me is the number of samples that, despite this additional step, are assigned to the wrong class. This score is unfortunately too high (~ 0.03). In many of these cases, the classifier is very confident (0.8 - 0.9999!) that he chose the correct class.
Changing parameters (C, class_weight, min_df, tokenizer) so far only lead to a small decrease in this score, but a significant decrease in correct classifications. However, looking at several samples and the most discriminative features of the respective classes, I cannot understand where this high confidence comes from. I would assume it is possible to discard most of these samples without discarding significantly more correct samples.
Is there a way to debug/analyze such cases? What could be the reason for these high confidence values?

Related

Loss functions in LightFM

I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Should sentiment analysis training data be evenly distributed?

If I am training a sentiment classifier off of a tagged dataset where most documents are negative, say ~95%, should the classifier be trained with the same distribution of negative comments? If not, what would be other options to "normalize" the data set?
You don't say what type of classifier you have but in general you don't have to normalize the distribution of the training set. However, usually the more data the better but you should always do blind tests to prevent over-fitting.
In your case you will have a strong classifier for negative comments and unless you have a very large sample size, a weaker positive classifier. If your sample size is large enough it won't really matter since you hit a point where you might start over-fitting your negative data anyway.
In short, it's impossible to say for sure without knowing the actual algorithm and the size of the data sets and the diversity within the dataset.
Your best bet is to carve off something like 10% of your training data (randomly) and just see how the classifier performs after being trained on the 90% subset.

How to combine LIBSVM probability estimates from two (or three) two class SVM classifiers.

I have training data that falls into two classes, let's say Yes and No. The data represents three tasks, easy, medium and difficult. A person performs these tasks and is classified into one of the two classes as a result. Each task is classified independently and then the results are combined. I am using 3 independently trained SVM classifiers and then voting on the final result.
I am looking to provide a measure of confidence or probability associated with each classification. LIBSVM can provide a probability estimate along with the classification for each task (easy, medium and difficult, say Pe, Pm and Pd) but I am unsure of how best to combine these into an overall estimate for the final classification of the person (let's call it Pp).
My attempts so far have been along the lines of a simple average:
Pp = (Pe + Pm + Pd) / 3
An Inverse-variance weighted average (since each task is repeated a few times and sample variance (VARe, VARm and VARd) can be calculated - in which case Pe would be a simple average of all the easy samples):
Pp = (Pe/VARe + Pm/VARm + Pd/VARd) / (( 1/VARe ) + ( 1/VARm ) + ( 1/VARd ))
Or a multiplication (under the assumption that these events are independent, which I am unsure of since the underlying tasks are related):
Pp = Pe * Pm * Pd
The multiplication would provide a very low number, so it's unclear how to interpret that as an overall probability when the results of the voting are very clear.
Would any of these three options be the best or is there some other method / detail I'm overlooking?
Based on your comment, I will make the following suggestion. If you need to do this as an SVM (and because, as you say, you get better performance when you do it this way), take the output from your intermediate classifiers and feed them as features to your final classifier. Even better, switch to a multi-layer Neural Net where your inputs represent inputs to the intermediates, the (first) hidden layer represents outputs to the intermediate problem, and subsequent layer(s) represent the final decision you want. This way you get the benefit of an intermediate layer, but its output is optimised to help with the final prediction rather than for accuracy in its own right (which I assume you don't really care about).
The correct generative model for these tests likely looks something like the following:
Generate an intelligence/competence score i
For each test t: generate pass/fail according to p_t(pass | i)
This is simplified, but I think it should illustrate tht you have a latent variable i on which these tests depend (and there's also structure between them, since presumably p_easy(pass|i) > p_medium(pass|i) > p_hard(pass|i); you could potentially model this as a logistic regression with a continuous 'hardness' feature). I suspect what you're asking about is a way to do inference on some thresholding function of i, but you want to do it in a classification way rather than as a probabilistic model. That's fine, but without explicitly encoding the latent variable and the structure between the tests it's going to be hard (and no average of the probabilities will account for the missing structure).
I hope that helps---if I've made assumptions that aren't justified, please feel free to correct.

Training set - proportion of pos / neg / neutral sentences

I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion of pos : neg : neutral be 33 % : 33% : 33% or should it be 25 % : 25 % : 50 %
Logically in my head it seems that I i train (i.e. give more samples for neutral) that the system would be better at identifying neutral sentences then whether they are positive or negative - is that true ? or I am missing some theory here ?
Thanks
Rahul
The problem you're referring to is known as the imbalance problem. Many machine learning algorithms perform badly when confronted with imbalanced training data, i.e. when the instances of one class heavily outnumber those of the other class. Read this article to get a good overview of the problem and how to approach it. For techniques like naive bayes or decision trees it is always a good idea to balance your data somehow, e.g. by random oversampling (explained in the references paper). I disagree with mjv's suggestion to have a training set match the proportions in the real world. This may be appropriate in some cases but I'm quite confident it's not in your setting. For a classification problem like the one you describe, the more the sizes of the class sets differ, the more most ML algorithms will have problems discriminating the classes properly. However, you can always use the information about which class is the largest in reality by taking it as a fallback such that when the classifier's confidence for a particular instance is low or this instance couldn't be classified at all, you would assign it the largest class.
One further remark: finding the positivity/negativity/neutrality in Twitter messages seems to me to be a question of degree. As such, it may be viewes as a regression rather than a classification problem, i.e. instead of a three class scheme you perhaps may want calculate a score which tells you how positive/negative the message is.
There are many other factors... but an important one (in determining a suitable ratio and volume of training data) is the expected distribution of each message category (Positive, Neutral, Negative) in the real world. Effectively, a good baseline for the training set (and the control set) is
[qualitatively] as representative as possible of the whole "population"
[quantitatively] big enough that measurements made from such sets is statistically significant.
The effect of the [relative] abundance of a certain category of messages in the training set is hard to determine; it is in any case a lesser factor -or rather one that is highly sensitive to- other factors. Improvements in the accuracy of the classifier, as a whole, or with regards to a particular category, is typically tied more to the specific implementation of the classifier (eg. is it Bayesian, what are the tokens, are noise token eliminated, is proximity a factor, are we using bi-grams etc...) than to purely quantitative characteristics of the training set.
While the above is generally factual but moderately helpful for the selection of the training set's size and composition, there are ways of determining, post facto, when an adequate size and composition of training data has been supplied.
One way to achieve this is to introduce a control set, i.e. one manually labeled but that is not part of the training set and to measure for different test runs with various subsets of the training set, the recall and precision obtained for each category (or some similar accuracy measurements), for this the classification of the control set. When these measurements do not improve or degrade, beyond what's statistically representative, the size and composition of the training [sub-]set is probably the right one (unless it is an over-fitting set :-(, but that's another issue altogether... )
This approach, implies that one uses a training set that could be 3 to 5 times the size of the training subset effectively needed, so that one can build, randomly (within each category), many different subsets for the various tests.

Resources