Sensitivity Vs Positive Predicted Value - which is best? - statistics

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w

The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Related

What to do if a significant factor in univariate linear regression become insignificant in multivariate analysis?

I am studying the impact of several factors "sleeping time, studying time, anxiety degree, depression degree ..." on students final exam mark.
when I did the univaraite linear regression analysis all models where significant (as final exam mark is a dependent variable) despite some have a small R^2.
then I tried to put all predictor factors in one multiple linear regression model, the result was most of the predictors are insignificant with exception to study time which was significant and has a big R^2 in uni and multi variate analysis.
How should I explain this in my paper? is it okay to have this result? or should I search for another model?
I sound like you have highly correlated predictors. This gives you a very unstable model, one where small changes in a few observations could produce large changes in regression coefficients.
You should try various models that use subsets of your predictors, and select a final model that has a significant overall F statistic, and significant t stats for your included predictors.
In your paper, you could explain that anxiety score and depression score were too highly correlated to allow them both into the model and you’ve selected the best model that doesn’t contain both of these scores.

Confusion matrix for LDA

I’m trying to check the performance of my LDA model using a confusion matrix but I have no clue what to do. I’m hoping someone can maybe just point my in the right direction.
So I ran an LDA model on a corpus filled with short documents. I then calculated the average vector of each document and then proceeded with calculating cosine similarities.
How would I now get a confusion matrix? Please note that I am very new to the world of NLP. If there is some other/better way of checking the performance of this model please let me know.
What is your model supposed to be doing? And how is it testable?
In your question you haven't described your testable assessment of the model the results of which would be represented in a confusion matrix.
A confusion matrix helps you represent and explore the different types of "accuracy" of a predictive system such as a classifier. It requires your system to make a choice (e.g. yes/no, or multi-label classifier) and you must use known test data to be able to score it against how the system should have chosen. Then you count these results in the matrix as one of the combination of possibilities, e.g. for binary choices there's two wrong and two correct.
For example, if your cosine similarities are trying to predict if a document is in the same "category" as another, and you do know the real answers, then you can score them all as to whether they were predicted correctly or wrongly.
The four possibilities for a binary choice are:
Positive prediction vs. positive actual = True Positive (correct)
Negative prediction vs. negative actual = True Negative (correct)
Positive prediction vs. negative actual = False Positive (wrong)
Negative prediction vs. positive actual = False Negative (wrong)
It's more complicated in a multi-label system as there are more combinations, but the correct/wrong outcome is similar.
About "accuracy".
There are many kinds of ways to measure how well the system performs, so it's worth reading up on this before choosing the way to score the system. The term "accuracy" means something specific in this field, and is sometimes confused with the general usage of the word.
How you would use a confusion matrix.
The confusion matrix sums (of total TP, FP, TN, FN) can fed into some simple equations which give you, these performance ratings (which are referred to by different names in different fields):
sensitivity, d' (dee-prime), recall, hit rate, or true positive rate (TPR)
specificity, selectivity or true negative rate (TNR)
precision or positive predictive value (PPV)
negative predictive value (NPV)
miss rate or false negative rate (FNR)
fall-out or false positive rate (FPR)
false discovery rate (FDR)
false omission rate (FOR)
Accuracy
F Score
So you can see that Accuracy is a specific thing, but it may not be what you think of when you say "accuracy"! The last two are more complex combinations of measure. The F Score is perhaps the most robust of these, as it's tuneable to represent your requirements by combining a mix of other metrics.
I found this wikipedia article most useful and helped understand why sometimes is best to choose one metric over the other for your application (e.g. whether missing trues is worse than missing falses). There are a group of linked articles on the same topic, from different perspectives e.g. this one about search.
This is a simpler reference I found myself returning to: http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
This is about sensitivity, more from a science statistical view with links to ROC charts which are related to confusion matrices, and also useful for visualising and assessing performance: https://en.wikipedia.org/wiki/Sensitivity_index
This article is more specific to using these in machine learning, and goes into more detail: https://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf
So in summary confusion matrices are one of many tools to assess the performance of a system, but you need to define the right measure first.
Real world example
I worked through this process recently in a project I worked on where the point was to find all of few relevant documents from a large set (using cosine distances like yours). This was like a recommendation engine driven by manual labelling rather than an initial search query.
I drew up a list of goals with a stakeholder in their own terms from the project domain perspective, then tried to translate or map these goals into performance metrics and statistical terms. You can see it's not just a simple choice! The hugely imbalanced nature of our data set skewed the choice of metric as some assume balanced data or else they will give you misleading results.
Hopefully this example will help you move forward.

Loss functions in LightFM

I recently came across LightFM while learning to train a recommender system. And so far what I know is that it utilizes loss functions which are logistic, BPR, WARP and k-OS WARP. I did not go through the math behind all these functions. Now what I am confused about is that how will I know that which loss function to use where?
From lightfm model documentation page:
logistic: useful when both positive (1) and negative (-1) interactions are present.
BPR: Bayesian Personalised Ranking 1 pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision#k) is desired.
k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-the positive example for any given user as a basis for pairwise updates.
Everything boils down to how your dataset is structured and what kind of user interacions you're looking at. Obviously one approach would be to include the loss function in your parameter grid when going through hyperparameter tuning (at least that's what I did) and check model accuracy. I find investingating why a given loss function performed better/worse on a dataset as a good learning exercise.

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

Training set - proportion of pos / neg / neutral sentences

I am hand tagging twitter messages as Positive, Negative, Neutral. I am try to appreciate is there some logic one can use to identify of the training set what proportion of message should be positive / negative and neutral ?
So for e.g. if I am training a Naive Bayes classifier with 1000 twitter messages should the proportion of pos : neg : neutral be 33 % : 33% : 33% or should it be 25 % : 25 % : 50 %
Logically in my head it seems that I i train (i.e. give more samples for neutral) that the system would be better at identifying neutral sentences then whether they are positive or negative - is that true ? or I am missing some theory here ?
Thanks
Rahul
The problem you're referring to is known as the imbalance problem. Many machine learning algorithms perform badly when confronted with imbalanced training data, i.e. when the instances of one class heavily outnumber those of the other class. Read this article to get a good overview of the problem and how to approach it. For techniques like naive bayes or decision trees it is always a good idea to balance your data somehow, e.g. by random oversampling (explained in the references paper). I disagree with mjv's suggestion to have a training set match the proportions in the real world. This may be appropriate in some cases but I'm quite confident it's not in your setting. For a classification problem like the one you describe, the more the sizes of the class sets differ, the more most ML algorithms will have problems discriminating the classes properly. However, you can always use the information about which class is the largest in reality by taking it as a fallback such that when the classifier's confidence for a particular instance is low or this instance couldn't be classified at all, you would assign it the largest class.
One further remark: finding the positivity/negativity/neutrality in Twitter messages seems to me to be a question of degree. As such, it may be viewes as a regression rather than a classification problem, i.e. instead of a three class scheme you perhaps may want calculate a score which tells you how positive/negative the message is.
There are many other factors... but an important one (in determining a suitable ratio and volume of training data) is the expected distribution of each message category (Positive, Neutral, Negative) in the real world. Effectively, a good baseline for the training set (and the control set) is
[qualitatively] as representative as possible of the whole "population"
[quantitatively] big enough that measurements made from such sets is statistically significant.
The effect of the [relative] abundance of a certain category of messages in the training set is hard to determine; it is in any case a lesser factor -or rather one that is highly sensitive to- other factors. Improvements in the accuracy of the classifier, as a whole, or with regards to a particular category, is typically tied more to the specific implementation of the classifier (eg. is it Bayesian, what are the tokens, are noise token eliminated, is proximity a factor, are we using bi-grams etc...) than to purely quantitative characteristics of the training set.
While the above is generally factual but moderately helpful for the selection of the training set's size and composition, there are ways of determining, post facto, when an adequate size and composition of training data has been supplied.
One way to achieve this is to introduce a control set, i.e. one manually labeled but that is not part of the training set and to measure for different test runs with various subsets of the training set, the recall and precision obtained for each category (or some similar accuracy measurements), for this the classification of the control set. When these measurements do not improve or degrade, beyond what's statistically representative, the size and composition of the training [sub-]set is probably the right one (unless it is an over-fitting set :-(, but that's another issue altogether... )
This approach, implies that one uses a training set that could be 3 to 5 times the size of the training subset effectively needed, so that one can build, randomly (within each category), many different subsets for the various tests.

Resources