How to compute False Accept and False reject rates using one class SVM Python - one-class-classification

How to compute False Accept and False reject rates using one class SVM? I have a user data with around 70000 samples. I am trying to apply one class SVM here. The number of -1 values obtained are 12765 and the rest are displayed as 1. From these values obtained, how do I compute False Accept Rate ?

You can compute it with the help of Confusion matrix.
FAR = FPR = FP/(FP + TN)
FRR = FNR = FN/(FN + TP)
where FP: False positive
FN: False Negative
TN: True Negative
TP: True Positive
You can also find the original answer here

Related

How to customize threshold PyTorch

I have trained ResNet50 for binary image classification.
I want to descrease FalseNegatives by reducing threshold value.
How can I do that?
To decrease the number of false negatives (FN) i.e. increase the recall (since recall = TP / (TP + FN)) you should increase the positive weight (the weight of the occurrence of that class) above 1. For example nn.BCEWithLogitsLoss allows you to provide the pos_weight option:
pos_weight > 1 increases the recall, pos_weight < 1 increases the precision.
For example, if a dataset contains 100 positive and 300 negative examples of a single class, then pos_weight for the class should be equal to 300/100 = 3. The loss would act as if the dataset contains 3*100 = 300 positive examples.
As a side note, the explicit expression for the binary cross entropy with logits (where "with logits" should rather be understood as "from logits") is:
>>> z = torch.sigmoid(q)
>>> loss = -(w_p*p*torch.log(z) + (1-p)*torch.log(1-z))
Above q are the raw logit values while w_p is the weight of the positive instance.

how to increase precision as well as recall in svm for highly imbalanced data set

I have loan data set with shape which is highly imbalanced:
(116058, 29)
how to improve precision and recall scores
target column m13
Counter({1: 636, 0: 115422})
I have used to split data in train and test set:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.8,random_state = 100,stratify = y)
and then used svm for classification:
svc = SVC(class_weight = {1:0.95,0:0.05},kernel='rbf')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
I got precision as .54 and recall as .55
I tried grid search as well with different value of C and gamma, the above code gave the best result
svc = SVC(class_weight = {1:0.95,0:0.05},kernel='rbf')
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
is there any way to improve the precision as well as recall score?
first of all let me comment the baseline of your prediction, If i understand you correct, you have 636 of class 1, and 115422 of class 0.
Imagen you would built a prediction model that always predicts class 0, your precision would be (if class 0 is your true class):
115422/(115422+636)=0,9945
and your recall (if class 0 is your true class):
1
If class 1 is your true class, precision would be: 0
As you can see it is quite a task to tune it. In general there are books about this topic, it will be super hard to tune it. But your target should be to predict the class 1 correctly! The goal should be to identify every class 1 in your algorithm. For example you could try to target your sensivity, here are some goals to target: https://en.wikipedia.org/wiki/Precision_and_recall
What you defetnly should do, make sure that your train and test sets have target ofs class 1.

spark ml 2.0 - Naive Bayes - how to determine threshold values for each class

I am using NB for document classification and trying to understand threshold parameter to see how it can help to optimize algorithm.
Spark ML 2.0 thresholds doc says:
Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.
0) Can someone explain this better? What goal it can achieve? My general idea is if you have threshold 0.7 then at least one class prediction probability should be more then 0.7 if not then prediction should return empty. Means classify it as 'uncertain' or just leave empty for prediction column. How can p/t function going to achieve that when you still pick the category with max probability?
1) What probability it adjust? default column 'probability' is actually conditional probability and 'rawPrediction' is
confidence according to document. I believe threshold will adjust 'rawPrediction' not 'probability' column. Am I right?
2) Here's how some of my probability and rawPrediction vector look like. How do I set threshold values based on this so I can remove certain uncertain classification? probability is between 0 and 1 but rawPrediction seems to be on log scale here.
Probability:
[2.233368649314982E-15,1.6429456680945863E-9,1.4377313514127723E-15,7.858651849363202E-15]
rawPrediction:
[-496.9606736723107,-483.452183395287,-497.40111830218746]
Basically I want classifier to leave Prediction column empty if it doesn't have any probability that is more then 0.7 percent.
Also, how to classify something as uncertain when more then one category has very close scores e.g. 0.812, 0.800, 0.799 . Picking max is something I may not want here but instead classify as "uncertain" or leave empty and I can do further analysis and treatment for those documents or train another model for those docs.
I haven't played with it, but the intent is to supply different threshold values for each class. I've extracted this example from the docstring:
model = nb.fit(df)
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.42..., 0.57...])
>>> result.rawPrediction
DenseVector([-1.60..., -1.32...])
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
If I understand correctly, the effect was to transform [0.42, 0.58] into [.42/.01, .58/10] = [42, 5.8], switching the prediction ("largest p/t") from column 1 (third row above) to column 0 (last row above). However, I couldn't find the logic in the source. Anyone?
Stepping back: I do not see a built-in way to do what you want: be agnostic if no class dominates. You will have to add that with something like:
def weak(probs, threshold=.7, epsilon=.01):
return np.all(probs < threshold) or np.max(np.diff(probs)) < epsilon
>>> cases = [[.5,.5],[.5,.7],[.7,.705],[.6,.1]]
>>> for case in cases:
... print '{:15s} - {}'.format(case, weak(case))
[0.5, 0.5] - True
[0.5, 0.7] - False
[0.7, 0.705] - True
[0.6, 0.1] - True
(Notice I haven't checked whether probs is a legal probability distribution.)
Alternatively, if you are not actually making a hard decision, use the predicted probabilities and a metric like Brier score, log loss, or info gain that accounts for the calibration as well as the accuracy.

Computation of ROC curve data points (Receiver operator characteristic)

Given a particular threshhold e, I am able to generate two sets of the following format :-
Set<String> observedDocs;
Set<String> actualDocs;
Now I have to come up with True Positive Rate and False Positive Rates. The TPR is easy to calculate, its a really intuitive definition of recall which I do in the following manner:-
private double recall(final Set<String> observedDocs, final Set<String> actualDocs) {
Set<String> relevantAndRetrieved = new HashSet<>(observedDocs);
relevantAndRetrieved.addAll(actualDocs);
return relevantAndRetrieved.size() / actualLabels.size();
}
I need some equivalent set manipulation based way to compute the False Positive rate. I dont want to compute the False positive, False Negative counts etc.
Well, the FPR is proportion of negative examples which are marked positive by the classifier. But I don't see how to express that in terms of the variables you have. How is your recall function working anyway? observedLabels and actualLabels are going to have at most 2 elements, right? Did you mean to make those List instead of Set ??

Computing precision and recall in Named Entity Recognition

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.
But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.
The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
This has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve
We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)
Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33
Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)
Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66
The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.
In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.
However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.
As mentioned before, there are different ways of measuring NER performance. It is possible to evaluate separately how precisely entities are detected in terms of position in the text, and in terms of their class (person, location, organization, etc.). Or to combine both aspects in a single measure.
You'll find a nice review in the following thesis: D. Nadeau, Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007). Have a look at section 2.6. Evaluation of NER.
There is no simple right answer to this question. There are a variety of different ways to count errors. The MUC competitions used one, other people have used others.
However, to help you with your immediate confusion:
You have a set of tags, no? Something like NONE, PERSON, ANIMAL, VEGETABLE?
If a token should be person, and you tag it NONE, then that's a false positive for NONE and a false negative for PERSON. If a token should be NONE and you tag it PERSON, it's the other way around.
So you get a score for each entity type.
You can also aggregate those scores.
Just to be clear, these are the definitions:
Precision = TP/(TP+FP) = What portion of what you found was ground truth?
Recall = TP/(TP+FN) = What portion of the ground truth did you recover?
The won't necessarily always be equal, since the number of false negatives will not necessarily equal the number of false positives.
If I understand your problem right, you're assigning each token to one of more than two possible labels. In order for precision and recall to make sense, you need to have a binary classifier. So you could use precision and recall if you phrased the classifier as whether a token is in Group "A" or not, and then repeat for each group. In this case a missed classification would count twice as a false negative for one group and a false positive for another.
If you're doing a classification like this where it isn't binary (assigning each token to a group) it might be useful instead to look at pairs of tokens. Phrase your problem as "Are tokens X and Y in the same classification group?". This allows you to compute precision and recall over all pairs of nodes. This isn't as appropriate if your classification groups are labeled or have associated meanings. For example if your classification groups are "Fruits" and "Vegetables", and you classify both "Apples" and "Oranges" as "Vegetables" then this algorithm would score it as a true positive even though the wrong group was assigned. But if your groups are unlabled, for example "A" and "B", then if apples and oranges were both classified as "A", afterward you could say that "A" corresponds to "Fruits".
If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.
The Code and output would be in this format:-
17
For those one having the same question in the following link:
spaCy/scorer.py
'''python
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot)
pred_value = ner_model(input_)
scorer.score(pred_value, gold)
return scorer.scores
example run
examples = [
('Who is Shaka Khan?',
[(7, 17, 'PERSON')]),
('I like London and Berlin.',
[(7, 13, 'LOC'), (18, 24, 'LOC')])
]
ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

Resources