Computation of ROC curve data points (Receiver operator characteristic) - statistics

Given a particular threshhold e, I am able to generate two sets of the following format :-
Set<String> observedDocs;
Set<String> actualDocs;
Now I have to come up with True Positive Rate and False Positive Rates. The TPR is easy to calculate, its a really intuitive definition of recall which I do in the following manner:-
private double recall(final Set<String> observedDocs, final Set<String> actualDocs) {
Set<String> relevantAndRetrieved = new HashSet<>(observedDocs);
relevantAndRetrieved.addAll(actualDocs);
return relevantAndRetrieved.size() / actualLabels.size();
}
I need some equivalent set manipulation based way to compute the False Positive rate. I dont want to compute the False positive, False Negative counts etc.

Well, the FPR is proportion of negative examples which are marked positive by the classifier. But I don't see how to express that in terms of the variables you have. How is your recall function working anyway? observedLabels and actualLabels are going to have at most 2 elements, right? Did you mean to make those List instead of Set ??

Related

What to pass as threshold for Naive Bayes Classifier in Pyspark?

I'm trying to make a ROC curve for my model while using a Naive Bayes Classifier. To do this, I need to change the value of the threshold for my classifier. The way I interpreted it, a list must be passed with the value for the threshold of each category. So if i had two categories, and t is the threshold I want to set (0 <= t <= 1), then I would have to pass a list like this: [1-t, t].
Anyways, when i tried doing the ROC curve, I got this:
Given the result, my idea was that the idea I had for the theshold might have been wrong, so I went to check the documentation for the Naive Bayes Classifier. But when I finally found an example i dont get what the criteria is for the parameter:
nb = nb.setThresholds([0.01, 10.00])
Does anyone know what must be passed to the threshold? Supose I want the theshold to be set at 0.7 (if the probability is over 0.7 i want the prediction to be 1), what should i pass to the threshold parameter?
As it says in pyspark.ml's documentation for NaiveBayes under the thresholds parameter:
The class with largest value p/t is predicted, where p is the original
probability of that class and t is the class's threshold.
Therefore, thresholds can be thought of as handicaps on the probabilities. To keep it simple, in the case of binary classification, you can set the thresholds as a value in the range [0, 1], such that they sum to 1. This will get you the desired rule of "Classify as True if the probability is over threshold T, otherwise classify as False".
For your specific ask of a 0.7 probability threshold, this would look like:
nb = nb.setThresholds([0.3, 0.7])
assuming that the first entry is the threshold for False and the second value is the thresold for True. Using these thresholds, the model would classify a class with False and True probabilities p_false and p_true by taking the greater value out of [p_false/0.3, p_true/0.7].
You can technically set the thresholds to any value. Just remember that the probability for class X will be divided by its respective threshold and compared against the other adjusted probabilities for the other classes.

Multiclass semantic segmentation model evaluation

I am doing a project on multiclass semantic segmentation. I have formulated a model that outputs pretty descent segmented images by decreasing the loss value. However, I cannot evaluate the model performance in metrics, such as meanIoU or Dice coefficient.
In case of binary semantic segmentation it was easy just to set the threshold of 0.5, to classify the outputs as an object or background, but it does not work in the case of multiclass semantic segmentation. Could you please tell me how to obtain model performance on the aforementioned metrics? Any help will be highly appreciated!
By the way, I am using PyTorch framework and CamVid dataset.
If anyone is interested in this answer, please also look at this issue. The author of the issue points out that mIoU can be computed in a different way (and that method is more accepted in literature). So, consider that before using the implementation for any formal publication.
Basically, the other method suggested by the issue-poster is to separately accumulate the intersections and unions over the entire dataset and divide them at the final step. The method in the below original answer computes intersection and union for a batch of images, then divides them to get IoU for the current batch, and then takes a mean of the IoUs over the entire dataset.
However, this below given original method is problematic because the final mean IoU would vary with the batch-size. On the other hand, the mIoU would not vary with the batch size for the method mentioned in the issue as the separate accumulation would ensure that batch size is irrelevant (though higher batch size can definitely help speed up the evaluation).
Original answer:
Given below is an implementation of mean IoU (Intersection over Union) in PyTorch.
def mIOU(label, pred, num_classes=19):
pred = F.softmax(pred, dim=1)
pred = torch.argmax(pred, dim=1).squeeze(1)
iou_list = list()
present_iou_list = list()
pred = pred.view(-1)
label = label.view(-1)
# Note: Following for loop goes from 0 to (num_classes-1)
# and ignore_index is num_classes, thus ignore_index is
# not considered in computation of IoU.
for sem_class in range(num_classes):
pred_inds = (pred == sem_class)
target_inds = (label == sem_class)
if target_inds.long().sum().item() == 0:
iou_now = float('nan')
else:
intersection_now = (pred_inds[target_inds]).long().sum().item()
union_now = pred_inds.long().sum().item() + target_inds.long().sum().item() - intersection_now
iou_now = float(intersection_now) / float(union_now)
present_iou_list.append(iou_now)
iou_list.append(iou_now)
return np.mean(present_iou_list)
Prediction of your model will be in one-hot form, so first take softmax (if your model doesn't already) followed by argmax to get the index with the highest probability at each pixel. Then, we calculate IoU for each class (and take the mean over it at the end).
We can reshape both the prediction and the label as 1-D vectors (I read that it makes the computation faster). For each class, we first identify the indices of that class using pred_inds = (pred == sem_class) and target_inds = (label == sem_class). The resulting pred_inds and target_inds will have 1 at pixels labelled as that particular class while 0 for any other class.
Then, there is a possibility that the target does not contain that particular class at all. This will make that class's IoU calculation invalid as it is not present in the target. So, you assign such classes a NaN IoU (so you can identify them later) and not involve them in the calculation of the mean.
If the particular class is present in the target, then pred_inds[target_inds] will give a vector of 1s and 0s where indices with 1 are those where prediction and target are equal and zero otherwise. Taking the sum of all elements of this will give us the intersection.
If we add all the elements of pred_inds and target_inds, we'll get the union + intersection of pixels of that particular class. So, we subtract the already calculated intersection to get the union. Then, we can divide the intersection and union to get the IoU of that particular class and add it to a list of valid IoUs.
At the end, you take the mean of the entire list to get the mIoU. If you want the Dice Coefficient, you can calculate it in a similar fashion.

Modelica Time Dependent Equations

I am new to Modelica, and I am wondering if it is possible to write a kind of dynamic programming equation. Assume time is discretized by an integer i, and in my specific application x is boolean and f is a boolean function of x.
x(t_i) = f(x(t_{i+d}))
Where d can be a positive or negative integer. Of course, I would initialize x accordingly, either true or false.
Any help or references would be greatly appreciated!
It is possible. In Modelica the discretization in time is usually carried on by the compiler, you have to take care of the equations (continous dynamics). Otherwise, if you want to generate events at discrete time points, you can do it using when statements.
I suggest you to take a look at Introduction to Object-Oriented Modeling and Simulation with OpenModelica (PDF format, 6.6 MB) - a more recent tutorial (2012) by Peter Fritzson. There is a section that on Discrete Events and Hybrid Systems, that should clarify how to implement your equations in Modelica.
Below you can find an example from that tutorial about the model of a bouncing ball, as you can see discretization in time is not considered when you write your dynamic equations. So the continous model of the ball v=der(s), a=der(v) and than the discrete part inside the when clause that handles the contact with the ground:
model BouncingBall "the bouncing ball model"
parameter Real g=9.81; //gravitational acc.
parameter Real c=0.90; //elasticity constant
Real height(start=10),velocity(start=0);
equation
der(height) = velocity;
der(velocity)=-g;
when height<0 then
reinit(velocity, -c*velocity);
end when;
end BouncingBall;
Hope this helps,
Marco
If I understand your question, you want to use the last n evaluations of x to determine the next value of x. If so, this code shows how to do this:
model BooleanHistory
parameter Integer n=10 "How many points to keep";
parameter Modelica.SIunits.Time dt=1e-3;
protected
Boolean x[n];
function f
input Integer n;
input Boolean past[n-1];
output Boolean next;
algorithm
next :=not past[1]; // Example
end f;
initial equation
x = {false for i in 1:n};
equation
when sample(0,dt) then
x[2:n] = pre(x[1:(n-1)]);
x[1] = f(n, x[2:n]);
end when;
end BooleanHistory;

Computing precision and recall in Named Entity Recognition

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.
But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.
The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
This has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve
We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)
Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33
Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)
Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66
The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.
In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.
However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.
As mentioned before, there are different ways of measuring NER performance. It is possible to evaluate separately how precisely entities are detected in terms of position in the text, and in terms of their class (person, location, organization, etc.). Or to combine both aspects in a single measure.
You'll find a nice review in the following thesis: D. Nadeau, Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007). Have a look at section 2.6. Evaluation of NER.
There is no simple right answer to this question. There are a variety of different ways to count errors. The MUC competitions used one, other people have used others.
However, to help you with your immediate confusion:
You have a set of tags, no? Something like NONE, PERSON, ANIMAL, VEGETABLE?
If a token should be person, and you tag it NONE, then that's a false positive for NONE and a false negative for PERSON. If a token should be NONE and you tag it PERSON, it's the other way around.
So you get a score for each entity type.
You can also aggregate those scores.
Just to be clear, these are the definitions:
Precision = TP/(TP+FP) = What portion of what you found was ground truth?
Recall = TP/(TP+FN) = What portion of the ground truth did you recover?
The won't necessarily always be equal, since the number of false negatives will not necessarily equal the number of false positives.
If I understand your problem right, you're assigning each token to one of more than two possible labels. In order for precision and recall to make sense, you need to have a binary classifier. So you could use precision and recall if you phrased the classifier as whether a token is in Group "A" or not, and then repeat for each group. In this case a missed classification would count twice as a false negative for one group and a false positive for another.
If you're doing a classification like this where it isn't binary (assigning each token to a group) it might be useful instead to look at pairs of tokens. Phrase your problem as "Are tokens X and Y in the same classification group?". This allows you to compute precision and recall over all pairs of nodes. This isn't as appropriate if your classification groups are labeled or have associated meanings. For example if your classification groups are "Fruits" and "Vegetables", and you classify both "Apples" and "Oranges" as "Vegetables" then this algorithm would score it as a true positive even though the wrong group was assigned. But if your groups are unlabled, for example "A" and "B", then if apples and oranges were both classified as "A", afterward you could say that "A" corresponds to "Fruits".
If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.
The Code and output would be in this format:-
17
For those one having the same question in the following link:
spaCy/scorer.py
'''python
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot)
pred_value = ner_model(input_)
scorer.score(pred_value, gold)
return scorer.scores
example run
examples = [
('Who is Shaka Khan?',
[(7, 17, 'PERSON')]),
('I like London and Berlin.',
[(7, 13, 'LOC'), (18, 24, 'LOC')])
]
ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.
Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.
If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}
Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.
Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.
Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.
Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.
You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.
I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Resources