Computing precision and recall in Named Entity Recognition - nlp

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.
But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.

The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
This has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve
We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)
Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25
Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33
Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)
Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55
Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66
The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.

In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.
However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.

As mentioned before, there are different ways of measuring NER performance. It is possible to evaluate separately how precisely entities are detected in terms of position in the text, and in terms of their class (person, location, organization, etc.). Or to combine both aspects in a single measure.
You'll find a nice review in the following thesis: D. Nadeau, Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007). Have a look at section 2.6. Evaluation of NER.

There is no simple right answer to this question. There are a variety of different ways to count errors. The MUC competitions used one, other people have used others.
However, to help you with your immediate confusion:
You have a set of tags, no? Something like NONE, PERSON, ANIMAL, VEGETABLE?
If a token should be person, and you tag it NONE, then that's a false positive for NONE and a false negative for PERSON. If a token should be NONE and you tag it PERSON, it's the other way around.
So you get a score for each entity type.
You can also aggregate those scores.

Just to be clear, these are the definitions:
Precision = TP/(TP+FP) = What portion of what you found was ground truth?
Recall = TP/(TP+FN) = What portion of the ground truth did you recover?
The won't necessarily always be equal, since the number of false negatives will not necessarily equal the number of false positives.
If I understand your problem right, you're assigning each token to one of more than two possible labels. In order for precision and recall to make sense, you need to have a binary classifier. So you could use precision and recall if you phrased the classifier as whether a token is in Group "A" or not, and then repeat for each group. In this case a missed classification would count twice as a false negative for one group and a false positive for another.
If you're doing a classification like this where it isn't binary (assigning each token to a group) it might be useful instead to look at pairs of tokens. Phrase your problem as "Are tokens X and Y in the same classification group?". This allows you to compute precision and recall over all pairs of nodes. This isn't as appropriate if your classification groups are labeled or have associated meanings. For example if your classification groups are "Fruits" and "Vegetables", and you classify both "Apples" and "Oranges" as "Vegetables" then this algorithm would score it as a true positive even though the wrong group was assigned. But if your groups are unlabled, for example "A" and "B", then if apples and oranges were both classified as "A", afterward you could say that "A" corresponds to "Fruits".

If you are training an spacy ner model then their scorer.py API which gives you precision, recall and recall of your ner.
The Code and output would be in this format:-
17
For those one having the same question in the following link:
spaCy/scorer.py
'''python
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(ner_model, examples):
scorer = Scorer()
for input_, annot in examples:
doc_gold_text = ner_model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot)
pred_value = ner_model(input_)
scorer.score(pred_value, gold)
return scorer.scores
example run
examples = [
('Who is Shaka Khan?',
[(7, 17, 'PERSON')]),
('I like London and Berlin.',
[(7, 13, 'LOC'), (18, 24, 'LOC')])
]
ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
'''
Output will be in format like:-
{'uas': 0.0, 'las': 0.0, **'ents_p'**: 43.75, **'ents_r'**: 35.59322033898305, **'ents_f'**: 39.252336448598136, 'tags_acc': 0.0, 'token_acc': 100.0}**strong text**

Related

How do I produce the correct Cohen's d estimate?

I am trying to replicate the result of an academic paper below to practice how to apply statistical methods in R.
This is what the paper states:
Pre- and postoutbreak differences in voter intentions.
Across the 32 elections included in primary analyses, the mean voter-intention difference score was greater than zero (M=1.02%), d=0.84, t(31)=2.34, p=.026.
This result is consistent with the pre- and postelection difference in nationwide polling results for the House of Representatives elections, which indicates a general postoutbreak shift toward favoring Republican rather than Democratic candidates. (If the two outliers were
included in the analysis, the mean voter-intention difference score was not meaningfully different from zero, p=.937.) (Source screenshot)
The t-test result matched the values in the paper, however, I cannot figure out how to produce the right cohen's d estimate. I looked into the documentation for cohen.d function over and over again, googled how to do this, even watched some boring Youtube videos, but to no avail. The code itself runs, but it gives me a wrong value. Am I missing something? Can someone help me with how I should format the arguments?
# excluding outliers from the dataset
no_outliers <- study2 %>%
filter(StateSenateRace != "Rhode Island", StateSenateRace != "Hawaii")
# paired t-test
t.test(no_outliers$OctMeanVoterIntentionIndex, no_outliers$SeptMeanVoterIntentionIndex, paired = TRUE, var.equal = TRUE, na.rm = TRUE)
cohen.d(no_outliers$SeptMeanVoterIntentionIndex, no_outliers$OctMeanVoterIntentionIndex, na.rm = TRUE)
Here's the result I got.
Cohen's d
d estimate: -0.02130406 (negligible)
95 percent confidence interval:
lower upper
-0.5171041 0.4744960
Thank you in advance--I wish I could contribute to Stack Overflow as a R expert someday!

Word2Vec Subsampling -- Implementation

I am implementing the Skipgram model, both in Pytorch and Tensorflow2. I am having doubts about the implementation of subsampling of frequent words. Verbatim from the paper, the probability of subsampling word wi is computed as
where t is a custom threshold (usually, a small value such as 0.0001) and f is the frequency of the word in the document. Although the authors implemented it in a different, but almost equivalent way, let's stick with this definition.
When computing the P(wi), we can end up with negative values. For example, assume we have 100 words, and one of them appears extremely more often than others (as it is the case for my dataset).
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
Q: What is the correct way to implement subsampling using this "probability"?
As an additional info, I have seen that in keras the function keras.preprocessing.sequence.make_sampling_table takes a different approach:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))
I tend to trust deployed code more than paper write-ups, especially in a case like word2vec, where the original authors' word2vec.c code released by the paper's authors has been widely used & served as the template for other implementations. If we look at its subsampling mechanism...
if (sample > 0) {
real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
next_random = next_random * (unsigned long long)25214903917 + 11;
if (ran < (next_random & 0xFFFF) / (real)65536) continue;
}
...we see that those words with tiny counts (.cn) that could give negative values in the original formula instead here give values greater-than 1.0, and thus can never be less than the long-random-masked-and-scaled to never be more than 1.0 ((next_random & 0xFFFF) / (real)65536). So, it seems the authors' intent was for all negative-values of the original formula to mean "never discard".
As per the keras make_sampling_table() comment & implementation, they're not consulting the actual word-frequencies at all. Instead, they're assuming a Zipf-like distribution based on word-rank order to synthesize a simulated word-frequency.
If their assumptions were to hold – the related words are from a natural-language corpus with a Zipf-like frequency-distribution – then I'd expect their sampling probabilities to be close to down-sampling probabilities that would have been calculated from true frequency information. And that's probably "close enough" for most purposes.
I'm not sure why they chose this approximation. Perhaps other aspects of their usual processes have not maintained true frequencies through to this step, and they're expecting to always be working with natural-language texts, where the assumed frequencies will be generally true.
(As luck would have it, and because people often want to impute frequencies to public sets of word-vectors which have dropped the true counts but are still sorted from most- to least-frequent, just a few days ago I wrote an answer about simulating a fake-but-plausible distribution using Zipf's law – similar to what this keras code is doing.)
But, if you're working with data that doesn't match their assumptions (as with your synthetic or described datasets), their sampling-probabilities will be quite different than what you would calculate yourself, with any form of the original formula that uses true word frequencies.
In particular, imagine a distribution with one token a million times, then a hundred tokens all appearing just 10 times each. Those hundred tokens' order in the "rank" list is arbitrary – truly, they're all tied in frequency. But the simulation-based approach, by fitting a Zipfian distribution on that ordering, will in fact be sampling each of them very differently. The one 10-occurrence word lucky enough to be in the 2nd rank position will be far more downsampled, as if it were far more frequent. And the 1st-rank "tall head" value, by having its true frequency *under-*approximated, will be less down-sampled than otherwise. Neither of those effects seem beneficial, or in the spirit of the frequent-word-downsampling option - which should only "thin out" very-frequent words, and in all cases leave words of the same frequency as each other in the original corpus roughly equivalently present to each other in the down-sampled corpus.
So for your case, I would go with the original formula (probability-of-discarding-that-requires-special-handling-of-negative-values), or the word2vec.c practical/inverted implementation (probability-of-keeping-that-saturates-at-1.0), rather than the keras-style approximation.
(As a totally-separate note that nonetheless may be relevant for your dataset/purposes, if you're using negative-sampling: there's another parameter controlling the relative sampling of negative examples, often fixed at 0.75 in early implementations, that one paper has suggested can usefully vary for non-natural-language token distributions & recommendation-related end-uses. This parameter is named ns_exponent in the Python gensim implementation, but simply a fixed power value internal to a sampling-table pre-calculation in the original word2vec.c code.)

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

spark ml 2.0 - Naive Bayes - how to determine threshold values for each class

I am using NB for document classification and trying to understand threshold parameter to see how it can help to optimize algorithm.
Spark ML 2.0 thresholds doc says:
Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class' threshold.
0) Can someone explain this better? What goal it can achieve? My general idea is if you have threshold 0.7 then at least one class prediction probability should be more then 0.7 if not then prediction should return empty. Means classify it as 'uncertain' or just leave empty for prediction column. How can p/t function going to achieve that when you still pick the category with max probability?
1) What probability it adjust? default column 'probability' is actually conditional probability and 'rawPrediction' is
confidence according to document. I believe threshold will adjust 'rawPrediction' not 'probability' column. Am I right?
2) Here's how some of my probability and rawPrediction vector look like. How do I set threshold values based on this so I can remove certain uncertain classification? probability is between 0 and 1 but rawPrediction seems to be on log scale here.
Probability:
[2.233368649314982E-15,1.6429456680945863E-9,1.4377313514127723E-15,7.858651849363202E-15]
rawPrediction:
[-496.9606736723107,-483.452183395287,-497.40111830218746]
Basically I want classifier to leave Prediction column empty if it doesn't have any probability that is more then 0.7 percent.
Also, how to classify something as uncertain when more then one category has very close scores e.g. 0.812, 0.800, 0.799 . Picking max is something I may not want here but instead classify as "uncertain" or leave empty and I can do further analysis and treatment for those documents or train another model for those docs.
I haven't played with it, but the intent is to supply different threshold values for each class. I've extracted this example from the docstring:
model = nb.fit(df)
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.42..., 0.57...])
>>> result.rawPrediction
DenseVector([-1.60..., -1.32...])
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
If I understand correctly, the effect was to transform [0.42, 0.58] into [.42/.01, .58/10] = [42, 5.8], switching the prediction ("largest p/t") from column 1 (third row above) to column 0 (last row above). However, I couldn't find the logic in the source. Anyone?
Stepping back: I do not see a built-in way to do what you want: be agnostic if no class dominates. You will have to add that with something like:
def weak(probs, threshold=.7, epsilon=.01):
return np.all(probs < threshold) or np.max(np.diff(probs)) < epsilon
>>> cases = [[.5,.5],[.5,.7],[.7,.705],[.6,.1]]
>>> for case in cases:
... print '{:15s} - {}'.format(case, weak(case))
[0.5, 0.5] - True
[0.5, 0.7] - False
[0.7, 0.705] - True
[0.6, 0.1] - True
(Notice I haven't checked whether probs is a legal probability distribution.)
Alternatively, if you are not actually making a hard decision, use the predicted probabilities and a metric like Brier score, log loss, or info gain that accounts for the calibration as well as the accuracy.

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?
Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
library(cluster)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this
Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)
You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects
Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (p1.skin-p2.skin)*skinscale;
double hairdiff = (p1.hair-p2.hair)*hairscale;
double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
}
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.
Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings
You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering

Resources