Record linkage using String similarity Techniques - string

We are working on Record linkage project.
We are observing a strange behavior from all of the standard technique like Jaro Winkler, Levenshtein, N-Gram, Damerau-Levenshtein, Jaccard index, Sorensen-Dice
Say,
String 1= MINI GRINDER KIT
String 2= Weiler 13001 Mini Grinder Accessory Kit, For Use With Small Right Angle Grinders
String 3= Milwaukee Video Borescope, Rotating Inspection Scope, Series: M-SPECTOR 360, 2.7 in 640 x 480 pixels High-Resolution LCD, Plastic, Black/Red
In the above case string 1 and string 2 are related the score of all the methods as shown below.
Jaro Winkler -> 0.391666651
Levenshtein -> 75
N-Gram, -> 0.9375
Damerau -> 75
Jaccard index -> 0
Sorensen-Dice -> 0
Cosine -> 0
But string 1 and string 3 are not at all related, but distance method are giving very high score.
Jaro Winkler -> 0.435714275
Levenshtein -> 133
N-Gram, -> 0.953571439
Damerau -> 133
Jaccard index -> 1
Sorensen-Dice -> 0
Cosine -> 0
Any thoughts .?

All distance calculation score are case sensitive. Hence bring all of them to same case. Then you get to see the score calculation appropriately.

I believe your goal here is to check if the two products are same or not. The data are form different sources i guess, in case of data like this you'll need to find out what is the most important mention worth comparing?! The brand name, the specs, etc...
These metrics follow very crude notion of similarity!, don't just feed the data like that.
So first Clean(remove punctuation, non important words), tokenize(break it single word sentences) then maybe u can use fuzzywuzzy to help find a better match.

Related

Welford's online variance algorithm, but for Interquartile Range?

Short Version
Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Long Version
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sum
Int64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Welford's Algorithm for Variance
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to computer reading - mean for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
How about an online algorithm for Interquartile Range (IQR)?
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1 and Q3.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
Knuth, Seminumerical Algorithms
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
Research Effort
Stackoverflow: Simple algorithm for online outlier detection of a generic time series
Stats: Simple algorithm for online outlier detection of a generic time series
Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
Stats: Robust outlier detection in financial timeseries
Stats: Online outlier detection
Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)
There's a useful paper by Ben-Haim, and Tom-Tov published in 2010 in the Journal of Machine Learning Research
A Streaming Parallel Decision Tree Algorithm
Short PDF: https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=0667E2F91B9E0E5387F85655AE9BC560?doi=10.1.1.186.7913&rep=rep1&type=pdf
Full paper: https://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
It describes an algoritm to automatically create a histogram from a online (streaming) data, that does not require unlimited memory.
you add a value to the history
the algorithm dynamically creates buckets
including bucket sizes
The paper is kind of dense (as all math papers are), but the algorithm is fairly simple.
Lets start with some sample data. For this answer i'll use digits of PI as the source for incoming floating point numbers:
Value
3.14159
2.65358
9.79323
8.46264
3.38327
9.50288
4.19716
9.39937
5.10582
0.97494
4.59230
7.81640
6.28620
8.99862
8.03482
5.34211
...
I will define that i want 5 bins in my histogram.
We add the first value (3.14159), which causes the first bin to be created:
Bin
Count
3.14159
1
The bins in this histogram don't have any width; they are purely a point:
And then we add the 2nd value (2.65358) to the histogram:
And we continue adding points, until we reach our arbitrary limit of 5 "buckets":
That is all 5 buckets filled.
We add our 6th value (9.50288) to the histogram, except that means we now have 6 buckets; but we decided we only want five:
Now is where the magic starts
In order to do the streaming part, and limit memory usage to less-than-infinity, we need to merge some of the "bins".
Look at each pair of bins - left-to-right - and see which two are closest together. Which in our case is these two buckets:
These two buckets are merged, and given a new x value dependant on their relative heights (i.e. counts)
ynew = yleft + yright = 1 + 1 = 2
xnew = xleft × (yleft/ynew) + xright×(yright/ynew) = 3.14159×(1/2) + 3.38327×(1/2) = 3.26243
And now we repeat.
add a new value
merge the two neighboring buckets that are closet to each other
deciding on the new x position between based on their relative heights
Eventually giving you (although i screwed it up as i was doing it manually in Excel for this answer):
Practical Example
I wanted a histogram of 20 buckets. This allows me to extract some useful statistics. For a histogram of 11 buckets, containing 38,000 data points, it only requires 40 bytes of memory:
With these 20 buckets, i can now computer the Probably Density Function (PDF):
Bin
Count
PDF
2.113262834
3085
5.27630%
6.091181608
3738
6.39313%
10.13907062
4441
7.59548%
14.38268188
5506
9.41696%
18.92107481
6260
10.70653%
23.52148965
6422
10.98360%
28.07685659
5972
10.21396%
32.55801082
5400
9.23566%
36.93292359
4604
7.87426%
41.23715698
3685
6.30249%
45.62006198
3136
5.36353%
50.38765223
2501
4.27748%
55.34957161
1618
2.76728%
60.37095192
989
1.69149%
65.99939004
613
1.04842%
71.73292736
305
0.52164%
78.18427775
140
0.23944%
85.22261376
38
0.06499%
90.13115876
12
0.02052%
96.1987941
4
0.00684%
And with the PDF, you can now calculate the Expected Value (i.e. mean):
Bin
Count
PDF
EV
2.113262834
3085
5.27630%
0.111502092
6.091181608
3738
6.39313%
0.389417244
10.13907062
4441
7.59548%
0.770110873
14.38268188
5506
9.41696%
1.354410824
18.92107481
6260
10.70653%
2.025790219
23.52148965
6422
10.98360%
2.583505901
28.07685659
5972
10.21396%
2.86775877
32.55801082
5400
9.23566%
3.00694827
36.93292359
4604
7.87426%
2.908193747
41.23715698
3685
6.30249%
2.598965665
45.62006198
3136
5.36353%
2.446843872
50.38765223
2501
4.27748%
2.155321935
55.34957161
1618
2.76728%
1.531676732
60.37095192
989
1.69149%
1.021171415
65.99939004
613
1.04842%
0.691950026
71.73292736
305
0.52164%
0.374190474
78.18427775
140
0.23944%
0.187206877
85.22261376
38
0.06499%
0.05538763
90.13115876
12
0.02052%
0.018498245
96.1987941
4
0.00684%
0.006581183
Which gives:
Expected Value: 27.10543
Cumulative Density Function CDF
We can now also get the Cumulative Density Function (CDF):
Bin
Count
PDF
EV
CDF
2.113262834
3085
5.27630%
0.11150
5.27630%
6.091181608
3738
6.39313%
0.38942
11.66943%
10.13907062
4441
7.59548%
0.77011
19.26491%
14.38268188
5506
9.41696%
1.35441
28.68187%
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
36.93292359
4604
7.87426%
2.90819
77.69587%
41.23715698
3685
6.30249%
2.59897
83.99836%
45.62006198
3136
5.36353%
2.44684
89.36188%
50.38765223
2501
4.27748%
2.15532
93.63936%
55.34957161
1618
2.76728%
1.53168
96.40664%
60.37095192
989
1.69149%
1.02117
98.09814%
65.99939004
613
1.04842%
0.69195
99.14656%
71.73292736
305
0.52164%
0.37419
99.66820%
78.18427775
140
0.23944%
0.18721
99.90764%
85.22261376
38
0.06499%
0.05539
99.97264%
90.13115876
12
0.02052%
0.01850
99.99316%
96.1987941
4
0.00684%
0.00658
100.00000%
And the CDF is where we can start to get the values i want.
The median (50th percentile), where the CDF reaches 50%:
From interpolation of the data, we can find the x value where the CDF is 50%:
Bin
Count
PDF
EV
CDF
18.92107481
6260
10.70653%
2.02579
39.38839%
23.52148965
6422
10.98360%
2.58351
50.37199%
t = (50-39.38839)/(50.37199-39.38839) = 10.61161/10.9836 = 0.96613
xmedian = (1-t)*18.93107481 + (t)*23.52148965 = 23.366
So now we know:
Expected Value (mean): 27.10543
Median: 23.366
My original ask was the IQV - the x values that account from 25%-75% of the values. Once again we can interpolate the CDF:
Bin
Count
PDF
EV
CDF
10.13907062
4441
7.59548%
0.77011
19.26491%
12.7235
25.00000%
14.38268188
5506
9.41696%
1.35441
28.68187%
23.366
50.00000% (median)
23.52148965
6422
10.98360%
2.58351
50.37199% (mode)
27.10543
mean
28.07685659
5972
10.21396%
2.86776
60.58595%
32.55801082
5400
9.23566%
3.00695
69.82161%
35.4351
75.00000%
36.93292359
4604
7.87426%
2.90819
77.69587%
This can be continued to get other useful stats:
Mean (aka average, expected value)
Median (50%)
Middle quintile (middle 20%)
IQR (middle 50% range)
middle 3 quintiles (middle 60%)
1 standard deviation range (middle 68.26%)
middle 80%
middle 90%
middle 95%
2 standard deviations range (middle 95.45%)
middle 99%
3 standard deviations range (middle 99.74%)
Short Version
https://github.com/apache/spark/blob/4c7888dd9159dc203628b0d84f0ee2f90ab4bf13/sql/catalyst/src/main/java/org/apache/spark/sql/util/NumericHistogram.java

Word2Vec Subsampling -- Implementation

I am implementing the Skipgram model, both in Pytorch and Tensorflow2. I am having doubts about the implementation of subsampling of frequent words. Verbatim from the paper, the probability of subsampling word wi is computed as
where t is a custom threshold (usually, a small value such as 0.0001) and f is the frequency of the word in the document. Although the authors implemented it in a different, but almost equivalent way, let's stick with this definition.
When computing the P(wi), we can end up with negative values. For example, assume we have 100 words, and one of them appears extremely more often than others (as it is the case for my dataset).
import numpy as np
import seaborn as sns
np.random.seed(12345)
# generate counts in [1, 20]
counts = np.random.randint(low=1, high=20, size=99)
# add an extremely bigger count
counts = np.insert(counts, 0, 100000)
# compute frequencies
f = counts/counts.sum()
# define threshold as in paper
t = 0.0001
# compute probabilities as in paper
probs = 1 - np.sqrt(t/f)
sns.distplot(probs);
Q: What is the correct way to implement subsampling using this "probability"?
As an additional info, I have seen that in keras the function keras.preprocessing.sequence.make_sampling_table takes a different approach:
def make_sampling_table(size, sampling_factor=1e-5):
"""Generates a word rank-based probabilistic sampling table.
Used for generating the `sampling_table` argument for `skipgrams`.
`sampling_table[i]` is the probability of sampling
the i-th most common word in a dataset
(more common words should be sampled less frequently, for balance).
The sampling probabilities are generated according
to the sampling distribution used in word2vec:
```
p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
(word_frequency / sampling_factor)))
```
We assume that the word frequencies follow Zipf's law (s=1) to derive
a numerical approximation of frequency(rank):
`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
where `gamma` is the Euler-Mascheroni constant.
# Arguments
size: Int, number of possible words to sample.
sampling_factor: The sampling factor in the word2vec formula.
# Returns
A 1D Numpy array of length `size` where the ith entry
is the probability that a word of rank i should be sampled.
"""
gamma = 0.577
rank = np.arange(size)
rank[0] = 1
inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
f = sampling_factor * inv_fq
return np.minimum(1., f / np.sqrt(f))
I tend to trust deployed code more than paper write-ups, especially in a case like word2vec, where the original authors' word2vec.c code released by the paper's authors has been widely used & served as the template for other implementations. If we look at its subsampling mechanism...
if (sample > 0) {
real ran = (sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn;
next_random = next_random * (unsigned long long)25214903917 + 11;
if (ran < (next_random & 0xFFFF) / (real)65536) continue;
}
...we see that those words with tiny counts (.cn) that could give negative values in the original formula instead here give values greater-than 1.0, and thus can never be less than the long-random-masked-and-scaled to never be more than 1.0 ((next_random & 0xFFFF) / (real)65536). So, it seems the authors' intent was for all negative-values of the original formula to mean "never discard".
As per the keras make_sampling_table() comment & implementation, they're not consulting the actual word-frequencies at all. Instead, they're assuming a Zipf-like distribution based on word-rank order to synthesize a simulated word-frequency.
If their assumptions were to hold – the related words are from a natural-language corpus with a Zipf-like frequency-distribution – then I'd expect their sampling probabilities to be close to down-sampling probabilities that would have been calculated from true frequency information. And that's probably "close enough" for most purposes.
I'm not sure why they chose this approximation. Perhaps other aspects of their usual processes have not maintained true frequencies through to this step, and they're expecting to always be working with natural-language texts, where the assumed frequencies will be generally true.
(As luck would have it, and because people often want to impute frequencies to public sets of word-vectors which have dropped the true counts but are still sorted from most- to least-frequent, just a few days ago I wrote an answer about simulating a fake-but-plausible distribution using Zipf's law – similar to what this keras code is doing.)
But, if you're working with data that doesn't match their assumptions (as with your synthetic or described datasets), their sampling-probabilities will be quite different than what you would calculate yourself, with any form of the original formula that uses true word frequencies.
In particular, imagine a distribution with one token a million times, then a hundred tokens all appearing just 10 times each. Those hundred tokens' order in the "rank" list is arbitrary – truly, they're all tied in frequency. But the simulation-based approach, by fitting a Zipfian distribution on that ordering, will in fact be sampling each of them very differently. The one 10-occurrence word lucky enough to be in the 2nd rank position will be far more downsampled, as if it were far more frequent. And the 1st-rank "tall head" value, by having its true frequency *under-*approximated, will be less down-sampled than otherwise. Neither of those effects seem beneficial, or in the spirit of the frequent-word-downsampling option - which should only "thin out" very-frequent words, and in all cases leave words of the same frequency as each other in the original corpus roughly equivalently present to each other in the down-sampled corpus.
So for your case, I would go with the original formula (probability-of-discarding-that-requires-special-handling-of-negative-values), or the word2vec.c practical/inverted implementation (probability-of-keeping-that-saturates-at-1.0), rather than the keras-style approximation.
(As a totally-separate note that nonetheless may be relevant for your dataset/purposes, if you're using negative-sampling: there's another parameter controlling the relative sampling of negative examples, often fixed at 0.75 in early implementations, that one paper has suggested can usefully vary for non-natural-language token distributions & recommendation-related end-uses. This parameter is named ns_exponent in the Python gensim implementation, but simply a fixed power value internal to a sampling-table pre-calculation in the original word2vec.c code.)

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

How to find the nearest neighbors of 1 Billion records with Spark?

Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.

Ways to calculate similarity

I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes:
age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others.
Can anyone tell me how to go about this problem or point me to some resources?
Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.
For example
x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)
library(cluster)
daisy(x, metric = "euclidean")
you'll get :
Dissimilarities :
1 2 3 4
2 2.000000
3 3.316625 2.236068
4 2.236068 1.732051 1.414214
5 4.242641 3.741657 1.732051 2.645751
If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this
Give each attribute an appropriate weight, and add the differences between values.
enum SkinType
Dry, Medium, Oily
enum HairLength
Bald, Short, Medium, Long
UserDifference(user1, user2)
total := 0
total += abs(user1.Age - user2.Age) * 0.1
total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
# etc...
return total
If you really need similarity instead of difference, use 1 / UserDifference(a, b)
You probably should take a look for
Data Mining and Data Warehousing (Essential)
Machine Learning (Extra)
Artificial Neural Networks (Especially SOM)
Pattern Recognition (Related)
These topics will let you your program recognize similarities and clusters in your users collection and try to adapt to them...
You can then know different hidden common groups of related users... (i.e users with green hair usually do not like watching TV..)
As an advice, try to use ready implemented tools for this feature instead of implementing it yourself...
Take a look at Open Directory Data Mining Projects
Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:
Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.
Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):
double Difference(Person p1, Person p2) {
double agescale=0.1;
double skinscale=0.5;
double hairscale=1;
double lifestylescale=1;
double agediff = (p1.age-p2.age)*agescale;
double skindiff = (p1.skin-p2.skin)*skinscale;
double hairdiff = (p1.hair-p2.hair)*hairscale;
double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;
double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
return diff;
}
Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.
Look at algorithms for computing srting difference. Its very similar to what you need. Store your attributes as a bit string and compute the distance between the strings
You should read these two topics.
Most popular clustering algorithm k - means
And similarity matrix are essential in clustering

Resources