Python multiprocessing same function with different arguments - python-3.x

I have a program that is designed to calculate order parameter from coarse-grained molecular system. In the system the I have different beads, which represents different parts of molecule. Each of these beads have a xyz-coordinates that represent their place in the system. The program works, but it is very slow since I have to calculate the number of beads type i around beads type j within a certain cutoff distance.
Function to calculate Euclidean distance between bead a and b:
def distance_ab(a, b):
n_beads = 0
for i in range(len(a)):
for j in range(len(b)):
# Euclidean distance
dist = np.sqrt(np.sum((a[i] - b[j])** 2, axis=0))
if dist <= 1.0 and dist > 0.0: # cut-off distance
n_beads += 1
return n_beads
So I decided to fasten the process of calculating the distance between different beads by using python multiprocessing library. But for some reason I can not get the multiprocessing to work for repeating the same distance calculation function with different parameters (xyz-data of beads). Multiprocessing returns a list of some numbers, when the idea is to return only one number (the number of beads in certain cut-off distance). What I do wrong and could someone help me to understand where the problem is?
The part where I am trying to use multiprocessing:
with multiprocessing.Pool(os.cpu_count()) as pool:
# go through certain number of molecular simulation frames (e.g. 100 frames)
for i in range(frames)):
# Calculating euclidean distances between different types of beads
# for each frame
a_b = pool.starmap(calculate_distances, zip(bead_a_array, bead_b_array))
a_c = pool.starmap(calculate_distances, zip(bead_a_array, bead_c_array))

When you zip together your bead arrays, you are creating an iterable of tuples that overall has the same length as the shorter of the two arrays.
>>> A=[1,2,3]
>>> B=[4,5,6,7]
>>> res=zip(A,B)
>>> list(res)
[(1, 4), (2, 5), (3, 6)]
Looking at the documentation for starmap:
Like map() except that the elements of the iterable are expected to be iterables that are unpacked as arguments.
Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].
So your starmaps are actually just passing a pair of elements (one from each array) to your function and returning the result for each of these pairs. If you want to use multiprocessing to determine, say, the number of b and c beads within the cutoff distance of a at the same time, you would need to do something like this:
import itertools as it
all_arr=[bead_a_array,bead_b_array,bead_c_array]
with multiprocessing.Pool(os.cpu_count()) as pool:
a_counts=pool.starmap(distance_ab, it.combinations(all_arr,2))
Here, instead of passing individual elements of each array to the function, it now passes each array into your function and it will compute the counts of b and c within the threshold of a (and c with the threshold of b) simultaneously. The it.combinations(all_arr,2) selects unique pairs of arrays to pass to your function.

Related

multithreaded iteration over numpy array indices

I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.
An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis

What is the meaning of "value" in a node in sklearn decisiontree plot_tree

I plotted my sklearn decision tree using the plot_tree function. The nodes have the following structure:
But I don't understand what does the value = [2417, 1059] mean. In other nodes there are other values. Thanks for explaining.
DecisionTreeClassifier:
value in a DecisionTreeClassifier is the class split in each node's samples.
Keep in mind it might also be weighted if you weighted your classes on the call to fit().
For example:
cw={0: 0.6495288248337029, 1: 2.1719184430027805}
Taking the true node, your true class split is calculated as:
>>> [3819.229 / cw[0], 1216.274 / cw[1]]
[5880, 560]
And if it's not clear, your criterion is calculated on the weighted split:
>>> a, b = 3819.229, 1216.274
>>> ab = a + b
>>> (-(a / ab)*math.log2(a / ab)) - ((b / ab)*math.log2(b / ab))
0.7975914228753467
DecisionTreeRegressor:
value in a DecisionTreeRegressor is the value that the tree would predict for a new example falling in that node. If your criterion is MSE, you'll find that value is an average measure of the samples in that node.
For example:
*(Data: Seaborn's "dots" example set.)
A depth-1 regressor tree fitted on coherence to predict firing_rate. It's not a very useful tree, but it illustrates the idea.
Taking the true node, value is calculated as:
>>> value = data[data.coherence <= 19.2].firing_rate.mean()
>>> value
40.48326118418657
squared_error for that node is:
>>> ((data[data.coherence <= 19.2].firing_rate - value)**2).mean()
134.6504380931471
They are indicating you the number of sample by class that you have in the step.
For example, your picture show that before splitting for "hops<=5" you have 2417 samples of class 0 and 1059 samples of the class 1.
Realize that if you sum this two values, you will obtain the same number (3476) as the parameter "samples".
If the tree works, you will observe how the data is splitting better in every step. For final leaf you will see that you have clear values like [300, 2]. Then you can say that all this sample are class 0.

Hyperopt list of values per hyperparameter

I'm trying to use Hyperopt on a regression model such that one of its hyperparameters is defined per variable and needs to be passed as a list. For example, if I have a regression with 3 independent variables (excluding constant), I would pass hyperparameter = [x, y, z] (where x, y, z are floats).
The values of this hyperparameter have the same bounds regardless of which variable they are applied to. If this hyperparameter was applied to all variables, I could simply use hp.uniform('hyperparameter', a, b). What I want the search space to be instead is a cartesian product of hp.uniform('hyperparameter', a, b) of length n, where n is the number of variables in a regression (so, basically, itertools.product(hp.uniform('hyperparameter', a, b), repeat = n))
I'd like to know whether this is possible within Hyperopt. If not, any suggestions for an optimizer where this is possible are welcome.
As noted in my comment, I am not 100% sure what you are looking for, but here is an example of using hyperopt to optimize 3 variables combination:
import random
# define an objective function
def objective(args):
v1 = args['v1']
v2 = args['v2']
v3 = args['v3']
result = random.uniform(v2,v3)/v1
return result
# define a search space
from hyperopt import hp
space = {
'v1': hp.uniform('v1', 0.5,1.5),
'v2': hp.uniform('v2', 0.5,1.5),
'v3': hp.uniform('v3', 0.5,1.5),
}
# minimize the objective over the space
from hyperopt import fmin, tpe, space_eval
best = fmin(objective, space, algo=tpe.suggest, max_evals=100)
print(best)
they all have the same search space in this case (as I understand this was your problem definition). Hyperopt aims to minimize the objective function, so running this will end up with v2 and v3 near the minimum value, and v1 near the maximum value. Since this most generally minimizes the result of the objective function.
You could use this function to create the space:
def get_spaces(a, b, num_spaces=9):
return_set = {}
for set_num in range(9):
name = str(set_num)
return_set = {
**return_set,
**{name: hp.uniform(name, a, b)}
}
return return_set
I would first define my pre-combinatorial space as a dict. The keys are names. The values are a tuple.
from hyperopt import hp
space = {'foo': (hp.choice, (False, True)), 'bar': (hp.quniform, 1, 10, 1)}
Next, produce the required combinatorial variants using loops or itertools. Each name is kept unique using a suffix or prefix.
types = (1, 2)
space = {f'{name}_{type_}': args for type_ in types for name, args in space.items()}
>>> space
{'foo_1': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_1': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1),
'foo_2': (<function hyperopt.pyll_utils.hp_choice(label, options)>,
(False, True)),
'bar_2': (<function hyperopt.pyll_utils.hp_quniform(label, *args, **kwargs)>,
1, 10, 1)}
Finally, initialize and create the actual hyperopt space:
space = {name: fn(name, *args) for name, (fn, *args) in space.items()}
values = tuple(space.values())
>>> space
{'foo_1': <hyperopt.pyll.base.Apply at 0x7f291f45d4b0>,
'bar_1': <hyperopt.pyll.base.Apply at 0x7f291f45d150>,
'foo_2': <hyperopt.pyll.base.Apply at 0x7f291f45d420>,
'bar_2': <hyperopt.pyll.base.Apply at 0x7f291f45d660>}
This was done with hyperopt 0.2.7. As a disclaimer, I strongly advise against using hyperopt because in my experience it has significantly poor performance relative to other optimizers.
Hi so I implemented this solution with optuna. The advantage of optuna is that it will create a hyperspace for all individual values, but optimizes this values in a more intelligent way and just uses one hyperparameter optimization. For example I optimized a neural network with the Batch-SIze, Learning-rate and Dropout-Rate:
The search space is much larger than the actual values being used. This safes a lot of time instead of an grid search.
The Pseudo-Code of the implementation is:
def function(trial): #trials is the parameter of optuna, which selects the next hyperparameter
distribution = [0 , 1]
a = trials.uniform("a": distribution) #this is a uniform distribution
b = trials.uniform("a": distribution)
return (a*b)-b
#This above is the function which optuna tries to optimze/minimze
For more detailed source-Code visit Optuna. It saved a lot of time for me and it was a really good result.

Why does choosing different datatype yield different results?

While calculating entries of the matrix, different results are obtained when defined as float and numpy.complex64 respectively. As some of the entries are large numbers (10^15), the difference seems to be large.
import numpy as np
import sympy as sy
npc=int(2)
k_pnc=[[E1*A1*kl_1*sy.cot(kl_1*l1),-E1*A1*kl_1*sy.csc(kl_1*l1),0],[-E1*A1*kl_1*sy.csc(kl_1*l1),(E1*A1*kl_1*sy.cot(kl_1*l1)+E2*A2*kl_2*sy.cot(kl_2*l2)),-E2*A2*kl_2*sy.csc(kl_2*l2)],[0,-E2*A2*kl_2*sy.csc(kl_2*l2),E2*A2*kl_2*sy.cot(kl_2*l2)]]
KG=np.zeros((2*npc+2,2*npc+2),dtype=np.complex64)
for i in range(0,2*npc,2):
KG[i:i + 3, i:i + 3] = k_pnc + KG[i:i + 3, i:i + 3]
KG[2*npc:2*npc+2,2*npc:2*npc+2]=KG[2*npc:2*npc+2,2*npc:2*npc+2]+E1*A1*kl_1*np.array([[sy.cot(kl_1*l1),-sy.csc(kl_1*l1)],[-sy.csc(kl_1*l1),sy.cot(kl_1*l1)]])
KG[-1,-1]=KG[-1,-1]+ sy.I*E1*A1*kl_1
KG, Es, As kl_s are parameters.
The matrix needed an element at the end which is a complex number. As I started to get unnatural results I removed the complex number added to the last element, KG[-1,-1] and compared against dtype=float, which gave me different results.
When checked the KG matrix possesses different values in each operation even though the calculations are the same.
Why does choosing different datatype yield different results?

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

Resources