Retrieving documents based on matrix multiplication - pytorch

I have a model that represents a collection of documents in multidimensional vector space. So, for example, for 100k documents, my model represents them in the form of 300 dimensional vectors. So, finally, I get a matrix of size [100K, 300]. For retrieving those documents according to relevance to the given query, I do matrix multiplication. For example, I represent a given query as a [300, 1]. Then I get the cosine similarity scores using matrix multiplication as follows :
[100K, 300]*[300, 1] = [100K, 1].
Now how can I retrieve top 1000 documents from this collection with highest cosine similarity. The trivial way would be to sort based on cosine similarity and grab the first 1000 docs. Is there any way to retrieve the documents this way using some function in pytorch?
I mean, how can I get the indices of highest 1000 values from a 1D torch tensor?p

Once you have the similarity scores after the dot product.
you can get the top 1000 indices as follows
top_indices = torch.argsort(sims)[:1000]
similar_docs = sims[top_indices]

I think you are looking for torch.topk it will return top k largest elements values and indices both.
For example
x = torch.arange(100).view(-1,1)
x.shape
torch.Size([100, 1])
value, indices = x.topk(k=10, dim=0)
value
tensor([[99],
[98],
[97],
[96],
[95],
[94],
[93],
[92],
[91],
[90]])
indices
tensor([[99],
[98],
[97],
[96],
[95],
[94],
[93],
[92],
[91],
[90]])

Related

Combining vectors in Gensim Word2Vec vocabulary

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.
Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687
From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
'''
FUNCTION : meanVector(...)
INPUT :
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT :
averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
]
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
]
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
all_keys.add(keyedVectors.get_index(key))
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.

A vector and matrix rows cosine similarity in pytorch

In pytorch, I have multiple (scale of hundred thousand) 300 dim vectors (which I think I should upload in a matrix), I want to sort them by their cosine similarity with another vector and extract the top-1000. I want to avoid for loop as it is time consuming. I was looking for an efficient solution.
You can use torch.nn.functional.cosine_similarity function for computing cosine similarity. And torch.argsort to extract top 1000.
Here is an example:
x = torch.rand(10000,300)
y = torch.rand(1,300)
dist = F.cosine_similarity(x,y)
index_sorted = torch.argsort(dist)
top_1000 = index_sorted[:1000]
Please note the shape of y, don't forget to reshape before calling similarity function. Also note that argsort simply returns the indexes of closest vectors. To access those vectors themselves, just write x[top_1000], which will return a matrix shaped (1000,300).

Understanding Data Leakage and getting perfect score by exploiting test data

I have read an article on data leakage. In a hackathon there are two sets of data, train data on which participants train their algorithm and test set on which performance is measured.
Data leakage helps in getting a perfect score in test data, with out viewing train data by exploiting the leak.
I have read the article, but I am missing the crux how the leakage is exploited.
Steps as shown in article are following:
Let's load the test data.
Note, that we don't have any training data here, just test data. Moreover, we will not even use any features of test objects. All we need to solve this task is the file with the indices for the pairs, that we need to compare.
Let's load the data with test indices.
test = pd.read_csv('../test_pairs.csv')
test.head(10)
pairId FirstId SecondId
0 0 1427 8053
1 1 17044 7681
2 2 19237 20966
3 3 8005 20765
4 4 16837 599
5 5 3657 12504
6 6 2836 7582
7 7 6136 6111
8 8 23295 9817
9 9 6621 7672
test.shape[0]
368550
For example, we can think that there is a test dataset of images, and each image is assigned a unique Id from 0 to N−1 (N -- is the number of images). In the dataframe from above FirstId and SecondId point to these Id's and define pairs, that we should compare: e.g. do both images in the pair belong to the same class or not. So, for example for the first row: if images with Id=1427 and Id=8053 belong to the same class, we should predict 1, and 0 otherwise.
But in our case we don't really care about the images, and how exactly we compare the images (as long as comparator is binary).
print(test['FirstId'].nunique())
print(test['SecondId'].nunique())
26325
26310
So the number of pairs we are given to classify is very very small compared to the total number of pairs.
To exploit the leak we need to assume (or prove), that the total number of positive pairs is small, compared to the total number of pairs. For example: think about an image dataset with 1000 classes, N images per class. Then if the task was to tell whether a pair of images belongs to the same class or not, we would have 1000*N*(N−1)/2 positive pairs, while total number of pairs was 1000*N(1000N−1)/2.
Another example: in Quora competitition the task was to classify whether a pair of qustions are duplicates of each other or not. Of course, total number of question pairs is very huge, while number of duplicates (positive pairs) is much much smaller.
Finally, let's get a fraction of pairs of class 1. We just need to submit a constant prediction "all ones" and check the returned accuracy. Create a dataframe with columns pairId and Prediction, fill it and export it to .csv file. Then submit
test['Prediction'] = np.ones(test.shape[0])
sub=pd.DataFrame(test[['pairId','Prediction']])
sub.to_csv('sub.csv',index=False)
All ones have accuracy score is 0.500000.
So, we assumed the total number of pairs is much higher than the number of positive pairs, but it is not the case for the test set. It means that the test set is constructed not by sampling random pairs, but with a specific sampling algorithm. Pairs of class 1 are oversampled.
Now think, how we can exploit this fact? What is the leak here? If you get it now, you may try to get to the final answer yourself, othewise you can follow the instructions below.
Building a magic feature
In this section we will build a magic feature, that will solve the problem almost perfectly. The instructions will lead you to the correct solution, but please, try to explain the purpose of the steps we do to yourself -- it is very important.
Incidence matrix
First, we need to build an incidence matrix. You can think of pairs (FirstId, SecondId) as of edges in an undirected graph.
The incidence matrix is a matrix of size (maxId + 1, maxId + 1), where each row (column) i corresponds i-th Id. In this matrix we put the value 1to the position [i, j], if and only if a pair (i, j) or (j, i) is present in a given set of pais (FirstId, SecondId). All the other elements in the incidence matrix are zeros.
Important! The incidence matrices are typically very very sparse (small number of non-zero values). At the same time incidence matrices are usually huge in terms of total number of elements, and it is impossible to store them in memory in dense format. But due to their sparsity incidence matrices can be easily represented as sparse matrices. If you are not familiar with sparse matrices, please see wiki and scipy.sparse reference. Please, use any of scipy.sparseconstructors to build incidence matrix.
For example, you can use this constructor: scipy.sparse.coo_matrix((data, (i, j))). We highly recommend to learn to use different scipy.sparseconstuctors, and matrices types, but if you feel you don't want to use them, you can always build this matrix with a simple for loop. You will need first to create a matrix using scipy.sparse.coo_matrix((M, N), [dtype]) with an appropriate shape (M, N) and then iterate through (FirstId, SecondId) pairs and fill corresponding elements in matrix with ones.
Note, that the matrix should be symmetric and consist only of zeros and ones. It is a way to check yourself.
import networkx as nx
import numpy as np
import pandas as pd
import scipy.sparse
import matplotlib.pyplot as plt
test = pd.read_csv('../test_pairs.csv')
x = test[['FirstId','SecondId']].rename(columns={'FirstId':'col1', 'SecondId':'col2'})
y = test[['SecondId','FirstId']].rename(columns={'SecondId':'col1', 'FirstId':'col2'})
comb = pd.concat([x,y],ignore_index=True).drop_duplicates(keep='first')
comb.head()
col1 col2
0 1427 8053
1 17044 7681
2 19237 20966
3 8005 20765
4 16837 599
data = np.ones(comb.col1.shape, dtype=int)
inc_mat = scipy.sparse.coo_matrix((data,(comb.col1,comb.col2)), shape=(comb.col1.max() + 1, comb.col1.max() + 1))
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
f = rows_FirstId.multiply(rows_SecondId)
f = np.asarray(f.sum(axis=1))
f.shape
(368550, 1)
f = f.sum(axis=1)
f = np.squeeze(np.asarray(f))
print (f.shape)
Now build the magic feature
Why did we build the incidence matrix? We can think of the rows in this matix as of representations for the objects. i-th row is a representation for an object with Id = i. Then, to measure similarity between two objects we can measure similarity between their representations. And we will see, that such representations are very good.
Now select the rows from the incidence matrix, that correspond to test.FirstId's, and test.SecondId's.
So do not forget to convert pd.series to np.array
These lines should normally run very quickly
rows_FirstId = inc_mat[test.FirstId.values,:]
rows_SecondId = inc_mat[test.SecondId.values,:]
Our magic feature will be the dot product between representations of a pair of objects. Dot product can be regarded as similarity measure -- for our non-negative representations the dot product is close to 0 when the representations are different, and is huge, when representations are similar.
Now compute dot product between corresponding rows in rows_FirstId and rows_SecondId matrices.
From magic feature to binary predictions
But how do we convert this feature into binary predictions? We do not have a train set to learn a model, but we have a piece of information about test set: the baseline accuracy score that you got, when submitting constant. And we also have a very strong considerations about the data generative process, so probably we will be fine even without a training set.
We may try to choose a thresold, and set the predictions to 1, if the feature value f is higer than the threshold, and 0 otherwise. What threshold would you choose?
How do we find a right threshold? Let's first examine this feature: print frequencies (or counts) of each value in the feature f.
For example use np.unique function, check for flags
Function to count frequency of each element
from scipy.stats import itemfreq
itemfreq(f)
array([[ 14, 183279],
[ 15, 852],
[ 19, 546],
[ 20, 183799],
[ 21, 6],
[ 28, 54],
[ 35, 14]])
Do you see how this feature clusters the pairs? Maybe you can guess a good threshold by looking at the values?
In fact, in other situations it can be not that obvious, but in general to pick a threshold you only need to remember the score of your baseline submission and use this information.
Choose a threshold below:
pred = f > 14 # SET THRESHOLD HERE
pred
array([ True, False, True, ..., False, False, False], dtype=bool)
submission = test.loc[:,['pairId']]
submission['Prediction'] = pred.astype(int)
submission.to_csv('submission.csv', index=False)
I want to understand the idea behind this. How we are exploiting the leak from the test data only.
There's a hint in the article. The number of positive pairs should be 1000*N*(N−1)/2, while the number of all pairs is 1000*N(1000N−1)/2. Of course, the number of all pairs is much, much larger if the test set was sampled at random.
As the author mentions, after you evaluate your constant prediction of 1s on the test set, you can tell that the sampling was not done at random. The accuracy you obtain is 50%. Had the sampling been done correctly, this value should've been much lower.
Thus, they construct the incidence matrix and calculate the dot product (the measure of similarity) between the representations of our ID features. They then reuse the information about the accuracy obtained with constant predictions (at 50%) to obtain the corresponding threshold (f > 14). It's set to be greater than 14 because that constitutes roughly half of our test set, which in turn maps back to the 50% accuracy.
The "magic" value didn't have to be greater than 14. It could have been equal to 14. You could have adjusted this value after some leader board probing (as long as you're capturing half of the test set).
It was observed that the test data was not sampled properly; same-class pairs were oversampled. Thus there is a much higher probability of each pair in the training set to have target=1 than any random pair. This led to the belief that one could construct a similarity measure based only on the pairs that are present in the test, i.e., whether a pair made it to the test is itself a strong indicator of similarity.
Using this insight one can calculate an incidence matrix and represent each id j as a binary array (the i-th element representing the presence of i-j pair in test, and thus representing the strong probability of similarity between them). This is a pretty accurate measure, allowing one to find the "similarity" between two rows just by taking their dot product.
The cutoff arrived at is purely by the knowledge of target-distribution found by leaderboard probing.

Reducing Memory requirements for distance/adjacency matrix

Based on a subsample of around 5,000 to 100,000 word-embeddings (GloVe, 300-dimensional) I need to construct an adjancency matrix, i.e. a matrix of 1's and 0's indicating if the euclidean (or cosine) distance between two words is smaller than x.
Currently, I'm using scipy.spatial.distance.pdist:
distances = pdist(common_model, 'euclidean')
adjacency = (distances <= 0.4)
adjacency = csr_matrix(squareform(adjacency), dtype=np.uint8)
With increasing vocabulary size, my memory fills up rather quickly and pdist fails with a MemoryError (when common_model has the shape (91938, 300) and contains float64).
Iterating the model manually and creating the adjacency directly without the distance matrix in between would be a way, but that was extremely slow.
Is there another way to construct the adjacency matrix in a time- and memory-optimal way?

Creating a list of (kxn) matrix

I am trying to perform a K-mean algorithm to obtain a lowest cost which would result in a KxN matrix. The value of K is determined by number of clusters the algorithm creates with optimal cost. For example, K=2 would imply 2 clusters ( or 2 centroids ) while N is the number of features. The K-mean is run in a loop for K=1 to 10 and the loop stops when best optimal cost is obtained for a particular value of K. for example if an optimal cost is obtained for K=2, the centroid returned would be an 2xN matrix. I want to store all the centroids returned by the loop into a list. Please note that in every increment of loop the value of K would change by k=K+1. Therefore my centroid returned would be of size 1xN, 2xN, 3xN.
How to store this into a list such that I can get something like this:-
List= [[10,12,13], [[10,20,30],[1,2,3]], [[5,6,9],[4,12,20],[40,50,60]],...
With every loop I return a KxN matrix which I want to store it into a list. I want to access the list later by an index , say List[i] to retrieve the KxN matrix.
I am mostly working with numpy.
any suggestions would be a big help.
N = 5
lst = []
for K in range(1,11):
lst.append(np.empty((K,N)))

Resources