A vector and matrix rows cosine similarity in pytorch - pytorch

In pytorch, I have multiple (scale of hundred thousand) 300 dim vectors (which I think I should upload in a matrix), I want to sort them by their cosine similarity with another vector and extract the top-1000. I want to avoid for loop as it is time consuming. I was looking for an efficient solution.

You can use torch.nn.functional.cosine_similarity function for computing cosine similarity. And torch.argsort to extract top 1000.
Here is an example:
x = torch.rand(10000,300)
y = torch.rand(1,300)
dist = F.cosine_similarity(x,y)
index_sorted = torch.argsort(dist)
top_1000 = index_sorted[:1000]
Please note the shape of y, don't forget to reshape before calling similarity function. Also note that argsort simply returns the indexes of closest vectors. To access those vectors themselves, just write x[top_1000], which will return a matrix shaped (1000,300).

Related

Element-wise multiplication with specified channel in pytorch

I have there tensors with shapes a = (B,12,512,512), b= (B,12,512,512), c = (B,2,512,512).
I want to multiply a* c[:,0,:,:] + b*c[:,1,:,:]
As far as I know, gradient calculation does not support indexing operation, so I need to implement this calculation without using indexing. How can I implement it in a vectorized way with Pytorch?
Thanks.

Combining vectors in Gensim Word2Vec vocabulary

Gensim Word2Vec Model has a great method which allows you to find the top n most similar words in the models vocabulary given a list of positive words and negative words.
wv.most_similar(positive=['word1', 'word2', 'word3'],
negative=['word4','word5'], topn=10)
What I am looking to do is create word vector that represents an averaged or summed vector of the input positive and negative words. I am hoping to use this new vector to compare to other vectors.
Something like this:
newVector = 'word1' + 'word2' + 'word3' - 'word4' - 'word5'
I know that vectors can be summed, but I am not sure if that is the best option. I am hoping to find out exactly how the above function (most_similar) combines the positive vectors and negative vectors, and if Gensim has a function to do so. Thank you in advance.
Gensim does not expose a separate function to add/subtract the (unit-normed) vectors in the same way that most_similar() does.
Perhaps it should, as that could be generally useful, including in sharing code between other existing methods.
But as an open-source project, you can look at its exact Python code for that operation, and use it as a model for your own calculations.
For the current code defining that function, see:
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L687
From advice above, I chose to look at Gensim source code and copy their method for averaging the vectors. Here is the code incase it helps anyone else.
Note : this code is copied from gensim, and is just reformatted to return the averaged vector.
from gensim import matutils
import numpy as np
from numpy import ndarray, array, float32 as REAL
KEY_TYPES = (str, int, np.integer)
'''
FUNCTION : meanVector(...)
INPUT :
keyedVectors : word vectors or keyed vectors from gensim model, (model.wv)
positive : list of words or vectors to be applied positively [default = list()]
negative : list of words or vectors to be applied negatively [default = list()]
OUTPUT :
averaged word vector, [type = numpy.ndarray]
DESCRIPTION :
allows for simple averaging of positive and negative words and vectors given a gensim model's word vector library.
'''
def meanVector(keyedVectors, positive=list(), negative=list()):
positive = [
(item, 1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in positive
]
negative = [
(item, -1.0) if isinstance(item, KEY_TYPES + (ndarray,))
else item for item in negative
]
# compute the weighted average of all keys
all_keys, mean = set(), []
for key, weight in positive + negative:
if isinstance(key, ndarray):
mean.append(weight * key)
else:
mean.append(weight * keyedVectors.get_vector(key, norm=True))
if keyedVectors.has_index_for(key):
all_keys.add(keyedVectors.get_index(key))
if not mean:
raise ValueError("cannot compute similarity with no input")
mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
return mean
Note: this has not been thoroughly tested.

minimum the cosine similarity of two tensors and output one scalar. Pytorch

I use Pytorch cosine similarity function as follows. I have two feature vectors and my goal is to make them dissimilar to each other. So, I thought I could minimum their cosine similarity. I have some doubts about the way I have coded. I appreciate your suggestions about the following questions.
I don't know why here are some negative values in val1?
I have done three steps to convert val1 to a scalar. Am I doing it in the right way? Is there any other way?
To minimum the similarity, I have used 1/var1. Is it a standard way to do this? Is it correct if I use 1-var1?
def loss_func(feat1, feat2):
cosine_loss = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
val1 = cosine_loss(feat1, feat2).tolist()
# 1. calculate the absolute values of each element,
# 2. sum all values together,
# 3. divide it by the number of values
val1 = 1/(sum(list(map(abs, val1)))/int(len(val1)))
val1 = torch.tensor(val1, device='cuda', requires_grad=True)
return val1
Do not convert your loss function to a list. This breaks autograd so you won't be able to optimize your model parameters using pytorch.
A loss function is already something to be minimized. If you want to minimize the similarity then you probably just want to return the average cosine similarity. If instead you want minimize the magnitude of the similarity (i.e. encourage the features to be orthogonal) then you can return the average absolute value of cosine similarity.
It seems like what you've implemented will attempt to maximize the similarity. But that doesn't appear to be in line with what you've stated. Also, to turn a minimization problem into an equivalent maximization problem you would usually just negate the measure. There's nothing wrong with a negative loss value. Taking the reciprocal of a strictly positive measure does convert it from minimization to a maximization problem, but also changes the behavior of the measure and probably isn't what you want.
Depending on what you actually want, one of these is likely to meet your needs
import torch.nn.functional as F
def loss_func(feat1, feat2):
# minimize average magnitude of cosine similarity
return F.cosine_similarity(feat1, feat2).abs().mean()
def loss_func(feat1, feat2):
# minimize average cosine similarity
return F.cosine_similarity(feat1, feat2).mean()
def loss_func(feat1, feat2):
# maximize average magnitude of cosine similarity
return -F.cosine_similarity(feat1, feat2).abs().mean()
def loss_func(feat1, feat2):
# maximize average cosine similarity
return -F.cosine_similarity(feat1, feat2).mean()

Retrieving documents based on matrix multiplication

I have a model that represents a collection of documents in multidimensional vector space. So, for example, for 100k documents, my model represents them in the form of 300 dimensional vectors. So, finally, I get a matrix of size [100K, 300]. For retrieving those documents according to relevance to the given query, I do matrix multiplication. For example, I represent a given query as a [300, 1]. Then I get the cosine similarity scores using matrix multiplication as follows :
[100K, 300]*[300, 1] = [100K, 1].
Now how can I retrieve top 1000 documents from this collection with highest cosine similarity. The trivial way would be to sort based on cosine similarity and grab the first 1000 docs. Is there any way to retrieve the documents this way using some function in pytorch?
I mean, how can I get the indices of highest 1000 values from a 1D torch tensor?p
Once you have the similarity scores after the dot product.
you can get the top 1000 indices as follows
top_indices = torch.argsort(sims)[:1000]
similar_docs = sims[top_indices]
I think you are looking for torch.topk it will return top k largest elements values and indices both.
For example
x = torch.arange(100).view(-1,1)
x.shape
torch.Size([100, 1])
value, indices = x.topk(k=10, dim=0)
value
tensor([[99],
[98],
[97],
[96],
[95],
[94],
[93],
[92],
[91],
[90]])
indices
tensor([[99],
[98],
[97],
[96],
[95],
[94],
[93],
[92],
[91],
[90]])

Reducing Memory requirements for distance/adjacency matrix

Based on a subsample of around 5,000 to 100,000 word-embeddings (GloVe, 300-dimensional) I need to construct an adjancency matrix, i.e. a matrix of 1's and 0's indicating if the euclidean (or cosine) distance between two words is smaller than x.
Currently, I'm using scipy.spatial.distance.pdist:
distances = pdist(common_model, 'euclidean')
adjacency = (distances <= 0.4)
adjacency = csr_matrix(squareform(adjacency), dtype=np.uint8)
With increasing vocabulary size, my memory fills up rather quickly and pdist fails with a MemoryError (when common_model has the shape (91938, 300) and contains float64).
Iterating the model manually and creating the adjacency directly without the distance matrix in between would be a way, but that was extremely slow.
Is there another way to construct the adjacency matrix in a time- and memory-optimal way?

Resources