Cosine similarity between query and document in a search engine - nlp

I am going through the Manning book for Information retrieval. Currently I am at the part about cosine similarity. One thing is not clear for me.
Lets say I have the tf-idf vectors for the query and a document. I want to compute the cosine similarity between both vectors. When I compute the magnitude for the document vector do I sum the squares of all the terms in the vector or just the terms in the query?
Here is an example : we have user query "cat food beef" .
Lets say its vector is (0,1,0,1,1).( assume there are only 5 directions in the vector one for each unique word in the query and the document)
We have a document "Beef is delicious"
Its vector is (1,1,1,0,0). We want to find the cosine similarity between the query and the document vectors.

Cosine similarity is simply a fraction where
the numerator is the dot product between 2 vectors
the denominator is product of the magnitude of the 2 vectors
i.e. euclidean length, i.e. the square root of the dot product of the vector with itself
for the numerator, e.g. in numpy:
>>> import numpy as np
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> np.dot(x,y)
1.0
Similarly if we compute the dot product by multiply x_i and y_i and summing the individual elements:
>>> x_dot_y = sum([(1.0 * 0.0) + (1.0 * 1.0) + (1.0 * 0.0) + (0.0 * 1.0) + (0.0 * 1.0)])
>>> x_dot_y
1.0
For the denominator, we can compute the magnitude in numpy:
>>> from numpy.linalg import norm
>>> y = [1.0, 1.0, 1.0, 0.0, 0.0]
>>> x = [0.0, 1.0, 0.0, 1.0, 1.0]
>>> norm(x) * norm(y)
2.9999999999999996
Similarly, if we compute the euclidean length without numpy
>>> import math
# with np.dot
>>> math.sqrt(np.dot(x,x)) * math.sqrt(np.dot(y,y))
2.9999999999999996
So the cosine similarity is:
>>> cos_x_y = np.dot(x,y) / (norm(x) * norm(y))
>>> cos_x_y
0.33333333333333337
You can also use the cosine distance function directly from scipy:
>>> from scipy import spatial
>>> 1 - spatial.distance.cosine(x,y)
0.33333333333333337
See also
How to calculate cosine similarity given 2 sentence strings? - Python
Cosine Similarity between 2 Number Lists

Related

what does 'computeU' mean in computeSVD() function spark

i found a code that uses computeSVD() function ,here is the code
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True)
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.
what does computeU=True mean in this code ?

Find euclidean / cosine distance between a tensor and all tensors stored in a column of dataframe efficently

I have a tensor 'input_sentence_embed' with shape torch.Size([1, 768])
There is a dataframe 'matched_df' which looks like
INCIDENT_NUMBER enc_rep
0 INC000030884498 [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1 INC000029956111 [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2 INC000029555353 [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3 INC000029555338 [[tensor(-0.2759), tensor(0.2604), tensor(0.21...
Shape of each tensor element in dataframe looks like
matched_df['enc_rep'].iloc[0].size()
torch.Size([1, 768])
I want to find euclidean / cosine similarity between 'input_sentence_embed' and each row of 'matched_df' efficently.
If they were scalar values, I could have easily broadcasted 'input_sentence_embed' as a new column in 'matched_df' and then find cosine similarity between two columns.
I am struggling with two problems
How to broadcast 'input_sentence_embed' as a new column to the
'matched_df'
How to find cosine similarity between tensors stored
in two column
May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.
Input data:
import pandas as pd
import numpy as np
from torch import tensor
match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
'INC000029956111',
'INC000029555353',
'INC000029555338'],
'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
[[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
[[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
[[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})
input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]
How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'
match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)
How to find cosine similarity between tensors stored in two column
a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
Output result:
INCIDENT_NUMBER enc_rep input_sentence_embed cosine_similarity
0 INC000030884498 [[tensor(0.2971), tensor(0.4831), tensor(0.823... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.446067
1 INC000029956111 [[tensor(0.3481), tensor(0.8104), tensor(0.287... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.377775
2 INC000029555353 [[tensor(0.2210), tensor(0.3478), tensor(0.261... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.201116
3 INC000029555338 [[tensor(0.2951), tensor(0.6698), tensor(0.965... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.574257
Basics
I suppose you are trying to calculate the similarity or closeness of two vectors via:
euclidean distance between vectors or
cosine between vectors
Cosine similarity
For cosine similarity, you need:
Norm of each vector -> You can use linalg.norm
Cosine of vectors -> You can use dot product (inner or dot)
https://en.wikipedia.org/wiki/Cosine_similarity
For instance A = [0.8, 0.9] and B = [1.0, 0.0], then the cosine similarity of A and B is:
A = np.array([0.8, 0.9])
B = np.array([1.0, 0.0])
EA = np.linalg.norm(A)
EB = np.linalg.norm(B)
NA = A / EA
NB = B / EB
COS_A_B = np.dot(NA, NB)
COS_A_B
---
0.6643638388299198
So if we can get get two vectors (rows) A and B from the enc_rep column, then we can calculate the cosine between them.
Pandas
We need to figure out how to run those cosine calculations on the same column.
C = np.array([0.5, 0.3])
df = pd.DataFrame(columns=['ID','enc_rep'])
df.loc[0] = [1, A]
df.loc[1] = [2, B]
df.loc[2] = [3, C]
df
---
ID enc_rep
0 1 [0.8, 0.9]
1 2 [1.0, 0.0]
2 3 [0.5, 0.3]
One naive way is to create a cartesian product of the enc_rep column itself.
cartesian_df = df['enc_rep'].to_frame().merge(df['enc_rep'], how='cross')
cartesian_df
---
enc_rep_x enc_rep_y
0 [0.8, 0.9] [0.8, 0.9]
1 [0.8, 0.9] [1.0, 0.0]
2 [0.8, 0.9] [0.5, 0.3]
3 [1.0, 0.0] [0.8, 0.9]
4 [1.0, 0.0] [1.0, 0.0]
5 [1.0, 0.0] [0.5, 0.3]
6 [0.5, 0.3] [0.8, 0.9]
7 [0.5, 0.3] [1.0, 0.0]
8 [0.5, 0.3] [0.5, 0.3]
Take the cosine between enc_rep_x and enc_rep_y.
def f(x, y):
nx = x / np.linalg.norm(x)
ny = y / np.linalg.norm(y)
return np.dot(nx, ny)
cartesian_df['cosine'] = cartesian_df.apply(lambda row: f(row.enc_rep_x, row.enc_rep_y), axis=1)
cartesian_df
---
enc_rep_x enc_rep_y cosine
0 [0.8, 0.9] [0.8, 0.9] 1.000000
1 [0.8, 0.9] [1.0, 0.0] 0.664364
2 [0.8, 0.9] [0.5, 0.3] 0.954226
3 [1.0, 0.0] [0.8, 0.9] 0.664364
4 [1.0, 0.0] [1.0, 0.0] 1.000000
5 [1.0, 0.0] [0.5, 0.3] 0.857493
6 [0.5, 0.3] [0.8, 0.9] 0.954226
7 [0.5, 0.3] [1.0, 0.0] 0.857493
8 [0.5, 0.3] [0.5, 0.3] 1.000000
However, if the number of rows are large, it will create a huge dataframe with duplicates. If the size is not an issue, then you can drop one column and take unique rows.
Hope this gives an idea on how. Regarding the details of the shape is 2 dimension vs 1, etc, please figure them out on your own.

Does the derivative for categorical cross entropy only add values to weights?

I understand the derivation of partial derivatives for binary cross-entropy.
binary cross entropy
Loss = y * log10(yHat) + (1 - y) * log10(1 - yHat)
dLoss/dyHat = -y/yHat + (1 - y) / (1 - yHat)
categorical cross entropy
Loss = y * log10(yHat)
dLoss/dyHat = -y / (yHat * exp(10))
Though, I do not see the latter derivative used in backpropagation. The problem I am having with it is that only the positive category is updated.
For example;
If after the first forward propagation,
y = category 4, and category 3 = [[0, 0, 0, 1], [0, 0, 1, 0]]
yHat = [[0.1, 0.2, 0.1, 0.4], [0.1, 0.3, 0.1, 0.5]]
Then the back propagation, using categorical cross entropy, would be =
[[0., 0., 0., -0.1e-3], [0., 0., -0.4e-3, 0.]
Updating the weights with these values only results in higher numbers at each update, without any penalty for the probabilities of the wrong classifications. Eventually, all of the yHat values approach 1.0.
The same backpropagation step using binary cross entropy gives values =
[[1.1, 1.3, 1.1, -2.5],[1.1, 1.4, -10.0, 2.0]]
Allowing both a reward for the correct category and a penalty for the incorrect.
So, is the practise when using categorical cross entropy to use the binary cross entropy derivative? Doesn’t seem like such a liberty should be taken, but not sure what to do?

Dimension reduction Using PCA while preserving variance in percentage

i am trying to reduce the dimensions of MNIST dataset using PCA. Trick is, i have to preserve the certain percentage of variance(say 80%) while reducing the dimension. I am using Scikit learn. I am doing pca.get_variance ratio but it gives me same values with different dot location like 9.7 or .97 or .097. i am also tried pca.get_variance() but i assume that's not the answer. My question is how to ensure that i have reduce the dimension with certain variance percentage preserve?
If you apply PCA without passing the n_components argument, then the explained_variance_ratio_ attribute of the PCA object will give you the information you need. This attribute indicates the fraction of total variance associated with the corresponding eigenvector. Here is an example copied directly from the current stable PCA documentation:
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, n_components=2, whiten=False)
>>> print(pca.explained_variance_ratio_)
[ 0.99244... 0.00755...]
In your case, if you apply np.cumsum to the explained_variance_ratio_ attribute, then the number of principal components you need to keep corresponds to the position of the first element in np.cumsum(pca.explained_variance_ratio_) that is greater than or equal to 0.8.

Sparse Vector vs Dense Vector

How to create SparseVector and dense Vector representations
if the DenseVector is:
denseV = np.array([0., 3., 0., 4.])
What will be the Sparse Vector representation ?
Unless I have thoroughly misunderstood your doubt, the MLlib data type documentation illustrates this quite clearly:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});
Where the second argument of Vectors.sparse is an array of the indices, and the third argument is the array of the actual values in those indices.
Sparse vectors are when you have a lot of values in the vector as zero. While a dense vector is when most of the values in the vector are non zero.
If you have to create a sparse vector from the dense vector you specified, use the following syntax:
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
Vector sparseVector = Vectors.sparse(4, new int[] {1, 3}, new double[] {3.0, 4.0});
Dense: Use it when you are having high probability of data.
sparse: Use it when you are having less available data positions filled (i.e. you are having too many zeroes)
eg: {0.0,3.0,0.0,4.0}
for different Vectors it will be
val posVector = Vector.dense(0.0, 3.0, 0.0, 4.0) // all data will be in dense
val sparseVector = Vector.sparse(4, Array(1, 3), Array(3.0, 4.0)) //only non-zeros are mentioned
Syntax ex: Vector.sparse(size of vector, non-zero-index, values)

Resources