how to get one number after calculation of distance between 2D matrices - python-3.x

I'd like to calculate document similarity by using word embedding models (w2v, glove)
so one document can be represented 257*300 matrix
( 257(max number of document) * 300(pretrained embedding model dimension))
And now I try to calculate distance between all document.
When I use cosine similarity, euclidean or other vector calculation methods in scikit-learn.
But these methods return similarity matrix.
Is there any method to get one number after matrix distance calculation?
Or should I calculate average of all vectors in similarity matrix ? (I think this is not proper way to solve this problem..)

Related

diagonal (or divergence) of Jacobian matrix

How to efficiently calculate the diagonal of the jacobian matrix using Pytorch?
This operator is widely used in the diffusion model.
[d z_1/d x_1, d z_2/d x_2, ..., d z_n/d x_n]
some not ideal alternatives are: 1. calculate the whole Jacobian matrix first, then take out the diagonal.
2. loop over each entry to calculate the derivative individually.

When creating a new feature of similarity in ham vs spam case, should I include the similarity of spam with itself in the average of samp similarity?

I want to improve my model by adding a new feature column to my data, the data of ham and spam texts.
I have already created the square Cosine similarity matrix between all the texts, the diagonal of the matrix are 1s = cos(0).
I extract all the spam text index in the training data, and I created the column of similarity, for each cell in the column, I add the individual similarity between this text and all the spam and average them.
My question: for the text that is ham, it makes sense to do above. But for the text are spam, when calculating the similarity, should I exclude the similarity between itself? Will it causes data leakage?
If we have n text of sample size, I represent the similarity value of ham_1 as
average(ham_1~spam_1, ham_1~spam_2, ..., ham_1~spam_n)
My question is:
For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., spam_5~spam_5, ..., spam_5~spam_n)
Or
For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., spam_5~spam_5, ..., spam_5~spam_n)

Fast algorithms to approximate distance between two strings

I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114

Normalizing Vectors with Negative values

I want to represent each text-based item I have in my system as a vector in vector space model. The values for the terms can be negative or positive that reflect the frequency of a term in the positive or negative class. The zero value means neutral
for example:
Item1 (-1,0,-5,4.5,2)
Item2 (2,6,0,-4,0.5)
My questions are:
1- How can I normalize my vectors to a range of [0 to 1] where:
.5 means zero before normalization
and .5> if it is positive
.5< if it negative
I want to know if there is a mathematical formula to do such a thing.
2- Will similarity measure choice be different after the normalization?? For example can I use Cosine similarity?
3- Will it be difficult if I preform dimensionality reduction after the normalization??
Thanks in advance
One solution could be to use the MinMaxScaler which scales the number between (0, 1) range and then divide each row by the sum of the row. In python using sklearn you can do something like this:
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
scaled_X = scaler.fit_transform(X)
normalized_X = normalize(scaled_X, norm='l1', axis=1, copy=True)

Can the cosine similarity when using Locality Sensitive Hashing be -1?

I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.

Resources