I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.
Related
I'd like to calculate document similarity by using word embedding models (w2v, glove)
so one document can be represented 257*300 matrix
( 257(max number of document) * 300(pretrained embedding model dimension))
And now I try to calculate distance between all document.
When I use cosine similarity, euclidean or other vector calculation methods in scikit-learn.
But these methods return similarity matrix.
Is there any method to get one number after matrix distance calculation?
Or should I calculate average of all vectors in similarity matrix ? (I think this is not proper way to solve this problem..)
I have distance value between two objects. Need algorithm to check whether measured distanced is available in distance of any two objects in grid pattern shown in Image
Grid for verification
This is grid with squared cells. All distances at such grid (expressed in units of cell size) should satisfy to condition
d^2 = a^2 + b^2
If squared distance is integer and you can represent it as sum of two integer squares, then objects can be placed in grid nodes.
There is mathematical criteria - number P is not representable as sum of two squares if it's factorization into primes contains any (4n+3)factor in odd power
I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114
I want to ask about the formula of amplitude bellow. I am using Fast Fourier Transform. So it returns real and complex numbers.
after that I must search amplitude for each frequency.
My formula is
amplitude = 10 * log (real*real + imagined*imagined)
I want to ask about this formula. What is it source? I have been search, but I don't found any source. Can anybody tell me about that source?
This is a combination of two equations:
1: Finding the magnitude of a complex number (the result of an FFT at a particular bin) - the equation for which is
m = sqrt(r^2 + i ^2)
2: Calculating relative power in decibels from an amplitude value - the equation for which is p =10 * log10(A^2/Aref^2) == 20 log10(A/Aref) where Aref is a some reference value.
By inserting m from equation 1 into a from equation 2 with ARef = 1 we get:
p = 10 log(r^2 + i ^ 2)
Note that this gives you a measure of relative signal power rather than amplitude.
The first part of the formula likely comes from the definition of Decibel, with the reference P0 set to 1, assuming with log you meant a logarithm with base 10.
The second part, i.e. the P1=real^2 + imagined^2 in the link above, is the square of the modulus of the Fourier coefficient cn at the n-th frequency you are considering.
A Fourier coefficient is in general a complex number (See its definition in the case of a DFT here), and P1 is by definition the square of its modulus. The FFT that you mention is just one way of calculating the DFT. In your case, likely the real and complex numbers you refer to are actually the real and imaginary parts of this coefficient cn.
sqrt(P1) is the modulus of the Fourier coefficient cn of the signal at the n-th frequency.
sqrt(P1)/N, is the amplitude of the Fourier component of the signal at the n-th frequency (i.e. the amplitude of the harmonic component of the signal at that frequency), with N being the number of samples in your signal. To convince yourself you need to divide by N, see this equation. However, the division factor depends on the definition/convention of Fourier transform that you use, see the note just above here, and the discussion here.
I want to represent each text-based item I have in my system as a vector in vector space model. The values for the terms can be negative or positive that reflect the frequency of a term in the positive or negative class. The zero value means neutral
for example:
Item1 (-1,0,-5,4.5,2)
Item2 (2,6,0,-4,0.5)
My questions are:
1- How can I normalize my vectors to a range of [0 to 1] where:
.5 means zero before normalization
and .5> if it is positive
.5< if it negative
I want to know if there is a mathematical formula to do such a thing.
2- Will similarity measure choice be different after the normalization?? For example can I use Cosine similarity?
3- Will it be difficult if I preform dimensionality reduction after the normalization??
Thanks in advance
One solution could be to use the MinMaxScaler which scales the number between (0, 1) range and then divide each row by the sum of the row. In python using sklearn you can do something like this:
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
scaled_X = scaler.fit_transform(X)
normalized_X = normalize(scaled_X, norm='l1', axis=1, copy=True)