Fast algorithms to approximate distance between two strings - string

I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114

Related

get n smallest values of multidimensional xarray.DataArray

I'm currently working with some weather data that I have as netcdf files which I can easily read with pythons xarray library
I would now like to get the n smallest values of my DataArray which has 3 dimensions (longitude, latitude and time)
When I have a DataArray dr, I can just do dr.min(), maybe specify an axis and then I get the minimum, but when I want to get also the second smallest or even a variable amount of smallest values, it seems not to be as simple
What I currently do is:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.values.reshape(dr.values.size)
dr.sort()
n_smallest = dr[0:n]
which seems a bit complicated to me compared to the simple .min() I have to type for the smallest value
I actually want to get the times to the respective smallest values which I do for the smallest with:
dr.where(dr[selection] == dr[selection].min(), drop=True)[time].values
so is there a better way of getting the n smallest values? or maybe even a simple way to get the times for the n smallest values?
maybe there is a way to reduce the 3D DataArray along the longitude and latitude axis to the respective smallest values?
I just figured out there really is a reduce function for DataArray that allows me to reduce along longitude and latitude and as I don't reduce the time dimension, I can just use the sortby function and get the DataArray with min values for each day with their respective times:
with xr.open_dataset(path) as ds:
dr = ds[selection]
dr = dr.reduce(np.min,dim=[longitude,latitude])
dr.sortby(dr)
which is obviously not shorter than my original code, but perfectly satisfies my demands

Algorithm to check whether distance between any two objects is within the Specified grid

I have distance value between two objects. Need algorithm to check whether measured distanced is available in distance of any two objects in grid pattern shown in Image
Grid for verification
This is grid with squared cells. All distances at such grid (expressed in units of cell size) should satisfy to condition
d^2 = a^2 + b^2
If squared distance is integer and you can represent it as sum of two integer squares, then objects can be placed in grid nodes.
There is mathematical criteria - number P is not representable as sum of two squares if it's factorization into primes contains any (4n+3)factor in odd power

How to find two strings that are close to each other

I have n strings and I want to find the closest pair.
What's the fastest practical algorithm to find this pair?
The paper "The Closest Pair Problem under the Hamming Metric", Min, Kao, Zhu seems to be what you are looking for, and it applies to finding a single closest pair.
For your case, where n0.294 < D < n, where D is the dimensionality of your data (1000) and n the size of your dataset, the algorithm will run in O(n1.843 D0.533).

formula Amplitude using FFT

I want to ask about the formula of amplitude bellow. I am using Fast Fourier Transform. So it returns real and complex numbers.
after that I must search amplitude for each frequency.
My formula is
amplitude = 10 * log (real*real + imagined*imagined)
I want to ask about this formula. What is it source? I have been search, but I don't found any source. Can anybody tell me about that source?
This is a combination of two equations:
1: Finding the magnitude of a complex number (the result of an FFT at a particular bin) - the equation for which is
m = sqrt(r^2 + i ^2)
2: Calculating relative power in decibels from an amplitude value - the equation for which is p =10 * log10(A^2/Aref^2) == 20 log10(A/Aref) where Aref is a some reference value.
By inserting m from equation 1 into a from equation 2 with ARef = 1 we get:
p = 10 log(r^2 + i ^ 2)
Note that this gives you a measure of relative signal power rather than amplitude.
The first part of the formula likely comes from the definition of Decibel, with the reference P0 set to 1, assuming with log you meant a logarithm with base 10.
The second part, i.e. the P1=real^2 + imagined^2 in the link above, is the square of the modulus of the Fourier coefficient cn at the n-th frequency you are considering.
A Fourier coefficient is in general a complex number (See its definition in the case of a DFT here), and P1 is by definition the square of its modulus. The FFT that you mention is just one way of calculating the DFT. In your case, likely the real and complex numbers you refer to are actually the real and imaginary parts of this coefficient cn.
sqrt(P1) is the modulus of the Fourier coefficient cn of the signal at the n-th frequency.
sqrt(P1)/N, is the amplitude of the Fourier component of the signal at the n-th frequency (i.e. the amplitude of the harmonic component of the signal at that frequency), with N being the number of samples in your signal. To convince yourself you need to divide by N, see this equation. However, the division factor depends on the definition/convention of Fourier transform that you use, see the note just above here, and the discussion here.

Can the cosine similarity when using Locality Sensitive Hashing be -1?

I was reading this question:
How to understand Locality Sensitive Hashing?
But then I found that the equation to calculate the cosine similarity is as follows:
Cos(v1, v2) = Cos(theta) = (hamming distance/signature length) * pi = ((h/b) * pi )
Which means if the vectors are fully similar, then the hamming distance will be zero and the cosine value will be 1. But when the vectors are totally not similar, then the hamming distance will be equal to the signature length and so we have cos(pi) which will result in -1. Shouldn't the similarity be always between 0 and 1?
Cosine similarity is the dot product of the vectors divided by the magnitudes, so it's entirely possible to have a negative value for the angle's cosine. For example, if you have unit vectors pointing in opposite directions, then you want the value to be -1. I think what's confusing you is the nature of the representation because the other post is talking about angles between vectors in 2-D space whereas it's more common to create vectors in a multidimensional space where the number of dimensions is customarily much greater than 2, and the value for each dimension is non-negative (e.g., a word occurs in document or not), resulting in a 0 to 1 range.

Resources