How to find two strings that are close to each other

How to find two strings that are close to each other - string

I have n strings and I want to find the closest pair.
What's the fastest practical algorithm to find this pair?

The paper "The Closest Pair Problem under the Hamming Metric", Min, Kao, Zhu seems to be what you are looking for, and it applies to finding a single closest pair.
For your case, where n0.294 < D < n, where D is the dimensionality of your data (1000) and n the size of your dataset, the algorithm will run in O(n1.843 D0.533).

Related

how to find all numbers that there distance from given point are less or equall to integer n

given a set of points D and some number K I want to find all numbers that are in D such that the distance between K and any found number is less or equal to integer N?
Example:
suppose we have D={5,9,0,6,7} and K=8 and N=1 then the result should be {9,7}
I was thinking to use k-d tree or VP tree but both as I understand (correct me if I am wrong please) find nearest neighbors and do not care about N in my example.

To summarize all the comment:
Solve this problem as brute force will take O(n) time as iterate on each element in D and check if its distance from k is less then n.
You have big data set but a lot of queries it is better to do pre-processing on D (with O(nlogn) and the you can get the answer in O(logn) -> by sorting D as pre-processes (in O(nlogn) as dimple sort of array.
Now, on given query search for k - notice binary search will stop if the number missing but he do stop at the closest value. From that index start spread to both side of D and for each check if still in n range. Notice the spreading in allow as it is include of O(|output|).
In your example: sorted D yield: [0,5,6,7,9]. Try finding k=8 will give false but index 3 or 4 (depended on the implementation). Let say is return index 3. for 3 till last index check if arr[i] - k < n if so print - if bigger stop. For the other side check k - arr[i] < n - if so print and if bigger stop -> this will give you 7,9
Hope that helps!

Fast algorithms to approximate distance between two strings

I am working on a project that requires to calculate minimum distance between two strings. The maximum length of each string can be 10,000(m) and we have around 50,000(n) strings. I need to find distance between each pair of strings. I also have a weight matrix that contains the the weight for each character pairs. Example, weight between (a,a) = (a,b) = 0.
Just iterating over all pair of string takes O(n^2) time. I have seen algorithms that takes O(m) time for finding distance. Then, the overall time complexity becomes O(n^2*m). Are there any algorithms which can do better than this using some pre-processing? It's actually the same problem as auto correct.
Do we have some algorithms that stores all the strings in a data structure and then we query the approximate distance between two strings from the data structure? Constructing the data structure can take O(n^2) and query processing should be done in less than O(m).
s1 = abcca, s2 = bdbbe
If we follow the above weighted matrix and calculate Euclidean distance between the two:
sqrt(0^2 + 9^2 + 9^2 + 9^2 + 342^2)
Context: I need to cluster time series and I have converted the time series to SAX representation with around 10,000 points. In order to cluster, I need to define a distance matrix. So, i need to calculate distance between two strings in an efficient way.
Note: All strings are of same length and the alphabet size is 5.
https://web.stanford.edu/class/cs124/lec/med.pdf
http://stevehanov.ca/blog/index.php?id=114

Measure the strings similarities by transposition

If I have three strings, the first one is string1 = Laptop, the second one is string2 = Latpop and the third one is string3 = Lavmop, then the levenshtein distance algorithm will return the same distance for the similarities of string1and string2 and the similarities of string1 and string3. that is because the levenshtein algorithm calculate only the operations: insert, delete and substution, which is not including the transposition operation, for example, we can swap the third and forth character at Latpop string which yields Laptop.
It's obviouse that Latpop is more similar to the Laptop than Lavmop, and it's not correct to classify them in the same similarity level.
Is there an algorithm, that take into account the transposition operation?

I found the answer in Damerau–Levenshtein distance and Jaro–Winkler distance

String pre-processing step, to answer further queries in O(1) time

A string is given to you and it contains characters consisting of only 3 characters. Say, x y z.
There will be million queries given to you.
Query format: x z i j
Now in this we need to find all possible different substrings which begins with x and ends in z. i and j denotes the lower and upper bound of the range where the substring must lie. It should not cross this.
My Logic:-
Read the string. Have 3 arrays which will store the count of x y z respectively, for i=0 till strlen
Store the indexes of each characters separately in 3 more arrays. xlocation[], ylocation[], zlocation[]
Now, accordingly to the query, (a b i j) find all the indices of b within the range i and j.
Calculate the answer, for each index of b and sum it to get the result.
Is it possible to pre-process this string before the query? So, like that it takes O(1) time to answer the query.

As the others suggested, you can do this with a divide and conquer algorithm.
Optimal substructure:
If we are given a left half of the string and a right half and we know how many substrings there are in the left half and how many there are in the right half then we can add the two numbers together. We will be undercounting by all the strings that begin in the left and end in the right. This is simply the number of x's in the left substring multiplied by the number of z's in the right substring.
Therefore we can use a recursive algorithm.
This would be a problem however if we tried to solve for everything single i and j combination as the bottom level subproblems would be solved many many times.
You should look into implementing this with a dynamic programming algorithm keeping track of substrings in range i,j, x's in range i,j, and z's in range i,j.

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908

Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to find two strings that are close to each other - string

I have n strings and I want to find the closest pair. What's the fastest practical algorithm to find this pair?

Related

how to find all numbers that there distance from given point are less or equall to integer n

Fast algorithms to approximate distance between two strings

Measure the strings similarities by transposition

String pre-processing step, to answer further queries in O(1) time

What is the fastest way to sort n strings of length n each?

Categories

Resources