algorithms for fast string approximate matching

algorithms for fast string approximate matching - string

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.

My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.

Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's

Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").

Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.

My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

Related

String decompression : Reduce time and space complexity

N-rounds of compression are run on a string, where each round replaces some character pattern with one special character (using a dictionary).
Given this compressed string and the dictionary used for compression, we need to find the original string.
For ex:
Dictionary used for compression:
b12k -> ?
a?l -> #
#mn -> !
So, the string ab12klmn is compressed as !
What data structure suits best to store this dictionary such that the decompression is O(n) operation with least possible extra space used?
What I've tried:
This was an interview question, I stored the dictionary in a map with target alphabet (of the compression dictionary) as the key of my map and decompressed strings as the values.
Then a traversal through the given string replacing the special characters with their respective expansions.
For ex:
! -> ab12klmn
# -> ab12k
? -> b12k
Then to reduce the duplicacy of string patterns I did a tree like structuring of this dictionary but the interviewer wasn't satisfied.
Where can I improve this solution?

I understand that we need to get back the original string from the given compressed string.
The best data structure that you can use here can be an 2-dimensional vector (dynamic array). I will try and explain why this can be the best data structure for this problem.
When we use a map we introduce a logn factor while looking for a particular key. With vectors if you know the location of your search query it can be done in O(1).
When we use a vector we are not wasting any extra memory blocks. This is also the case with maps. But if you use 2-d arrays unnecessary memory will be wasted.
But since there are only 256 characters, we will store the dictionary as follows. Lets have a 2d vector of strings with max 256 rows. For this example
b12k -> ?
a?l -> #
#mn -> !
So we will store "b12k" at v[63] as ASCII value of '?' is 63. Similarly, we will store we will store "a?l" at v[35] as ASCII value of '#' is 35 and so on,
NOW HOW TO FIND THE ORIGINAL STRING:
We start from the compressed string.
Initialize your string which will store the final ans. Lets call it origString = "".
Start traversing the string. If its a non-special character add this character to the origString.
If we find any special character just go to that characters ASCII value and its corresponding location in 2d-vector.
Go to step 2.
The pseudo-code for this is
origString = "";
func getOriginalFromCompressed(string s)
for i = [0:s.length()-1]
if(v[s[i]].length()) getOriginalFromCompressed(v[s[i]]);
else origString = stringConcat(origString,s[i]); //add the charcacter to your final ans
end for
end func
origString has the original string.
So the time and space complexity of this solution is O(n).
where n=sum of lengths of all the strings in dictionary given.

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.
I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.
Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.

Here's one fairly simple algorithm, which should be reasonably fast.
Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H0 of all the |S0|-k+1 length k substrings of S0. That's roughly O(|S0|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S0 then you could end up using O(k|S0|).
Now use the same rolling hash on S1. This time, look each substring up in H0 and if you find it, remove it from H0 and insert it into a new table H1. Again, this should be around O(|S1|) unless you have some pathological case, like both S0 and S1 are just long repetitions of the same character. (It's also going to be suboptimal if S0 and S0 are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each Si, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)
At the end, the last hash table will contain all the common k-length substrings.
The total run time should be about O(Σ|Si|) but in the worst case it could be O(kΣ|Si|). Even so, with the problem size as described, it should run in acceptable time.

Some thoughts (N is number of strings, M is average length, K is needed substring size):
Approach 1:
Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})
time O(NxM), space O(NxM)
Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)
Approach 2:
Build suffix array for every string
time O(N x MlogM) space O(N x M)
Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on

I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).
I would then sort each collection into ascending order.
Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.
(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).
Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.
Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).

I would try a simple method using HashSets:
Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set.
Lookup the term in the other sets.
The first step takes care of repetitions in each long string.
The second ensures the minimum number of comparisons.
let getHashSet k (lstr:string) =
let strs = System.Collections.Generic.HashSet<string>()
for i in 0..lstr.Length - k do
strs.Add lstr.[i..i + k - 1] |> ignore
strs
let getCommons k lstrs =
let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
match strss |> Seq.tryHead with
| None -> [||]
| Some h ->
let rest = Seq.tail strss |> Seq.toArray
[| for s in h do
if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
|]
Test:
let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
[| for i in 1..n do
yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
|] |> System.Text.Encoding.ASCII.GetString
[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l
result:
found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]
Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT

Finding length of substring

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S.
For example, I have given the following 5 strings:
AATT
CGTT
CAGT
ACGT
ATGC
Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence.
To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?

This is a NP-complete problem called multiple sequence alignment.
The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.
The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.

My Approach: using Trie
Building a Trie from the given words.
create empty string (S)
create empty string (prev)
for each layer in the trie
create empty string (curr)
for each character used in the current layer
if the character not used in the previous layer (not in prev)
add the character to S
add the character to curr
prev = curr
Hope this helps :)

1 Definitions
A sequence of length n is a concatenation of n symbols taken from an alphabet .
If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous.
A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T.
A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.
2 The problem
The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).
3 Techniques
Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]
4 Implemented heuristics
4.1 The trivial solution
The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.
4.2 Majority merge heuristic
The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
WHILE there are non-empty input sequences
s <- The most frequent symbol at the start of non-empty input-sequences.
Add s to the end of S.
Remove s from the beginning of each input sequence that starts with s.
END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.
5 My approach - Local search
My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.
Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.
I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.
That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:
xi {k, l}, where k L and l L(k)
To be sure that any solution is valid we need to introduce the following constraints:
Every symbol in every sequence may only have one xi mapped to it.
If xi ss,k and xj ss,l and k < l then i < j.
If xi ss,k and xj ss,l and k > l then i > j.
The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.
5.1 The initial solution
There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.
The first one is to create an initial solution by simply concatenating all the sequences.
The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.
5.2 Local change and the neighbourhood
A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.
There are two variants I have tried.
In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.
In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.
A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.
The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.
5.3 Evaluation
Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.
The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.
The value of the solution S is obtained as |T|.
With Many Many Thanks to : Andreas Westling

complexity of constructing an inverted index list

Given n strings S1, S2, ..., Sn, and an alphabet set A={a_1,a_2,....,a_m}. Assume that the alphabets in each string are all distinct. Now I want to create an inverted-index for each a_i (i=1,2...,m). My inverted-index has also something special: The alphabets in A are in some sequential order, if in the inverted-index a_i has included one string (say S_2), then a_j (j=i+1,i+2,...,m) don't need to include S_2 any more. In short, every string just appears in the inverted list only once. My question is how to build such list in a fast and efficient way? Any time complexity is bounded?
For example, A={a,b,e,g}, S1={abg}, S2={bg}, S3={gae}, S4={g}. Then my inverted-list should be:
a: S1,S3
b: S2 (since S1 has appeared previously, so we don't need to include it here)
e:
g: S4

If I understand your question correctly, a straightforward solution is:
for each string in n strings
find the "smallest" character in the string
put the string in the list for the character
The complexity is proportional to the total length of the strings, multiplying by a constant for the order testing.
If there is a simple way for testing, (e.g. the characters are in alphabetical order and all lower-case, a < will be enough), simply compare them; otherwise, I suggest using a hash table, each pair of which is a character and its order, later simply compare them.

How to find all cyclic shifted strings in a given input?

This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?

As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.

If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.

Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table

Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.

I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.

You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string