Supersequence from a bag of strings

Supersequence from a bag of strings - string

Given a string s, what is the most efficient way of identifying the shortest supersequence of s from a bag of strings? Also, the last character of s should match the last character of the superstring.

Unless i misunderstood it, this problem is most certainly in P.
A naive approach would be:
Take all strings in B ending with same character as s. Call this new bag B'. Can be done in O(|B|)
Select all strings that are supersequences of s in the bag B'.
It can be done in O(|B'|* max(|z|)) for z in B. Testing if a given string s is a subsequence of another string z can be done in O(|z|)
Select the shortest one of previously found strings (in O(|B'|))
Where |x| means size of x.
You can combine those steps, but it's O(|B| * max(|z|)) anyway.

Assuming the bag doesn't change very often, I would construct a DAWG and search it with A*.

Run through every string in the bag, checking if s is a substring using a fast string search like KMP. Check which of the superstrings is shortest. This is O(Σlength of strings in bag).
If you need to do the search a multiple of times, you can construct a suffix trie for each string in the bag, and merge these. Then you can do lookups in O(|s|).

Related

Splitting a string into words with dynamic programming

In this problem we've to split a string into meaningful words. We're given a dictionary to see If the word exists or not.
I"ve seen some other approaches here at How to split a string into words. Ex: "stringintowords" -> "String Into Words"?.
I thought of a different approach and was wondering If it would work or not.
Example- itlookslikeasentence
Algorithm
Each letter of the string corresponds to a node in a DAG.
Initialize a bool array to False.
At each node we have a choice- If the addition of the present letter to the previous subarray still produces a valid word then add it, if it does not then we will begin a new word from that letter and set bool[previous_node]=True indicating that a word ended there. In the above example bool[1] would be set to true.
This is something similar to the maximum subarray sum problem.
Would this algorithm work?

No, it wouldn't. You solution takes the longest possible word at every step, which doesn't always work.
Here is counterexample:
Let's assume that the given string is aturtle. Your algorithm will take a. Then it will take t as at is valid word. atu is not a word, so it'll split the input: at + urtle. However, there is no way to split urtle into a sequence of valid English words. The right answer would be a + turtle.
One of the possible correct solutions uses dynamic programming. We can define a function f such that f(i) = true iff it's possible to split the first i characters of the input into a valid sequence of words. Initially, f(0) = true and the rest of the values are false. There is a transition from f(l) to f(r) if s[l + 1, r] is a valid word for all valid l and r.
P.S. Other types of greedy algorithms would not work here either. For instance, if you take the shortest word instead of the longest one, it fails to work on, for instance, the input atnight: there is no way to split tnight after the a is stripped off, but at + night is clearly a valid answer.

Finding all names matching a query: how to use a suffix tree?

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.

The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

How to reverse a suffix tree of a string (findind the string it represents)

Given a (modified/broken) suffix tree, which stores in each edge the beginning and ending of the current substring, but not the substring itself, i.e a suffix tree that looks like this:
this tree represents the string "banana" over the alphabet: {a, b, n}.
The algorithm I'm looking for is to find the string that a tree of that sort represents, for the example above, I would like the algorithm to find "banana".
I would like to that in a complexity of O(|string|) where |string| is the length of the string that is being searched.
It can be assumed that:
The size of the alphabet is constant and that every string starts from index 1.

Let's start with some polynomial time solution:
Let's divide all characters in the string into classes of equivalence.
We already know: it is a special $ symbol.
Induction hypothesis: let's assume that we have properly divided all characters of the suffix of length k into classes of equivalence. We can do it properly for the suffix of length k + 1, too.
Proof: let's iterate over all suffices of length i <- 1...k and check if the length of longest common prefix of the suffix of length k and the suffix of length i is not zero. It is non-zero iff the lowest common ancestor of the corresponding leaves is not the root of the tree. If we have found such a suffix, we know that it's first letter is equal to the first letter of the current suffix. So we can add the first letter of the suffix of length k + 1 to the appropriate class of equivalence. Otherwise, it belongs to its own equivalence class.
When all characters are divided into equivalence classes, we just need to assign a unique symbol to each class(if we need to maintain a correct lexicographical order, we can check which one of them goes earlier. To do this, we need to look at the order of edges that go from the root).
The time complexity is O(n ^ 3)(there are n suffices, we iterate over O(n) other suffices for each of them and we compute their lca in O(n)(I assume that we use a naive algorithm here)). So far, so good.
Now let's use several observation to get a linear solution:
We don't really need the lca. We just need to check that it is not the root. Thus, we can divide all leaves into classes of equivalence based on their ancestor which is an immediate child of the root. It can done in linear time using a depth-first search. The longest common prefix of two suffices is non-empty iff they are in the same class.
We don't actually need to check all shorter suffices. We only need to check the closest one to the left and to the right in depth first search order. Finding the closest smaller number to the left and to the right from the given is a standard problem and it has a linear solution with a stack.
That's it: we check at most two other suffices for the given one and each check is O(1). We have a linear solution now.
This solution uses an assumption that such a string does exist. If this assumption is not feasible, we can construct some string using this algorithm, then build a suffix tree in linear for it using Ukkonnen's algorithm and check that it is exactly the same as the given one.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.

You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).

Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.

You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).

We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.

Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.

I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.
"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!
How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.

Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.

You're algorithm has loss of information from compressing the string's original data set so I'm not sure you can recover the full information set without doing far more work than comparing the original string. Also while your data set appears easier for human readability, it current takes up as much space as the original string and a difference map of the string (where the values are the distance between the prior character and current character) may have a more comparable information set.
However, as to how you can detect all common subsets you should look at Least Common Subsequence algorithms to find the largest matching pattern. It is a well defined algorithm and is efficient -- O(n * m), where n and m are the lengths of the strings. See LCS on SO and Wikipedia. If you also want to see patterns which wrap around a string (as a circular stirng -- where abeab and eabab should match) then you'll need a ciruclar LCS which is described in a paper by Andy Nguyen.
You'll need to change the algorithm slightly to account for number of variations so far. My advise would be to add two additional dimensions to the LCS table representing the number of unique numbers encountered in the past k characters of both original strings along with you're compressed version of each string. Then you could do an LCS solve where you are always moving in the direction which matches on your compressed strings AND matching the same number of unique characters in both strings for the past k characters. This should encode all possible unique substring matches.
The tricky part will be always choosing the direction which maximizes the k which contains the same number of unique characters. Thus at each element of the LCS table you'll have an additional string search for the best step of k value. Since a longer sequence always contains all possible smaller sequences, if you maximize you're k choice during each step you know that the best k on the next iteration is at most 1 step away, so once the 4D table is filled out it should be solvable in a similar fashion to the original LCS table. Note that because you have a 4D table the logic does get more complicated, but if you read how LCS works you'll be able to see how you can define consistent rules for moving towards the upper left corner at each step. Thus the LCS algorithm stays the same, just scaled to more dimensions.
This solution is quite complicated once it's complete, so you may want to rethink what you're trying to achieve/if this pattern encodes the information you actually want before you start writing such an algorithm.

Here goes a solution that uses Prolog's unification capabilities and attributed variables to match templates:
:-dynamic pattern_i/3.
test:-
retractall(pattern_i(_,_,_)),
add_pattern(abab),
add_pattern(bcbc),
add_pattern(babcbc),
add_pattern(dababd),
show_similarities.
show_similarities:-
call(pattern_i(Word, Pattern, Maps)),
match_pattern(Word, Pattern, Maps),
fail.
show_similarities.
match_pattern(Word, Pattern, Maps):-
all_dif(Maps), % all variables should be unique
call(pattern_i(MWord, MPattern, MMaps)),
Word\=MWord,
all_dif(MMaps),
append([_, Pattern, _], MPattern), % Matches patterns
writeln(words(Word, MWord)),
write('mapping: '),
match_pattern1(Maps, MMaps). % Prints mappings
match_pattern1([], _):-
nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
select(MChar-Char, MMaps, NMMaps),
write(Char), write('='), write(MChar), write(' '),
!,
match_pattern1(Maps, NMMaps).
add_pattern(Word):-
word_to_pattern(Word, Pattern, Maps),
assertz(pattern_i(Word, Pattern, Maps)).
word_to_pattern(Word, Pattern, Maps):-
atom_chars(Word, Chars),
chars_to_pattern(Chars, [], Pattern, Maps).
chars_to_pattern([], Maps, [], RMaps):-
reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
member(Char-PChar, Maps),
!,
chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).
all_dif([]).
all_dif([_-Var|Maps]):-
all_dif(Var, Maps),
all_dif(Maps).
all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
dif(Var, MVar),
all_dif(Var, Maps).
The idea of the algorithm is:
For each word generate a list of unbound variables, where we use the same variable for the same char in the word. e.g: for the word abcbc the list would look something like [X,Y,Z,Y,Z]. This defines the template for this word
Once we have the list of templates we take each one and try to unify the template with a subtemplate of every other word. So for example if we have the words abcbc and zxzx, the templates would be [X,Y,Z,Y,Z] and [H,G,H,G]. Then there is a subtemplate on the first template which unifies with the template of the second word (H=Y, G=Z)
For each template match we show the substitutions needed (variable renamings) to yield that match. So in our example the substitutions would be z=b, x=c
Output for test (words abab, bcbc, babcbc, dababd):
?- test.
words(abab,bcbc)
mapping: a=b b=c
words(abab,babcbc)
mapping: a=b b=c
words(abab,dababd)
mapping: a=a b=b
words(bcbc,abab)
mapping: b=a c=b
words(bcbc,babcbc)
mapping: b=b c=c
words(bcbc,dababd)
mapping: b=a c=b

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string