Converting a trie into a reverse trie? - string

Suppose that I have a trie data structure holding a number of strings. To look up a string in the trie, I would start at the root and follow the pointer labeled with the appropriate characters of the string, in order, until I arrived at a given node.
Now suppose that I want to build a "reverse trie" for the same set of strings, where instead of looking up strings by starting with the first character, I would look up strings by starting with the last character.
Is there an efficient algorithm for turning tries into reverse tries? In the worst case I could always list off all the strings in the trie and then insert them one-at-a-time into a reverse trie, but it seems like there might be a better, more clever solution.

But the nodes in your trie are in essentially random order, if your ordering criteria is based on the strings' reverse. That is, given the sequence [aardvark, bat, cat, dog, elephant, fox], which are in alphabetical order, reversing the words would not give you the sequence [xof, tnahpele, god, tac, tab, kravdraa].
In other words, there's no "more clever solution" than to build your reverse trie.

Related

Longest common substring via suffix array: do we really need unique sentinels?

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.
Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.
This means we can write code like this:
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
increment count of common characters
However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.
However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:
set sentinel = '#'
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
if suffix_1[c] != sentinel
increment count of common characters
else
return
Or, am I missing something fundamental here?
Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14
When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).
Then, to identify which original string every suffix belongs to, just do a binary search.
The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."
The longer answer:
At a high level, the basic algorithm described in the video goes like this:
Construct a generalized suffix array for the strings T1 and T2.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.
So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.
(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)
Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.
So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

Finding all names matching a query: how to use a suffix tree?

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.
The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

Time complexity for inserting all suffixes of a string into a ternary search tree?

I have a ternary search tree that contain all the suffixes of a word. What is the time complexity for construction and searching a word in this structure?
Example:
a word banana$, have the suffix banana$,anana$,nana$,ana$,na$,a$,$
and in lexicografical order $,a$,ana$,anana$,banana$,na$,nana$.
inserting all suffix in the ternary search tree in balanced form is:
anana$,a$,$,ana$,na$,banana$,nana$.
Generally speaking, the time required to insert something into a TST is O(L log |Σ|), where L is the length of the string and Σ is the set of allowed characters in your string. The reason for this is that adding each individual character takes time O(log |Σ|) because you're adding each character into a BST of at most |Σ| elements. For the example you're describing, you're adding in strings of length 1, 2, 3, ..., n, so the runtime is O(n2 log |Σ|).
That said, I think you can speed this up by going through a more indirect route. A ternary search tree can be thought of as a trie where the child pointers of each node are stored in a binary search tree. If you just want a trie of all the suffixes, you might want to look at suffix trees, which are specifically designed to represent that information. They can be built in time O(n) for a length-n string.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.
"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!
How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.
Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.
You're algorithm has loss of information from compressing the string's original data set so I'm not sure you can recover the full information set without doing far more work than comparing the original string. Also while your data set appears easier for human readability, it current takes up as much space as the original string and a difference map of the string (where the values are the distance between the prior character and current character) may have a more comparable information set.
However, as to how you can detect all common subsets you should look at Least Common Subsequence algorithms to find the largest matching pattern. It is a well defined algorithm and is efficient -- O(n * m), where n and m are the lengths of the strings. See LCS on SO and Wikipedia. If you also want to see patterns which wrap around a string (as a circular stirng -- where abeab and eabab should match) then you'll need a ciruclar LCS which is described in a paper by Andy Nguyen.
You'll need to change the algorithm slightly to account for number of variations so far. My advise would be to add two additional dimensions to the LCS table representing the number of unique numbers encountered in the past k characters of both original strings along with you're compressed version of each string. Then you could do an LCS solve where you are always moving in the direction which matches on your compressed strings AND matching the same number of unique characters in both strings for the past k characters. This should encode all possible unique substring matches.
The tricky part will be always choosing the direction which maximizes the k which contains the same number of unique characters. Thus at each element of the LCS table you'll have an additional string search for the best step of k value. Since a longer sequence always contains all possible smaller sequences, if you maximize you're k choice during each step you know that the best k on the next iteration is at most 1 step away, so once the 4D table is filled out it should be solvable in a similar fashion to the original LCS table. Note that because you have a 4D table the logic does get more complicated, but if you read how LCS works you'll be able to see how you can define consistent rules for moving towards the upper left corner at each step. Thus the LCS algorithm stays the same, just scaled to more dimensions.
This solution is quite complicated once it's complete, so you may want to rethink what you're trying to achieve/if this pattern encodes the information you actually want before you start writing such an algorithm.
Here goes a solution that uses Prolog's unification capabilities and attributed variables to match templates:
:-dynamic pattern_i/3.
test:-
retractall(pattern_i(_,_,_)),
add_pattern(abab),
add_pattern(bcbc),
add_pattern(babcbc),
add_pattern(dababd),
show_similarities.
show_similarities:-
call(pattern_i(Word, Pattern, Maps)),
match_pattern(Word, Pattern, Maps),
fail.
show_similarities.
match_pattern(Word, Pattern, Maps):-
all_dif(Maps), % all variables should be unique
call(pattern_i(MWord, MPattern, MMaps)),
Word\=MWord,
all_dif(MMaps),
append([_, Pattern, _], MPattern), % Matches patterns
writeln(words(Word, MWord)),
write('mapping: '),
match_pattern1(Maps, MMaps). % Prints mappings
match_pattern1([], _):-
nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
select(MChar-Char, MMaps, NMMaps),
write(Char), write('='), write(MChar), write(' '),
!,
match_pattern1(Maps, NMMaps).
add_pattern(Word):-
word_to_pattern(Word, Pattern, Maps),
assertz(pattern_i(Word, Pattern, Maps)).
word_to_pattern(Word, Pattern, Maps):-
atom_chars(Word, Chars),
chars_to_pattern(Chars, [], Pattern, Maps).
chars_to_pattern([], Maps, [], RMaps):-
reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
member(Char-PChar, Maps),
!,
chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).
all_dif([]).
all_dif([_-Var|Maps]):-
all_dif(Var, Maps),
all_dif(Maps).
all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
dif(Var, MVar),
all_dif(Var, Maps).
The idea of the algorithm is:
For each word generate a list of unbound variables, where we use the same variable for the same char in the word. e.g: for the word abcbc the list would look something like [X,Y,Z,Y,Z]. This defines the template for this word
Once we have the list of templates we take each one and try to unify the template with a subtemplate of every other word. So for example if we have the words abcbc and zxzx, the templates would be [X,Y,Z,Y,Z] and [H,G,H,G]. Then there is a subtemplate on the first template which unifies with the template of the second word (H=Y, G=Z)
For each template match we show the substitutions needed (variable renamings) to yield that match. So in our example the substitutions would be z=b, x=c
Output for test (words abab, bcbc, babcbc, dababd):
?- test.
words(abab,bcbc)
mapping: a=b b=c
words(abab,babcbc)
mapping: a=b b=c
words(abab,dababd)
mapping: a=a b=b
words(bcbc,abab)
mapping: b=a c=b
words(bcbc,babcbc)
mapping: b=b c=c
words(bcbc,dababd)
mapping: b=a c=b

Resources