Finding all names matching a query: how to use a suffix tree? - string

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.

The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

Related

Time complexity for inserting all suffixes of a string into a ternary search tree?

I have a ternary search tree that contain all the suffixes of a word. What is the time complexity for construction and searching a word in this structure?
Example:
a word banana$, have the suffix banana$,anana$,nana$,ana$,na$,a$,$
and in lexicografical order $,a$,ana$,anana$,banana$,na$,nana$.
inserting all suffix in the ternary search tree in balanced form is:
anana$,a$,$,ana$,na$,banana$,nana$.
Generally speaking, the time required to insert something into a TST is O(L log |Σ|), where L is the length of the string and Σ is the set of allowed characters in your string. The reason for this is that adding each individual character takes time O(log |Σ|) because you're adding each character into a BST of at most |Σ| elements. For the example you're describing, you're adding in strings of length 1, 2, 3, ..., n, so the runtime is O(n2 log |Σ|).
That said, I think you can speed this up by going through a more indirect route. A ternary search tree can be thought of as a trie where the child pointers of each node are stored in a binary search tree. If you just want a trie of all the suffixes, you might want to look at suffix trees, which are specifically designed to represent that information. They can be built in time O(n) for a length-n string.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

Converting a trie into a reverse trie?

Suppose that I have a trie data structure holding a number of strings. To look up a string in the trie, I would start at the root and follow the pointer labeled with the appropriate characters of the string, in order, until I arrived at a given node.
Now suppose that I want to build a "reverse trie" for the same set of strings, where instead of looking up strings by starting with the first character, I would look up strings by starting with the last character.
Is there an efficient algorithm for turning tries into reverse tries? In the worst case I could always list off all the strings in the trie and then insert them one-at-a-time into a reverse trie, but it seems like there might be a better, more clever solution.
But the nodes in your trie are in essentially random order, if your ordering criteria is based on the strings' reverse. That is, given the sequence [aardvark, bat, cat, dog, elephant, fox], which are in alphabetical order, reversing the words would not give you the sequence [xof, tnahpele, god, tac, tab, kravdraa].
In other words, there's no "more clever solution" than to build your reverse trie.

Finding which word is occurring in given sentence

I've list of words. Number of words is around 1 million.
I've strings coming at runtime, I've to check which word from the list is present in string and return that word (need not to return all words occurring in sentence, returning first one also suffice the requirement).
One solution is checking all words one by one in string but it's inefficient.
Can someone please point out any efficient method of doing it?
Use the Knuth-Morris-Pratt algorithm. Although a million words is not all that much. You can also convert your text body into a Trie structure and then use that to check your search list against. There is a special kind of Trie called a Suffix Tree used especially for full text searching.
Put your word list in a tree or hash table.
Unless your word's list is ordered (or inserted in a efficient data structure like an ordered binary tree) to perform a binary search, the solution you are proposing is the most efficient one.

Resources