Tree searching algorithm - string

I'm looking for suggestions on strategies for searching a tree-like data structure.
The structure is a tree where each element is a string, each branch is a period, and a path is the concatenation of several strings and periods starting at the root. The root and edges from the root are a special case where there is no string behind them.
So given the tree,
{root}
/ \
A X
/ \ /
B C Y
Valid paths are the strings "A", "A.B", "A.C", "X", and "X.Y".
What we have is a set of strings that we need to search for in this tree and find the element that terminates each string. Not all strings in the set appear in the tree. We stop searching when we find all strings. We need to run this search several times but the trees may differ each time. The set of strings to search is the same each run though.
Currently we're using depth-first search, but this isn't very efficient if all strings fall under say the last branch under the root. I feel like there should be a better way of doing this.
What would be a good algorithm for doing this repeated search? Would it be possible to leverage multithreading here as well?

It's an interesting problem; usually one would imagine a single tree being searched for a variable set of strings. Here the situation is reversed: the set of strings is fixed and the tree is highly variable.
I think that the best you can do is build a trie representing the set of strings. That way, you only have to search a tree once for any given prefix. (So, for the example strings you mentioned, you would only need to find the "A" prefix once and the "X" prefix once.) There are lots of trie data structures and algorithms for building them from a set of strings, but since that's a one-time operation for this problem, I wouldn't worry too much about the cost of this preprocessing.

Related

Finding all names matching a query: how to use a suffix tree?

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.
The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

clustering strings - what algorithm is suitable?

I have some strings and characters will not be repeated in a single string.
for example: "AABC" is not possible.
I want to cluster them into sets by their common sub-strings.
for example: "ABC, CDF, GHP" will be cluster into two sets
{ABC,CDF},{GHP}.
several strings with one or more common sub-strings will be in one set.
a string which has no common sub-string with any other strings will be a set itself.
so keep the number of sets smallest.
for example:
1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}.
2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.
Finding a string which has nothing common with others is easy I think;
for(i=0; i< strings.num; i++)
{ str1 = strings[i];
bool m_com=false;
for(j=0;j < strings.num; j++ )
{
str2=strings[j];
if(hascommon(str1,str2))
m_com=true;
}
if(!m_com)
{
str1 has no common substring with any string,
}
}
now I am thinking about others, how to classify them, is there any algorithm suitable for this?
Input:
strings (characters are not be repeated)
output:
sets (keep number of sets as small as possible)
I know this involves with finding common sub-string problem and clustering.
but I am not familiar with clustering techniques, so I am hoping some one
could recommend me such algorithm.
while I am looking for good ways to do this, I also appreciate suggestions from others.
Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist.
and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.
I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths.
or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.
I thought I could solve this by finding common subs-strings, then I decide where to remove edges.
One thing should be pointed out first:
For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".
Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.
What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.
There can be further optimizations, but I guess it will depend on the nature of your input data.
You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm
As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.
As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.
If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

Time complexity for inserting all suffixes of a string into a ternary search tree?

I have a ternary search tree that contain all the suffixes of a word. What is the time complexity for construction and searching a word in this structure?
Example:
a word banana$, have the suffix banana$,anana$,nana$,ana$,na$,a$,$
and in lexicografical order $,a$,ana$,anana$,banana$,na$,nana$.
inserting all suffix in the ternary search tree in balanced form is:
anana$,a$,$,ana$,na$,banana$,nana$.
Generally speaking, the time required to insert something into a TST is O(L log |Σ|), where L is the length of the string and Σ is the set of allowed characters in your string. The reason for this is that adding each individual character takes time O(log |Σ|) because you're adding each character into a BST of at most |Σ| elements. For the example you're describing, you're adding in strings of length 1, 2, 3, ..., n, so the runtime is O(n2 log |Σ|).
That said, I think you can speed this up by going through a more indirect route. A ternary search tree can be thought of as a trie where the child pointers of each node are stored in a binary search tree. If you just want a trie of all the suffixes, you might want to look at suffix trees, which are specifically designed to represent that information. They can be built in time O(n) for a length-n string.

String search using suffix trees

A suffix tree can be used to efficiently search a word in a set of words. Is suffix trees still the best method if:
1. the set of words is made from an infinite set of characters
2. the set of words is ordered alphabetically (or in a way that makes sense)?
A suffix tree is an overkill if you just want search for a word in a set of words(and you do not need search for their substrings). A trie is a better choice(the time complexity is the same, but it is much simpler). If the words are ordered, you can use a binary search to find the word(yes, it does have an additional log n factor, but it is not that bad). Even if they are not ordered, you can sort them before searching for other words. This approach is good because it does not require any custom data structures and it usually has smaller constant and smaller memory usage(the space complexity is the same, but the constant is better).

Meta-information in DAWG/DAFSA

I would like to implement a string look-up data structure, for dynamic strings, that will support efficient search and insertion. Currently, I am using a trie but I would like to reduce the memory footprint if possible. This Wikipedia article describes a DAWG/DAFSA, which will obviously save a lot of space over a trie by compressing suffixes. However, while it will clearly test whether a string is legal, it is not obvious to me if there is any way to exclude illegal strings. For example, using the words "cite" and "cat" where the "t" and "e" are terminal states, a DAWG/DAFSA would look like this:
c
/ \
a i
\ /
t
|
e
and "cit" and "cate" will be incorrectly recognized as legal strings without some meta-information.
Questions:
1) Is there a preferred way to store meta-information about strings/paths (such as legality) in a DAWG/DAFSA?
2) If a DAWG/DAFSA is incompatible with the requirements (efficient search/insertion and storing meta-information) what's the best data structure to use? A minimal memory footprint would be nice, but perhaps not absolutely necessary.
In a DAWG, you only compress states together if they're completely indistinguishable from one another. This means that you actually wouldn't combine the T nodes for CAT and CITE together for precisely the reason you've noted - that gives you either a false positive on CIT or a false negative on CAT.
DAWGs are typically most effective for static dictionaries when you have a huge number of words with common suffixes. A DAWG for all of English, for example, could save a lot of space by combining all the suffix "s"'s at the end of plural words and most of the "ING" suffixes from gerunds. If you're going to be doing a lot of insertions or deletions, DAWGs are almost certainly the wrong data structure for the job because adding or removing a single word from a DAWG can cause ripple effects that require lots of branches that were previously combined to be split or vice-versa.
Quite honestly, for reasonably-sized data sets, a trie isn't a bad call. A trie for all of English would only use up something like 26MB, which isn't very much. I would only go with the DAWG if space usage really is at a premium and you aren't doing many insertions or deletions.
Hope this helps!

Resources