Finding all strings by prefix with use of binary tree - nlp

This book on page 50 states that we can use binarу tree to find strings starting with certain prefix. but i dont undertand how to actually implement that. from figurе 3.1 (p. 51) i only understood how to find string starting with certain letter.

Related

String Matching with dot

I need to design a data structure to which I can efficiently add new words(Strings) and search for an existing word. Also, the search word can contain "." in it which can match any character. For eg. if I add strings "abcd" and "abed" then search for "ab.d" should return me both of them.
I tried to tackle this using prefix Tries. It works fine for normal string search(without dot) but for dot I have to search for every child of a node. Is there a more efficient way of solving this problem?
I believe this is an interview question (as I faced this question on one of my past interview). There are some other approaches with different tradeoff, but anything beyond prefix tree (Trie) will be over-engineering for an interview problem. Just build a trie with all the given words and search on the trie for any query words. At any point, when encountered a dot, search for all possible child nodes from that node and continue searching further by matching the later characters. That's it.

Finding all names matching a query: how to use a suffix tree?

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.
The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

Tree searching algorithm

I'm looking for suggestions on strategies for searching a tree-like data structure.
The structure is a tree where each element is a string, each branch is a period, and a path is the concatenation of several strings and periods starting at the root. The root and edges from the root are a special case where there is no string behind them.
So given the tree,
{root}
/ \
A X
/ \ /
B C Y
Valid paths are the strings "A", "A.B", "A.C", "X", and "X.Y".
What we have is a set of strings that we need to search for in this tree and find the element that terminates each string. Not all strings in the set appear in the tree. We stop searching when we find all strings. We need to run this search several times but the trees may differ each time. The set of strings to search is the same each run though.
Currently we're using depth-first search, but this isn't very efficient if all strings fall under say the last branch under the root. I feel like there should be a better way of doing this.
What would be a good algorithm for doing this repeated search? Would it be possible to leverage multithreading here as well?
It's an interesting problem; usually one would imagine a single tree being searched for a variable set of strings. Here the situation is reversed: the set of strings is fixed and the tree is highly variable.
I think that the best you can do is build a trie representing the set of strings. That way, you only have to search a tree once for any given prefix. (So, for the example strings you mentioned, you would only need to find the "A" prefix once and the "X" prefix once.) There are lots of trie data structures and algorithms for building them from a set of strings, but since that's a one-time operation for this problem, I wouldn't worry too much about the cost of this preprocessing.

Pattern searching in array of words

I need to search in big array of words using pattern. Pattern can contain sequences of letters and wildcard * which can represents every letter(or some of them). Pattern represents the whole word or words. I found that I an use Suffix tree. But I need effective way to store this tree on disk because it's need lots of RAM. Is there any effective ways to search through the list of words which is stored on the drive? It also should be an online algorithm (I mean that I can append new words to tree)
Thanks!
You can try aho-corasick algorithm. It's the fastest multi pattern search algorithm. You can also use a wildcard. You can try my implementation in PHP # https://phpahocorasick.codeplex.com.

Extracting information in a string

I would like to parse strings with an arbitrary number of parameters, such as P1+05 or P2-01 all put together like P1+05P2-02. I can get that data from strings with a rather large (too much to post around...) IF tree and a variable keeping track of the position within the string. When reaching a key letter (like P) it knows how many characters to read and proceeds accordingly, nothing special. In this example say I got two players in a game and I want to give +05 and -01 health to players 1 and 2, respectively. (hence the +-, I want them to be somewhat readable).
It works, but I feel this could be done better. I am using Lua to parse the strings, so maybe there is some built-in function, within Lua, to ease that process? Or maybe some general hints , or references for better approaches?
Here is some code:
for w in string.gmatch("P1+05P2-02","%u[^%u]+") do
print(w)
end
It assumes that each "word" begins with an uppercase letter and its parameters contain no uppercase letters.

Resources