Pattern searching in array of words - search

I need to search in big array of words using pattern. Pattern can contain sequences of letters and wildcard * which can represents every letter(or some of them). Pattern represents the whole word or words. I found that I an use Suffix tree. But I need effective way to store this tree on disk because it's need lots of RAM. Is there any effective ways to search through the list of words which is stored on the drive? It also should be an online algorithm (I mean that I can append new words to tree)
Thanks!

You can try aho-corasick algorithm. It's the fastest multi pattern search algorithm. You can also use a wildcard. You can try my implementation in PHP # https://phpahocorasick.codeplex.com.

Related

Finding all names matching a query: how to use a suffix tree?

Question : You have a smartphone and you opened the contact app. You want to search a contact. let's say manmohan. but you don't remember his full name. you only remember mohan so you started typing. the moment you type 'm' contact app will start searching for contact which has letter 'm' available. suppose you have stored names in your contact list ("manmohan", "manoj", "raghav","dinesh", "aman") now contact will show manmohan,manoj and aman as a result. Now the next character you type is 'o' (till now you have typed "mo" ) now the result should be "manmohan". How would you implement such data structure?
My approach was applying KMP as you look for pattern "m" then "mo" in all available contact. then display the string which has the match. But interviewer said it's not efficient. ( I couldn't think of any better approach. ) Before leaving he said there is an algorithm which will help. if you know it you can solve it. I couldn't do it. (before leaving I asked about that standard algorithm. Interviewer said : suffix tree). can anyone explain please how is it better ? or which is the best algorithm to implement this data structure.
The problem you're trying to solve essentially boils down to the following: given a fixed collection of strings and a string that only changes via appends, how do you efficiently find all strings that contain that pattern as a substring?
There's a neat little result on strings that's often useful for taking on problems that involve substring searching: a string P is a substring of a string T if and only if P is a prefix of at least one suffix of T. (Do you see why?)
So imagine that you take every name in your word bank and construct a trie of all the suffixes of all the words in that bank. Now, given the pattern string P to search for, walk down the trie, reading characters of P. If you fall off the trie, then the string P must not be a substring of any of the name bank (otherwise, it would have been a prefix of at least one suffix of one of the strings in T). Otherwise, you're at some trie node. Then all of the suffixes in the subtree rooted at the node you're currently visiting correspond to all of the matches of your substring in all of the names in T, which you can find by DFS-ing the subtrie and recording all the suffixes you find.
A suffix tree is essentially a time- and space-efficient data structure for representing a trie of all the suffixes of a collection of strings. It can be built in time proportional to the number of total characters in T (though the algorithms for doing so are famously hard to intuit and code up) and is designed so that you can find all matches of the text string in question rooted at a given node in time O(k), where k is the number of matches.
To recap, the core idea here is to make a trie of all the suffixes of the strings in T and then to walk down it using the pattern P. For time and space efficiency, you'd do this with a suffix tree rather than a suffix trie.

Data structure to index entire document and algorithm for quick search of any size substring

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.
For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

Fastest way to find dictionary strings in a text

I have a text file and dictionary. The dictionary consists of a list of exactly 8-chars long words. I go through the text file and search the dictionary every 8 chars ("sliding window").
Currently, I use python dictionary data structure as the look up table. It has amortized look up time of 0(1), but I wonder if there exists faster algorithms/data structures that uses the specific nature/structure of the problem.
You can try aho-corasick multiple pattern matcher. It construct a finite state machine with a trie and breadth-first search the first occurrence of the longest prefix that is also a suffix of a dictionary string. You can try my implementation in php at https://phpahocorasick.codeplex.com. It also augment the algorithm to search for wildcards.
I think you can use Full text search to do it such as Apache Sorl, Elastich Search.
But you can use http://lunrjs.com/ for client side.

String search using suffix trees

A suffix tree can be used to efficiently search a word in a set of words. Is suffix trees still the best method if:
1. the set of words is made from an infinite set of characters
2. the set of words is ordered alphabetically (or in a way that makes sense)?
A suffix tree is an overkill if you just want search for a word in a set of words(and you do not need search for their substrings). A trie is a better choice(the time complexity is the same, but it is much simpler). If the words are ordered, you can use a binary search to find the word(yes, it does have an additional log n factor, but it is not that bad). Even if they are not ordered, you can sort them before searching for other words. This approach is good because it does not require any custom data structures and it usually has smaller constant and smaller memory usage(the space complexity is the same, but the constant is better).

Finding which word is occurring in given sentence

I've list of words. Number of words is around 1 million.
I've strings coming at runtime, I've to check which word from the list is present in string and return that word (need not to return all words occurring in sentence, returning first one also suffice the requirement).
One solution is checking all words one by one in string but it's inefficient.
Can someone please point out any efficient method of doing it?
Use the Knuth-Morris-Pratt algorithm. Although a million words is not all that much. You can also convert your text body into a Trie structure and then use that to check your search list against. There is a special kind of Trie called a Suffix Tree used especially for full text searching.
Put your word list in a tree or hash table.
Unless your word's list is ordered (or inserted in a efficient data structure like an ordered binary tree) to perform a binary search, the solution you are proposing is the most efficient one.

Resources