Fastest way to find dictionary strings in a text - string

I have a text file and dictionary. The dictionary consists of a list of exactly 8-chars long words. I go through the text file and search the dictionary every 8 chars ("sliding window").
Currently, I use python dictionary data structure as the look up table. It has amortized look up time of 0(1), but I wonder if there exists faster algorithms/data structures that uses the specific nature/structure of the problem.

You can try aho-corasick multiple pattern matcher. It construct a finite state machine with a trie and breadth-first search the first occurrence of the longest prefix that is also a suffix of a dictionary string. You can try my implementation in php at https://phpahocorasick.codeplex.com. It also augment the algorithm to search for wildcards.

I think you can use Full text search to do it such as Apache Sorl, Elastich Search.
But you can use http://lunrjs.com/ for client side.

Related

Data structure to index entire document and algorithm for quick search of any size substring

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.
For instance, given the following sentence:
The book is on the table
The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.
For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.
Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).
I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.
The most similar thing I found is SQLite3 Full-text search.
Thanks!
One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.
With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.
If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.
Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.
This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

Pattern searching in array of words

I need to search in big array of words using pattern. Pattern can contain sequences of letters and wildcard * which can represents every letter(or some of them). Pattern represents the whole word or words. I found that I an use Suffix tree. But I need effective way to store this tree on disk because it's need lots of RAM. Is there any effective ways to search through the list of words which is stored on the drive? It also should be an online algorithm (I mean that I can append new words to tree)
Thanks!
You can try aho-corasick algorithm. It's the fastest multi pattern search algorithm. You can also use a wildcard. You can try my implementation in PHP # https://phpahocorasick.codeplex.com.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

Finding which word is occurring in given sentence

I've list of words. Number of words is around 1 million.
I've strings coming at runtime, I've to check which word from the list is present in string and return that word (need not to return all words occurring in sentence, returning first one also suffice the requirement).
One solution is checking all words one by one in string but it's inefficient.
Can someone please point out any efficient method of doing it?
Use the Knuth-Morris-Pratt algorithm. Although a million words is not all that much. You can also convert your text body into a Trie structure and then use that to check your search list against. There is a special kind of Trie called a Suffix Tree used especially for full text searching.
Put your word list in a tree or hash table.
Unless your word's list is ordered (or inserted in a efficient data structure like an ordered binary tree) to perform a binary search, the solution you are proposing is the most efficient one.

Best way to sort a long list of strings

I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.

Resources