I have a short piece of text (more specifically a Tweet, so maximum length of 140 characters) that I would like to perform a search against approximately 100,000 terms.
It is turning the classical search problem on its head (large document, small search term). The naive approach of iterating through each of the search times and attempting to map can not be the most efficient way of tackling this problem.
Does anyone any resources or insights on how to tackle this type of a search problem?
Stumbled upon the Aho-Corasick algorithm which is working very well for this situation.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
Using a Javascript implementation I am able to get the following performance:
Words to match: Abridged English Dictionary (~250,000 words)
Sentences per second performance: ~80,000
Some extra filtering is necessary to check for word boundaries if that is important to your use. The algorithm spits out the match location in the text, so it is trivial to efficiently check for word boundaries.
Hope this helps someone searching for similar problem :)
Related
Forgive me, this will be my first every post to SO, so do let me know how I can improve.
I am currently looking for advice on a problem I am facing. I have a list of one billion unique strings of text. These text strings also have a list of tags associated with them to indicate the content of the string.
Example:
StringText: The cat ate on Sunday
AnimalCode: c001
ActionCode: a001
TimeCode: d001
where
c001 = The cat
a001= ate
d001 = on Sunday
I have loaded all of the strings and their codes as individual documents in an instance of MongoDB
At present, I am trying to devise a method by which I can enter a string and search against the database to return the match. My problem is that the search is taking far to long to return results.
I have created an index on the StringText field, but am guessing that it is too large to hold in memory.
Each string has an equal probability of being searched for so I can't reliably predict which strings have a higher probability of being searched for and pull them out into another collection.
Currently, I am running the DB off a single box with 16GB of RAM and a 4TB HDD.
Does anybody have any advice on how I might accomplish my task more efficiently? Is Mongo the right technology or are there others more adept at doing this kind of search and return?
My goal (forgive me if foolish) would be to try and return a result within 2 seconds or less.
I am very new to this whole arena so any and all advice would be welcome.
Thanks much to all in advance for the help and time.
Sincerely,
Zinga
As discussed in the comments, you could preprocess the input string to find the associated Animal and Action codes and search for StringText based on the indexed codes, which is much faster than text search.
You can't totally avoid text search, so reduce it to the Animal and/or Action collection by tokenizing the input string. See how you can use map/reduce techniques just for queries of this sort.
In your case, if you know that the first word or two will always contain the name of the animal, just use those one or two words to search for the relevant animal. Searching through the Animal/Actions collection shouldn't take long. In case it does, you can keep a periodically updating list of most common animals/actions (based on their frequency) and search against that to make it faster. This is also discussed in the articles on the linked page.
If even after that your search against StringText is slow, you could shard the StringText collection by Animal/Action codes. The official doc should suffice for this and there's not much that's involved in the setup so you might try this anyway. The basic ideology everywhere is to restrict your target space as much as possible. Searching through a billion records for every query is plain overkill. Cache where you can, preprocess where you can, show guesses while you run a slow query.
Good luck!
Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.
I have a scenario where a user can post a number of responses or phrases via a form field. I would like to be able to take the response and determine what they are asking for. For instance if the user types in car, train, bike, jet .... I can assume they are talking about a vehicle, and respond accordingly. I understand that I could use a switch statement or perhaps a regexp as well, however the larger the number of possible responses, the less efficient that computation will be. I'm wondering if there is an efficient algorithm for comparing a string with a group of strings. Any info would be great.
You may want to look into the Aho-Corasick algorithm. If you have a collection of strings that you want to search for, you can spend linear time doing preprocessing on those strings and from that point forward can, in O(n) time, check for all possible matches of those strings in a text corpus of length n. In other words, with a small preprocessing time to set up the algorithm once, you can extremely efficiently scan over numerous inputs again and again searching for those keywords.
Interestingly enough, the algorithm was specifically invented to build a fast index (that is, to look for a lot of different keywords in a huge body of text), and allegedly outperformed other methods by a factor of ten. I think it would work great in your application.
Hope this helps!
If you have a large number of "magic" words, I would suggest splitting the query into words, and using a hash-based lookup to check whether the words are recognized.
You can check Trie structure. I think one of best solution for your problem.
I am having a lot of trouble finding a string matching algorithm that fits my requirements.
I have a very large database of strings in an unabbreviated form that need to be matched to an arbitrary abbreviation. A string that is an actual substring with no letters between its characters should also match, and with a higher score.
Example: if the word to be matched within was "download" and I searched "down", "ownl", and then "dl", I would get the highest matching score for "down", followed by "ownl" and then "dl".
The algorithm would have to be optimized for speed and a large number of strings to be searched through, and should allow me to pull back a list of matching items strings (if I had added both "download" and "upload" to the database, searching "load" should return both). Memory is still important, but not as important as speed.
Any ideas? I've done a bunch of research on some of these algorithms but I haven't found any that even touch abbreviations, let alone with all these conditions!
I'd wonder if Peter Norvig's spell checker could be adapted in some way for this problem.
It's a stretch that I haven't begun to work out, but it's such an elegant solution that it's worth knowing about.
I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)