I am studying fuzzy search and how to retrieve information from database using a Inverted Indexing. I studied Inverted Indexing and I think it only works for EXACT match. Imagine the situation I have the string East Lamar Street in my database. Someone is looking for East Lmar Street and I what to find East Lamar Street.
Will it use Edit Distance?
How the algorithm will operate?
Is the database going to use the inverted indexing?
Or it will do a full scan?
I saw that it uses a hash to make the operation in O(1).
I have written a small library that indexes using Soundex by word and scores using Levenshtein distance on the entire phrase. There is a scala and C# version. You could use this if you can afford loading loading all of your street names into memory. Otherwise you may to could take some of the source and use it differently.
https://github.com/rstokes/fuzzysearch
Related
I have a short piece of text (more specifically a Tweet, so maximum length of 140 characters) that I would like to perform a search against approximately 100,000 terms.
It is turning the classical search problem on its head (large document, small search term). The naive approach of iterating through each of the search times and attempting to map can not be the most efficient way of tackling this problem.
Does anyone any resources or insights on how to tackle this type of a search problem?
Stumbled upon the Aho-Corasick algorithm which is working very well for this situation.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
Using a Javascript implementation I am able to get the following performance:
Words to match: Abridged English Dictionary (~250,000 words)
Sentences per second performance: ~80,000
Some extra filtering is necessary to check for word boundaries if that is important to your use. The algorithm spits out the match location in the text, so it is trivial to efficiently check for word boundaries.
Hope this helps someone searching for similar problem :)
I'm working on a project which searches through a database, then sorts the search results by relevance, according to a string the user inputs. I think my current search is fairly decent, but the comparator I wrote to sort the results by relevance is giving me funny results. I don’t know what to consider relevant. I know this is a big branch of information retrieval, but I have no idea where to start finding examples of searches which sort objects by relevance and would appreciate any feedback.
To give a little more background about my specific issue, the user will input a string in a website database, which stores objects (items in the store) with various fields, such as a minor and major classification (for example, an XBox 360 game might be stored with major=video_games and minor=xbox360 fields along with its specific name). The four main fields that I think should be considered in the search are the specific name, major, minor, and genre of the type of object, if that helps.
In case you don't wanna use lucene/Solr, you can always use distance metrics to find the similarity between query and the rows retrieved from database. Once you get the score you can sort them and they will be considered as sorted by relevance.
This is what exactly happens behind the scene of lucene. You can use simple similarity metrics like manhattan distance, distance of points in n-dimensional space etc. Look for lucene scoring formula for more insight.
I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you
Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!
Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.
I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.
http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram
I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.
I do it with Lucene's Spell Checker.
Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will
Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"
Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html