what algorithm does freebase use to match by name? - search

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"

Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive

Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".

Most likely it's a trie with lexicographical order.

There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Related

Natural language processing keywords for building search engine

I'm recently interested in NLP, and would like to build up search engine for product recommendation. (Actually I'm always wondering about how search engine for Google/Amazon is built up)
Take Amazon product as example, where I could access all "word" information about one product:
Product_Name Description ReviewText
"XXX brand" "Pain relief" "This is super effective"
By applying nltk and gensim packages I could easily compare similarity of different products and make recommendations.
But here's another question I feel very vague about:
How to build a search engine for such products?
For example, if I feel pain and would like to search for medicine online, I'd like to type-in "pain relief" or "pain", whose searching results should include "XXX brand".
So this sounds more like keyword extraction/tagging question? How should this be done in NLP? I know corpus should contain all but single words, so it's like:
["XXX brand" : ("pain", 1),("relief", 1)]
So if I typed in either "pain" or "relief" I could get "XXX brand"; but what about I searched "pain relief"?
I could come up with idea that directly call python in my javascript for calculate similarities of input words "pain relief" on browser-based server and make recommendation; but that's kind of do-able?
I still prefer to build up very big lists of keywords at backends, stored in datasets/database and directly visualized in web page of search engine.
Thanks!
Even though this does not provide a full how-to answer, there are two things that might be helpful.
First, it's important to note that Google does not only treat singular words but also ngrams.
More or less every NLP problem and therefore also information retrieval from text needs to tackle ngrams. This is because phrases carry way more expressiveness and information than singular tokens.
That's also why so called NGramAnalyzers are popular in search engines, be it Solr or elastic. Since both are based on Lucene, you should take a look here.
Relying on either framework, you can use a synonym analyser that adds for each word the synonyms you provide.
For example, you could add relief = remedy (and vice versa if you wish) to your synonym mapping. Then, both engines would retrieve relevant documents regardless if you search for "pain relief" or "pain remedy". However, you should probably also read this post about the issues you might encounter, especially when aiming for phrase synonyms.

smart search by first/last name

I have to build a search facility capable of searching members by their first name/last name and may be some other search parameters (i.e. address).
The search should provide a list of match candidates so that the user can select whatever he/she seems the "correct" match.
The search should be smart enough so that the "correct" result would be among the first few items on the list. The search should also be tolerant to typos and misspellings and, may be, even be aware of name shortcuts i.e. Bob vs. Robert or Bill vs. William.
I started investigating Lucene and the family (like elastic search) as a tool for the job. While it has an impressive array of features addressing similar problems for the full text search, I am not so sure how to use them for my task - up to the point that maybe Lucene is not the right tool here at all.
What do you guys think - how can I harness Elastic Search to solve my problem? Or should I look elsewhere?
Lucene supports edit distance queries so that your search query will tolerate some typos, you define this as the allowed edit distance for a term.
for instance:
name:johnni~0.8
would return "johnny"
Also Solr provides a wide array of ready made search filters and analyzers you can use for search.
In your case I would probably chain several filter factories together:
TrimFilterFactory - trim the query
LowerCaseFilterFactory - to get rid of case differences
ISOLatin1AccentFilterFactory - to remove accents from letters (most people don't search with the accent anyway)
PhoneticFilterFactory - for matching sounds like queries like: kris -> chris
look at the documentation under the link it is pretty straight forward how to set up a new solr instance with an Analyzer that uses all the above filters. I used something similar for searching city names and it worked fairly well.
Lucene can be made tolerant of typos and misspellings, and can use synonyms. As for
The search should be smart enough so that the "correct" result would be among the first few items on the list
Are there any search engines which don't try to do this?
As far as Bob/Robert goes, that can be done with synonyms, but you need to get the synonym data from some reliable source.
In addition to what #Asaf mentioned, you might try to use N-gram indexing to deal with spelling variants. See the CJKAnalyzer for an example of how to do that.

Synonym style text lookup and parsing

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.
It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".
Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.
Any suggestions of other getting the client what they are looking for would be gratefully accepted.
Thanks! I'll look into Wordnet.
Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.
The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services.
You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.
Having said that, are you solving the right problem? How do you build the category list?
Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.
You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.
Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.
I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.
I think this will help get you a long way toward what you want!
For text classification you can take a look at Apache Mahout.

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.
First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

How do you implement a "Did you mean"? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!
Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.
I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.
http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram
I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.
I do it with Lucene's Spell Checker.
Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will
Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"
Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Resources