Open-source full-text article recommendation engines - search

I'm wondering if there are any good .NET recommendation algorithms available in open source projects, whether attached to a search engine or not. By recommendation I mean something that accepts a full-text article and recommends other articles from its index based on keyword similarity.
At the high end there are document classification engines like Autonomy; at the low-end spam filters and blog "related posts" widgets. Possibly advertisement-to-article matching, too. I'd like to incorporate one into a project but can't afford the high end and the low end seems to all be LAMP-based.
[Sorry, one answer asked for clarification: What I'm looking for is ideally a standalone library, but I'm willing to adapt good source code as necessary. The end result is that I need to be able to create a C# service that accepts an arbitrary amount of text and returnsa list of similar previously-indexed articles. Basicallly, the exact thing that StackOverflow itself does as you are submitting a question!]
Thanks!
Steve

I think that in StackOverflow they extract all common english words from the text and then compare this words with the remaining words of other posts to get the "Related" posts.

Question is not very clear (algorithm or library???) but only thing that comes to mind is Lucene.NET, the porting of the popular Lucene library on the .Net framework. HTH.

Related

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.
I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.
As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Recommendations for a simple search engine for bag of words?

Any recommendations for small, lightweight, bag of words search engine?
I have a set of 'documents' that are each basically a small bag of arbitrary words.
Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.
Stemming would be great but not highly required.
Word expansion with word nets not required.
opensource or freeware preferred, as this is a prototype, not a full-blow project.
unix/linux platform preferred.
I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.
Whoosh is a pure Python (no C, no external database) indexer / search engine. Check out the documentation for more information. It does support stemming.
I tried it out on an XML dump of a mediawiki instance and it seemed to work pretty well!
Solr or Sphinx. They aren't exactly lightweight but I wouldn't recommend anything smaller, if the project turns out to be successful and it needs to grow, switching the search engine might be painful.
I think that Lucene is an option. It should allow you to build a custom bag of words search engine.
I wonder about MongoDB http://www.mongodb.org/display/DOCS/Home
It seems like 'full-text-search' may be what I'm after...
and having additional fields to search with may be handy.

How would I go about creating a custom search index much like Lucene?

I implemented a Lucene search solution awhile back, and it got me interested in compressed file indexes that are searchable. At the time I could not find any good information on how exactly you would go about creating a custom search index, so I wonder if anyone can point me in the right direction?
My primary interest is in file formatting, compression, and something similar to the concept of Lucene's documents and fields. It should not necessarily be language specific, but if you can point me to online resources that have language specific implementations with full descriptions of the process then that is okay, too.
Managing Gigabytes by Alistair Moffat, Timothy C. Bell
You may also try to look in the source code of excellent Sphinx search engine.
It is modern full-text open source search engine, and it uses smartly optimized indexes.

How do you implement a "Did you mean"? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?
Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.
There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.
Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.
Soundex - in my experience - is crap.
Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.
Google's Dr Norvig has outlined how it works; he even gives a 20ish line Python implementation:
http://googlesystem.blogspot.com/2007/04/simplified-version-of-googles-spell.html
http://www.norvig.com/spell-correct.html
Dr Norvig also discusses the "did you mean" in this excellent talk. Dr Norvig is head of research at Google - when asked how "did you mean" is implemented, his answer is authoritive.
So its spell-checking, presumably with a dynamic dictionary build from other searches or even actual internet phrases and such. But that's still spell checking.
SOUNDEX and other guesses don't get a look in, people!
Check this article on wikipedia about the Levenshtein distance. Make sure you take a good look at Possible improvements.
I was pleasantly surprised that someone has asked how to create a state-of-the-art spelling suggestion system for search engines. I have been working on this subject for more than a year for a search engine company and I can point to information on the public domain on the subject.
As was mentioned in a previous post, Google (and Microsoft and Yahoo!) do not use any predefined dictionary nor do they employ hordes of linguists that ponder over the possible misspellings of queries. That would be impossible due to the scale of the problem but also because it is not clear that people could actually correctly identify when and if a query is misspelled.
Instead there is a simple and rather effective principle that is also valid for all European languages. Get all the unique queries on your search logs, calculate the edit distance between all pairs of queries, assuming that the reference query is the one that has the highest count.
This simple algorithm will work great for many types of queries. If you want to take it to the next level then I suggest you read the paper by Microsoft Research on that subject. You can find it here
The paper has a great introduction but after that you will need to be knowledgeable with concepts such as the Hidden Markov Model.
I would suggest looking at SOUNDEX to find similar words in your database.
You can also access google own dictionary by using the Google API spelling suggestion request.
You may want to look at Peter Norvig's "How to Write a Spelling Corrector" article.
I believe Google logs all queries and identifies when someone makes a spelling correction. This correction may then be suggested when others supply the same first query. This will work for any language, in fact any string of any characters.
http://en.wikipedia.org/wiki/N-gram#Google_use_of_N-gram
I think this depends on how big your website it. On our local Intranet which is used by about 500 member of staff, I simply look at the search phrases that returned zero results and enter that search phrase with the new suggested search phrase into a SQL table.
I them call on that table if no search results has been returned, however, this only works if the site is relatively small and I only do it for search phrases which are the most common.
You might also want to look at my answer to a similar question:
"Similar Posts" like functionality using MS SQL Server?
If you have industry specific translations, you will likely need a thesaurus. For example, I worked in the jewelry industry and there were abbreviate in our descriptions such as kt - karat, rd - round, cwt - carat weight... Endeca (the search engine at that job) has a thesaurus that will translate from common misspellings, but it does require manual intervention.
I do it with Lucene's Spell Checker.
Soundex is good for phonetic matches, but works best with peoples' names (it was originally developed for census data)
Also check out Full-Text-Indexing, the syntax is different from Google logic, but it's very quick and can deal with similar language elements.
Soundex and "Porter stemming" (soundex is trivial, not sure about porter stemming).
There's something called aspell that might help:
http://blog.evanweaver.com/files/doc/fauna/raspell/classes/Aspell.html
There's a ruby gem for it, but I don't know how to talk to it from python
http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html
Here's a quote from the ruby implementation
Usage
Aspell lets you check words and suggest corrections. For example:
string = "my haert wil go on"
string.gsub(/[\w\']+/) do |word|
if !speller.check(word)
# word is wrong
puts "Possible correction for #{word}:"
puts speller.suggest(word).first
end
end
This outputs:
Possible correction for haert:
heart
Possible correction for wil:
Will
Implementing spelling correction for search engines in an effective way is not trivial (you can't just compute the edit/levenshtein distance to every possible word). A solution based on k-gram indexes is described in Introduction to Information Retrieval (full text available online).
U could use ngram for the comparisment: http://en.wikipedia.org/wiki/N-gram
Using python ngram module: http://packages.python.org/ngram/index.html
import ngram
G2 = ngram.NGram([ "iis7 configure ftp 7.5",
"ubunto configre 8.5",
"mac configure ftp"])
print "String", "\t", "Similarity"
for i in G2.search("iis7 configurftp 7.5", threshold=0.1):
print i[1], "\t", i[0]
U get:
>>>
String Similarity
0.76 "iis7 configure ftp 7.5"
0.24 "mac configure ftp"
0.19 "ubunto configre 8.5"
Why not use google's did you mean in your code.For how see here
http://narenonit.blogspot.com/2012/08/trick-for-using-googles-did-you-mean.html

Resources