Google Like Search Mechanism - nlp

I want to do search mechanism similar to google using NLP(Natural Language Processing) in java. The algorithm should be able to give auto-suggestions , spellcheck , get the meaning out of the sentence and display top relevant results.
For Example , if I typed "laptop" relevant results to be shown ["laptop bags","laptop deals","laptop prices","laptop services" ,"laptop tablet"]
Is it possible to achieve with NLP and Semantics? It would be appreciable if you post any reference links or ideas to achieve.

"Get the meaning out of a sentence" - that's really difficult task. I don't believe even google does that in their search engine;) When talking about searching getting the meaning of query is not that important...but it really depends on what do you mean by "get the meaning", anyway you always can buy yourself something like "Google Search Appliance" - its a private google search box.
All the other requirements are quite straightforward. I'm from java land soi'd suggest you to look at:
Apache Lucene - if you are a developer, it's an indexer created around full text searches
Elasticsearch It's full blown,fast scalable server build around lucene that can do most of what you are asking.
Solr Another one, in terms of functionality equal to elastic IMHO.

Related

Search feature on website

I am interested in implementing a search feature on a website. It is a location search, so address/state/zip all should work. Which will then show results in that area and allow it to be filtered.
My question is:
What's the best approach for something like this?
There are literally dozens of ways of doing this (if not more). The exact implementation would depend on the technology stack that you use, but as a very top level overview:
you'd need to store the things you are searching for somewhere, and tag them with a lat/long location. Often, this would be in a database of some kind.
using a programming language, you would need to write a search that accepts a postcode, translates that to a lat/long and then searches the things in your database based on the distance between the location of the thing, and the location entered in the search.
if you want to support filtering, your search would need to support that too. This is often called "faceting" the search.
Working out the lat/long locations will need to be done using a GeoLocation service, there are some, such as PostCode Anywhere that will do this as a paid service, and others that are free (within reason), such as the Google Maps APIs.
There are probably some hosted services that will do what you want, you'd have to shop around.
Examples of search software that supports geolocation searching out of the box are things like Solr, Azure Search, Lucene and Elastic.

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.
I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.
As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Search engine string matching

What is the typical algorithm used by online search engines to make suggestions for misspelled words. I'm not necessarily talking about Google, but any site with a search feature, such as as Amazon.com for instance. Say I search for the word "shoo"; the site will come back and say "did you mean: shoe".
Is this some variation of the Levenshtein distance algorithm? Perhaps if they are using some full text search framework (like lucene for instance) this is built in? Maybe fully custom?
I know the answer varies a lot, I'm just looking for an indication on how to get started with this (in an enterprise environment).

Open-source full-text article recommendation engines

I'm wondering if there are any good .NET recommendation algorithms available in open source projects, whether attached to a search engine or not. By recommendation I mean something that accepts a full-text article and recommends other articles from its index based on keyword similarity.
At the high end there are document classification engines like Autonomy; at the low-end spam filters and blog "related posts" widgets. Possibly advertisement-to-article matching, too. I'd like to incorporate one into a project but can't afford the high end and the low end seems to all be LAMP-based.
[Sorry, one answer asked for clarification: What I'm looking for is ideally a standalone library, but I'm willing to adapt good source code as necessary. The end result is that I need to be able to create a C# service that accepts an arbitrary amount of text and returnsa list of similar previously-indexed articles. Basicallly, the exact thing that StackOverflow itself does as you are submitting a question!]
Thanks!
Steve
I think that in StackOverflow they extract all common english words from the text and then compare this words with the remaining words of other posts to get the "Related" posts.
Question is not very clear (algorithm or library???) but only thing that comes to mind is Lucene.NET, the porting of the popular Lucene library on the .Net framework. HTH.

Resources