Text recommendation with Lucene/solr/mahout - text

I'm working on a project where I need to implement an article/news recommendation engine.
I'm thinking of combining different methods (item-based, user based, model CF) and have a question regarding the tool to use.
From my research Lucene is definitely the tool for text processing but for the recommendation part, it's not so clear.
If I want to implement an item CF on articles based on text similarity :
- I've seen case studies using Mahout but also solr (http://fr.slideshare.net/lucenerevolution/building-a-realtime-solrpowered-recommendation-engine), as it's really close to a search problem I would think that solr is maybe better, am I right ?
- What are the differences in term of time processing between the 2 tools (I think Mahout is more batch and solr real time) ?
- Can I get a text distance directly from Lucene (it's not really clear for me what is the added value of solr compared to Lucene) ?
- For more advanced method (model based on matrix factorization), I would use Mahout but is there any SVD-like function in solr for concept/tag discovering ?
Thanks for your help.

it depends on your requirements, if you only need offline recommendaton function, mahout is good. for online, i am testing it too. In fact, I have tested with lucene and mahout, they work fine together. for solr, im not so sure, all i know it uses lucene as its core. so all the heavy liftings are still done by lucene. In my case, I combined mahout and lucene in my java program, basically lucene does preprocessing and primitive similarity calculations and then the result is sent to mahout to be further analysed.

Related

How is spatial search implemented at the code level in solar or elasticsearch

Since lucene is developed purely in java , can i find out how spatial is implemented by solar or elasticsearch since they are using lucene
While this is a very broad question, a good place to start is the github repo for Lucene and Solr. Searching for Solr Spatial will give you the code for Solr's interface to the Lucene Spatial functionality, and gives you the names of classes (and by looking at the imports, what the important parts in the Lucene code base is).
After digging through a bit of code, looking at AbstractSpatialFieldType for fields defined as spatial field types in Solr seems to be a good place to dig further into the Solr implementation.
In addition I can recommend looking up the spatial talks from the previous years of Lucene Solr Revolution, where you should be able to find down-to-the-metal-talks about the implementation (and the evolution it has been through the last years). David Smiley has been heading the implementation from the Solr side (which also includes a lot of the Lucene side as far as I understand).

Hybrid recommender in spark

I am trying to build a hybrid recommender using prediction.io which functions as a layer on top of spark/mllib under the hood.
I'm looking for a way to incorporate a boost based on tags in the ALS algorithm when doing a recommendation request.
Using content information to improve collaborative filtering seems like such a usual path although I cannot find any documentation on combining a collaborative algorithm (eg ALS) with a content based measure.
Any examples or documentation on incorporating content similarity with collaborative filtering for either mllib (spark) or mahout (hadoop) would be greatly appreciated.
This PredictionIO Template uses Mahout's Spark version of Correlators so it can make use of multiple actions to recommend to users or find similar items. It allows you to include multiple categorical tag-like content to boost or filter recs.
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
The v0.2.0 branch also has date range filtering and popular item backfill is in development.

mongodb approximate string matching

I am trying to implement a search engine for my recipes-website using mongo db.
I am trying to display the search suggestions in type-ahead widget box to the users.
I am even trying to support mis-spelled queries(levenshtein distance).
For example: whenever users type 'pza', type-ahead should display 'pizza' as one of the suggestion.
How can I implement such functionality using mongodb?
Please note, the search should be instantaneous, since the search result will be fetched by type-ahead widget. The collections over which I would run search queries have at-most 1 million entries.
I thought of implementing levenshtein distance algorithm, but this would slow down performance, as collection is huge.
I read FTS(Full Text Search) in mongo 2.6 is quite stable now, but my requirement is Approximate match, not FTS. FTS won't return 'pza' for 'pizza'.
Please recommend me the efficient way.
I am using node js mongodb native driver.
The text search feature in MongoDB (as at 2.6) does not have any built-in features for fuzzy/partial string matching. As you've noted, the use case currently focuses on language & stemming support with basic boolean operators and word/phrase matching.
There are several possible approaches to consider for fuzzy matching depending on your requirements and how you want to qualify "efficient" (speed, storage, developer time, infrastructure required, etc):
Implement support for fuzzy/partial matching in your application logic using some of the readily available soundalike and similarity algorithms. Benefits of this approach include not having to add any extra infrastructure and being able to closely tune matching to your requirements.
For some more detailed examples, see: Efficient Techniques for Fuzzy and Partial matching in MongoDB.
Integrate with an external search tool that provides more advanced search features. This adds some complexity to your deployment and is likely overkill just for typeahead, but you may find other search features you would like to incorporate elsewhere in your application (e.g. "like this", word proximity, faceted search, ..).
For example see: How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search. Note: ElasticSearch's fuzzy query is based on Levenshtein distance.
Use an autocomplete library like Twitter's open source typeahead.js, which includes a suggestion engine and query/caching API. Typeahead is actually complementary to any of the other backend approaches, and its (optional) suggestion engine Bloodhound supports prefetching as well as caching data in local storage.
The best case for it would be using elasticsearch fuzzy query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html
It supports levenshtein distance algorithm out of the box and has additional features which can be useful for your requirements i.e.:
- more like this
- powerful facets / aggregations
- autocomplete

Are there any technologies that help develop website search?

PROBLEM:
I need to write an advanced search functionality for a website. All the data is stored in MySQL and I'm using Zend Framework on top. I know that I can write a script that takes the search page and builds an SQL query out of it, but this becomes extremely slow if there's a lot of hits. Then I would have to get down to the gritty details of optimizing the database tables/fields/etc. which I'm trying to avoid if possible.
Lucene: I gave Lucene a try, but since it's a full-text search engine, it does not allow any mathematical operators!! So if I wanted to get all the records where field_x > 5, there is no way to do it (correct?)
General Practice? I would like to know how large sites deal with this dilemma. Is there a standard way of doing this that I don't know about, or does everyone have to deal with the nasty details of optimizing the database at some point? I was hoping that some fast indexing/searching technology existed (e.g. Lucene) that would address this problem.
ANY OTHER COMMENTS OR SUGGESTION ARE MOST WELCOME!!
Thanks a lot guys!
Ali
You can use Zend Lucene for textual search, and combine it with MySQL for joins.
Please see Mark Krellenstein's Search Engine vs DBMS paper about the choice; Basically, search engines are better for ranked text search; Databases are better for more complex data manipulations, such as joins, using different record structures.
For a simple x>5 type query, you can use a range query inside Lucene.
Use Lucene for your text-based searches, and use SQL for field_x > 5 searches. I say this because text-based search is hard to get right, and you're probably better off leaving that to an expert.
If you need your users to have the capability of building mathematical expression searches, consider writing an expression builder dialog like this example to collect the search phrase. Then use a parameterized SQL query to execute the search.
SqlWhereBuilder ASP.NET Server Control
http://www.codeproject.com/KB/custom-controls/SqlWhereBuilder.aspx
You can use filters in Lucene to carry out a text search of a reduced set of records. So if you query the database first to get all records where field_x > 5, build a filter (a list of lucene document IDs) and pass this into the lucene search method along with the text query. I'm just learning about this, here's a link to a question I asked (it uses Lucene.Net and C# but it may help) - ignore my question, just check out the accepted answer:
How do you implement a custom filter with Lucene.net?

What is the best search approach?

I'm using lucene in my project.
Here is my question:
should I use lucene to replace the whole search module which has been implemented with sql using a large number of like statement and accurate search by id or sth,
or should I just use lucene in fuzzy search(i mean full text search)?
Probably you should use lucene, unless the SQL search is very performant.
We are right now moving to Solr (based on Lucene) because our search queries are inherently slow, and cannot be sped up with our database.... If you have reasonably large tables, your search queries will start to get really slow unless the DB has some kind of highly optimized free text search mechanisms.
Thus, let Lucene do what it does best....
I don't think using like statement abusively is a good idea.
And I believe the performance of lucene will be better than database.
I'm actually very impressed by Solr, at work we were looking for a replacement for our Google Mini (it's woefully inadequate for any serious site search) and were expecting something that would take a while to implement. Within 30 minutes of installing Solr we had done what we had expected to take at least a few days and provided us with a far more powerful search interface than we had before.
You could probably use Solr to do quite a lot of clever things beyond a simple site search.

Resources