SOLR/Lucene weighting by user-centric criteria - search

We are switching from SQL Fulltext Search to Lucene (SOLR stack) search in the next few months. One last wrinkle in figuring out our strategy here has to with replicating one current part of our search platform.
First, some nomenclature to describe the problem: Our site has a bunch of documents. People might "add" those documents, they might "favorite" those documents, they might "read" those documents, etc. Let's call that union of such documents for a given user their "personal documents". Some documents are public, and some are private so that only the logged-in-user can see them.
Currently, we have a weighting function that will always show a given user's "personal" documents FIRST in the search list, for any search. This outranks the normal order (but a document must be valid in the result set -- it just ranks above any other less important document). In SQL, we are able to achieve this by having a user-defined-function that returns a score, and it varies by user.
An analogy is Facebook -- where, when you type "Joe", it will first find all the Joes that you know, followed by any other Joe that meets the criteria. My search for "Joe" will return a different ordered set than your search for Joe.
In the world of Lucene/SOLR, as I understand it, I cannot figure out how to have such user-centric weighting of documents without two separate queries that are then effectively UNIONed together (I know, it's not relational, but you get the idea). We have millions of users, and hundreds of thousands of documents. If a user is logged in, we want "their documents" to show up first in any search, then the rest of all documents. And in each case, we want the search results to show only those documents that match the original search -- we're just talking about rank-order.
Can you think of any strategies here to reproduce this user-defined-function feature?

Can you afford to have a field in each document telling this particular document belongs to Jim (e.g. user123Doc:1)? If yes, you could solve it by sorting the result set by {user123Doc, score, ...}.
Or, if you don't want to store this information in Lucene, you can store this elsewhere (e.g. in the database) and implement FieldComparator so it works with these values. More on this is available here.

Related

What indexer do I use to find the list in the collection that is most similar to my list?

Lets say I have my list of ingredients:
{'potato','rice','carrot','corn'}
and I want to return lists from a database that are most similar to mine:
{'beans','potato','oranges','lettuce'},
{'carrot','rice','corn','apple'}
{'onion','garlic','radish','eggs'}
My query would return this first:
{'carrot','rice','corn','apple'}
I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.
In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.
What technology should I use to accomplish what I want to do?
Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.
With so much data I can't really loop through it, I need to query everything at once.
I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match.
If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki
Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

Personalized Search Results for Elasticsearch

How would one go about setting up Elasticsearch so that it returns personalized results?
For example, I would want results returned to a particular user to rank higher if they clicked on a result previously, or if they "starred" that result in the past. You could also have a "hide" option that pushes results further down the ranking. From what I've seen with elasticsearch so far, it seems difficult to return different rankings to users based on that user's own dynamic data.
The solution would have to scale to thousands of users doing a dozen or so searches per day. Ideally, I would like the ranking to change in real-time, but it's not critical.
Elasticsearch provides a wide variety of scoring options , but then to achieve what you have told you will need to do some additional tasks.
Function score query and document terms lookup terms filter would be our tools of our choice
First create a document per user , telling the links or link ID he visited and the links he has liked. This should be housed separately as separate index. And this should be maintained by the user , as he should update and maintain this record from client side.
Now when a user hits the data index, do a function score query with filter function pointing to this fields.
In this approach , as the filter is cached , you should get decent performance too.

showing search results more efficiently?

I want to implement the auto-complete feature provided by various e-commerce stores. Functionality is pretty simple, when you type some characters, it start showing relevant suggestions.
I implemented it using solr (django-haystack), using the autocomplete method provided in haystack.query.SearchQuerySet . Basically, i get a list of results sorted by the score. Showing top n results as suggestions.
Solr document contains $product_name, $category_name and other fields. So the results which i generated looks like list of " in ".
Problem arise when i change the category name. If i change the category name, i have to update all the product belong to that particular category to reflect these changes in the auto-complete (update all documents in solr for products of this category).
Another way to do this is by putting just the id of the categories with product in the solr document. In that case, I have do look-up for category name each time, and this is not efficient.
Is there any other efficient way to do this?
Since you are changing the underlying data, the same has to be propogated to SOLR.
There are different approaches to do this:
Update the database, and reindex - Pros: Simple enough, Cons: Indexing time can be large.
Update database and Solr in tandem - Pros: Quick updates, almost instantaneous, Cons: Can lead to data inconsistency (if one update fails)
Update database, and schedule a delta-import in Solr. This is like a middle ground between the two above.
I would recommend the 3rd approach, but this would require some upfront schema design. Read more about delta import here, in context of DataImportHandler.

How to find all documents in a database created by many different users?

I need to find all documents in a database which have been created by x number of users and the result must be a combined (sorted by date) list/collection of documents from all these users.
so I have a multivalue field that contain e.g 100 users, I now need to return a collection programmatically with all the documents in the database that has been created with those users.
worth mentioning here is that the 100 users is dynamic, so in another document there might be 100 different users that is the be used for the search.
I have experimented with the following kind of query, but I believe I run into some kind of limit in the search query. (looks like if the query is to long I get query is not understandable)
FIELD CreatedBy Contains "Thomas" OR FIELD CreatedBy Contains "Peter".... up to 100 more like this
also, finding these documents is triggered by webusers so it must be relative fast.
is there another way to find these documents?
Thanks
Thomas
Your best bet might be a view sorted by the creator and a loop with 100 getAllDocumentsByKey. You could run a comparison if you sort the entry field and just walk through the viewnavigator after jumping to the first document.
The result could be added to a folder (you need to manage folders for different query sets and have a cleanup routine) that is sorted how you need it or you sort in memory (think JavaBean). The folder option makes paging to the result easier - you probably don't want to show 10000 results in one go - that would take very long to transmit. You also could use a "result document" where you store the result as JSON and use that as paging source.
Easily done with Ytria's ScanEZ. They have trials, I think

Why should (or shouldn't) a Search Query return back only document IDs?

So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...

Resources