Solr query results with mixed types

Solr query results with mixed types - search

We are using Solr 6.6 and have documents in our index with 2 different types: book and movie. I'd like to run a search that would return 10 results per page. The first 3 results should be matching book docs and the last 7 results should be matching movie docs. Then on page 2 it would have 3 more book docs with 7 movie docs.
Does anyone know if something like this is possible with a Solr query? I'm trying to avoid 2 separate queries, one for 3 books per page and 1 for 7 movies per page. So doing this in a single query would be ideal.

We assume you have some 'type' field that says if a doc is a movie or book. Given that:
what you really want is diversity in results. This is for Lucene, unfortunately there is no Solr layer on top to take advantage yet.
so your next best thing is Grouping, by asking for more docs than you need, then grouping by type, and hoping you have at least 3 books and 7 movies in each group (you might not). You can always fall back to a second query if you are missing some books or movies. It probably will work fine.

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks

If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.

Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

searching YouTube for videos with specific range of views eg. between 9,000,000 and 11,000,000

first time posting.
I wanted to ask if anyone knows how I can search on YouTube for, let's say, music video's that have been viewed between a set number of times. Like the title says for example, between 9 and 11 million times.
One reason I want to do this is because I want to find good music that I haven't heard before. The logic I'm working on is that the Got Talent type video's that get viewed millions of times are generally viewed that many times for one of two reason. 1) they're amazing. 2) they're embarrassingly horrible.
And though I don't think a song being popular will necessarily mean I'll like it, I'm hoping this method will be successful to some degree.
Another reason is to look for trailers for independent films with a similar logic as above. Though with these movies I think I only hear about them six months to a year after they've been released because they're flying under the radar.
If I were to be able to search for movie trailers with 'x' number of views though.. for example, between 500,000 and a million, maybe I'd be able to find movies that I'll like quicker than via time passing and them getting mentioned to me by a friend.
Any help would be greatly appreciated as I've wanted to be able to perform these kind of searches for awhile now.
thanks

You will need to use YouTube API v3.
I havent written this exact request but it looks like you can list videos then filter by 'Chart' = 'mostPopular'
https://developers.google.com/youtube/v3/docs/videos/list
Perhaps a bit of background reading on the API would help too...
https://developers.google.com/youtube/v3/

First off, you would need the Youtube Data API. "v3" means nothing because it's simply the current version, like "Windows 10."
The API lets you get a video's view count, but doesn't put it in a range like 9 million to 11 million.
Youtube's own search function is pretty sophisticated. For instance,
https://www.youtube.com/results?search_query=movie+trailer&search_sort=video_view_count&filters=month. This gives all results for "movie trailer," within the last month, sorted by view count. You can customize the URL, i.e. "week" instead of month would return only trailers from the last week. Or year, etc. Essentially this is a "Videos: List: MostPopular" query, with subject filter.
I have a few Youtube API scripts, and I hardly think it's worth the hassle to do it that way when Youtube's advanced search get you 99% there. If you did, you would need to to a Search:list query for a given subject (i.e. "movie trailer"). Limited to a given time frame (i.e. last month). Then for each video ID, make a Videos:list query to get its view count. Then print all, sorted by views.

Boost SolR results using users behavior

I would like SolR to be able to "learn" from my website users' choices. By that i mean that i know which product the user click after he performed a search. So i collected a list of [term searched => number of clicks] for each product indexed in SolR. But i can't figure how to have a boost that depends on the user input. Is it possible to index some key/value pairs for a document and retrieve the value with a function usable in the boost parameter ?
I'm not sure to be clear, so i'll add a concrete example :
Let's say that when a user search for "garden chair", SolR returns me 3 products, "green garden chair", "blue chair", and "hamac for garden".
"green garden chair" ranks first, the hamac ranks last, as expected.
But, then, all the users searching for "garden chair" ends up clicking on the hamac.
I would like to help the hamac to rank first on the search "garden chair", WITHOUT altering the rank it got on other search. So i would like to be able to perform a key=>value based boost.
Is that possible to achieve with SolR ?
I'm sure that i can't be the first one needing such user-based search results improvement.
Thanks in advance.

You could you edismax bq, if you are using edismax (or maybe bf). For this to work, you obviously need to store the info (in a db, redis, whatever you fancy):
searched "garden chair":
clicked "hamac for garden": 10
clicked "green garden chair": 4
searched "green table":
...
And so forth, look this up when there is a search, and if there is info available for the search, send the bq boosting what you want.
Also, check out the QueryElevationComponent It might your purpose (although is stronger than just boosting....). There are two things to consider though:
Every time you change the click number you would need to modify the xml and reload, so it would be better if you could batch it to nightly or something like that.
there was a recent jira issue to allow you to provide similar functionality but by providing request params, no need of xml/reload, so check that out too

Solr / Lucene: scoring individual tag

I'm designing a Lucene search index that includes ranked tags for each document.
Example:
Document 1
tag: java , rank 1.2
tag: learning, rank 2.1
tag: bugs, rank 1.2
tag: architecture: rank 0.3
The tags comes from an automated classification algorithm that is also assigning a score.
How do I design the index so I can query for search for a combination of tags and return the most relevant results? Example, search for java+learning
I've initially created a FIELD for each tag and used the rank to boost the field for each document. Is this a good approach in terms of performance? What if I have 10,000 possible tags? Is it a good idea to have 10,000 FIELDS in Lucene?
Field tag = new Field(
FIELD_TAG+tag.getId(),
"y",
Field.Store.NO,
Field.Index.NOT_ANALYZED);
tag.setBoost(tag.getRank());
luceneDoc.add(tag);
If I instead add all the tags to the same field, how can I take into account the rank?

I had this problem in my search too... Tell me if I'm wrong...
The good was if you could have one field like "Tags" contain the value "java learning bugs architecture" and you use a WhiteSpaceTokenizer:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WhitespaceTokenizerFactory
But doing this you are not able to bost each words, you are able to boost the field "Tags"...
Doing this Lucene will not give a good scoring when user searchs for "java bugs" ou "architecture in java", but will return all documents that have this words.
But you can do like you said, a lot of "Tags" and boost each one... Or you can crate a new Query Parser http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html inheritance edismax (for example) to make a field works like you want.
Is that what you want?
Ow... One more thing, adding a lot of fields will make the docs indexer slow and index bigger (probably not good to search).

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.

I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.

Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.

To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string