How do I paginate a Stream ranked feed? - getstream-io

I was pretty deep into integrating Stream into my existing pagination implementation (which is also used for paginating non-activity data stored in MySQL) when I came across this line in the Stream documentation under "Custom Ranking":
Please note: offset and id_lt cannot be used to read ranked feeds. Use score_lt for pagination instead.
This seems to be the only mention of score_lt in the docs. I can't find it discussed anywhere else, nor can I find an example of what its value should be. Should it be the same UUID I would use for id_lt if I were paginating a non-ranked feed? Or is it meant to be a score value of some kind that would be returned only by a ranked feed?
Normally I'd just try it and see, but ranked feeds are only available to paid plans and I'm still evaluating Stream.
This could have significant implications for how I implement pagination though, since I do want to be able to use ranked feeds in the future if I move forward with Stream.

When retrieving activities from a ranked feed using a specific ranking config, each activity will include a score attribute. You can use the score_lt to paginate through the items in the ranked feed (along with the limit parameter).
(When paginating through items on non-ranked feeds, we usually recommend using the id_lt parameter, which will just return activities by creation date, in chronological order from most-recent to least-recent. However, since older content in a ranked feed might be ranked higher than newer content, we have to paginate and order via the score attribute.)
--
Whenever you create a ranked feed, you'll create at least one ranked feed config. I'm going to name my ranked feed config ranked-feed-config-one (you can have as many as you'd like) which will look something like this:
{
"score": "decay_linear(time) * popularity ^ 0.5",
"defaults": {
"popularity": 1
}
}
Whenever you send a new activity into stream, you'll also provide an optional popularity parameter. (If you don't provide one, popularity will default to 1.)
Then, whenever you retrieve activities from the ranked feed, you can specify what ranking config you'd like to use (ranked-feed-config-one), like this:
someFeed.get({ ranking: 'ranked-feed-config-one' })
Each activity will be returned with (and ordered by) a score attribute. You'll save the last score attribute, and use that when supplying the score_lt parameter for future pagination calls.
--
Hopefully that helps clear things up! Let me know if there's anything else I can help answer for you.

You can use Limit & Offset Pagination.
someFeed.get({limit:20, offset:20})

Related

Solr: how to manage irrelevant results when not sorting by relevance?

Case in point: say we have a search query that returns 2000 results ranging from very relevant to hardly relevant at all. When this is sorted by relevance this is fine, as the most relevant results are listed on the first page.
However, when sorting by another field (e.g. user rating) the results on the first page are full of hardly-relevant results, which is a problem for our client. Somehow we need to only show the 'relevant' results with highest ratings.
I can only think of a few solutions, all of which have problems:
1 - Filter out listings on Solr side if relevancy score is under a threshold. I'm not sure how to do this, and from what I've read this isn't a good idea anyway. e.g. If a result returns only 10 listings I would want to display them all instead of filter any out. It seems impossible to determine a threshold that would work across the board. If anyone can show me otherwise please show me how!
2 - Filter out listings on the application side based on score. This I can do without a problem, except that now I can't implement pagination, because I have no way to determine the total number of filtered results without returning the whole set, which would affect performance/bandwidth etc... Also has same problems of the first point.
3 - Create a sort of 'combined' sort that aggregates a score between relevancy and user rating, which the results will then be sorted on. Firstly I'm not sure if this is even possible, and secondly it would be weird for the user if the results aren't actually listed in order of rating.
How has this been solved before? I'm open to any ideas!
Thanks
If they're not relevant, they should be excluded from the result set. Since you want to order by a dedicated field (i.e. user rating), you'll have to tweak how you decide which documents to include in the result at all.
In any case you'll have to define "what is relevant enough", since scores aren't really comparable between queries and doesn't say anything about "this was xyz relevant!".
You'll have to decide why those documents that are included aren't relevant and exclude them based on that criteria, and then either use the review score as a way to boost them further up (if you want the search to appear organic / by relevance). Otherwise you can just exclude them and sort by user score. But remember that user score, as an experience for the user, is usually a harder problem to make relevant than just order by the average of the votes.
Usually the client can choose different ordering options, by relevance or ratings for example. But you are right that ordering by rating is probably not useful enough. What you could do is take into account the rating in the relevance scoring. For example, by multiplying an "organic" score with a rating transformed as a small boost. In Solr you could do this with Function Queries. It is not hard science, and some magic is involved. Much is common sense. And it requires some very good evaluation and testing to see what works best.
Alternatively, if you do not want to treat it as a retrieval problem, you can apply faceting and let users do filtering of the results by rating. Let users help themselves. But I can imagine this does not work in all domains.
Engineers can define what relevancy is. Content similarity scoring is not only what constitutes relevancy. Many Information Retrieval researchers and engineers agree that contextual information should be used besides only the content similarity. This opens a plethora of possibilities to define a retrieval model. For example, what has become popular are Learning to Rank (LTR) approaches where different features are learnt from search logs to deliver more relevant documents to users given their user profiles and prior search behavior. Solr offers this as module.

How can I get all the exercises for a topic (e.g., math) and all its subtopics from the khanacademy api?

Khan Academy's API Explorer has an exercises section that mentions filtering by tags, but the url with math tag applied returns nothing.
The generic exercise objects don't contain the topic they're in. My guess is that there's an id to join on somewhere in the topictree/exercises json objects, but I don't know an efficient way to find it.
Here are the raw exercises json and raw topictree json (note, the second one is huge, and contains many topics other than math).
I don't think there is a nice way to return exercises from just a subtree of the topictree (e.g. just math). Tags are a different concept, and there isn't a tag common to everything in math. Probably your best bet is to load the full topictree with just Exercises (and Topics) and work from there:
http://www.khanacademy.org/api/v1/topictree?kind=Exercise
If you need to reference this structure repeatedly, it probably makes sense to download and filter it ahead of time, and maybe re-fetch it from time to time to account for changes to Khan Academy content. But it depends on your exact use case.
Generally, any content item can be referenced by content_id (sometimes just called id) or by slug, but unfortunately, the naming and usage aren't consistent everywhere.
You can use the following to get all the exercises:
http://www.khanacademy.org/api/v1/exercises
http://www.khanacademy.org/api/v1/topictree?kind=Exercise
I'm not sure what's the difference between these two - I don't use them.
I prefer to fetch the data for the individual topic nodes as follows:
http://www.khanacademy.org/api/v1/topic/%s
http://www.khanacademy.org/api/v1/topic/%s/exercises
http://www.khanacademy.org/api/v1/topic/%s/videos
where %s is the "node_slug" property for each topic. The root of the tree is just "root". The first one will give you the topic details and a list of sub-items in the "child_data" array. Use the "id" properties of each sub-topic in this array to look up its details in the "children" array having "internal_id" equal to "id". There you get the "node_slug" to for the next API call for that sub-topic. The "child_data" array has all the sub-items in the order that they appear on the website when you're working with the missions.
I cache these responses so that I don't have to download everything every time.

Boost SolR results using users behavior

I would like SolR to be able to "learn" from my website users' choices. By that i mean that i know which product the user click after he performed a search. So i collected a list of [term searched => number of clicks] for each product indexed in SolR. But i can't figure how to have a boost that depends on the user input. Is it possible to index some key/value pairs for a document and retrieve the value with a function usable in the boost parameter ?
I'm not sure to be clear, so i'll add a concrete example :
Let's say that when a user search for "garden chair", SolR returns me 3 products, "green garden chair", "blue chair", and "hamac for garden".
"green garden chair" ranks first, the hamac ranks last, as expected.
But, then, all the users searching for "garden chair" ends up clicking on the hamac.
I would like to help the hamac to rank first on the search "garden chair", WITHOUT altering the rank it got on other search. So i would like to be able to perform a key=>value based boost.
Is that possible to achieve with SolR ?
I'm sure that i can't be the first one needing such user-based search results improvement.
Thanks in advance.
You could you edismax bq, if you are using edismax (or maybe bf). For this to work, you obviously need to store the info (in a db, redis, whatever you fancy):
searched "garden chair":
clicked "hamac for garden": 10
clicked "green garden chair": 4
searched "green table":
...
And so forth, look this up when there is a search, and if there is info available for the search, send the bq boosting what you want.
Also, check out the QueryElevationComponent It might your purpose (although is stronger than just boosting....). There are two things to consider though:
Every time you change the click number you would need to modify the xml and reload, so it would be better if you could batch it to nightly or something like that.
there was a recent jira issue to allow you to provide similar functionality but by providing request params, no need of xml/reload, so check that out too

SOLR Query parameters to avoid flooding with the same manufacturer

I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.

Freebase: Format search result to list all properties of object of unknown type(s)

I'm trying to write a MQL query to format a search result in freebase (the "output" parameter in the search API). I essentially want to find the (simple) values of all the properties of a given search result (without knowing anything about the types of the result a priori). By "simple", I mean only the default properties if the values are complex objects.
E.g., if I search for "Yo La Tengo" and this takes me to the result for "/en/yo_la_tengo", I want to be able to get the group's members (I just need names, not instruments or dates started), albums (again, just names), films contributed to (again, just names), etc.
Is there a simple way to do this with a search output query, given that I know nothing about the types? I imagine there's some sort of reflection magic I can use, and I've tried mucking about with "/type/reflect", but I'm not getting anywhere. I'm brand-new to MQL (though I have extensive SQL experience), so this is a little daunting. Any ideas?
Edit: So to clarify, I think the problem I'm seeing is due to mediator types like "performance" (an actor in a film) or "marriage". E.g., with a query about Yo La Tengo, I can see most (all?) information that I'm interested in, but a similar query about [The Muppet Movie]( freebase.com/api/service/search?limit=1&mql_output=%5B%7B%22%2Ftype%2Freflect%2Fany_reverse%22%3A%5B%7B%7D%5D%2C%22%2Ftype%2Freflect%2Fany_master%22%3A%5B%7B%7D%5D%2C%22%2Ftype%2Freflect%2Fany_value%22%3A%5B%7B%7D%5D%7D%5D&query=The%20Muppet%20Movie -- sorry, SO thinks I'm a spammer so I can't make this a link), I don't see Frank Oz reference at all (probably because his performance is referenced instead). Is there a generic way for me to "follow" mediator types to get all their properties? E.g., is there a single output MQL that would allow me to get the actor in a performance (when linked form a film search result) and give the the spouse in a marriage (when linked from a person)?
Querying not only every property, but then following those properties another ply deep in the graph for all search results is going to be an incredibly expensive operation. What is the use case for this? Do you really have a UI where the user can see and effectively absorb all this information? To answer the question directly though, it's not possible to unpack mediator types automatically using mql_output on the search API.
I'd suggest combining a basic set of information on the search query with a deeper set of information on a topic that the user has expressed interest (e.g. by hovering over). This UI experience would be similar to that of Freebase Suggest.
In the years since the question was originally asked there have been some additional useful things added such as the "notable" pseudo-property which lets you see what the topic is notable for.
Of course everyone also needs to be moving to the new API, so the queries would be:
https://www.googleapis.com/freebase/v1/search?query=%22the%20muppet%20movie%22&limit=1&indent=true
https://www.googleapis.com/freebase/v1/topic/en/the_muppet_movie
AFAIK there is no way to do this in outright MQL, but you can:
Get all the properties of an object or type of object, then
Programmatically construct another MQL query to get those objects you want to know more about.
Look at this example:
[{
"type|=": [
"/film/actor",
"/tv/tv_actor",
"/celebrities/celebrity"
],
"*": [{}]
}]​
It grabs all the properties of all objects that have the type actor, tv_actor, or celebrity. When you run it, you'll see all the possible "follow" points you can explore.
This is not exactly what you want, but it should get you closer.

Resources