Back-filling a feed? - getstream-io

Is there a way to insert activities into a feed so they appear as if they were inserted at a specific time in the past? I had assumed that when adding items to a feed it would use the 'time' value to sort the results, even when propagated to other feeds following the initial feed, but it seems that's not the case and they just get sorted by the order they were added to the feed.
I'm working on a timeline view for our users, and I have a couple of reasons for wanting to insert activities at previous points in time:
1) We have a large number of entities in our database but a relatively small number of them will be followed (especially at first), so to be more efficient I had planned to only add activities for an entity once it had at least one follower. Once somebody follows it I would like to go back 14 days and insert activities for that entity as if they were created at the time they occurred, so the new follower would see them in their feed at the appropriate place. Currently they will just see a huge group of activities from the past at the top of their feed which is not useful.
2) Similarly, we already have certain following relationships within our database and at launch I would like to go back a certain amount of time and insert activities for all entities that already have followers so that the feed is immediately useful.
Is there any way to do this, or am I out of luck?
My feeds are a combination of flat and aggregated feeds - the main timeline for a user is aggregated, but most entity feeds are flat. All of my aggregation groups would be based on the time of the activity so ideally there would be a way to sort the final aggregation groups by time as well.

Feeds on Stream are sorted differently depending on their type:
Flat feeds are sorted based by activity time descending
Aggregated feeds and Notification feeds sort activity groups based on last-updated (activities inside groups are sorted by time descending)
This means that you can back-fill flat feeds but not aggregated feeds.
One possible way to get something similar to what you describe is to create follow relationship with copy_limit set to a low number so that only the most recent activities are propagated to followers.

Related

Designing Twitter Search - How to sort large datasets?

I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!
First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.

Using different time zones for strftime in aggregated feeds

If I understand correctly, if I'm attempting to aggregate activities in an aggregated feed group based on their calendar day, I necessarily do this relative to the UTC date. This unfortunately can yield confusing results if it's done somewhere like North America. Is it possible to aggregate on a date relative to a different time zone? Is there another way to achieve a similar result?
Currently it's not possible to provide an offset to the strftime function inside of an aggregation rule.
Without knowing all the specifics I think you may be able to achieve the desired result by adding a separate custom field such as local_date (e.g. with a string value of '2018-05-15'). This would be pre-computed and included with the Activity when it's added to a Stream feed, and referred to in the aggregation rule like {{ local_date }}.
The cavieat / limitation is that you'll need to decide whether to use the 'local date' from the perspective of a user who creates an activity (which may be different to the user reading a feed containing the activity), or a system-wide date that's applied across the application regardless of where your user's are located.

Two streams for inter-related models?

If we have users and posts - and I can follow a user (and see all their posts), or follow a particular post (and see all it's edits/updates), would each post be pushed to two seperate streams, one for the user and another for the post?
My concern is that if a user follows an idea, and also the user feed, their aggregated activity-feed could show multiple instances of the same idea, one from each feed.
Every unique activity will only appear at most once in a feed. To make an activity have the exact same internal ID, you might try using the to field. This add an activity into different feed groups with the same activity UUID.
If this is not possible, at least make an activity unique, by both entering the same time and foreign_id values. This will make an activity unique as well.
Cheers!

Range-based, chronological pagination queries across multiple collections with MongoDB?

Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.
MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.
If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges
You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.

How does solr work with data split into different services and therefore not synchronously available?

take for instance an ecommerce store with catalog and price data in different web services. Now, we know that solr does not allow partial updates to a document field(JIRA bug), so how do you index these two services ?
I had three possibilities, but I'm not sure which one is correct:
Partial update - not possible
Solr join - have price and catalog in separate index and join them in solr. You cant join them in your client side code, without screwing up pagination and facet counts. I dont know if this is possible in pre-solr 4.0
have some sort of intermediate indexing service, which composes an entire document based on the results from both these services and sends this for indexing. however there are two problems with this approach:
3.1 You can still compose documents partially, and then when the document is complete, you can set a flag indicating that this is a complete document. However, to do this each time a document has to be indexed, it has to first check whether the document exists in the index, edit it and push it back. So, big performance hit.
3.2 Your intermediate service checks whether a particular id is available from all services - if not silently drops it and hopes that when it appears in the other service, the first service will already be populated. This is OK, but it means that an item is not available in search until all fields are available (not desirable always - if u dont have price, you can simply set it to out-of-stock and still have it available)
Of all these methods, only #3.2 looks viable to me - does anyone know how you do this kind of thing with DIH? Because now, you have two different entry points (2 different web services) into indexing and each has to check the other
The usual way to solve this is close to your 3.2: write code that creates the document you want to index from the different available services. The usual flow would be to fetch all the items from the catalog, then fetch the prices when indexing. Wether you want to have items in the search from the catalog that doesn't have prices available depends on your business rules for the service. If you want to speed up the process (fetch product, fetch price, repeat), expand the API to fetch 1000 products and then prices for all the products at the same time.
There is no reason why you should drop an item from the index if it doesn't have price, unless you don't want items without prices in your index. It's up to you and your particular need what kind of information you need to have available before indexing the document.
As far as I remember 4.0 will probably support partial updates as it moves to the new abstraction layer for the index files, although I'm not sure it'll make your situation that much more flexible.
Approach 3.2 is the most common, though I think about it slightly differently. First, think about what you want in your search results, then create one Solr document for each potential result, with as much information as you can get. If it is OK to have a missing price, then add the document that way.
You may also want to match the documents in Solr, but get the latest data for display from the web services. That gives fresh results and avoids skew between the batch updates to Solr and the live data.
Don't hold your breath for fine-grained updates to be added to Solr and Lucene. It gets a lot of its speed from not having record-level locking and update.

Resources