Aggregate feed, removing duplicates in getstream

Aggregate feed, removing duplicates in getstream - getstream-io

I have followed the adivce here stackoverflow aggregate answer
I am grouping posts together(shares for same post together, likes for same posts together, regular posts as single activities). What I'm noticing, however, is that I end up with duplicates for a user. If a user shares a post, and also likes the post, it shows up twice on their getstream feed.Right now, I have to do filtering on my own backend service with a certain order(If you share a post, remove the activity if you also liked it).If you like a post, then remove the regular post.Is there a better way to solve this problem of duplicates?

One idea that comes to mind: when you post an activity of a share, make sure you send a foreign_id and time (sending both will avoid duplicates in our system), then if you also 'like' the activity you could store a like counter in the activity metadata, and send an update with the foreign_id and incrementing the like count.
Keep in mind that updates don't push to aggregated feeds or notification feeds, though, so you'd still want to push that 'like' activity to those feeds, too.

Related

Back-filling a feed?

Is there a way to insert activities into a feed so they appear as if they were inserted at a specific time in the past? I had assumed that when adding items to a feed it would use the 'time' value to sort the results, even when propagated to other feeds following the initial feed, but it seems that's not the case and they just get sorted by the order they were added to the feed.
I'm working on a timeline view for our users, and I have a couple of reasons for wanting to insert activities at previous points in time:
1) We have a large number of entities in our database but a relatively small number of them will be followed (especially at first), so to be more efficient I had planned to only add activities for an entity once it had at least one follower. Once somebody follows it I would like to go back 14 days and insert activities for that entity as if they were created at the time they occurred, so the new follower would see them in their feed at the appropriate place. Currently they will just see a huge group of activities from the past at the top of their feed which is not useful.
2) Similarly, we already have certain following relationships within our database and at launch I would like to go back a certain amount of time and insert activities for all entities that already have followers so that the feed is immediately useful.
Is there any way to do this, or am I out of luck?
My feeds are a combination of flat and aggregated feeds - the main timeline for a user is aggregated, but most entity feeds are flat. All of my aggregation groups would be based on the time of the activity so ideally there would be a way to sort the final aggregation groups by time as well.

Feeds on Stream are sorted differently depending on their type:
Flat feeds are sorted based by activity time descending
Aggregated feeds and Notification feeds sort activity groups based on last-updated (activities inside groups are sorted by time descending)
This means that you can back-fill flat feeds but not aggregated feeds.
One possible way to get something similar to what you describe is to create follow relationship with copy_limit set to a low number so that only the most recent activities are propagated to followers.

Getstream- How to filter fake likes?

I have the user feed.If the user posts the activity, the same follower like two times,the likes count increased.how to avoid that?
when i post the activity, the followers can like the activity multiple times

The best way to avoid that is to not send duplicate reactions to Stream. The React library already enforces this. While we do not enforce uniqueness for reaction kinds, support for this will be added soon.

Two streams for inter-related models?

If we have users and posts - and I can follow a user (and see all their posts), or follow a particular post (and see all it's edits/updates), would each post be pushed to two seperate streams, one for the user and another for the post?
My concern is that if a user follows an idea, and also the user feed, their aggregated activity-feed could show multiple instances of the same idea, one from each feed.

Every unique activity will only appear at most once in a feed. To make an activity have the exact same internal ID, you might try using the to field. This add an activity into different feed groups with the same activity UUID.
If this is not possible, at least make an activity unique, by both entering the same time and foreign_id values. This will make an activity unique as well.
Cheers!

Is the twissandra data model efficient one ?

help me please,
I am new in cassandra world, so i need some advice.
I am trying to make data model for cassandra DB.
In my project i have
- users which can follow each other,
- articles which can be related with many topics.
Each user can follow many topics.
So the goal is make the aggregated feed where user will get:
articles from all topics which he follow +
articles from all friends which he follow +
self articles.
I have searched about same tasks and found twissandra example project.
As i understood in that example we storing only ids of tweets in timeline, and when we need to get timeline we getting ids of tweets and then getting each tweet by id in separate non blocking request. After collecting all tweets we returning list of tweets to user.
So my question is: is it efficient ?
Making ~41 requests to DB for getting one page of tweets ?
And second question is about followers.
When someone creating tweet we getting all of his followers and putting tweet id to their timeline,
but what if user have thousands of followers ?
It means that for creating only one tweet we should write (1+followers_count) times to DB ?

twissandra is more a toy example. It will work for some workloads, but you possibly have more you need to partition the data more (break up huge rows).
Essentially though yes, it is fairly efficient - it can be made more so by including the content in the timeline, but depending on requirements that may be a bad idea (if need deleting/editing). The writes should be a non-issue, 20k writes/sec/node is reasonable providing you have adequate systems.
If I understand your use case correctly, you will probably be good with twissandra like schema, but be sure to test it with expected workloads. Keep in mind at a certain scale everything gets a little more complicated (ie if you expect millions of articles you will need further partitioning, see https://academy.datastax.com/demos/getting-started-time-series-data-modeling).

Do we need to call foursquare Venue Categories API at regular interval?

Please let me know whether we have to call foursquare Venue Categories API at regular interval
or we have to call only once so that we can store category list in database and use them for searching items
If category Id is not getting changed in the above scenario , it will work for me .

Yes, you should call the categories endpoint at a regular interval, but that interval can be large.
They make changes to the categories - we call it once a month or so (manually actually), to update the hierarchy that we cache on our side.
We have not seen a category ID changes, but rather more categories are added over time, and maybe removed (not really sure about removed)
It happens rarely, but we sometimes have an error when we can a category id that we do not recognize and then we need to go refresh the categories list and rebuild our cache.

From the API docs (https://developer.foursquare.com/docs/venues/categories):
"...please download this list only once per session, but also avoid caching this data for longer than a week to avoid stale information."
So, you can store the list in your database, but you should refresh this data at least once a week.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Aggregate feed, removing duplicates in getstream - getstream-io

Related

Back-filling a feed?

Getstream- How to filter fake likes?

Two streams for inter-related models?

Is the twissandra data model efficient one ?

Do we need to call foursquare Venue Categories API at regular interval?

Categories

Resources