news feed like time-series data on cassandra - cassandra

I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this
CREATE TABLE Posts(
ID uuid,
title text,
insertedTime timestamp,
postHour int,
contentURL text,
userID text,
PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);
The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying
SELECT * FROM Posts WHERE postHour = ?;
? = current hour
so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.
but this approach will query more than 1 partition when user scrolls down.
I want to know whether this model would perform without a problem, is there any other model that I can follow.
Someone please point me in the right direction.
Thank You.

That's a good start but a few pointers:
You'll probably need more than just the postHour as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
But there's still a problem. Your PRIMARY KEY has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime value. What you really need then is to add the ID to make sure they are unique:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
At this point, I'd consider just combining your ID and insertedTime columns into a single ID column of type timeuuid. With those changes, your final table looks like:
CREATE TABLE Posts(
ID timeuuid,
postYear int,
postMonth int,
postDay int,
postHour int,
title text,
contentURL text,
userID text,
PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
Whatever programming language you're using should have a way to generate a timeuuid from the inserted time and then extract that time from a timeuuid value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)
As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)
There are a couple things you could do to try and mitigate that:
Make your partitions less granular. For example, instead of doing posts by hour, make it posts by day, or posts by month. If you know that those partitions won't get too large (i.e. users won't make so many posts that the partition gets huge), that's probably the easiest solution.
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
CREATE TABLE post_hours (
postYear int,
postMonth int,
postDay int,
postHour int,
PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).

Related

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

Single update results in thousands of writes

I'm looking for a viable answer to this use case. There are music tracks, and users have playlists of tracks. Let's say a user uploads a track, then a week later decides to edit the name (or make the track private, etc). If the track has been added to ~10k different playlists, that single edit results in ~10k writes.
It takes a single query to get all the playlists the track has been added to using
a reverse lookup table, then the application has to loop through all 10k
results and perform the respective updates on the playlist table.
The only alternative I see to this is performing a join at the application level when retrieving playlists.
This is a common use case I keep running into and would like to know how best to handle it.
CREATE TABLE tracks (
track_id timeuuid,
url text,
name text,
PRIMARY KEY (track_id)
)
CREATE TABLE playlist_ordered_by_recently_added (
playlist_id timeuuid,
date_added_id timeuuid,
track_id timeuuid,
url text,
name text,
PRIMARY KEY (playlist_id, date_added_id)
) WITH CLUSTERING ORDER BY (date_added_id DESC)
CREATE TABLE playlist_ordered_by_recently_added_reverse_lookup (
track_id,
playlist_id,
date_added_id,
PRIMARY KEY (track_id, playlist_id)
)
The "join" approach is the correct one, though I wouldn't call it "join".
To retrieve the track list, you will need to issue a first query against playlist_ordedred_by_recently_added (which gives you all the track_id(s), which is expected to be reasonably small), followed by a bunch of parallel queries to retrieve the tracks.url and tracks.name from your tracks table.
When you update, you only need to update the tracks table to change the name, once.

Distributed pagination in Cassandra

I was searching for pagination in cassandra and found this perfect topic here: Results pagination in Cassandra (CQL) , with this answer accepted by majority of people. But I want to do same thing on multiple computers. I'll provide an example...
The problem
Lets say I have three computers that are connected to same cassandra DB. Each computer wants to take a few rows from the following table:
CREATE TABLE IF NOT EXISTS lp_webmap.page (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
location text,
title text,
rank float,
updated timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), location
)
);
Every computer takes few rows and performs time consuming calculations for them. For a fixed partition key (domain_name1st, domain_name2nd, domain_name3rd) and different clustering key (location), there can be still thousands of results.
And now the problem comes...how to lock quickly a couple of rows with that computer1 is working for other computers?
Unusable solution
In a standard SQL I would use something like this:
CREATE TABLE IF NOT EXISTS lp_registry.page_lock (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
page_from int,
page_count int,
locked timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), locked, page_from
)
) WITH CLUSTERING ORDER BY (locked DESC);
This would allow me to do following:
Select first 10 pages on computer 1 and lock them (page_from=1, page_count=10)
Check locks quickly on other two machines and get unused pages for calculations
Take and lock bigger amount of pages on faster computers
Delete all locks for given partition key after all pages are processed
Question
However, I can't do LIMIT 20,10 in Cassandra and also I can't do this, since I want to paginate on different computers. Is there any chance how can I paginate through these pages quickly?

How to handle a change in denormalized data

What is the best approach for updating an un-indexed regular column (not a primary key related) throughout the tables containing it as a duplicate ?
i.e the user posts something and that post is duplicated in many tables for fast retrieval. But when that post changes (with an edit) it needs to be updated throughout the database, in all tables that contain that post (in tables that have different and unknown primary keys).
Solutions I'm thinking of:
Have a mapper table to track down the primary keys in all those tables, but it seems to lead to tables explosion (post is not the only property that might change).
Use Solr to do the mapping, but I fear I would be using it for the wrong purpose.
Any enlightenments would be appreciated.
EDIT (fictional schema).
What if the post changes? or even the user's display_name?
CREATE TABLE users (
id uuid,
display_name text,
PRIMARY KEY ((id))
);
CREATE TABLE posts (
id uuid,
post text,
poster_id uuid,
poster_display_name text
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id))
);
CREATE TABLE posts_by_user (
user_id uuid,
created timeuuid,
post text,
post_id uuid,
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id), created)
);
It depends on the frequency of the updates. For instance, if users only update their names infrequently (a handful of time per user account), then it may be ok to use a secondary index. Just know that using a 2i is a scatter gather, so you'll see performance issues if it's a common operation. In those cases, you'll want to use a materialized view (either the ones in 3.0 or manage it yourself) to be able to get the list of all the posts for a given user, then update the user's display name.
I recommend doing this in a background job, and giving the user a message like "it may take [some unit of time] for the change in your name to be reflected everywhere".

Cassandra/Redis: Way to create feed without Cassandra 'IN' secondary index?

I'm having a bit of an issue with my application functionality integrating with Cassandra. I'm trying to create a content feed for my users. Users can create posts which, in turn, have the field user_id. I'm using Redis for the entire social graph and using Cassandra columns solely for objects. In Redis, user 1 has a set named user:1:followers with all of his/her follower ids. These follower ids correspond with the Cassandra ids in the users table and user_ids in the posts table.
My goal was originally to simply plug all of the user_ids from this Redis set into a query that would use FROM posts WHERE user_id IN (user_ids here) and grab all of the posts from the secondary index user_id. The issue is that Cassandra purposely does not support the IN operator in secondary indexes because that index would force Cassandra to search ALL of its nodes for that value. I'm left with only two options I can see: Either create a Redis list of user:1:follow_feed for the post IDs then search Cassandra's primary index for those posts in a single query, or keep it the way I have it now and run an individual query for every user_id in the user:1:follower set.
I'm really leaning against the first option because I already have tons and tons of graph data in Redis, and this option would add a new list for every user. The second way is far worse. I would put a massive read load on Cassandra and it would take a long time to run individual queries for a set of ids. I'm kind of stuck between a rock and a hard place, as far as I see it. Is there any way to query the secondary indexes with multiple values? If not, is there a more efficient way to load these content feeds (RAM and speed wise) compared to the options of more Redis lists or multiple Cassandra queries? Thanks in advance.
Without knowing the schema of the posts table (and preferably the others, as well), it's really hard to make any useful suggestions.
It's unclear to me why you need to have user_id be a secondary index, as opposed to your primary key.
In general it's quite useful to key content like posts off of the user that created it, since it allows you to do things like retrieve all posts (optionally over a given range, assuming they are chronologically sorted) very efficiently.
With Cassandra, if you find that a table can effectively answer some of the queries that you want to perform but not others, you are usually best of denormalizing that table and creating another table with a different structure in order to keep your queries to a single CQL partition and node.
CREATE TABLE posts (
user_id int,
post_id int,
post_text text,
PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC)
This table can answer queries such as:
select * from posts where user_id = 1234;
select * from posts where user_id = 1 and post_id = 53;
select * from posts where user_id = 1 and post_id > 5321 and post_id < 5400;
The reverse clustering on post_id is to make retrieving the most recent posts the most efficient by placing them at the beginning of the partition physically within the sstable.
In that example, user_id being a partition column, means "all cql rows with this user_id will be hashed to the same partition, and hence the same physical nodes, and eventually, the same sstables. That's why it's possible to
retrieve all posts with that user_id, as they are store contiguously
retrieve a slice of them by doing a ranged query on post_id
retrieve a single post by supplying both the partition column(user_id) and the clustering column (post_id)
In effect, this become a hashmap of a hashmap lookup. The one major caveat, though, is that when using partition and clustering columns, you always need to supply all columns from left to right in your query, without skipping any. So in this case, that means you can't retrieve an individual post without knowing the user_id that the post_id belongs to. That is addressable in user-code(by storing a reverse mapping and doing the lookup when necessary, or by encoding the user_id into the post_id that is passed around your application), but is definitely something to take into consideration.

Resources