Single update results in thousands of writes - cassandra

I'm looking for a viable answer to this use case. There are music tracks, and users have playlists of tracks. Let's say a user uploads a track, then a week later decides to edit the name (or make the track private, etc). If the track has been added to ~10k different playlists, that single edit results in ~10k writes.
It takes a single query to get all the playlists the track has been added to using
a reverse lookup table, then the application has to loop through all 10k
results and perform the respective updates on the playlist table.
The only alternative I see to this is performing a join at the application level when retrieving playlists.
This is a common use case I keep running into and would like to know how best to handle it.
CREATE TABLE tracks (
track_id timeuuid,
url text,
name text,
PRIMARY KEY (track_id)
)
CREATE TABLE playlist_ordered_by_recently_added (
playlist_id timeuuid,
date_added_id timeuuid,
track_id timeuuid,
url text,
name text,
PRIMARY KEY (playlist_id, date_added_id)
) WITH CLUSTERING ORDER BY (date_added_id DESC)
CREATE TABLE playlist_ordered_by_recently_added_reverse_lookup (
track_id,
playlist_id,
date_added_id,
PRIMARY KEY (track_id, playlist_id)
)

The "join" approach is the correct one, though I wouldn't call it "join".
To retrieve the track list, you will need to issue a first query against playlist_ordedred_by_recently_added (which gives you all the track_id(s), which is expected to be reasonably small), followed by a bunch of parallel queries to retrieve the tracks.url and tracks.name from your tracks table.
When you update, you only need to update the tracks table to change the name, once.

Related

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

Design data model for messaging system with Cassandra

I am new to Cassandra and trying to build a data model for messaging system. I found few solutions but none of them exactly match my requirements. There are two main requirements:
Get a list of last messages for a particular user, from all other users, sorted by time.
Get a list of messages for one-to-one message history, sorted by time as well.
I thought of something like this,
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user,from_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
But this design has few issues, like I wont be able to satisfy first requirement since this design requires to pass from_user as well. And also this would be inefficient when number of (to_user,from_user) pair increases.
You are right. That one table won't satisfy both queries, so you will need two tables. One for each query. This is a core concept with Cassandra data modeling. Query driven design.
So the query looking for messages to a user:
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
Messages from a user to another user.
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),from_user,time)
) WITH CLUSTERING ORDER BY (time DESC);
Slight difference from yours: from_user is a clustering column and not a part of the partition key. This is minimize the amount of select queries needed in application code.
It's possible to use the second table to satisfy both queries, but you will have to supply the 'from_user' to use a range query on time.

news feed like time-series data on cassandra

I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this
CREATE TABLE Posts(
ID uuid,
title text,
insertedTime timestamp,
postHour int,
contentURL text,
userID text,
PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);
The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying
SELECT * FROM Posts WHERE postHour = ?;
? = current hour
so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.
but this approach will query more than 1 partition when user scrolls down.
I want to know whether this model would perform without a problem, is there any other model that I can follow.
Someone please point me in the right direction.
Thank You.
That's a good start but a few pointers:
You'll probably need more than just the postHour as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
But there's still a problem. Your PRIMARY KEY has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime value. What you really need then is to add the ID to make sure they are unique:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
At this point, I'd consider just combining your ID and insertedTime columns into a single ID column of type timeuuid. With those changes, your final table looks like:
CREATE TABLE Posts(
ID timeuuid,
postYear int,
postMonth int,
postDay int,
postHour int,
title text,
contentURL text,
userID text,
PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
Whatever programming language you're using should have a way to generate a timeuuid from the inserted time and then extract that time from a timeuuid value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)
As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)
There are a couple things you could do to try and mitigate that:
Make your partitions less granular. For example, instead of doing posts by hour, make it posts by day, or posts by month. If you know that those partitions won't get too large (i.e. users won't make so many posts that the partition gets huge), that's probably the easiest solution.
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
CREATE TABLE post_hours (
postYear int,
postMonth int,
postDay int,
postHour int,
PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).

Distributed pagination in Cassandra

I was searching for pagination in cassandra and found this perfect topic here: Results pagination in Cassandra (CQL) , with this answer accepted by majority of people. But I want to do same thing on multiple computers. I'll provide an example...
The problem
Lets say I have three computers that are connected to same cassandra DB. Each computer wants to take a few rows from the following table:
CREATE TABLE IF NOT EXISTS lp_webmap.page (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
location text,
title text,
rank float,
updated timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), location
)
);
Every computer takes few rows and performs time consuming calculations for them. For a fixed partition key (domain_name1st, domain_name2nd, domain_name3rd) and different clustering key (location), there can be still thousands of results.
And now the problem comes...how to lock quickly a couple of rows with that computer1 is working for other computers?
Unusable solution
In a standard SQL I would use something like this:
CREATE TABLE IF NOT EXISTS lp_registry.page_lock (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
page_from int,
page_count int,
locked timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), locked, page_from
)
) WITH CLUSTERING ORDER BY (locked DESC);
This would allow me to do following:
Select first 10 pages on computer 1 and lock them (page_from=1, page_count=10)
Check locks quickly on other two machines and get unused pages for calculations
Take and lock bigger amount of pages on faster computers
Delete all locks for given partition key after all pages are processed
Question
However, I can't do LIMIT 20,10 in Cassandra and also I can't do this, since I want to paginate on different computers. Is there any chance how can I paginate through these pages quickly?

How to handle a change in denormalized data

What is the best approach for updating an un-indexed regular column (not a primary key related) throughout the tables containing it as a duplicate ?
i.e the user posts something and that post is duplicated in many tables for fast retrieval. But when that post changes (with an edit) it needs to be updated throughout the database, in all tables that contain that post (in tables that have different and unknown primary keys).
Solutions I'm thinking of:
Have a mapper table to track down the primary keys in all those tables, but it seems to lead to tables explosion (post is not the only property that might change).
Use Solr to do the mapping, but I fear I would be using it for the wrong purpose.
Any enlightenments would be appreciated.
EDIT (fictional schema).
What if the post changes? or even the user's display_name?
CREATE TABLE users (
id uuid,
display_name text,
PRIMARY KEY ((id))
);
CREATE TABLE posts (
id uuid,
post text,
poster_id uuid,
poster_display_name text
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id))
);
CREATE TABLE posts_by_user (
user_id uuid,
created timeuuid,
post text,
post_id uuid,
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id), created)
);
It depends on the frequency of the updates. For instance, if users only update their names infrequently (a handful of time per user account), then it may be ok to use a secondary index. Just know that using a 2i is a scatter gather, so you'll see performance issues if it's a common operation. In those cases, you'll want to use a materialized view (either the ones in 3.0 or manage it yourself) to be able to get the list of all the posts for a given user, then update the user's display name.
I recommend doing this in a background job, and giving the user a message like "it may take [some unit of time] for the change in your name to be reflected everywhere".

Resources