Cassandra CQL3 clustering order and pagination - pagination

I am building out a user favourites service using Cassandra. I want to be able to have the favourites sorted by latest and then be able to paginate over the track_ids i.e the front end sends back the last track_id in the 200 page.
CREATE TABLE user_favorites
( user_id uuid, track_id int ,
favourited_date timestamp,
PRIMARY KEY ((user_id), favourited_date))
WITH CLUSTERING ORDER BY (favourited_date DESC);
I've tried different combinations of primary and clustering keys but to no avail.
I am wondering if it is better to split this out over multiple tables also.

I solved it using the comment about the Java driver base64'ing the PagingState and returning it to the client.

Related

Table layout for social app in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I was trying to see if we can avoid data de-normalization using YB’s secondary index , primary table is something like below :
CREATE TABLE posts_by_user(
user_id bigint,
post_id bigserial,
group_ids bigint[] null,
tag_ids bigint[] null,
content text null,
....
PRIMARY KEY (user_id, post_id)
)
-- there could be multiple group ids(up to 20) which user can select to publish his/her post in
-- there could be multiple tag ids(up to 20) which user can select to publish his/her post with
This structure makes fetching by user_id easier but, suppose I want to fetch by group_id(s) or tag_id(s), then either I will need to de-normalize it into secondary tables using YB transaction, which will require additional app logic and also could lead to performance issues because data will be written into multiple nodes based hash primary keys(group_ids and tag_ids).
Or I could use a secondary index to avoid writing additional logic, I have the following doubts regarding that :
YB stable version 2.8 does not allow creating a secondary index on array columns using GIN , any rough timelines it will be available as stable release version ?
will this also suffer same performance issue since multiple index will be updated at the time of client call in multiple nodes based on partition key group_id(s) or tag_id(s) ?
Other ideas are also most welcome for saving data to enable faster queries based on user_id(s), group_id(s), tag_id(s) in a scalable way.
The problem with the GIN index is that it won't be sorted on disk by the timestamp.
You have to create an index for (user_id, datetime desc).
While for groups you can maintain a separate table, with a primary key of (group_id desc, datetime desc, post_id desc). The same for tags.
And on each feed-request, you can make multiple queries for, say, 5 posts on each user_id or group_id and then merge them in the application layer.
This will be the most efficient since all records will be sorted on-disk and in-memory at write-time.

Single update results in thousands of writes

I'm looking for a viable answer to this use case. There are music tracks, and users have playlists of tracks. Let's say a user uploads a track, then a week later decides to edit the name (or make the track private, etc). If the track has been added to ~10k different playlists, that single edit results in ~10k writes.
It takes a single query to get all the playlists the track has been added to using
a reverse lookup table, then the application has to loop through all 10k
results and perform the respective updates on the playlist table.
The only alternative I see to this is performing a join at the application level when retrieving playlists.
This is a common use case I keep running into and would like to know how best to handle it.
CREATE TABLE tracks (
track_id timeuuid,
url text,
name text,
PRIMARY KEY (track_id)
)
CREATE TABLE playlist_ordered_by_recently_added (
playlist_id timeuuid,
date_added_id timeuuid,
track_id timeuuid,
url text,
name text,
PRIMARY KEY (playlist_id, date_added_id)
) WITH CLUSTERING ORDER BY (date_added_id DESC)
CREATE TABLE playlist_ordered_by_recently_added_reverse_lookup (
track_id,
playlist_id,
date_added_id,
PRIMARY KEY (track_id, playlist_id)
)
The "join" approach is the correct one, though I wouldn't call it "join".
To retrieve the track list, you will need to issue a first query against playlist_ordedred_by_recently_added (which gives you all the track_id(s), which is expected to be reasonably small), followed by a bunch of parallel queries to retrieve the tracks.url and tracks.name from your tracks table.
When you update, you only need to update the tracks table to change the name, once.

Ordering rows cross-nodes in cassandra

I have a table:
CREATE TABLE sessions (
session_id timeuuid,
app_id text,
PRIMARY KEY (session_id, app_id)
)
To have a good data distribution across nodes, I need to have the Partition key set as the session_id (as I expect millions of such sessions).
How can I have DESC ordered rows when trying to fetch the sessions that fall into a specific array of session Ids? Something like this:
this.cassandraClient
.query()
.select("*")
.from("sessions")
.where("session_id", "in", instancesIds)
You can't directly with Cassandra, and this table design. ASC/DESC are working only inside same partition, not between multiple partitions. You'll need to perform sort inside your client.

Design data model for messaging system with Cassandra

I am new to Cassandra and trying to build a data model for messaging system. I found few solutions but none of them exactly match my requirements. There are two main requirements:
Get a list of last messages for a particular user, from all other users, sorted by time.
Get a list of messages for one-to-one message history, sorted by time as well.
I thought of something like this,
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user,from_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
But this design has few issues, like I wont be able to satisfy first requirement since this design requires to pass from_user as well. And also this would be inefficient when number of (to_user,from_user) pair increases.
You are right. That one table won't satisfy both queries, so you will need two tables. One for each query. This is a core concept with Cassandra data modeling. Query driven design.
So the query looking for messages to a user:
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),time)
) WITH CLUSTERING ORDER BY (time DESC);
Messages from a user to another user.
CREATE TABLE chat (
to_user text,
from_user_text,
time text,
msg text,
PRIMARY KEY((to_user),from_user,time)
) WITH CLUSTERING ORDER BY (time DESC);
Slight difference from yours: from_user is a clustering column and not a part of the partition key. This is minimize the amount of select queries needed in application code.
It's possible to use the second table to satisfy both queries, but you will have to supply the 'from_user' to use a range query on time.

How to handle a change in denormalized data

What is the best approach for updating an un-indexed regular column (not a primary key related) throughout the tables containing it as a duplicate ?
i.e the user posts something and that post is duplicated in many tables for fast retrieval. But when that post changes (with an edit) it needs to be updated throughout the database, in all tables that contain that post (in tables that have different and unknown primary keys).
Solutions I'm thinking of:
Have a mapper table to track down the primary keys in all those tables, but it seems to lead to tables explosion (post is not the only property that might change).
Use Solr to do the mapping, but I fear I would be using it for the wrong purpose.
Any enlightenments would be appreciated.
EDIT (fictional schema).
What if the post changes? or even the user's display_name?
CREATE TABLE users (
id uuid,
display_name text,
PRIMARY KEY ((id))
);
CREATE TABLE posts (
id uuid,
post text,
poster_id uuid,
poster_display_name text
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id))
);
CREATE TABLE posts_by_user (
user_id uuid,
created timeuuid,
post text,
post_id uuid,
tags set<text>,
statistics map<int, bigint>,
PRIMARY KEY ((id), created)
);
It depends on the frequency of the updates. For instance, if users only update their names infrequently (a handful of time per user account), then it may be ok to use a secondary index. Just know that using a 2i is a scatter gather, so you'll see performance issues if it's a common operation. In those cases, you'll want to use a materialized view (either the ones in 3.0 or manage it yourself) to be able to get the list of all the posts for a given user, then update the user's display name.
I recommend doing this in a background job, and giving the user a message like "it may take [some unit of time] for the change in your name to be reflected everywhere".

Resources