Distributed pagination in Cassandra - cassandra

I was searching for pagination in cassandra and found this perfect topic here: Results pagination in Cassandra (CQL) , with this answer accepted by majority of people. But I want to do same thing on multiple computers. I'll provide an example...
The problem
Lets say I have three computers that are connected to same cassandra DB. Each computer wants to take a few rows from the following table:
CREATE TABLE IF NOT EXISTS lp_webmap.page (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
location text,
title text,
rank float,
updated timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), location
)
);
Every computer takes few rows and performs time consuming calculations for them. For a fixed partition key (domain_name1st, domain_name2nd, domain_name3rd) and different clustering key (location), there can be still thousands of results.
And now the problem comes...how to lock quickly a couple of rows with that computer1 is working for other computers?
Unusable solution
In a standard SQL I would use something like this:
CREATE TABLE IF NOT EXISTS lp_registry.page_lock (
domain_name1st text,
domain_name2nd text,
domain_name3rd text,
page_from int,
page_count int,
locked timestamp,
PRIMARY KEY (
(domain_name1st, domain_name2nd, domain_name3rd), locked, page_from
)
) WITH CLUSTERING ORDER BY (locked DESC);
This would allow me to do following:
Select first 10 pages on computer 1 and lock them (page_from=1, page_count=10)
Check locks quickly on other two machines and get unused pages for calculations
Take and lock bigger amount of pages on faster computers
Delete all locks for given partition key after all pages are processed
Question
However, I can't do LIMIT 20,10 in Cassandra and also I can't do this, since I want to paginate on different computers. Is there any chance how can I paginate through these pages quickly?

Related

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

How to search record using ORDER_BY without the partition keys

I'm debugging an issue and the logs should be sitting on a time range between 4/23/19~ 4/25/19
There are hundreds of millions of records on our production.
It's impossible to locate the target records using random sort.
Is there any workaround to search in a time range without partition key?
select * from XXXX.report_summary order by modified_at desc
Schema
...
"modified_at" "TimestampType" "regular"
"record_end_date" "TimestampType" "regular"
"record_entity_type" "UTF8Type" "clustering_key"
"record_frequency" "UTF8Type" "regular"
"record_id" "UUIDType" "partition_key"
First, ORDER BY is really quite superfluous in Cassandra. It can only operate on your clustering columns within a partition, and then only on the exact order of the clustering columns. The reason for this, is that Cassandra reads sequentially from the disk, so it writes all data according to the defined clustering order to begin with.
So IMO, ORDER BY in Cassandra is pretty useless, except for cases where you want to change the sort direction (ascending/descending).
Secondly, due to its distributed nature, you need to take a query-oriented approach to data modeling. In other words, your tables must be designed to support the queries you intend to run. Now you can find ways around this, but then you're basically doing a full table scan on a distributed cluster, which won't end well for anyone.
Therefore, the recommended way to go about that, would be to build a table like this:
CREATE TABLE stackoverflow.report_summary_by_month (
record_id uuid,
record_entity_type text,
modified_at timestamp,
month_bucket bigint,
record_end_date timestamp,
record_frequency text,
PRIMARY KEY (month_bucket, modified_at, record_id)
) WITH CLUSTERING ORDER BY (modified_at DESC, record_id ASC);
Then, this query will work:
SELECT * FROM report_summary_by_month
WHERE month_bucket = 201904
AND modified_at >= '2019-04-23' AND modified_at < '2019-04-26';
The idea here, is that as you care about the order of the results, you need to partition by something else to allow for sorting to work. For this example, I picked month, hence I've "bucketed" your results by month into a partition key called month_bucket. Within each month, I'm clustering on modified_at in DESCending order. This way, the most-recent results are at the "top" of the partition. Then, I threw in record_id as a tie-breaker key to help ensure uniqueness.
If you're still focused on doing this the wrong way:
You can actually run a range query on your current schema. But with "hundreds of millions of records" across several nodes, I don't have high hopes for that to work. But you can do it with the ALLOW FILTERING directive (which you shouldn't ever really use).
SELECT * FROM report_summary
WHERE modified_at >= '2019-04-23'
AND modified_at < '2019-04-26' ALLOW FILTERING;
This approach has the following caveats:
With many records across many nodes, it will likely time out.
Without being able to identify a single partition for this query, a coordinator node will be chosen, and that node has a high chance of becoming overloaded.
As this is pulling rows from multiple partitions, a sort order cannot be enforced.
ALLOW FILTERING makes Cassandra work in ways that it really wasn't designed to, so I would never use that on a production system.
If you really need to run a query like this, I recommend using an in-memory aggregation tool, like Spark.
Also, as the original question was about ORDER BY, I wrote an article a while back which better explains this topic: https://www.datastax.com/dev/blog/we-shall-have-order

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

Single update results in thousands of writes

I'm looking for a viable answer to this use case. There are music tracks, and users have playlists of tracks. Let's say a user uploads a track, then a week later decides to edit the name (or make the track private, etc). If the track has been added to ~10k different playlists, that single edit results in ~10k writes.
It takes a single query to get all the playlists the track has been added to using
a reverse lookup table, then the application has to loop through all 10k
results and perform the respective updates on the playlist table.
The only alternative I see to this is performing a join at the application level when retrieving playlists.
This is a common use case I keep running into and would like to know how best to handle it.
CREATE TABLE tracks (
track_id timeuuid,
url text,
name text,
PRIMARY KEY (track_id)
)
CREATE TABLE playlist_ordered_by_recently_added (
playlist_id timeuuid,
date_added_id timeuuid,
track_id timeuuid,
url text,
name text,
PRIMARY KEY (playlist_id, date_added_id)
) WITH CLUSTERING ORDER BY (date_added_id DESC)
CREATE TABLE playlist_ordered_by_recently_added_reverse_lookup (
track_id,
playlist_id,
date_added_id,
PRIMARY KEY (track_id, playlist_id)
)
The "join" approach is the correct one, though I wouldn't call it "join".
To retrieve the track list, you will need to issue a first query against playlist_ordedred_by_recently_added (which gives you all the track_id(s), which is expected to be reasonably small), followed by a bunch of parallel queries to retrieve the tracks.url and tracks.name from your tracks table.
When you update, you only need to update the tracks table to change the name, once.

news feed like time-series data on cassandra

I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this
CREATE TABLE Posts(
ID uuid,
title text,
insertedTime timestamp,
postHour int,
contentURL text,
userID text,
PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);
The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying
SELECT * FROM Posts WHERE postHour = ?;
? = current hour
so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.
but this approach will query more than 1 partition when user scrolls down.
I want to know whether this model would perform without a problem, is there any other model that I can follow.
Someone please point me in the right direction.
Thank You.
That's a good start but a few pointers:
You'll probably need more than just the postHour as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
But there's still a problem. Your PRIMARY KEY has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime value. What you really need then is to add the ID to make sure they are unique:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
At this point, I'd consider just combining your ID and insertedTime columns into a single ID column of type timeuuid. With those changes, your final table looks like:
CREATE TABLE Posts(
ID timeuuid,
postYear int,
postMonth int,
postDay int,
postHour int,
title text,
contentURL text,
userID text,
PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
Whatever programming language you're using should have a way to generate a timeuuid from the inserted time and then extract that time from a timeuuid value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)
As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)
There are a couple things you could do to try and mitigate that:
Make your partitions less granular. For example, instead of doing posts by hour, make it posts by day, or posts by month. If you know that those partitions won't get too large (i.e. users won't make so many posts that the partition gets huge), that's probably the easiest solution.
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
CREATE TABLE post_hours (
postYear int,
postMonth int,
postDay int,
postHour int,
PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).

Resources