Cassandra partitioning strategy for systems with skewed traffic - cassandra

Please bear with me for slightly longer problem description.
I am a newbie to Cassandra world and I am trying to migrate my current product from oracle based data layer to Cassandra.
In order to support range queries I have created an entity like below:
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event),
event_type, client_request_id, id)
) with clustering order by (created_date desc);
Now, I have come across several documentation/resources/blogs that mentions that I should be keeping my partition size less than 100 mb for an optimally performing cluster. With the volume of traffic my system handles per day for a certain combinations of partitioning key, there is no way i can keep it less than 100 mb with above partitioning key.
To fix this i introduced a new factor called bucket_id and was thinking of assigning it hour of the day value to further break partitions into smaller chunks and keep them less than 100 mb(Even though this means i have to do 24 reads to serve traffic details for one day, but i am fine with some inefficiency in reads). Here is the schema with bucket id
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
bucket_id int,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event,
bucket_id), event_type, client_request_id, id)
) with clustering order by (created_date desc);
Even with this, couple of combinations of
goes more than 100 mb while all other volume sits comfortably within the range.
With this situation in mind I have below questions:
Is it an absolute blunder to have few of your partitions go beyond 100 mb limit?
Though with even smaller bucket say 15 min window, I get all combinations of partition key under 100 mb but that too creates heavily skewed partitions, meaning that high volume combinations of partition key goes up till 80 mb while remaining once are well under 15 mb. Is this something that will adversely impact performance of my cluster?
Is there a better way to solve this problem?
Here is some more info that I thought may be useful:
Avg row size for this entity is around 200 bytes
I am also considering a load future proofing factor of 2 and estimating for double the load.
Peak load for a specific combination of partition key is around 2.8 Million records in a day
the same combination has peak traffic hour of about 1.4 million records
and the same in 15 min window is around 550,000 records.
Thanks in advance for your inputs!!

Your approach with the bucket id looks good. Answering your questions:
No, it's not a hard limit, and actually, it might be too low taking into account hardware improvements over the last few years. I have seen partitions of 2 GB and 5 GB (though they can give you a lot of headaches when doing repairs), but those are extreme cases. Don't go near those values. Bottom line, if you don't go WAY above those 100 MB, you will be fine. If you have at least 15 GB of ram, use G1GC and you're golden.
A uniform distribution on the partition sizes is important to keep the data load balanced throughout the cluster, and it's also good so that you're confident that your queries will be close to an average latency (because they will be reading the approximate same sizes of data), but it's not something that will give performance issues on its own.
The approach looks good, but if that's a time series, which I think it is taking into account what you said, then I recommend that you use TWCS (Time Window Compaction Strategy) in my_system.my_system_log_dated. Check how to configure this compaction strategy, because the time window you set will be very important.

I was able to device bucketisation that prevents any risks to cluster health due to any unexpected traffic spike. Same has been described here https://medium.com/walmartlabs/bucketisation-using-cassandra-for-time-series-data-scans-2865993f9c00

Related

Cassandra : optimal partition size

I plan to have a simple table like this (simple key/value use case) :
CREATE TABLE my_data (
id bigint,
value blob,
PRIMARY KEY (id)
)
With the following caracteristics :
as you can see, one partition = one blob (value)
each value is always accessed by the corresponding key
each value is a blob of 1MB max (average also 1 MB)
with 1MB blob, it give 60 millions partitions
What do you think about the 1MB blob ? Is that OK for Cassandra ?
Indeed, I can divide my data further, to work with 1ko blob, but in that case, it will lead to many more partitions on Cassandra (more than 600 millions ?), and many more partitions to retreive the data for a same client side query..
Thanks
The general recommendation is to stay as close to 100MB partition sizes although this isn't a hard limit. There are some edge cases were partitions can get beyond 1GB and still be acceptable for some workloads as long as you're willing to accept the tradeoffs.
However in your case, 1MB blobs is a strong recommendation but again not a hard limit. You will notice a significant performance hit for larger blob sizes if you were to do a reasonable load test.
600 million partitions is not a problem at all. Cassandra is designed to handle billions, trillions of partitions and beyond. Cheers!

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

Cassandra: Is manual bucketing still needed when applying TWCS?

I am just about to start exploring Cassandra for (long term) saving time series (write only once) data, that potentially can grow quite large.
Assuming the probably most simple time series:
CREATE TABLE raw_data (
sensor uuid,
timestamp timestamp,
value int,
primary key(sensor, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
To make sure, partitions don't grow too much, many posts on the internet recommend bucketing, e.g. introducing day or just an up counting bucket number like
primary key((sensor, day, bucket), timestamp)
. However, these strategies need to be managed manually what seems quite cumbersome especially for unknown number of buckets.
But, what if I say add:
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS'
};
As said e.g. in https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html:
TWCS aims at simplifying DTCS by creating time windowed buckets of SSTables that are compacted with each other using the Size Tiered Compaction Strategy.
As far as I understand this means that Cassandra when using TWCS internally creates readonly buckets anyway. Thus, I am wondering if I still need to manually implement the bucketing key day?
The purpose of the bucket is to stop the partition growing too large. Without the bucket the growth of the partition is unbounded - that is, the more data you collect for a particular sensor, the larger the partition becomes, with no ultimate limit.
Changing the compaction strategy alone will not stop growth of the partition, so you would still need the bucket.
(You wrote "Cassandra when using TWCS internally creates readonly buckets". Don't confuse this with the 'bucket' column. The same word is being used for two completely different things.)
On the other hand, if you were to set a TTL on the data then this would effectively limit the size of the partition because data older than the TTL would (eventually) be deleted from disc. So, if the TTL were small enough, you would no longer need the bucket. In this particular scenario - timeseries data collected in-order and a TTL - then TWCS is the optimum compaction strategy.

Why varying blob size gives different performance?

My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!
I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
read-optimized.
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.
Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml
Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

Resources