Cassandra: Is manual bucketing still needed when applying TWCS? - cassandra

I am just about to start exploring Cassandra for (long term) saving time series (write only once) data, that potentially can grow quite large.
Assuming the probably most simple time series:
CREATE TABLE raw_data (
sensor uuid,
timestamp timestamp,
value int,
primary key(sensor, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
To make sure, partitions don't grow too much, many posts on the internet recommend bucketing, e.g. introducing day or just an up counting bucket number like
primary key((sensor, day, bucket), timestamp)
. However, these strategies need to be managed manually what seems quite cumbersome especially for unknown number of buckets.
But, what if I say add:
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS'
};
As said e.g. in https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html:
TWCS aims at simplifying DTCS by creating time windowed buckets of SSTables that are compacted with each other using the Size Tiered Compaction Strategy.
As far as I understand this means that Cassandra when using TWCS internally creates readonly buckets anyway. Thus, I am wondering if I still need to manually implement the bucketing key day?

The purpose of the bucket is to stop the partition growing too large. Without the bucket the growth of the partition is unbounded - that is, the more data you collect for a particular sensor, the larger the partition becomes, with no ultimate limit.
Changing the compaction strategy alone will not stop growth of the partition, so you would still need the bucket.
(You wrote "Cassandra when using TWCS internally creates readonly buckets". Don't confuse this with the 'bucket' column. The same word is being used for two completely different things.)
On the other hand, if you were to set a TTL on the data then this would effectively limit the size of the partition because data older than the TTL would (eventually) be deleted from disc. So, if the TTL were small enough, you would no longer need the bucket. In this particular scenario - timeseries data collected in-order and a TTL - then TWCS is the optimum compaction strategy.

Related

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

Cassandra partitioning strategy for systems with skewed traffic

Please bear with me for slightly longer problem description.
I am a newbie to Cassandra world and I am trying to migrate my current product from oracle based data layer to Cassandra.
In order to support range queries I have created an entity like below:
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event),
event_type, client_request_id, id)
) with clustering order by (created_date desc);
Now, I have come across several documentation/resources/blogs that mentions that I should be keeping my partition size less than 100 mb for an optimally performing cluster. With the volume of traffic my system handles per day for a certain combinations of partitioning key, there is no way i can keep it less than 100 mb with above partitioning key.
To fix this i introduced a new factor called bucket_id and was thinking of assigning it hour of the day value to further break partitions into smaller chunks and keep them less than 100 mb(Even though this means i have to do 24 reads to serve traffic details for one day, but i am fine with some inefficiency in reads). Here is the schema with bucket id
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
bucket_id int,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event,
bucket_id), event_type, client_request_id, id)
) with clustering order by (created_date desc);
Even with this, couple of combinations of
goes more than 100 mb while all other volume sits comfortably within the range.
With this situation in mind I have below questions:
Is it an absolute blunder to have few of your partitions go beyond 100 mb limit?
Though with even smaller bucket say 15 min window, I get all combinations of partition key under 100 mb but that too creates heavily skewed partitions, meaning that high volume combinations of partition key goes up till 80 mb while remaining once are well under 15 mb. Is this something that will adversely impact performance of my cluster?
Is there a better way to solve this problem?
Here is some more info that I thought may be useful:
Avg row size for this entity is around 200 bytes
I am also considering a load future proofing factor of 2 and estimating for double the load.
Peak load for a specific combination of partition key is around 2.8 Million records in a day
the same combination has peak traffic hour of about 1.4 million records
and the same in 15 min window is around 550,000 records.
Thanks in advance for your inputs!!
Your approach with the bucket id looks good. Answering your questions:
No, it's not a hard limit, and actually, it might be too low taking into account hardware improvements over the last few years. I have seen partitions of 2 GB and 5 GB (though they can give you a lot of headaches when doing repairs), but those are extreme cases. Don't go near those values. Bottom line, if you don't go WAY above those 100 MB, you will be fine. If you have at least 15 GB of ram, use G1GC and you're golden.
A uniform distribution on the partition sizes is important to keep the data load balanced throughout the cluster, and it's also good so that you're confident that your queries will be close to an average latency (because they will be reading the approximate same sizes of data), but it's not something that will give performance issues on its own.
The approach looks good, but if that's a time series, which I think it is taking into account what you said, then I recommend that you use TWCS (Time Window Compaction Strategy) in my_system.my_system_log_dated. Check how to configure this compaction strategy, because the time window you set will be very important.
I was able to device bucketisation that prevents any risks to cluster health due to any unexpected traffic spike. Same has been described here https://medium.com/walmartlabs/bucketisation-using-cassandra-for-time-series-data-scans-2865993f9c00

Primary key cardinality causing Partition Too Large errors?

I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.
If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.
Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?
[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.
CREATE TABLE test
(
date date,
time time,
cid text,
loc text,
src text,
dst text,
size bigint,
s_bytes bigint,
d_bytes bigint,
time_ms bigint,
log text,
PRIMARY KEY ((date, loc, cid), src, time, log)
)
WITH compression = { 'class' : 'LZ4Compressor' }
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};
I guess you meant Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large errors?
This is because of the parameter batch_size_fail_threshold_in_kb which is by default 50kB of data in a single batch - and there are also warnings earlier at a at 5Kb threshold through batch_size_warn_threshold_in_kb in cassandra.yml (see http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html).
Can you share your data model? Just adding a column doesnt mean the partition key to change - maybe you just changed the primary key only by adding a clustering column. Hint: PRIMARY KEY (a,b,c,d) uses only a as partition key, while PRIMARY KEY ((a,b),c,d) uses a,b as partition key - an easy overlooked mistake.
Apart from that, the additional column takes some space - so you can easily hit the threshold now, just reduce the batch size so it does fit again into the limits. In general it's a good way to batch only upserts the affect a single partition as you mentioned. Also make use of async queries and make parallel requests to different coordinators to gain some more speed.

How do we track the impact expired entries have on a time series table?

We are processing the oldest data as it comes into the time-series table. I am taking care to make sure that the oldest entries expire as soon as they are processed. Expectation is to have all the deletes at the bottom part of the clustering column of TimeUUID. So query will always read time slot without any deleted entries.
Will this scheme work? Are there any impacts of the expired columns that I should be aware of?
So keeping the timeuuid as part of clustering key guarantee the sort order to provide the most recent data.
If Cassandra 3.1 (DSE 5.x) and above :-
Now regarding the deletes, "avoid manual and use TWCS": Here is how
Let's say every X minutes your job process the data. Lets say X = 5min, (hopefully less than 24hours). Set the compaction to TWCS: Time Window Compaction Strategy and lets assume with TTL of 24hours.
WITH compaction= {
'compaction_window_unit': 'HOURS',
'compaction_window_size': '1',
};
Now there are 24buckets created in a day, each with one hour of data. These 24 buckets simply relates to 24 sstables (after compaction) in your Cassandra data directory. Now during the 25hour, the entire 1st bucket/sstable would automatically get dropped by TTL. Hence instead of coding for deletes, let Cassandra take care of the cleanup. The beauty of TWCS is to TTL the entire data within that sstable.
Now the READs from your application always goes to the recent bucket, 24th sstable in this case always. So the reads would never have to scan through the tombstones (caused by TTL).
If Cassandra 2.x or DSE 4.X, if TWCS isn't available yet :-
A way out till you upgrade to Cassandra 3.1 or above is to use artificial buckets. Say you introduce a time bucket variable as part of the partition key and keep the bucket value to be date and hour. This way each partition is different and you could adjust the bucket size to match the job processing interval.
So when you delete, only the processed partition gets deleted and will not come in the way while reading unprocessed ones. So scanning of tombstones could be avoided.
Its an additional effort on application side to start writing to the correct partition based on the current date/time bucket. But its worth it in production scenario to avoid Tombstone scan.
You can use TWCS to easily manage expired data, and perform filtering by some timestamp column on query time, to ensure that your query always getting the last results.
How do you "taking care" about oldest entries expiry? Cassandra will not show records with expired ttl, but they will persist in sstables until next compaction for this sstable. If you are deleting the rows by yourself, you can't make sure that your query will always read latest records, since Cassandra is eventually consistent, and theoretically there's can be some moments, when you will read stale data (or many such moments, based on your consistency settings).

whats would be the compaction strategy for performing better in Range queries on clustered column

I have Cassandra table
CREATE TABLE schema1 (
key bigint,
lowerbound bigint,
upperbound bigint,
data blob,
PRIMARY KEY (key, lowerbound,upperbound)
) WITH COMPACT STORAGE ;
I want to perform a range query by using CQL
Select lowerbound, upperbound from schema1 where key=(some key) and lowerbound<=123 order by lowerbound desc limit 1 allow filtering;
Any Suggsetion please Regarding the compaction strategy
Note MY read:write ration is 1:1
Size-tiered compaction is the default, and should be appropriate for most use-cases. In 2012 DataStax posted an article titled When To Use Leveled Compaction, in which it specified three (main) conditions for which leveled compaction was a good idea:
High Sensitivity to Read Latency (your queries need to meet a latency SLA in the 99th percentile).
High Read/Write Ratio
Rows Are Frequently Updated
It also identifies three scenarios when leveled compaction is not a good idea:
Your Disks Can’t Handle the Compaction I/O
Write-heavy Workloads
Rows Are Write-Once
Note how none of the six scenarios I mentioned above are specific to range queries.
My question would be "what problem are you trying to fix?" You mentioned "performing better," but I have found that query performance issues tend to be more tied to data model design. Switching the compaction strategy isn't going to help much if you're running with an inefficient primary key strategy. By virtue of the fact that your query requires ALLOW FILTERING, I would say that changing compaction strategy isn't going to help much.
The DataStax docs contain a section on Slicing over partition rows, which appears to be somewhat similar to your query. Give it a look and see if it helps.
Leveled compaction will mean fewer SSTables are involved for your queries on a key, but requires extra IO. Also, during compaction it uses 10% more disk than data, while for size tiered compaction, you need double. Which is better depends on your setup, queries, etc. Are you experiencing performance problems? If not, and if I could deal with the extra IO, I might choose leveled as it means I don't have to keeps 50+% of headroom in terms of disk space for compaction. But again, there's no "one right way".
Perhaps read this:
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra
When Rows Are Frequently Updated
From datasatx article
Whether you’re dealing with skinny rows where columns are overwritten frequently (like a “last access” timestamp in a Users column family) or wide rows where new columns are constantly added, when you update a row with size-tired compaction, it will be spread across multiple SSTables. Leveled compaction, on the other hand, keeps the number of SSTables that the row is spread across very low, even with frequent row updates.

Resources