Primary key cardinality causing Partition Too Large errors? - cassandra

I'm inserting into a Cassandra 3.12 via the Python (DataStax) driver and CQL BatchStatements [1]. With a primary key that results in a small number of partitions (10-20) all works well, but data is not uniformly distributed across nodes.
If I include a high cardinality column, for example time or client IP in addition to date, the batch inserts result in a Partition Too Large error, even though the number of rows and the row length is the same.
Higher cardinality keys should result in more but smaller partitions. How does a key generating more partitions result in this error?
[1] Although everything I have read suggests that batch inserts can be an anti-pattern, with a batch covering only one partition, I still see the highest throughput compared to async or current inserts for this case.
CREATE TABLE test
(
date date,
time time,
cid text,
loc text,
src text,
dst text,
size bigint,
s_bytes bigint,
d_bytes bigint,
time_ms bigint,
log text,
PRIMARY KEY ((date, loc, cid), src, time, log)
)
WITH compression = { 'class' : 'LZ4Compressor' }
AND compaction = {'compaction_window_size': '1',
'compaction_window_unit': 'DAYS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'};

I guess you meant Caused by: com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large errors?
This is because of the parameter batch_size_fail_threshold_in_kb which is by default 50kB of data in a single batch - and there are also warnings earlier at a at 5Kb threshold through batch_size_warn_threshold_in_kb in cassandra.yml (see http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html).
Can you share your data model? Just adding a column doesnt mean the partition key to change - maybe you just changed the primary key only by adding a clustering column. Hint: PRIMARY KEY (a,b,c,d) uses only a as partition key, while PRIMARY KEY ((a,b),c,d) uses a,b as partition key - an easy overlooked mistake.
Apart from that, the additional column takes some space - so you can easily hit the threshold now, just reduce the batch size so it does fit again into the limits. In general it's a good way to batch only upserts the affect a single partition as you mentioned. Also make use of async queries and make parallel requests to different coordinators to gain some more speed.

Related

Cassandra read perfomance slowly decreases over time

We have a Cassandra cluster that consists of six nodes with 4 CPUs and 16 Gb RAM each and underlying shared storage (SSD). I'm aware that shared storage considered a bad practice for Cassandra, but ours is limited at the level of 3 Gb/s on reads and seems to be reliable against exigent disk requirements.
The Cassandra used as an operational database for continuous stream processing.
Initially Cassandra serves requests at ~1,700 rps and it looks nice:
The initial proxyhistograms:
But after a few minutes the perfomance starts to decrease and becomes more than three times worse in the next two hours.
At the same time we observe that the IOWait time increases:
And proxyhistograms shows the following picture:
We can't understand the reasons that lie behind such behaviour. Any assistance is appreciated.
EDITED:
Table definitions:
CREATE TABLE IF NOT EXISTS subject.record(
subject_id UUID,
package_id text,
type text,
status text,
ch text,
creation_ts timestamp,
PRIMARY KEY((subject_id, status), creation_ts)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.c_record(
c_id UUID,
s_id UUID,
creation_ts timestamp,
ch text,
PRIMARY KEY(c_id, creation_ts, s_id)
) WITH CLUSTERING ORDER BY (creation_ts DESC);
CREATE TABLE IF NOT EXISTS subject.s_by_a(
s int,
number text,
hold_number int,
hold_type text,
s_id UUID,
PRIMARY KEY(
(s, number),
hold_type,
hold_number,
s_id
)
);
far from 100 Mb
While some opinions may vary on this, keeping your partitions in the 1MB to 2MB range is optimal. Cassandra typically doesn't perform well when returning large result set. Keeping the partition size small, helps queries perform better.
Without knowing what queries are being run, I can say that with queries which deteriorate over time... time is usually the problem. Take this PRIMARY KEY definition, for example:
PRIMARY KEY((subject_id, status), creation_ts)
This is telling Cassandra to store the data in a partition (hashed from a concatenation of subject_id and status), then to sort and enforce uniqueness by creation_ts. What can happen here, is that there doesn't appear to be an inherent way to limit the size of the partition. As the clustering key is a timestamp, each new entry (to a particular partition) will cause it to get larger and larger over time.
Also, status by definition is temporary and subject to change. For that to happen, partitions would have to be deleted and recreated with every status update. When modeling systems like this, I usually recommend status columns as non-key columns with a secondary index. While secondary indexes in Cassandra aren't a great solution either, it can work if the result set isn't too large.
With cases like this, taking a "bucketing" approach can help. Essentially, pick a time component to partition by, thus ensuring that partitions cannot grow infinitely.
PRIMARY KEY((subject_id, month_bucket), creation_ts)
In this case, the application writes a timestamp (creation_ts) and the current month (month_bucket). This helps ensure that you're never putting more than a single month's worth of data in a single partition.
Now this is just an example. A whole month might be too much, in your case. It may need to be smaller, depending on your requirements. It's not uncommon for time-driven data to be partitioned by week, day, or even hour, depending on the required granularity.

How can I reduce or is it necessary to reducing partition count for large amount of data in Cassandra?

I have estimated ~500 million rows data with 5 million unique numbers. My query must get data by number and event_date. number as partition key, there will be 5 million partitions. I think it is not good that exists a lot of small partitions and timeouts occurs during query. I'm in trouble with defining partition key. I have found some synthetic sharding strategies, but couldn't apply for my model. I can define partition key by mod number, but then rows aren't distributed balanced among partitions.
How can I model this for reducing or is it necessary to reducing partition count? Is there any partition count limit?
CREATE TABLE events_by_number_and_date (
number bigint,
event_date int, /*eg. 20200520*/
event text,
col1 int,
col2 decimal
PRIMARY KEY (number, event_date)
);
For your query, change of the data model won't help, as you're using the query that is unsuitable for Cassandra. Although Cassandra supports aggregations, such as, max, count, avg, sum, ..., they are designed for work inside single partition, and not designed to work in the whole cluster. If you issue them without restriction on the partition key, coordinating node, need to reach every node in the cluster, and they will need to go through all the data in the cluster.
You can still do this kind of query, but it's better to use something like Spark to do that, as it's heavily optimized for parallel data processing, and Spark Cassandra Connector is able to correctly perform querying of the data. If you can't use Spark, you can implement your own full token range scan, using code similar to this. But in any case, don't expect that there will be a "real-time" answer (< 1sec).

Cassandra: Is manual bucketing still needed when applying TWCS?

I am just about to start exploring Cassandra for (long term) saving time series (write only once) data, that potentially can grow quite large.
Assuming the probably most simple time series:
CREATE TABLE raw_data (
sensor uuid,
timestamp timestamp,
value int,
primary key(sensor, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
To make sure, partitions don't grow too much, many posts on the internet recommend bucketing, e.g. introducing day or just an up counting bucket number like
primary key((sensor, day, bucket), timestamp)
. However, these strategies need to be managed manually what seems quite cumbersome especially for unknown number of buckets.
But, what if I say add:
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS'
};
As said e.g. in https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html:
TWCS aims at simplifying DTCS by creating time windowed buckets of SSTables that are compacted with each other using the Size Tiered Compaction Strategy.
As far as I understand this means that Cassandra when using TWCS internally creates readonly buckets anyway. Thus, I am wondering if I still need to manually implement the bucketing key day?
The purpose of the bucket is to stop the partition growing too large. Without the bucket the growth of the partition is unbounded - that is, the more data you collect for a particular sensor, the larger the partition becomes, with no ultimate limit.
Changing the compaction strategy alone will not stop growth of the partition, so you would still need the bucket.
(You wrote "Cassandra when using TWCS internally creates readonly buckets". Don't confuse this with the 'bucket' column. The same word is being used for two completely different things.)
On the other hand, if you were to set a TTL on the data then this would effectively limit the size of the partition because data older than the TTL would (eventually) be deleted from disc. So, if the TTL were small enough, you would no longer need the bucket. In this particular scenario - timeseries data collected in-order and a TTL - then TWCS is the optimum compaction strategy.

Avoiding filtering with a compound partition key in Cassandra

I am fairly new to Cassandra and currently have to following table in Cassandra:
CREATE TABLE time_data (
id int,
secondary_id int,
timestamp timestamp,
value bigint,
PRIMARY KEY ((id, secondary_id), timestamp)
);
The compound partition key (with secondary_id) is necessary in order to not violate max partition sizes.
The issue I am running in to is that I would like to complete the query SELECT * FROM time_data WHERE id = ?. Because the table has a compound partition key, this query requires filtering. I realize this is a querying a lot of data and partitions, but it is necessary for the application. For reference, id has relatively low cardinality and secondary_id has high cardinality.
What is the best way around this? Should I simply allow filtering on the query? Or is it better to create a secondary index like CREATE INDEX id_idx ON time_data (id)?
You will need to specify full partition key on queries (ALLOW FILTERING will impact performance badly in most cases).
One way to go could be if you know all secondary_id (you could add a table to track them in necessary) and do the job in your application and query all (id, secondary_id) pairs and process them afterwards. This has the disadvantage of beeing more complex but the advantage that it can be done with async queries and in parallel so many nodes in your cluster participate in processing your task.
See also https://www.datastax.com/dev/blog/java-driver-async-queries

Cassandra data model for application logs (billions of operations!)

Say, I want to collect logs from a huge application cluster which produces 1000-5000 records per second. In future this number might reach 100000 records per second, aggregated from a 10000-strong datacenter.
CREATE TABLE operation_log (
-- Seconds will be used as row keys, thus each row will
-- contain 1000-5000 log messages.
time_s bigint,
time_ms int, -- Microseconds (to sort data within one row).
uuid uuid, -- Monotonous UUID (NOT time-based UUID1)
host text,
username text,
accountno bigint,
remoteaddr inet,
op_type text,
-- For future filters — renaming a column must be faster
-- than adding a column?
reserved1 text,
reserved2 text,
reserved3 text,
reserved4 text,
reserved5 text,
-- 16*n bytes of UUIDs of connected messages, usually 0,
-- sometimes up to 100.
submessages blob,
request text,
PRIMARY KEY ((time_s), time_ms, uuid)) -- Partition on time_s
-- Because queries will be "from current time into the past"
WITH CLUSTERING ORDER BY (time_ms DESC)
CREATE INDEX oplog_remoteaddr ON operation_log (remoteaddr);
...
(secondary indices on host, username, accountno, op_type);
...
CREATE TABLE uuid_lookup (
uuid uuid,
time_s bigint,
time_ms int,
PRIMARY KEY (uuid));
I want to use OrderedPartitioner which will spread data all over the cluster by its time_s (seconds). It must also scale to dozens of concurrent data writers as more application log aggregators are added to the application cluster (uniqueness and consistency is guaranteed by the uuid part of the PK).
Analysts will have to look at this data by performing these sorts of queries:
range query over time_s, filtering on any of the data fields (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
pagination query from the results of the previous one (SELECT * FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND token(uuid) < token($uuid) AND $filters),
count messages filtered by any data fields within a time range (SELECT COUNT(*) FROM operation_log WHERE time_s < $time1 AND time_s > $time2 AND $filters),
group all data by any of the data fields within some range (will be performed by application code),
request dozens or hundreds of log messages by their uuid (hundreds of SELECT * FROM uuid_lookup WHERE uuid IN [00000005-3ecd-0c92-fae3-1f48, ...]).
My questions are:
Is this a sane data model?
Is using OrderedPartitioner the way to go here?
Does provisioning a few columns for potential filter make sense? Or is adding a column every once in a while cheap enough to run on a Cassandra cluster with some reserved headroom?
Is there anything that prevents it from scaling to 100000 inserted rows per second from hundreds of aggregators and storing a petabyte or two of queryable data, provided that the number of concurrent queryists will never exceed 10?
This data model is close to a sane model, with several important modifications/caveats:
Do not use ByteOrderedPartitioner, especially not with time as the key. Doing this will result in severe hotspots on your cluster, as you'll do most of your reads and all your writes to only part of the data range (and therefore a small subset of your cluster). Use Murmur3Partitioner.
To enable your range queries, you'll need a sentinel key--a key you can know in advance. For log data, this is probably a time bucket + some other known value that's not time-based (so your writes are evenly distributed).
Your indices might be ok, but it's hard to tell without knowing your data. Make sure your values are low in cardinality, or the index won't scale well.
Make sure any potential filter columns adhere to the low cardinality rule. Better yet, if you don't need real-time queries, use Spark to do your analysis. You should create new columns as needed, as this is not a big deal. Cassandra stores them sparsely. Better yet, if you use Spark, you can store these values in a map.
If you follow these guidelines, you can scale as big as you want. If not, you will have very poor performance and will likely get performance equivalent to a single node.

Resources