Large data in Cassandra renders cluster unresponsive - cassandra

I have created a table in Cassandra 2.2.0 on AWS with a simple structure:
CREATE TABLE data_cache (
cache_id text,
time timeuuid,
request_json_data text,
PRIMARY KEY (cache_id, time)
) WITH CLUSTERING ORDER BY (time DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 3600
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I have 2 data center on AWS - eu and us-east.
The issue that i am experiencing is the table fills up to rapidly to the point that no more disk space is on the system. It is also problematic to truncate the table as the READ becomes unresponsible in CQLSH.
As you can see - I changed the default TTL to be 3600sec (or 1 hr) and the GC grace seconds to be shorter than the default 10 days.
Currently the Data is now 101GB per cluster and the system become unresponsive.
If i try a simple select count(*) from data_cache it sends me a connection time out - after 3 tries the cluster itself is lost. the Error log states a java out of memory.
What should i do different? what am I doing wrong?
Currently the TTL is there so that data doesnt destroy the server until we know how long we will use the cache for hence why its only set to 1hr - but if we deem that the cache should be built for 1 day - we will scale the capacity accordingly but we will also need to read from it and due to the crash we are unable to do so.

What you are experiencing has to be expected. Cassandra is good at retrieving one particular record, but not at retrieving billions of rows at once. Indeed, your simple SELECT COUNT(*) FROM data_cache is reading all your dataset under the hood. Due to the nature of Cassandra, counting is hard.
If you query by BOTH cache_id and time everything is fine, but if you don't then it's a call for trouble, especially if you don't have idea on how wide your rows are.
Beware that TTL generate tombstones, which will hit you sooner or later. The TTL thing doesn't guarantee that your free space will be collected, even if you lower the grace period. Indeed, with default params, SizeTieredCompactionStrategy takes 4 SSTables of around equal size, but if you don't have such equal tables, then compaction does, well, nothing. And with in the worst case, SizeTieredCompactionStrategy requires the free space on your disk to be at least the size of the biggest CF being compacted.
It seems to me you are trying to use Cassandra as a cache, but you are currently using it like a queue. I would rethink about the data model. If you come here with a better specification of what you want to achieve maybe we can help you.

I think your first issue is related to compaction and more precisely to the ratio between write throughput and compaction. In the cassandra.yaml file there is a field compaction_throughput_mb_per_sec. If its value is lower than your write load Cassandra won't be able to clear space and it will end up with no dsik space and node crashing.
I am wondering whether your data is correctly spread among your cluster or not. I see here that you use a PARTITION_KEY cache_id and a CLUSTERING_KEY time. It means that any insert with the same cache_id goes to the same node. So if you got too few cache_id or too much time in the same cache_id the work load would not be equally distributed and there is a risk of unresponsive nodes. The limits you must keep in mind are no more than 100 000 rows per partition and no more than 100 Mb per partition.

Related

Cassandra does not compact shadowed rows for TWCS

I have a Cassandra table with default TTL. Unfortunately, the default TTL was too small, so now I want to update the default TTL, but also I need to update
all rows. Right now my table uses 80 GB of data. I am wondering how to perform this operation to not negatively impact performance.
For testing purposes, I adjusted a little bit configuration of my table:
AND compaction = {'class' : 'TimeWindowCompactionStrategy',
'compaction_window_unit' : 'MINUTES',
'compaction_window_size' : 10 ,
'tombstone_compaction_interval': 60,
'log_all': true }
AND default_time_to_live = 86400
AND gc_grace_seconds = 100
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
I am using Time Window Compaction Strategy, compaction is executed every 10 minutes. To speed up all operations, I set tombstone_compaction_interval to 1 minute - so after one minute SSTable is taking into account for compaction. gc_grace_seconds is set to 100 seconds.
In my first scenario, I just overwrite every row without deleting it. As far as I understand, tombstones in that scenario are not created, I just shadow previously inserted rows.
So I perform the following steps:
write data
nodetool flush - to flush memtable to sstable
overwrite all rows
nodetool flush
Even after one hour both SStables exist
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 14:04 md-1-big-Data.db
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 14:11 md-2-big-Data.db
Of course, If I execute nodetool compact, I will end up with one SStable with size 4.7MB, but I was expecting that compacting an old SSTable will be executed automatically, as it happens when in an SSTable there are many tombstones.
In the second scenario, I executed the same operations, but I explicitly removed every row before writing it again. The result was, the following:
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 16:16 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 6.2M Jan 30 16:35 md-5-big-Data.db
So, SSTable was bigger, because it has to store information about tombstones and about new values. But again, SSTables were not compacted.
Can you explain to me, why automatic compaction was not executed? In this case old row, tombstone and new row can be replaced by just one entry that represents new row.
First, log_all value of true should not be set in a production cluster for an indefinite period of time. You could test it out in lower environments and then remove it in the production cluster. I believe this is temporarily turned on for triaging purposes only. There are other red flags here in your case above, for example, setting gc_grace_seconds to 100 seconds, you loose the opportunity/flexibility to recover during a catastrophic situation as you're compromising on the default hints generation and have to perform manual repairs, etc. You could read about why that's not a great idea in other SO questions.
First question we need to ask is if there is an opportunity to have an application downtime and then decide with other options.
If given a downtime window, I may work with the below procedure. Remember, there are multiple ways and this is just one of them.
Ensure that application(a) aren't accessing the cluster.
Issue a DSBulk unload operation to get the data exported out.
Truncate the table
Ensure you've the right table properties set (e.g. compaction settings, default ttl, etc.,).
Issue a DSBulk load operation by specifying the desired TTL value in seconds using --dsbulk.schema.queryTtl number_seconds.
Perform your validation prior to opening the application(s) traffic back.
Other reading references:
TWCS how does it work and when to use it?
hinted-handoff demystified

Cassandra service startup delays with WARN message "min_index_interval of 128 is too low" while reading SSTables in system.log file

We found a delay of 2 hrs in starting the Cassandra service with WARN in the system.log file for one table.
Please find the warnings in a few of the below servers:
WARN [SSTableBatchOpen:5] 2022-08-29 10:01:13,732 IndexSummaryBuilder.java:115 - min_index_interval of 128 is too low for 5511836446 expected keys of avg size 64; using interval of 185 instead
Aaron's answer pointed to the right code: Since you have a LOT of keys in a single SSTable, the default min_index_interval is not efficient anymore and Cassandra recomputes it. This then triggers a rewrite of the index summary during startup, and in this very case it's taking a very long time.
Aaron's suggestion of using sstablesplit would be a temporary fix as eventually they'll get compacted again and you'll be back to the same situation.
Changes will have to be made in production to remediate anyway, and changing the min_index_interval seems easy enough as a fix, while really being the only thing to do that won't require deep schema changes to reduce the number of partitions per sstable (or compaction strategy changes which could have hard to predict performance impacts).
Note that changing the min_index_interval will not trigger the rewrite of the sstables straight away. Only newly written sstables will get the new setting, which can be (and should be) forced onto all the sstables using nodetool upgradesstables -a.
On a side note, there seem to be a confusion in the comments between the partition index and secondary indexes (indices?). They are two distinct things and the reported warning message is referring to the partition index summary, not secondary indexes.
It's difficult to discern a question from the above, so I'll assume you're wondering why Cassandra is taking 2 hours to start up.
If you look in the source of Cassandra 3.0, there are some clues given in the IndexSummaryBuilder class. Specifically, the calculations just prior to the warning:
if (maxExpectedEntriesSize > Integer.MAX_VALUE)
{
// that's a _lot_ of keys, and a very low min index interval
int effectiveMinInterval = (int) Math.ceil((double)(expectedKeys * expectedEntrySize) / Integer.MAX_VALUE);
maxExpectedEntries = expectedKeys / effectiveMinInterval;
maxExpectedEntriesSize = maxExpectedEntries * expectedEntrySize;
assert maxExpectedEntriesSize <= Integer.MAX_VALUE : maxExpectedEntriesSize;
logger.warn("min_index_interval of {} is too low for {} expected keys of avg size {}; using interval of {} instead",
minIndexInterval, expectedKeys, defaultExpectedKeySize, effectiveMinInterval);
The comment about "that's a _lot_ of keys" is a big one, and 5,511,836,446 keys is certainly a lot.
The calculations shown in the method above are driven by the number of keys and sampling intervals for a particular SSTable, to build the Partition Summary into RAM. You can see the Partition Summary on the right side of the diagram showing Cassandra's read path below:
Based on this, I would hypothesize that one particular table's SSTable file(s) is getting too big to handle efficiently. Have a look at the underlying data directory for that table. You may have to split some of those files with tools/bin/sstablesplit to make them more manageable.

How to set TTL on Cassandra sstable

We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.

Cassandra partitioning strategy for systems with skewed traffic

Please bear with me for slightly longer problem description.
I am a newbie to Cassandra world and I am trying to migrate my current product from oracle based data layer to Cassandra.
In order to support range queries I have created an entity like below:
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event),
event_type, client_request_id, id)
) with clustering order by (created_date desc);
Now, I have come across several documentation/resources/blogs that mentions that I should be keeping my partition size less than 100 mb for an optimally performing cluster. With the volume of traffic my system handles per day for a certain combinations of partitioning key, there is no way i can keep it less than 100 mb with above partitioning key.
To fix this i introduced a new factor called bucket_id and was thinking of assigning it hour of the day value to further break partitions into smaller chunks and keep them less than 100 mb(Even though this means i have to do 24 reads to serve traffic details for one day, but i am fine with some inefficiency in reads). Here is the schema with bucket id
create table if not exists my_system.my_system_log_dated(
id uuid,
client_request_id text,
tenant_id text,
vertical_id text,
channel text,
event text,
bucket_id int,
event_type text,
created_date date,
primary key((created_date, tenant_id, vertical_id, channel, event,
bucket_id), event_type, client_request_id, id)
) with clustering order by (created_date desc);
Even with this, couple of combinations of
goes more than 100 mb while all other volume sits comfortably within the range.
With this situation in mind I have below questions:
Is it an absolute blunder to have few of your partitions go beyond 100 mb limit?
Though with even smaller bucket say 15 min window, I get all combinations of partition key under 100 mb but that too creates heavily skewed partitions, meaning that high volume combinations of partition key goes up till 80 mb while remaining once are well under 15 mb. Is this something that will adversely impact performance of my cluster?
Is there a better way to solve this problem?
Here is some more info that I thought may be useful:
Avg row size for this entity is around 200 bytes
I am also considering a load future proofing factor of 2 and estimating for double the load.
Peak load for a specific combination of partition key is around 2.8 Million records in a day
the same combination has peak traffic hour of about 1.4 million records
and the same in 15 min window is around 550,000 records.
Thanks in advance for your inputs!!
Your approach with the bucket id looks good. Answering your questions:
No, it's not a hard limit, and actually, it might be too low taking into account hardware improvements over the last few years. I have seen partitions of 2 GB and 5 GB (though they can give you a lot of headaches when doing repairs), but those are extreme cases. Don't go near those values. Bottom line, if you don't go WAY above those 100 MB, you will be fine. If you have at least 15 GB of ram, use G1GC and you're golden.
A uniform distribution on the partition sizes is important to keep the data load balanced throughout the cluster, and it's also good so that you're confident that your queries will be close to an average latency (because they will be reading the approximate same sizes of data), but it's not something that will give performance issues on its own.
The approach looks good, but if that's a time series, which I think it is taking into account what you said, then I recommend that you use TWCS (Time Window Compaction Strategy) in my_system.my_system_log_dated. Check how to configure this compaction strategy, because the time window you set will be very important.
I was able to device bucketisation that prevents any risks to cluster health due to any unexpected traffic spike. Same has been described here https://medium.com/walmartlabs/bucketisation-using-cassandra-for-time-series-data-scans-2865993f9c00

Check table size in cassandra historically

I've a Cassandra table (Cassandra version is 2.0) with terabytes of data, here is what the schema looks like
"my_table" (
key ascii,
timestamp bigint,
value blob,
PRIMARY KEY ((key), timestamp)
)
I'd like to delete some data, but before want to estimate how much disk space it will reclaim.
Unfortunately stats from JMX metrics are only available for last two weeks, so thats not very useful.
Is there any way to check how much space is used by certain set of data (for example where timestamp < 1000)?
I was wondering also if there is a way to check query result set size, so that I can do something like select * from my_table where timestamp < 1000 and see how many bytes the result occupies.
There is no mechanism to see the size on disk from the data, it can be pretty far removed from the coordinator of the request and theres levels that impact it like compression and multiple sstables which would make it difficult.
Also be aware that issuing a delete will not immediately reduce disk space. C* does not delete data, the sstables are immutable and cannot be changed. Instead it writes a tombstone entry that after gc_grace_seconds will disappear. When sstables are being merged, the tombstone + data would combine to be just the tombstone. After it is past the gc_grace_seconds the tombstone will no longer be copied during compaction.
The gc_grace is to prevent losing deletes in a distributed system, since until theres a repair (should be scheduled ~weekly) theres no absolute guarantee that the delete has been seen by all replicas. If a replica has not seen the delete and you remove the tombstone, the data can come back.
No, not really.
Using sstablemetadata you can find tombstone drop times, minimum timestamp and maximum timestamp in the mc-####-big-data.db files.
Additionally if you're low on HDD space consider nodetool cleanup, nodetool clearsnapshot and then finally nodetool repair.

Resources