Cassandra does not compact shadowed rows for TWCS - cassandra

I have a Cassandra table with default TTL. Unfortunately, the default TTL was too small, so now I want to update the default TTL, but also I need to update
all rows. Right now my table uses 80 GB of data. I am wondering how to perform this operation to not negatively impact performance.
For testing purposes, I adjusted a little bit configuration of my table:
AND compaction = {'class' : 'TimeWindowCompactionStrategy',
'compaction_window_unit' : 'MINUTES',
'compaction_window_size' : 10 ,
'tombstone_compaction_interval': 60,
'log_all': true }
AND default_time_to_live = 86400
AND gc_grace_seconds = 100
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99PERCENTILE';
I am using Time Window Compaction Strategy, compaction is executed every 10 minutes. To speed up all operations, I set tombstone_compaction_interval to 1 minute - so after one minute SSTable is taking into account for compaction. gc_grace_seconds is set to 100 seconds.
In my first scenario, I just overwrite every row without deleting it. As far as I understand, tombstones in that scenario are not created, I just shadow previously inserted rows.
So I perform the following steps:
write data
nodetool flush - to flush memtable to sstable
overwrite all rows
nodetool flush
Even after one hour both SStables exist
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 14:04 md-1-big-Data.db
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 14:11 md-2-big-Data.db
Of course, If I execute nodetool compact, I will end up with one SStable with size 4.7MB, but I was expecting that compacting an old SSTable will be executed automatically, as it happens when in an SSTable there are many tombstones.
In the second scenario, I executed the same operations, but I explicitly removed every row before writing it again. The result was, the following:
-rw-r--r-- 1 cassandra cassandra 4.7M Jan 30 16:16 md-4-big-Data.db
-rw-r--r-- 1 cassandra cassandra 6.2M Jan 30 16:35 md-5-big-Data.db
So, SSTable was bigger, because it has to store information about tombstones and about new values. But again, SSTables were not compacted.
Can you explain to me, why automatic compaction was not executed? In this case old row, tombstone and new row can be replaced by just one entry that represents new row.

First, log_all value of true should not be set in a production cluster for an indefinite period of time. You could test it out in lower environments and then remove it in the production cluster. I believe this is temporarily turned on for triaging purposes only. There are other red flags here in your case above, for example, setting gc_grace_seconds to 100 seconds, you loose the opportunity/flexibility to recover during a catastrophic situation as you're compromising on the default hints generation and have to perform manual repairs, etc. You could read about why that's not a great idea in other SO questions.
First question we need to ask is if there is an opportunity to have an application downtime and then decide with other options.
If given a downtime window, I may work with the below procedure. Remember, there are multiple ways and this is just one of them.
Ensure that application(a) aren't accessing the cluster.
Issue a DSBulk unload operation to get the data exported out.
Truncate the table
Ensure you've the right table properties set (e.g. compaction settings, default ttl, etc.,).
Issue a DSBulk load operation by specifying the desired TTL value in seconds using --dsbulk.schema.queryTtl number_seconds.
Perform your validation prior to opening the application(s) traffic back.
Other reading references:
TWCS how does it work and when to use it?
hinted-handoff demystified

Related

How to set TTL on Cassandra sstable

We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.

Tombstone in Cassandra

I have a Cassandra table with TTL of 60 seconds, I have few questions in this,
1) I am getting the following warning
Read 76 live rows and 1324 tombstone cells for query SELECT * FROM xx.yy WHERE token(y) >= token(fc872571-1253-45a1-ada3-d6f5a96668e8) LIMIT 100 (see tombstone_warn_threshold)
What does this mean?
2) As per my study, Tombstone is a flag in case of TTL (will be deleted after gc_grace_seconds)
i) so till 10 days does it mean that it won't be deleted ?
ii) What will be the consequence of it waiting for 10 days?
iii) Why it is a long time 10 days?
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
gc_grace_seconds 864000 [10 days] The number of seconds after data is marked with a tombstone (deletion marker) before it is eligible for garbage-collection. Cassandra will not execute hints or batched mutations on a tombstoned record within its gc_grace_period. The default value allows a great deal of time for Cassandra to maximize consistency prior to deletion. For details about decreasing this value, see garbage collection below.
3) I read that performing compaction and repair using nodetool will delete the tombstone, How frequently we need to run this in background, What will be the consequence of it?
This means that your query returned 76 "live" or non-deleted/non-obsoleted rows of data, and that it had to sift through 1324 tombstones (deletion markers) to accomplish that.
In the world of distributed databases, deletes are hard. After all, if you delete a piece of data from one node, and you expect that deletion to happen on all of your nodes, how would you know if it worked? Quite literally, how do you replicate nothing? Tombstones (delete markers) are the answer to that question.
i. The data is gone (obsoleted, rather). The tombstone(s) will remain for gc_grace_seconds.
ii. The "consequence" is that you'll have to put up with those tombstone warning messages for that duration, or find a way to run your query without having to scan over the tombstones.
iii. The idea behind the 10 days, is that if the tombstones are collected too early, that your deleted data will "ghost" its way back up to some nodes. 10 days gives you enough time to run a weekly repair, which ensures your tombstones are properly replicated before removal.
Compaction removes tombstones. Repair replicates them. You should run repair once per week. While you can run compaction on-demand, don't. Cassandra has its own thresholds (based on number and size of SSTable files) to figure out when to run compaction, and it's best not to get in its way. If you do, you'll be manually running compaction from there on out, as you'll probably never reach the compaction conditions organically.
The consequences, are that both repair and compaction take compute resources, and can reduce a node's ability to serve requests. But they need to happen. You want them to happen. If compaction doesn't run, your SSTable files will grow in number and size; eventually causing rows to exist over multiple files, and queries for them will get slow. If repair doesn't run, your data is at risk of not being in-sync.

Cassandra - What is difference between TTL at table and inserting data with TTL

I have a Cassandra 2.1 cluster where we insert data though Java with TTL as the requirement of persisting the data is 30 days.
But this causes problem as the files with old data with tombstones is kept on the disk. This results in disk space being occupied by data which is not required. Repairs take a lot of time to clear this data (upto 3 days on a single node)
Is there a better way to delete the data?
I have come across this on datastax
Cassandra allows you to set a default_time_to_live property for an entire table. Columns and rows marked with regular TTLs are processed as described above; but when a record exceeds the table-level TTL, Cassandra deletes it immediately, without tombstoning or compaction. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html?hl=tombstone
Will the data be deleted more efficiently if I set TTL at table level instead of setting each time while inserting.
Also, documentation is for Cassandra 3, so will I have to upgrade to newer version to get any benefits?
Setting default_time_to_live applies the default ttl to all rows and columns in your table - and if no individual ttl is set (and cassandra has correct ntp time on all nodes), cassandra can easily drop those data safely.
But keep some things in mind: your application is still able so set a specific ttl for a single row in your table - then normal processing will apply. On top, even if the data is ttled it won't get deleted immediately - sstables are still immutable, but tombstones will be dropped during compaction.
What could help you really a lot - just guessing - would be an appropriate compaction strategy:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/dml/dmlHowDataMaintain.html#dmlHowDataMaintain__twcs-compaction
TimeWindowCompactionStrategy (TWCS)
Recommended for time series and expiring TTL workloads.
The TimeWindowCompactionStrategy (TWCS) is similar to DTCS with
simpler settings. TWCS groups SSTables using a series of time windows.
During compaction, TWCS applies STCS to uncompacted SSTables in the
most recent time window. At the end of a time window, TWCS compacts
all SSTables that fall into that time window into a single SSTable
based on the SSTable maximum timestamp. Once the major compaction for
a time window is completed, no further compaction of the data will
ever occur. The process starts over with the SSTables written in the
next time window.
This help a lot - when choosing your time windows correctly. All data in the last compacted sstable will have roughly equal ttl values (hint: don't do out-of-order inserts or manual ttls!). Cassandra keeps the youngest ttl value in the sstable metadata and when that time has passed cassandra simply deletes the entire table as all data is now obsolete. No need for compaction.
How do you run your repair? Incremental? Full? Reaper? How big in terms of nodes and data is your cluster?
The quick answer is yes. The way it is implemented is by deleting the SStable/s directly from disk. Deleting an SStable without the need to compact will clear up disk space faster. But you need to be sure that the all the data in a specific sstable is "older" than the globally configured TTL for the table.
This is the feature referred to in the paragraph you quoted. It was implemented for Cassandra 2.0 so it should be part of 2.1

gc_grace_seconds to remove tombstone rows in cassandra

I am using awesome Cassandra DB (3.7.0) and I have questions about tombstone.
I have table called raw_data. This table has default TTL as 1 hour. This table gets new data every second. Then another processor reads one row and remove the row.
It seems like this raw_data table becomes slow at reading and writing after several days of running.
Is this because of deleted rows are staying as tombstone? This table already has TTL as 1 hour. Should I set gc_grace_period to something less than 10 days (default value) to remove tombstones quickly? (By the way, I am single-node DB)
Thank you in advance.
Deleting your data is the way to have tombstone problems. TTL is the other way.
It is pretty normal for a Cassandra cluster to become slower and slower after each delete, and your cluster will eventually refuse to read data from this table.
Setting gc_grace_period to less than the default 10 days is only one part of the equation. The other part is the compaction strategy you use. Indeed, in order to remove tombstones a compaction is needed.
I'd change my mind about my single-node cluster and I'd go with the minimum standard 3 nodes with RF=3. Then I'd design my project around something that doesn't explicitly delete data. If you absolutely need to delete data, make sure that C* runs compaction periodically and removes tombstones (or force C* to run compactions), and make sure to have plenty of IOPS, because compaction is very IO intensive.
In short Tombstones are used to Cassandra to mark the data is deleted, and replicate the same to other nodes so the deleted data doesn't re-appear. These tombstone will be stored in Cassandra till the gc_grace_period. Creating more tobestones might slow down your table. As you are using a single node Cassandra you don't have to replicate anything in other nodes, hence you can update your gc grace seconds to 1 day, which will not affect. In future if you are planning to add new nodes and data centers change this gc grace seconds.

Large data in Cassandra renders cluster unresponsive

I have created a table in Cassandra 2.2.0 on AWS with a simple structure:
CREATE TABLE data_cache (
cache_id text,
time timeuuid,
request_json_data text,
PRIMARY KEY (cache_id, time)
) WITH CLUSTERING ORDER BY (time DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 3600
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I have 2 data center on AWS - eu and us-east.
The issue that i am experiencing is the table fills up to rapidly to the point that no more disk space is on the system. It is also problematic to truncate the table as the READ becomes unresponsible in CQLSH.
As you can see - I changed the default TTL to be 3600sec (or 1 hr) and the GC grace seconds to be shorter than the default 10 days.
Currently the Data is now 101GB per cluster and the system become unresponsive.
If i try a simple select count(*) from data_cache it sends me a connection time out - after 3 tries the cluster itself is lost. the Error log states a java out of memory.
What should i do different? what am I doing wrong?
Currently the TTL is there so that data doesnt destroy the server until we know how long we will use the cache for hence why its only set to 1hr - but if we deem that the cache should be built for 1 day - we will scale the capacity accordingly but we will also need to read from it and due to the crash we are unable to do so.
What you are experiencing has to be expected. Cassandra is good at retrieving one particular record, but not at retrieving billions of rows at once. Indeed, your simple SELECT COUNT(*) FROM data_cache is reading all your dataset under the hood. Due to the nature of Cassandra, counting is hard.
If you query by BOTH cache_id and time everything is fine, but if you don't then it's a call for trouble, especially if you don't have idea on how wide your rows are.
Beware that TTL generate tombstones, which will hit you sooner or later. The TTL thing doesn't guarantee that your free space will be collected, even if you lower the grace period. Indeed, with default params, SizeTieredCompactionStrategy takes 4 SSTables of around equal size, but if you don't have such equal tables, then compaction does, well, nothing. And with in the worst case, SizeTieredCompactionStrategy requires the free space on your disk to be at least the size of the biggest CF being compacted.
It seems to me you are trying to use Cassandra as a cache, but you are currently using it like a queue. I would rethink about the data model. If you come here with a better specification of what you want to achieve maybe we can help you.
I think your first issue is related to compaction and more precisely to the ratio between write throughput and compaction. In the cassandra.yaml file there is a field compaction_throughput_mb_per_sec. If its value is lower than your write load Cassandra won't be able to clear space and it will end up with no dsik space and node crashing.
I am wondering whether your data is correctly spread among your cluster or not. I see here that you use a PARTITION_KEY cache_id and a CLUSTERING_KEY time. It means that any insert with the same cache_id goes to the same node. So if you got too few cache_id or too much time in the same cache_id the work load would not be equally distributed and there is a risk of unresponsive nodes. The limits you must keep in mind are no more than 100 000 rows per partition and no more than 100 Mb per partition.

Resources