Disk space not decreasing after gc_grace_seconds (10 days) elapsed - cassandra

I deleted a lot of data(10 billions rows) from my table (made a small app that query from LONG.MIN_VALUE up to LONG.MAX_VALUE in token range and DELETE some data).
Disk space did not decrease after 20 days from then (also I run nodetool repair on 1 node from total of 6), but number of keys(estimate) have decrease accordingly.
Will the space decrease in the future in a natural way, or there is some utility from cassandra I need to run to reclaim the space?

In general, yes, the space will decrease accordingly (once compaction runs). Depending on the compaction strategy chosen for that table, it could take some time. Size Tiered Compaction Strategy for example requires, by default, that 4 sstables be the same size before being compacted. If you have very large SSTABLES then they may not get compacted for quite some time, or indefinitely if there are not 4 of the same size. A manual compaction would fix that situation, but it would put everything in a single sstable, which is not recommended either. If the resulting sstable of a manual compaction is very small, then it won't hurt you. If it ends up compacting to a "large" SSTABLE, then you have sacrificed "now" for "later" (again, because you now have only a single large sstable, it may take a very long time for it to participate in compaction). You can split the sstable after a manual compaction to remidy the situation you've created, but you'll have to take your node off-line to do it. Anyway, short answer is that over time the table should shrink accordingly - when depends on the compaction strategy chosen.

Try running "nodetool garbagecollect" as this will trigger compaction and removes deleted data. which you can verify running status by "nodetool compacationstats"

Related

Disk space requirement for compaction on a token range in scylla/cassandra

I am using SizeTieredCompaction strategy in Scylla db. I deleted half of my data in a specific token range (let's say x to y). My gc_grace_seconds is set to 6 hours. I want to get rid of all the tombstones that are created in this token range. If I run nodetool compact --start-token x --end-token y keyspace table on all the nodes in cluster after gc_grace_seconds has passed, what would happen? will it delete the tombstones and how much disk space will it consume? Will it be same as nodetool compact major compaction that needs 50% more space?
Scylla's documentation of nodetool compact (see https://docs.scylladb.com/operating-scylla/nodetool-commands/compact/) doesn't even the token range option, unfortunately. But the Cassandra documentation (https://cassandra.apache.org/doc/latest/operating/compaction/index.html) explains what the so-called sub-range compaction does:
It is possible to only compact a given sub range - this could be useful if you know a token that has been misbehaving - either gathering many updates or many deletes. (nodetool compact -st x -et y) will pick all sstables containing the range between x and y and issue a compaction for those sstables. For STCS this will most likely include all sstables but with LCS it can issue the compaction for a subset of the sstables.
With STCS the common case is that all sstables have tokens from all over the token ring, so your nodetool compact call will usually invoke a full major compaction of all sstables. The token range option will likely not exempt any of the sstables from being compacted. So the temporary disk space overhead will be as usual with STCS: At the end of the compaction, you have both the old sstables, and the new one. You assumed the new ones have only half of the original data, so the new sstable will be around half the total size of the old sstable, so this is probably the "50%" you asked about.
To delete the tombstones you also need to run nodetool repair. See here for details on the repair procedure. Basically repair compares data between node so that tombstones can be safely expired.
The space required for compaction is dependent on the specific workload, it is impossible to provide an answer without data about your workload. But 2x is a safe bet which takes into account safety margins. After full compaction the space used will be minimal as only 1 copy of the data is save on each node.

Cassandra hard disk requirement with SizeTieredCompactionStrategy

I was going through Cassandra's SizeTieredCompactionStrategy and found out that it can sometimes double the size of the dataset's largest table during the compaction process. But I didn't get any information regarding when this can happen? Does anyone know about this?
This requirement arises from the fact that compaction process should have enough space to take all SSTables that should be compacted, read data from them, and write new SSTable to the same disk. In the worst case, if you have table consisting of all SSTables that should be compacted, their total size is 50% of available disk space, and no data will be thrown away - in this case, compaction process will write a single SSTable that is equal to size of input data. And if you have input data occupying more than 50% of disk space, compaction won't have enough space for writing a new version.
In real situation, you need to have enough space to compact biggest SSTables in your biggest table performed by N compaction threads at the same time. If you have many tables of similar size, then this restriction is not so strong...

How to set TTL on Cassandra sstable

We are using Cassandra 3.10 with 6 nodes cluster.
lately, we noticed that our data volume increased drastically, approximately 4GB per day in each node.
We want to implement a more aggressive retention policy in which we will change the compaction to TWCS with 1-hour window size and set a few days TTL, this can be achieved via the table properties.
Since the ETL should be a slow process in order to lighten Cassandra workload it possible that it will not finish extracting all the data until the TTL, so I wanted to know is there a way for the ETL process to set TTL=0 on entire SSTable once it done extracting it?
TTL=0 is read as a tombstone. When next compacted it would be written tombstone or purged depending on your gc_grace. Other than the overhead of doing the writes of the tombstone it might be easier just to do a delete or create sstables that contain the necessary tombstones than to rewrite all the existing sstables. If its more efficient to do range or point tombstones will depend on your version and schema.
An option that might be easiest is to actually use a different compaction strategy all together or a custom one like https://github.com/protectwise/cassandra-util/tree/master/deleting-compaction-strategy. You can then just purge data on compactions that have been processed. This still depends quite a bit on your schema on how hard it would be to mark whats been processed or not.
You should set TTL 0 on table and query level as well. Once TTL expire data will converted to tombstones. Based on gc_grace_seconds value next compaction will clear all the tombstones. you may run major compaction also to clear tombstones but it is not recommended in cassandra based on compaction strategy. if STCS atleast 50% disk required to run healthy compaction.

Cassandra 2.1 speed up full compaction

I have a Cassandra 2.1 cluster using Leveled Compaction Strategy.
Base on my calculation, the cluster will run out of space before compaction kick in automatically when it reaches next level. For that reason, I have a cron job that runs "nodetool compact" every week to run a full (major) compaction to remove tomb stoned data points.
I noticed that full compaction consumes very little CPU/network resources. With bigger data set, full compaction runs for days.
I have tried to "setcompactionthroughput" to higher number (128MB/s instead of 32MB/s by default, even tried to set it to 0 (no limit), but full compaction speed doesn't seem to change at all.
Is there anything I can tune to make it faster? Thanks in advance.
There are very few cases where you should run full compaction via nodetool compact - it causes what you're likely seeing now (a single huge data file, which never naturally compacts with other sstables, even/especially when other deletions have happened).
Recovering from the state your in isn't trivial, but is possible. If you have a lot of cpu/IO to spare, you can try toggling from STCS to LCS, and LeveledCompactionStrategy will naturally split up that huge file into thousands of tiny files, and will be much more aggressive about rewriting those files over time (so tombstones are compacted away much more regularly). This is very much CPU and IO intensive, so don't do it if you're near tipping. Also, it will duplicate all data on disk for a short period, so you'll need to be under 50% disk utilization to do this.
If you're over 50% disk utilization, you've backed yourself into a corner, and you'll probably need to add more disk temporarily in order to recover.

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources