cassandra disk space usage - garbage-collection

The problem: Our cassandra's database occupies a lot of disc space. The estimated data size is about 10 Gb while disc space occupied is about 100Gb. We do a lot of writes/deletes. We have two nodes.
Here's what we tried to do (in the order it was done):
Run compaction on both nodes - completed, but zero effect
Set gc_grace to 0.
Run repair on both nodes - one node succeeded, on the other repair 'hang up' - it was alive, but lasted 3 days, after which we cut it off.
Run compaction on both nodes - completed, but still zero effect.
Can someone help with this? What should we do next? :)

I faced similar problem with Cassandra 2.0.9.
I succeeded in clearing space on HDD by using nodetool clearsnapshot on every node. It is possible to remove snapshots only for specified column families. Details on nodetool utility usage could be found here.

Related

How to limit number of validation compactions during repair on Apache Cassandra 3.11

While running full repair on a cassandra cluster with 15 nodes, RF=3 and 3racks(single datacenter) using command ./nodetool repair -pr -full -seq I can see multiple validation compactions running at the same time (>10). Is there any way to limit simultaneous validations in cassandra 3.11.1 like we can limit normal compactions?
As the cluster size has increased, I limited repairs to run table by table and also used -pr and -seq to restrict load on the nodes. But now, the load is very high due to concurrent validation compactions. Need a way to restrict concurrent validation compactions to reduce load on nodes during repairs. I'm also exploring reaper to manage repairs but need to find some workaround for the load issues till I use reaper.
If you're seeing (validation) compactions becoming cumbersome, there are two settings that you should look at:
compaction_throughput_mb_per_sec
concurrent_compactors
compaction_throughput_mb_per_sec
This is the main tuneable setting for compaction. I mentioned this setting in a related answer here: Advise on stopping compaction to reduce slowness
I would recommend checking this setting, and then reducing it until contention is resolved. Or, you could try to set compaction throughput to 1 (the lowest setting) during the day. Then, raise it back up once business hours are over.
% bin/nodetool setcompactionthroughput 1
% bin/nodetool getcompactionthroughput
Current compaction throughput: 1 MB/s
But definitely check it first, just to see what you're running at, and maybe consider halving that and check the effect.
concurrent_compactors
So this defaults to the smaller of (number of disks, # number of cores), with a minimum of 2 and a maximum of 8. There is some solid advice out there around forcing this to a value of 1 if you're using spinning disks, and maybe set it to 4 for SSDs. The default is usually fine, but if it's too high, compactions can overwhelm disk I/O.
tl;dr;
Focus on compaction throughput for now. My advise is to check it, lower it, observe it, and repeat until things improve.

Cleanup space in almost full Cassandra Node

I have a Cassandra Cluster (2 DC) with 6 nodes each and RF 2. 4 of the nodes (in each DC) getting full so I need to cleanup space very soon.
I tried to run a full repair but ended up as a bad idea since the space start increased even more and the repair eventually hanged. As a last solution I am thinking to start repairing and then cleanup specific columns starting from the smallest to the biggest.
i.e
nodetool repair -full foo_keyspace bar_columnfamily
nodetool cleanup foo_keyspace bar_columnfamily
Do you think that this procedure will be safe for the data?
Thank you
The commands that you presented in your question make several incorrect assumptions. First, "repair" is not supposed to, and will not, save any space. All repair does is to find inconsistencies between different replicas and repair them. It will either do nothing (if there's no inconsistencies), or add data, not remove data.
Second, "cleanup" is something you need to do after adding new nodes to the cluster - after each node sent some of its data to the new node, a "cleanup" removes the data from the old nodes. But cleanup is not relevant when not adding node.
The command you may be looking for is "compact". This can save space, but only when you know you had a lot of overwrites (rewriting existing rows), deletions or data expirations (TTL). What compaction strategy are you using? If it's the default, size-tiered compaction strategy (STCS) you can start major compaction (nodetool compact) but should be aware of a big risk involved:
Major compaction merges all the data into one sstable (Cassandra's on-disk file format), dropping deleted, expired or overwritten data. However, during this compaction process, you have both input and output files, and at worst case this may double your disk usage, and may fail if the disk is more than 50% full. This is why a lot of Cassandra best-practice guides suggest never to fill more than 50% of the disk. But this is just the worst case. You can get along with less free space if you know that the output file will be much smaller than the input (because most of the data has been deleted). Perhaps more usefully, if you have many separate tables (column family), you can compact each one separately (as you suggested, from smallest to biggest) and the maximum amount of disk space needed temporarily during the compaction can be much less than 50% of the disk.
Scylla, a C++ reimplementation of Cassandra, is developing something known as "hybrid compaction" (see https://www.slideshare.net/ScyllaDB/scylla-summit-2017-how-to-ruin-your-performance-by-choosing-the-wrong-compaction-strategy) which is like Cassandra's size-tiered compaction but does compaction in small pieces instead of generating one huge file, to avoid the huge temporary disk usage during compaction. Unfortunately, Cassandra doesn't have this feature yet.
Good idea is first start repair on smallest table on smallest keyspace one by one and complete repair. It will take time but safer way and no chance to hang and traffic loss.
Once repair completed start cleanup in the same way as repair. This way no impact on node and cluster as well.
You shouldn't fill more than about 50-60 % of your disks to make room for compaction. If you're above that amount of disk usage you need to consider getting bigger disks or add more nodes.
Datastax recommendations are usually good to follow: https://docs.datastax.com/en/dse-planning/doc/planning/planPlanningDiskCapacity.html

Cassandra - Compaction process stuck

On one of our server, compaction process is hanging. It's stuck at 80%. It was stuck for last 3 days. And today we did a cluster restart (one host at time). And again it is stuck at same 80%. CPU usages are 100% and there seems no IO issue. We are seeing following WARNING in system.log
BatchStatement.java (line 226) Batch of prepared statements for [****, ****] is of size 7557, exceeding specified threshold of 5120 by 2437.
I have tried to stop this compaction using nodetool. But this also does not stop.
Can someone please help?
How much disk space left ? At least 50% disk space available if you are using STCS compaction strategy.Other reason, On large partition compaction is happening for particular key so you need to delete the data for particular key.

Cassandra 'nodetool repair -pr' taking way too much time

I am running a cluster with 1 datacenter (10 nodes) and Cassandra 2.1.7 installed on each. We are using SimpleStretegy (old mistake).
The situation is, I have not run any nodetool repair since begining, and now there is data of about 200 GB with 3 RF.
As running full repair or incremental repair is same at this point. So I have tried to run full repair. But this result in coordinator node down.
So I end up running full partition ranges repair (nodetool repair -pr) on each node one at a time. But this is taking way too much time (15+ hrs for each node, hence weeks for all nodes).
Am I doing this wrong, or this is supposed to happen? Or this is a version problem?
In future, if I run full repair again after finishing this, would this take weeks as well?
Since full repair is mainly affected by data size, it should take same amount of time.
I suggest moving to incremental repairs, this should save your time and resources.
Here's a article about how to do this in 2.1:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
If your date size too big, you can use Sub-range repair, it's smiliar to repair pr but it's focus in sub range.
For more explain :
https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra

Cassandra cfstats: differences between Live and Total used space values

For about 1 month I'm seeing the following values of used space for the 3 nodes ( I have replication factor = 3) in my Cassandra cluster in nodetool cfstats output:
Pending Tasks: 0
Column Family: BinaryData
SSTable count: 8145
Space used (live): 787858513883
Space used (total): 1060488819870
For other nodes I see good values, something like:
Space used (live): 780599901299
Space used (total): 780599901299
You can note a 25% difference (~254Gb) between Live and Total space. It seems I have a lot garbage on these 3 nodes which cannot be compacted for some reason.
The column family I'm talking about has a LeveledCompaction strategy configured with SSTable size of 100Mb:
create column family BinaryData with key_validation_class=UTF8Type
and compaction_strategy=LeveledCompactionStrategy
and compaction_strategy_options={sstable_size_in_mb: 100};
Note, that total value staying for month on all of the three nodes. I relied Cassandra normalize data automatically.
What I tried to decrease space (without result):
nodetool cleanup
nodetool repair -pr
nodetool compact [KEYSPACE] BinaryData (nothing happens: major compaction is ignored for LeveledCompaction strategy)
Are there any other things I should try to cleanup a garbage and free space?
Ok, I have a solution. It looks like Cassandra issue.
First, I went deep into the Cassandra 1.1.9 sources and noted that Cassandra perform some re-analysing of SStables during node starting. It removes the SStables marked as compacted, performs recalculation of used space, and do some other staff.
So, what I did is restarted the 3 problem nodes. The Total and Live values have become equals immediately after restart was completed and then Compaction process has been started and used space is reducing now.
Leveled compaction creates sstables of a fixed, relatively small size,
in your case it is 100Mb that are grouped into “levels”. Within each
level, sstables are guaranteed to be non-overlapping. Each level is
ten times as large as the previous.
So basically from this statement provided in cassandra doc, we can conclude that may be in your case ten time large level background is not formed yet, resulting to no compaction.
Coming to second question, since you have kept the replication factor as 3, so data has 3 duplicate copies, for which you have this anomaly.
And finally 25% difference between Live and Total space, as you know its due over deletion operation.
For LeveledCompactionStrategy you want to set the sstable size to a max of around 15 MB. 100MB is going to cause you a lot of needless disk IO and it will cause it to take a long time for data to propagate to higher levels, making deleted data stick around for a long time.
With a lot of deletes, you are most likely hitting some of the issues with minor compactions not doing a great job cleaning up deleted data in Cassandra 1.1. There are a bunch of fixes for tombstone cleanup during minor compaction in Cassandra 1.2. Especially when combined with LCS. I would take a look at testing Cassandra 1.2 in your Dev/QA environment. 1.2 does still have some kinks being ironed out, so you will want to make sure to keep up to date with installing new versions, or even running off of the 1.2 branch in git, but for your data size and usage pattern, I think it will give you some definite improvements.

Resources