increased disk space usage after nodetool cleanup - Apache Cassandra - cassandra

We have an Apache Cassandra (version 3.11.4) cluster in production with 5-5 nodes in two DCs. We've just added the last two nodes recently and after the repairs has finished, we started the cleanup 2 days ago. The nodes are quite huge, /data has 2.8TB mounted disk space, Cassandra used around 48% of it before the cleanup.
Cleanup finished (I don't think it broke, no errors in log, and nodetool compactionstats says 0 pending tasks) on the first node after ~14 hours and during the cleanup the disk usage increased up to 81% and since then never gone back.
Will Cassandra clean it up and if yes, when, or do we have to do something manually? Actually we don't find any tmp files that could be removed manually, so we have no idea now. Did anyone met this usecase and has a solution?
Thanks in advance!

Check the old snapshots - most probably you had many snapshots (from backups, or truncated, or removed tables) that were a hard links to the files with data (and not consuming the space), and after nodetool cleanup, the data files were rewritten, and new files were created, while hard links still pointing to the original files, consuming the disk space. Use nodetool listsnapshots to get a list of existing snapshots, and nodetool clearsnapshot to remove not necessary snapshots.

Related

How to rebalance and reclaim disk space after adding a Cassandra node

I have a 12 node cassandra cluster which is high on data load and disc space is almost nearing full capacity. I have expanded the cluster by adding 1 node and planning to add couple more.
I could find that the data load got reduced after adding the new node. However, the disc space has not reduced.
I fear running nodetool repair as this may require additional disc space and the available space may not be sufficient.
There are suggestions to use nodetool cleanup, looks like this will also cause temporary increase in disk space.
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/tools/toolsCleanup.html
Please suggest if there are better ways to cleanup old data from other nodes to reclaim disc space
Unfortunately, nodetool cleanup is the only way you could evict data that a node no longer owns after nodes are added to a cluster in order to reclaim disk space.
In order for cleanup to work, it temporarily uses more space since it needs to re-compact SSTables to new ones. This can be problematic if you have really large SSTables that are several GBs in size and don't have a lot of disk space left.
You can workaround this problem for large SSTables which are configured with SizeTieredCompactionStrategy by splitting them into smaller files on another server using the sstablesplit tool. I've documented the instructions in https://community.datastax.com/questions/6415/. Cheers!

Guideline regarding nodetool repair in apache cassandra v3.0.9

We are using Apache-Cassandra v3.0.9 and have 3 DC. We are experiencing continuous troubles while running nodetool repair and most of the time the repair process causes big outages. We have 3 different datacenters consisting of 4, 4 & 15 nodes. The total data is around 200 GB at RF=3 and we are using LCS. The RAM is 16 GB, out of which 6 GB is dedicated as heap. Most of the times we try to run full repair the repair process fails with long GC pauses and node becoming unresponsive. Other than at the time of repair our nodes are good on heap and GC pauses are hardly 300 ms. I have following doubts.
Is it still required to run full repair before gc_grace_seconds or just the incremental repairs are good enough in apache cassandra v3.0.9
Do I need to run incremental sequential repairs on every node of the cluster, any one node of each of the datacenters or just any node of the whole cluster? One-by-one or concurrently?
What are the downsides of repair failing because some nodes became unresponsive/died during the repair process, any steps to take care of before starting another repair session.
What are the downsides of not scheduling repairs at all?
We started our cassandra deployment straight away on version 3.0.9. Is the migration as mentioned on Apache Cassandra documentation still required?
full repair is still needed. incremental repair would split SSTables into "repaired" and "unrepaired" parts and the "repaired" part will not be repaired later and that's why incremental repair is more efficient. However, if there're data corruption in the "repaired" sstables, only full repair can fix that; our experience is to run incremental repair every day and full repair only once per grace period. Also, when you have incremental repair, you can make the grace period longer.
better run incremental repair on each node one by one; you can have a cron job or code a simple scheduler to do that;
repair failure, just run it again; no side effect I know of.
if you don't do repair, as time goes, you data consistency is at danger; Cassandra takes the eventual consistency concept, which means that it doesn't guarantee strong consistency when you write data to it unless you explicitly specify that. repair is very important to guarantee the data in the background are all kept up to date and consistent;
if you run full repair already in your cluster, you shouldn't need to migrate explicitly.

Is it recommended to do periodic cassandra repair

We recently had a disk fail in one of our Cassandra node (its a 5 Cassandra 2.2 cluster with replication factor of 3). It took about a week or more to perform a full repair on that node. Each node contains 3/5 of the data and doing nodetool repair repaired 3/5 of the token ranges across all nodes. Now that its been repaired it will most likely repair faster since it did a incremental repair. I am wondering if its a good idea to perform periodic repairs on all nodes using nodetool repair -pr (We are at 2.2 and I think incremental repair is default in 2.2).
I think its a good idea because if performed periodically it will take less time to repair as it only needs to repair non repaired SStables. We also might've had instances where the nodes may've been down for more than the hinted handoff window and we probably didn't do anything about it.
Yes, its good practice to run scheduled incremental repair. Run repair frequently enough that every node is repaired before reaching the time specified in the gc_grace_seconds setting.
Also, it would be great if you run incremental repair on a frequent basis, combined with full repair less frequently like once per month/week. incremental repair would repair the SSTable which was not marked as repaired before, but full repair could take care more comprehensive case like SSTable rotting. check the reference from datastax:https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesWhen.html

cassandra nodetool repair what does it really do?

Hi all (cassandra noob here),
I'm trying understand exactly what going on..with repair..trying to get to the point where we can run our repairs on a schedule.
I have setup a 4 DC (3 nodes per dc) Cassandra 2.1 cluster, 4gb Ram (1gb heap), HDD.
Due to various issues..(repair taking too long/crashing nodes with OOM), I decided to start fresh, nuke everything, I deleted /var/lib/cassandra/data, and /opt/cassandra/data
Recreated my keyspaces (no data), and ran nodetool repair -par -inc -local
I was supprised to see it took ~5min to run, watching the logs I see merkle trees being generated...
I guess..my question is, if my keyspaces have no data..whats it generating Merkle trees against?
After running a local-dc repair against each node in that dc, I decided to run a cross-dc repair, again with no data..
this time it took 4+ hours..with no data? This really feels wrong to me, what am I missing?
Thanks

How can I switch from multiple disks to a single disk in cassandra?

Because I ran out of space when shuffling, I was forced to add multiple disks on my Cassandra nodes.
When I finish compacting, cleaning up, and repairing, I'd like to remove them and return to one disk per node.
What is the procedure to make the switch?
Can I just kill cassandra, move the data from one disk to the other, remove the configuration for the second disk, and re-start cassandra?
I assume files will not have the same name and thus not be overwritten, is this the case?
Run disablegossip and disablethrift from nodetool, such that this
node is seen as DOWN by other nodes.
flush/drain the memtables, run compaction to merge SSTables, if any
[optionally, take snapshot as a precaution]
This stops all the other nodes/clients from writing to this node and since memtables are flushed to disk
stop Cassandra (though this node is down, cluster is available for
write/read, so zero downtime)
move data/log contents from other disk to the disk you want
make changes in cassandra.yaml to change the below paths:
commitlog_directory
saved_caches_directory
data_file_directories
log_directory
restart cassandra
do this for all nodes.

Resources