How can I switch from multiple disks to a single disk in cassandra? - cassandra

Because I ran out of space when shuffling, I was forced to add multiple disks on my Cassandra nodes.
When I finish compacting, cleaning up, and repairing, I'd like to remove them and return to one disk per node.
What is the procedure to make the switch?
Can I just kill cassandra, move the data from one disk to the other, remove the configuration for the second disk, and re-start cassandra?
I assume files will not have the same name and thus not be overwritten, is this the case?

Run disablegossip and disablethrift from nodetool, such that this
node is seen as DOWN by other nodes.
flush/drain the memtables, run compaction to merge SSTables, if any
[optionally, take snapshot as a precaution]
This stops all the other nodes/clients from writing to this node and since memtables are flushed to disk
stop Cassandra (though this node is down, cluster is available for
write/read, so zero downtime)
move data/log contents from other disk to the disk you want
make changes in cassandra.yaml to change the below paths:
commitlog_directory
saved_caches_directory
data_file_directories
log_directory
restart cassandra
do this for all nodes.

Related

How to rebalance and reclaim disk space after adding a Cassandra node

I have a 12 node cassandra cluster which is high on data load and disc space is almost nearing full capacity. I have expanded the cluster by adding 1 node and planning to add couple more.
I could find that the data load got reduced after adding the new node. However, the disc space has not reduced.
I fear running nodetool repair as this may require additional disc space and the available space may not be sufficient.
There are suggestions to use nodetool cleanup, looks like this will also cause temporary increase in disk space.
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/tools/toolsCleanup.html
Please suggest if there are better ways to cleanup old data from other nodes to reclaim disc space
Unfortunately, nodetool cleanup is the only way you could evict data that a node no longer owns after nodes are added to a cluster in order to reclaim disk space.
In order for cleanup to work, it temporarily uses more space since it needs to re-compact SSTables to new ones. This can be problematic if you have really large SSTables that are several GBs in size and don't have a lot of disk space left.
You can workaround this problem for large SSTables which are configured with SizeTieredCompactionStrategy by splitting them into smaller files on another server using the sstablesplit tool. I've documented the instructions in https://community.datastax.com/questions/6415/. Cheers!

increased disk space usage after nodetool cleanup - Apache Cassandra

We have an Apache Cassandra (version 3.11.4) cluster in production with 5-5 nodes in two DCs. We've just added the last two nodes recently and after the repairs has finished, we started the cleanup 2 days ago. The nodes are quite huge, /data has 2.8TB mounted disk space, Cassandra used around 48% of it before the cleanup.
Cleanup finished (I don't think it broke, no errors in log, and nodetool compactionstats says 0 pending tasks) on the first node after ~14 hours and during the cleanup the disk usage increased up to 81% and since then never gone back.
Will Cassandra clean it up and if yes, when, or do we have to do something manually? Actually we don't find any tmp files that could be removed manually, so we have no idea now. Did anyone met this usecase and has a solution?
Thanks in advance!
Check the old snapshots - most probably you had many snapshots (from backups, or truncated, or removed tables) that were a hard links to the files with data (and not consuming the space), and after nodetool cleanup, the data files were rewritten, and new files were created, while hard links still pointing to the original files, consuming the disk space. Use nodetool listsnapshots to get a list of existing snapshots, and nodetool clearsnapshot to remove not necessary snapshots.

What is the purpose of Cassandra's commit log?

Please some one clarify for me to understand Commit Log and its use.
In Cassandra, while writing to Disk is the commit log the first entry point or MemTables.
If Memtables is what is getting flushed to disk, what is the use of Commit log, is the only purpose of commit log is to server sync issues if a data node is down?
You can think of the commit log as an optimization, but Cassandra would be unusably slow without it. When MemTables get written to disk we call them SSTables. SSTables are immutable, meaning once Cassandra writes them to disk it does not update them. So when a column changes Cassandra needs to write a new SSTable to disk. If Cassandra was writing these SSTables to disk on every update it would be completely IO bound and very slow.
So Cassandra uses a few tricks to get better performance. Instead of writing SSTables to disk on every column update, it keeps the updates in memory and flushes those changes to disk periodically to keep the IO to a reasonable level. But this leads to the obvious problem that if the machine goes down or Cassandra crashes you would lose data on that node. To avoid losing data, in addition to keeping recent changes in memory, Cassandra writes the changes to its CommitLog.
You may be asking why is writing to the CommitLog any better than just writing the SSTables. The CommitLog is optimized for writing. Unlike SSTables which store rows in sorted order, the CommitLog stores updates in the order which they were processed by Cassandra. The CommitLog also stores changes for all the column families in a single file so the disk doesn't need to do a bunch of seeks when it is receiving updates for multiple column families at the same time.
Basically writting the CommitLog to the disk is better because it has to write less data than writing SSTables does and it writes all that data to a single place on disk.
Cassandra keeps track of what data has been flushed to SSTables and is able to truncate the Commit log once all data older than a certain point has been written.
When Cassandra starts up it has to read the commit log back from that last known good point in time (the point at which we know all previous writes were written to an SSTable). It re-applies the changes in the commit log to its MemTables so it can get into the same state when it stopped. This process can be slow so if you are stopping a Cassandra node for maintenance it is a good idea to use nodetool drain before shutting it down which will flush everything in the MemTables to SSTables and make the amount of work on startup a lot smaller.
The write path in Cassandra works like this:
Cassandra Node ---->Commitlog-----------------> Memtable
| |
| |
|---> Periodically |---> Periodically
sync to disk flush to SSTable
Memtable and Commitlog are NOT written (kind of) in parallel. Write to Commitlog must be finished before starting to write to Memtable. Related source code stack is:
org.apache.cassandra.service.StorageProxy.mutateMV:mutation.apply->
org.apache.cassandra.db.Mutation.apply:Keyspace.open(keyspaceName).apply->
org.apache.cassandra.db.Keyspace.apply->
org.apache.cassandra.db.Keyspace.applyInternal{
Tracing.trace("Appending to commitlog");
commitLogPosition = CommitLog.instance.add(mutation)
...
Tracing.trace("Adding to {} memtable",...
...
upd.metadata().name(...);
...
cfs.apply(...);
...
}
The purpose of the Commitlog is to be able to recreate the Memtable after a node crashes or gets rebooted. This is important, since the Memtable only gets flushed to disk when it's 'full' - meaning the configured Memtable size is exceeded - or the flush is performed by nodetool or opscenter. So the data in Memtable is not persisted directly.
Having said that, a good thing before rebooting a node or container is to call nodetool flush to make sure your Memtables are fully persisted (flushed) to SSTables on disk. This also will reduce playback time of the Commitlog after the node or container comes up again.

Cassandra node almost out of space, but nodetool cleanup is increasing disk use?

One of our nodes was at 95% disk use and we added another node to the cluster to hopefully rebalance but the disk space didn't drop on the node. I tried doing nodetool cleanup assuming that excess keys were on the node, but the disk space is increasing! Will cleanup actually reduce the size?
Yes it will, but you have to be careful because a compaction is calculated and it generates temporary files and tmp link files that will increase disk space until the cleaned up compacted table is calculated.
So I would go into your data directory and figure out what your keyspace sizes are using
du -h -s *
Then individually clean up the smaller keyspaces (you can specify a keyspace in the nodetool cleanup command with nodetool cleanup ) until you have some overhead. To get an idea of how much space is being freed, tail the log and cat/grep for cleaned compactions:
tail <system.log location> | grep 'eaned'
I'd recommend you don't try to cleanup a keyspace that is more that half the size of your remaining disk space. Hopefully that is possible.
If you don't have enough space you'll have to shut down the node, attach a bigger disk, copy the data files over to the bigger disk, repoint the yaml to the new data directories, then restart up. This is useful for things like SSDs that are expensive and small, but the main spinning disks are cheaper and bigger.

Datastax Cassandra Remove and cleanup one column family

After some IT cleanup, we are noticing that we should probably do a full cleanup / restore for one column family. We believe that Cassandra has duplicate data that it is not cleaning up. Is it possible to clear out and just have Cassandra rebuild a single column family from scratch or a snapshot?
During an upgrade some of the nodes decided to rejoin the cluster, rather than just restarting. During that process nodetool netstats showed that nodes where transferring new data file into the original nodes. The cluster is stable, but the disk usage grew substantially. I am thinking that we will migrate to a new ring, but in the mean time I would like to see if I can reduce some disk usage. The ring is stable, and repairs are looking fine.
If we are able to cleanup one cf it would relieve disk space usage a ton.
nodetool cleanup is not reducing the size of the sstables.
If we have a new node join the cluster it is using approximately 50% of the disk space as the other nodes.
We could do the dance of nodetool decommision && nodetool join, but that is not going to be fun :)
We have validated that the data in the ring is consistent, and repairs show that the data is consistent across the ring.
Adding a new node and successfully running repair means the data for the partition range(s) that has(have) been assigned to that node has been streamed to the new node.
If, after this has happened, you run nodetool cleanup, any data from the other nodes that is no longer needed is cleaned up.
If you still see that some of your nodes have more data than others, this may be because you have some wider rows in some of your partitions, or because your nodes are unbalanced. There should not be any data duplication scenario (if you can prove this then it would be jira worthy).
You can run rebalance in OpsCenter or manually re-assign your tokens if you are looking to spread out the data more evenly across your nodes (or design your data model to avoid the aforementioned wide rows).
Use nodetool compact to clean up all the tombstones and compacts all the updated records into single record.
{nodetool compact}

Resources