Cassandra: how move a database from one server to another quickly? - cassandra

It's a backup restore question. We have a single node running Cassandra. We have also rsync'ed the /var/lib/cassandra folder into another (backup) server.
In case of emergency, we can quickly boot a new server and transfer files there. But the question is: will it work then? Let's assume we have the same Cassandra versions on both old and the new server, and the same OS version. Is it enough to simply transfer the whole /var/lib/cassandra folder? As you understand, backups is a critical thing so I want to be sure everything will be ok.
(we currently are using dsc2.0 package from the Ubuntu's repos)
Yes, running a 'normal' cluster with 2-3 nodes would be a better choice, I know. Both performance and reliability would increase. But for now we have what we have - it's a single node. And for some reasons right now we will not switch to a multinode cluster.
Thanks!

Related

cassandra 3.11.x mixing vesions

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines.
We are switching to physical machines on brand (8!) new servers that will have debian 11 and presumably cassandra 3.11.12.
Since the main version is always 3.11.x and ubuntu 16.04 is out of support, the question is: can we just let the new machines join the old cluster and then decommission the outdated?
I hope to get a tips about this becouse intuitively it seems fine but we are not too sure about that.
Thank you.
We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines. We are switching to physical machines on brand (8!)
Quick tip here; but it's a good idea to build your clusters in multiples of your RF. Not sure what your RF is, but if RF=3, I'd either stay with six or get one more and go to nine. It's all about even data distribution.
can we just let the new machines join the old cluster and then decommission the outdated?
In short, no. You'll want to upgrade the existing nodes to 3.11.12, first. I can't recall if 3.11.3 and 3.11.12 are SSTable compatible, but I wouldn't risk it.
Secondly, the best way to do this, is to build your new (physical) nodes in the cluster as their own logical data center. Start them up empty, and then run a nodetool rebuild on each. Once that's complete, then decommission the old nodes.
There is a bit simpler solution - move data from each virtual machine into a physical server, as following:
Prepare Cassandra installation on a physical machine, configure the same cluster name, etc.
1.Stop Cassandra in a virtual machine & make sure that it won't start
Copy all Cassandra data /var/lib/cassandra or something like from VM to the physical server
Start Cassandra process on a physical server
Repeat that process for all VM nodes, at some point, updating seeds, etc. After process is finished, you can add two physical servers that are left. Also, to speedup process, you can do initial copy of the data before stopping Cassandra in the VM, and after it's stopped, re-sync data with rsync or something like. This way you can minimize the downtime.
This approach would be much faster compared to the adding a new node & decommissioning the old one as we won't need to stream data twice. This works because after node is initialized, Cassandra identify nodes by assigned UUID, not by IP address.
Another approach is to follow instructions on replacement of the dead node. In this case streaming of data will happen only once, but it could be a bit slower compared to the direct copy of the data.

How to add sstablesplit to an existing Cassandra cluster?

I am a beginner in Cassandra and currently have a small cluster with replication factor 3 and most of the parameters being set to default.
What I noticed the other day is that the SSTables have become absolutely massive (>1TB) and the logs are now starting to complain that they cannot perform a compaction anymore. I've looked into it and decided to switch to the LevelCompactionStrategy, as well as performing an sstablesplit on my existing SSTables.
However, at that point I noticed that sstablesplit did not come with my installation of Cassandra. Is there a way of installing just that tool? All the guides I've seen talk about installing the entire Datastax tech stack, which would probably invalidate my existing cluster or require a great deal of reinstalling which at the moment I cannot do. The Cassandra installation was not set up by me.
At the same time, LCS is complaining that it cannot perform re-compaction because it's trying to recompact all SSTables at once, which, since they now take slightly more than 50% of hard drive space, it can't find enough space to do.
If sstablesplit is impossible (or inadvisable), is there any other way to resolve my issue of having several SSTables which are too massive to be re-compacted into more manageable chunks?
Thanks!
sstablesplit is part of the cassandra codebase, you can use it even without it being packaged. The cassandra-all jar and lib jars to classpath have everything to run it. This is all the sstablesplit script does: https://github.com/apache/cassandra/blob/trunk/tools/bin/sstablesplit.
Is this in AWS or some cloud platform where you can get larger hosts temporarily? Easiest is to replace the hosts with new hosts with 2x the disk space or something, migrate to LCS then switch back for costs.

Cassandra directory setup of replica datacenter

I have a Cassandra cluster and plan to add a new datacenter to replicate data. There will be no write on this node, only reads.
My questions are:
in this case is it still recommended to have separate drives for commit log and data?
if I know, that my cluster will receive data only by hints (and lots of them) should I create a separate disk for the hints? I did not find any mention of this.
in this case is it still recommended to have separate drives for commit log and data?
So the whole idea of putting your commitlog on a separate mount point, goes back to spinning disks being a chokepoint for I/O. If you have your cluster/nodes backed by SSDs, you shouldn't need to do that.
if I know, that my cluster will receive data only by hints (and lots of them) should I create a separate disk for the hints?
Hints only build up when a node is down. When your writes happen, the Snitch handles propagation of all of the required replicas. So no, I wouldn't worry about putting your hints dir on a separate mount point, either.

Is it possible to recover a Cassandra node without a snapshot?

Offsite backups for Cassandra seem like a challenging thing. You basically have to make yet another copy of ALL your data, including the copies of data that exist due to the replication factor. Snapshots make backups easy when you don't mind storing it on the same disk that your node already uses. I'm curious - in the event of a catastrophic failure of this disk, is it possible to recover the node using the nodes that the data was replicated to?
Yes, you can restore data on crashed node using a procedure in documentation - Replacing a dead node or dead seed node. It's for Cassandra 3.x, please pick your Cassandra version from a drop-down menu on the top of the page.
But please note that you still need to do backups if your data is valuable. If you using AWS you can use this project to backup Cassandra to S3 storage.
If you are looking for offsite or off-host backups, you can also look at opscenter from Datastax or Talena software (my company). Both provide you the ability to backup your database locally or to S3. As you may expect, you also have the ability to restore data in case of hardware failures, user errors or logical corruptions which the replicas will not protect you against.
Yes, it is possible. Just execute in terminal "nodetool repair" on the node with missed data. It can take a lot of time. Also I would recommend execute repair operation on each node every month to keep your data always replicated because cassandra does not repairs data automatically (for example after node(s) falling).

Cassandra: Simplest way to backup data from one machine and restore on a fresh machine

I have cassandra running on a single machine. I need to backup a particular keyspace from there and setup the same schema with all the data on my local machine.
I understand that I can run the nodetool snapshot command take the point in time snapshot of the keyspace.
But from the documentation, I could understand that it requires the schema to exist. Is there not any command which can take the backup with the schema and restore it to another machine? The data is very small, hardly a few MBs.
If you have the same version of Cassandra on the single machine and on your local machine, there is a brute force solution (not to be used in production):copy all the folder $CASSANDRA_HOME/data (or sometimes /var/lib/cassandra/data) from one machine to another ...

Resources