Cassandra restore from incremental backup files from multilple nodes - cassandra

I am also looking for incremental/point-in-time backup / restore solution.
I have three Cassandra nodes,I enabled incremental backup , and i tried copy one day's SSTable files from backups folder on one node to a new cassandra cluster /data folder, then it works, but i have three node, and the name on all three node are same, i dont' know how to restore the incremental backup files from all the three nodes.
You comments are really appreciated !

One simple solution is to config new cluster with exact same node, and everything can work.
But if i want to replay the data with less nodes, like only one, so i have to take a complete snapshot ?

Related

Is it possible to backup and restore Cassandra cluster using dsbulk?

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?
It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):
DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

Cassandra - copying sstable snapshot from one cluster to another

I know there are several similar questions out there, but I'm still confused about this. As there is a need for this mechanism (copying data from one cluster to another), I'm looking for a little clarification.
Let's assume a very simple scenario. I want to copy a table from one cassandra cluster (C1) to another (C2). The table I'm copying is called "item".
Let's assume the node count of each cluster is the same (source and target nodes in cluster is 4 each). Not sure that matters or not.
I'm attempting to use snapshots and sstableloader to do the trick. I have been able to create a snapshot, copy the snapshot files from C1:N1 (cluster 1 node 1 .../myspace/item-xxxxxx/snapshot/######) to target table directory C2:N1 (cluster 2 node 1: .../myspace/item-xxxxxx). I used sstableloader to load the data and ran nodetool repair. Perfect. The only problem is that as the loaded snapshot was only from one of the source nodes, I only "restored" part of the data (about 485 of the 1k rows). So I'm thinking I'll copy the snapshot from C1:N2 to C2:N1 again and load it up. The problem is that all of the table files already exist on the C2:N1. If I copy the snapshot files from C1:N2 to table directory on C2:N1, I'll blow away the files that are already there. I didn't check all 4 target nodes, but I did check node 2 of the target and the item table directory already existed there too with data files. I'm guessing all of nodes on the target have data files, so I'm stuck with how to sstableload the other 3 source node snapshot files.
So long story short (if that's possible):
How am I supposed to load multiple source snapshot files (one from each host on the source cluster) to a target cluster? And to complicate matters, will it matter if the source and target clusters have a different number of nodes (I would think that having less nodes on the target would be potentially be a bigger problem).
What is really needed here, in my opinion, is a way to run the ssableloader on the SOURCE cluster and have it stream the data to a target cluster. Would make life a lot easier, I would think.
Thanks in advance.
-Jim
There are two options for bulk loading, It seems you may have them semi-merged together. You are mostly referring to the "copy the sstables" mechanism which is pretty manual and may not be worth the trouble unless performance of the restore is top priority. Using sstable loader is different though and doesn't require that.
sstableloader tool will connect to a node, find all the nodes in that nodes cluster and uses the connection to build metadata/discovery. It will split/stream the sstables that you select to the target cluster in the appropriate token ranges (you wont need the repair). You can run sstableloader from the source clusters nodes, and point it to the destination cluster, you dont need to copy the sstables over yourself (although if they are in different DCs it may be a bit faster).
If you have OpsCenter the automation of these steps can be done for you with a GUI https://docs.datastax.com/en/opscenter/5.2/opsc/online_help/services/opscBackupCloneCluster.html

how to create solr index backup and restore?

we are creating the snapshot of all keyspaces of Cassandra. but also need the create a backup of solr index contains huge data, which is useful in Solr indexing.
Here is datastax link to create backup.
I tried we the following command
$nodetool -h localhost rebuild_index ks cf ks.cf
which is working fine for small data and takes more time for the huge size of data.
"Backup Solr Indexes" section in datastax doc.
and try to run:
$backup -d /var/lib/cassandra/data/solr.data -u root -v
and found this:
backup: Unrecognized or ambiguous switch '-d'; type 'backup help interactive' for detailed help.
means this backup package is not for the solr index. where we can find out suitable backup package?
Could someone suggest me how to create the backup and restore for solr index?
Assuming you'll be creating backups intended to restore a cluster with the same token layout, and you can make your backups in a rolling fashion, something like the following may at least be a starting point:
For each node...
1.) nodetool drain the node to make sure your Solr cores are in sync with their backing Cassandra tables. (drain forces a memtable flush, which forces a Solr hard commit.)
2.) Shut down the node.
3.) Manually back up your data directories (.../solr.data for your index).
4.) Start the node again.

Best practices for cleaning up Cassandra incremental backup folders

We have incremental backup on our Cassandra cluster. The "backups" folders under the data folders now contain a lot of data and some of them have millions of files.
According to the documentation: "DataStax recommends setting up a process to clear incremental backup hard-links each time a new snapshot is created."
It's not clear to me what the best way is to clear out these files. Can they all just be deleted when a snapshot is created, or should we delete files that are older than a certain period?
My thought was, just to be on the safe side, to run a regular script to delete files more than 30 days old:
find [Cassandra data root]/*/*/backups -type f -mtime +30 -delete
Am I being too careful? We're not concerned about having a long backup history.
Thanks.
You are probably being too careful, though that's not always a bad thing, but there are a number of considerations. A good pattern is to have multiple snapshots (for example weekly snapshots going back to some period) and all backups during that time period so you can restore to known states. For example, if for whatever reason your most recent snapshot doesn't work for whatever reason, if you still have your previous snapshot + all sstables since then, you can use that.
You can delete all created backups after your snapshot as the act of doing the snapshot flushes and hard links all sstables to a snapshots directory. Just make sure your snapshots are actually happening and completing (it's a pretty solid process since it hard links) before getting rid of old snapshots & deleting backups.
You should also make sure to test your restore process as that'll give you a good idea of what you will need. You should be able to restore from your last snapshot + the sstables backed up since that time. Would be a good idea to fire up a new cluster and try restoring data from your snapshots + backups, or maybe try out this process in place in a test environment.
I like to point to this article: 'Cassandra and Backups' as a good run down of backing up and restoring cassandra.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources