Is it possible to backup and restore Cassandra cluster using dsbulk? - cassandra

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?

It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):
DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

Related

Combine Cassandra Snapshot with updated data

We deleted some old data within our 3 node cassandra cluster (v3.11) some days ago which shall now be restored from a Snapshot. Is there a possibility to restore the data from the snapshot without loosing updates made since the snapshot was taken?
There are two approaches which come to my mind
A)
Create export via COPY keypsace.table TO xy.csv
Truncate table
restore table from snapshot via sstableloader
Reimport newer data via COPY keypsace.table FROM xy.csv
B)
Just copy sstable files of snapshot into current table directory
Is A) a feasible option? What do we need to consider so that the COPY FROM/TO commands get synchronized over all nodes?
For option B) I read that the deletion commands that happend may be executed again (tombstone rows). Can I ignore this warning if we make sure the deletion commands happened more than 10 days ago (gc_grace_seconds)?
for exporting/importing data from Apache Cassandra®, there is an efficient tool -- DataStax Bulk Loader (aka DSBulk). You could refer to more documentation and examples here. For getting consistent reads and writes, you could leverage --datastax-java-driver.basic.request.consistency LOCAL_QUORUM in your unload & load commands.

Using DSBulk for backup/restore takes too long

I use dsbulk for text based backup and restore of cassandra cluster. I have created a python script that backsup/restores the all the tables in cassandra cluster using dsbulk load/unload but it takes long time even for less data due to new session created for each table (approx 7s), In my case I have 70 tables, so 70*7s is added due to session creation. Is there a way to backup data from all tables in a cluster using a single session using dsbulk? From the docs, I see dsbulk is suitable only for single table load/unload at a time. Is there any alternative or other approach for this? Please suggest if any..!
Thanks..
No, there isn't a way to load/unload multiple tables in a single DSBulk execution because it doesn't make sense to do so.
In any case, using unloading data to CSV isn't recommended as a means of backing up your cluster because there are no guarantees that the data will be consistent at a point in time.
The correct way of backing up a Cassandra cluster is using the nodetool snapshot command. For details, see Apache Cassandra Backups.
If you're interested, there is an open-source tool which allows you to automate backups -- https://github.com/thelastpickle/cassandra-medusa. Cheers!

what are the difference between the data back up using nodetool and cqlsh copy command?

Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool commands or use the copy command from the cqlsh terminal.
1) What are the differences between these commands ?
2) Which one is most appropriate ?
3) Also if we are using nodetool to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy command ?
Any help is appreciated.
Thanks very much.
GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY is essentially the same as doing a SELECT * FROM on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY will contain data from all nodes. Whereas nodetool snapshot needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot is the way to go. If you're trying to export/import data into a new table or cluster, then COPY is the better approach.
Also worth noting, cqlsh COPY takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the cqlsh copy command ?
No, that's not necessary. As cqlsh COPY works just like a SELECT, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.
nodetool snapshot is good approach for any amount of data and it creates a hard link within seconds.copy command will take much time because and depends upon the size of data and cluster. for less data and testing you may use copy command but for production nodetool snapshot is recommended.

how to archive and purge Cassandra data

I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.
Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.
I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:
[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf
There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.
It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.
If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.
I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.
For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.
Let me know if this help in getting the end result you want or if you have further question that I can help with.

Proper cassandra keyspace restore procedure

I am looking for confirmation that my Cassandra backup and restore procedures are sound and I am not missing anything. Can you please confirm, or tell me if something is incorrect/missing?
Backups:
I run daily full backups of the keyspaces I care about, via "nodetool snapshot keyspace_name -t current_timestamp". After the snapshot has been taken, I copy the data to a mounted disk, dedicated to backups, then do a "nodetool clearsnapshot $keyspace_name -t $current_timestamp"
I also run hourly incremental backups - executing a "nodetool flush keyspace_name" and then moving files from the backup directory of each keyspace, into the backup mountpoint
Restore:
So far, the only valid way I have found to do a restore (and tested/confirmed) is to do this, on ALL Cassandra nodes in the cluster:
Stop Cassandra
Clear the commitlog *.log files
Clear the *.db files from the table I want to restore
Copy the snapshot/full backup files into that directory
Copy any incremental files I need to (I have not tested with multiple incrementals, but I am assuming I will have to overlay the files, in sequence from oldest to newest)
Start Cassandra
On one of the nodes, run a "nodetool repair keyspace_name"
So my questions are:
Does the above backup and restore strategy seem valid? Are any steps inaccurate or anything missing?
Is there a way to do this without stopping Cassandra on EVERY node? For example, is there a way to restore the data on ONE node, then somehow make it "authoritative"? I tried this, and, as expected, since the restored data is older, the data on the other nodes (which is newer) overwrites in when they sync up during repair.
Thank you!
There's two ways to restore Cassandra backups without restarting C*:
Copy the files into place, then run "nodetool refresh". This has the caveat that the rows will still be older than tombstones. So if you're trying to restore deleted data, it won't do what you want. It also only applies to the local server (you'll want to repair after)
Use "sstableloader". This will load data to all nodes. You'll need to make sure you have the sstables from a complete replica, which may mean loading the sstables from multiple nodes. Added bonus, this works even if the cluster size has changed. I'm not sure if ordering matters here (that is, I don't know if row timestamps are preserved through the load or if they're redefined during load)

Resources