Backups folder in Opscenter keyspace growing really huge - cassandra

We have a 10 node Cassandra cluster. We configured a repair in Opscenter. We find there is a backups folder created for every table in Opscenter keyspace. It keeps growing huge. Is there a solution to this, or do we manually delete the data in each backups folder?

First off, Backups are different from snapshots - you can take a look at the backup documentation for OpsCenter to learn more.
Incremental backups:
From the datastax docs -
When incremental backups are enabled (disabled by default), Cassandra
hard-links each flushed SSTable to a backups directory under the
keyspace data directory. This allows storing backups offsite without
transferring entire snapshots. Also, incremental backups combine with
snapshots to provide a dependable, up-to-date backup mechanism.
...
As with snapshots, Cassandra does not automatically clear
incremental backup files. DataStax recommends setting up a process to
clear incremental backup hard-links each time a new snapshot is
created.
You must have turned on incremental backups by setting incremental_backups to true in cassandra yaml.
If you are interested in a backup strategy, I recommend you use the OpsCenter Backup Service instead. That way, you're able to control granularly which keyspace you want to back up and push your files to S3.
Snapshots
Snapshots are hardlinks to old (no longer used) SSTables. Snapshots protect you from yourself. For example you accidentally truncate the wrong keyspace, you'll still have a snapshot for that table that you can bring back. There are some cases when you have too many snapshots, there's a couple of things you can do:
Don't run Sync repairs
This is related to repairs because synchronous repairs generate a Snapshot each time they run. In order to avoid this, you should run parallel repairs instead (-par flag or by setting the number of repairs in the opscenter config file note below)
Clear your snapshots
If you have too many snapshots and need to free up space (maybe once you have backed them up to S3 or glacier or something) go ahead and use nodetool clearsnapshots to delete them. This will free up space. You can also go in and remove them manually from your file system but nodetool clearsnapshots removes the risk of rm -rf ing the wrong thing.
Note: You may also be running repairs too fast if you don't have a ton of data (check my response to this other SO question for an explanation and the repair service config levers).

Related

Is it safe to copy cassandra snapshot files over sstable files in a running node?

Edited after reading nodetool tagged questions.
We take snapshots of our single node cassandra database daily. If I want to restore a snapshot either on that node, or on our staging server which is running a different instance of cassandra, my understanding is I have to:
nodetool disablegossip
nodetool disablebinary
nodetool drain
Copy the sstable files from the snapshot directories to the sstable directories under the keyspace directory.
Run nodetool refresh on each table.
Enable binary & gossip.
Is this sufficient to safely bring the snapshot sstable files in without cassandra overwriting them while I'm doing the refresh?
What is the opposite of nodetool drain?
Another edit: What about sstableloader? Should I use that instead? If so, how? I looked at the "documentation" and am none the wiser.
The steps you outlined isn't quite right. You don't shutdown Cassandra and you shouldn't just copy the files on top of the existing SSTables.
At a high level, the steps to restore table snapshots on a node are:
TRUNCATE the table you want to restore (will remove the SSTables from the data directories).
Copy the SSTables from data/ks_name/table-UUID/snapshots/snapshot_name subdirectory into the "live" data directory data/ks_name/table-UUID.
Run nodetool refresh -- ks_name table_name.
You will need to repeat these steps for each application table you want to restore. NOTE: Do NOT restore system tables, only application tables.
The detailed steps are documented in Restoring from a snapshot in Cassandra.
To restore a snapshot into another cluster, I prefer to refer to this as "cloning". The procedure for cloning snapshots to another cluster depends on whether the source and destination clusters have identical configuration.
If both source and destination clusters are identical, follow the steps I documented here -- https://community.datastax.com/questions/4534/. I've explained what identical configuration means in this post.
If they are not identical, follow the steps I documented here -- https://community.datastax.com/questions/4477/. Cheers!

Not able to restore cassandra data from snapshot

We have a regular backup of our cluster and we store schema and snapshot back up on aws s3 on daily basis.
Somehow we have lost all the data and while recovering the data from backup we are able to recover schema but while copying snapshots files to /var/lib/cassandra/data directory its not showing up the data in the tables.
After copying the data we have done nodetool refresh -- keyspace table but still nothing is working out.
could you please help on this ?
Im new at Apache Cassandra, but my first focus at this topic was the Backup.
If you want to restore from a Snapshot (on new node/cluster) you have to shut down Cassandra on any node and clear any existing data from these folders:
/var/lib/cassandra/data -> If you want to safe your System Keyspaces so delete only your Userkeyspaces folders
/var/lib/cassandra/commitlog
/var/lib/cassandra/hints
/var/lib/cassandra/saved_cashes
After this, you have to start Cassandra again (the whole Cluster). Create the Keyspace like the one you want to restore and the table you want to restore. In Your Snapshot folder you will find a schema.cql script for the creation of the table.
After Creating the Keyspaces an tables again, wait a moment (time depends on the ammount of nodes in your cluster and keypsaces you want to restore.)
Shut down the Cassandra Cluster again.
Copy the Files from the Snapshot folder to the new folders of the tables you want to restore. Do this on ALL NODES!
After copying the files, start the nodes one by one.
If all nodes are running, run the nodetool repair command.
If you try to check the data via CQLSH, so think of the CONSISTENCY LEVEL! (ALL/QUORUM)
Thats the way, wich work at my Cassandra cluster verry well.
The general steps to follow for restoring a snapshot is:
1.Shutdown Cassandra if still running.
2.Clear any existing data in commitlogs, data and saved caches directories
3.Copy snapshots to relevant data directories
4.Copy incremental backups to data directory (if incremental backups are enabled)
If required, set restore_point_in_time parameter in commitlog_archiving.properties to
restore point.
5.Start Cassandra.
6.Run repair
So try running repair after copying data.

Restore Cassandra snapshot (from 3-node-cluster) on developer or test cluster (1-node cluster)

We have set up a backup/restore procedure for our Cassandra production environment via snapshots. The snapshot files, schema and token ring information are copied to S3.
The production cluster is a 3-node-cluster with a replication factor of 3.
For development and test, I would like to restore the snapshots from production into separated clusters. To save money and to keep maintenance easy, it would be nice to restore only the snapshot from one production node. Since we are using a replication factor of 3 in a 3-node-cluster, each snapshot should have all rows. Consistency is also not important for our use-case.
Is it possible (and how) to restore only a single snapshot?
All of your data should exist on all 3 nodes so copying the sstables from any 1 node to your test cluster should be sufficient. Making sure theres a recent repair beforehand may be good idea if worried about consistency.
First create the same schema on the test cluster. Then you can simply take a snapshot with nodetool snapshot -t cloneme. Once complete, copy all the sstables from the folder that is created (cloneme) into the equivalent tables folder on your test cluster. Then run nodetool refresh.
It gets much more complicated if you have a different topology (more nodes, different RF) but since your going with "every node has all the data" its pretty trivial.
Worth mentioning that OpsCenter has a feature to automate the copying of a backup to other clusters.

Point-In-Time Cassandra backup & recovery

I have read about Cassandra backup & recovery here, and have a few questions:
Do the native Cassandra CLI commands suffice? I see a lot of people writing scripts and custom-making their own solutions.
What other tools out there would you recommend for Cassandra backup and recovery? I am looking for something that can help me manage the backup images (e.g. with point-in-time)
Do I need to invest significantly more into storage if I opt to backup my Cassandra tables?
Any insights would be appreciated.
Please try to limit your questions to one actual question.
Do the native Cassandra CLI commands suffice?
I assume that you mean nodetool snapshot, so for the most-part, "yes." In addition, many users choose to also enable incremental backups. With a combination of using snapshots and incremental backups (from the linked doc) "provides a dependable, up-to-date backup mechanism."
I see a lot of people writing scripts and custom-making their own solutions.
I have a backup script that runs on my nodes nightly. There are two reasons for this.
I don't want to have to manually take a snapshot for each keyspace every week, so I have the script do it.
Snapshot and incremental backup files don't remove themselves, so I have the script do that after a certain time threshold.
What other tools out there would you recommend for Cassandra backup and recovery?
DataStax OpsCenter allows you to schedule backups, but I believe that is only a valid option in the Enterprise edition. You could also look at Netflix's Cassandra backup/recovery tool called Priam. There's also a company called Talena which claims to provide an extensive enterprise-grade backup solution for Cassandra (I don't know anyone who uses them, but they hit me with a marketing email recently so I thought I'd mention it).
Do I need to invest significantly more into storage if I opt to backup my Cassandra tables?
Incremental backups and snapshots can take up a great deal of space if you don't stay on top of them (deleting and/or archiving them). I would try them both out, and keep an eye on your disk usage while you do. If your business requirements have a statement on terms of service (how far back you would need to be able to restore to), you should be able to figure out how many days-worth of backups it makes sense for you to keep around. That should tell you whether or not you need more disk to fulfill those obligations.
Edit 20181205
Do you run nodetool snapshot on each node? What would be the approach if there are three nodes with 100% replication.
Typically yes, nodetool snapshot needs to be run on each node. This helps to ensure backup coverage, as not all of the nodes may be responsible for all of the data.
However, if your cluster runs in a configuration where number of nodes equals your RF, then each node has a complete copy of the data. In that case, you would need to run nodetool snapshot on only one node; as long as you are confident that repairs are running regularly and your data is consistent.
With regards to point-in-time backup and recovery of Cassandra, there are a few aspects that you need to consider depending on what your needs and limitations are:
Storage Footprint
All the solutions available today will put a big strain on your infrastructure as they would require you to store 3x the data that you absolutely need to, assuming you have a replication factor of 3.
I agree with #Aaron, you need to manage the snapshots yourself because the tools will not do “garbage collection” for you :)
Failure resiliency
All the solutions out there, opscenter and others, provide limited failure resiliency. You will lose data if a Cassandra node goes down during a backup window.
This situation is exasperated when you have incremental backups and node failure happens during an incremental
Recovery time/speed
Note that you may have to go through a “repair” process during recovery. This is needed because the node level snapshots that the native tools provide are not consistent across the cluster.
Depending on your RTO/RPO needs, this may not be adequate. I suggest you test both the backup and recovery times for your operations before you arrive at any solution.
If you are looking for enterprise grade solution for backup and recovery of Cassandra, you may want to check out the solution offered by “Datos IO”. It reduces your storage footprint by 3x while also providing failure resiliency and cluster consistency.

What are different ways to backup and restore cassandra cluster?

I am trying to backup the whole cluster consistently. What are different ways to backup and restore Cassandra cluster?
If you are using the DataStax Enterprise version, then the easiest way is to perform the backups and restore using OpsCenter.
If you are using the DataStax Community or open-sourced version of Cassandra, then use nodetool snapshot to create backups of tables and/or keyspaces.
Please bear in mind that SSTables are immutable, i.e. they never change once they are written to disk. So unlike RDBMS data files, SSTables are not updated.
To perform a snapshot cluster-wide, use SSH tools such as pssh to perform parallel snapshots on all nodes.
More information on the snapshot utility is available here.
There are several ways to restore from snapshots. One way is to re-load the data using the sstableloader tool where the data is read back into the cluster. Another way is by copying the SSTable directory from snapshot and running nodetool refresh. Finally, you can replace the existing data with the snapshot and restarting the node.
More information on backups and restores are available here.

Resources