Strategies for Cassandra backups - cassandra

We are considering additional backup strategies for our Cassandra database.
Currently we use:
nodetool snapshot - allows us to take a snapshot of all data on a given node at any time
replication level in itself is a backup - if any node ever goes down we still have additional copies of the data
Is there any other efficient strategy? The biggest issue is that we need a copy of life data in additional datacenter, which has just one server. So, for example let's say we 3 Cassandra servers, that we want to replicate to one backup Cassandra server in a second datacenter. the only way I can think we can do it is to set the replication level to 4. That means all the data will have to be written to all servers at all times.
We are also considering running Cassandra on top of some network distributed system like HDFS. Correct me if I'm wrong but that would be a very inefficient use of Cassandra?
Are there any other efficient backup strategies for Cassandra that can backup the current data without impairing it's performance?

Related

How cassandra improve performance by adding nodes?

I'm going build apache cassandra 3.11.X cluster with 44 nodes. Each application server will have one cluster node so that application do r/w locally.
I have couple of questions running in my mind kindly answer if possible.
1.How many server Ip's should mention in seednode parameter?
2.How HA works when all the mentioned seed node goes down?
3.What is the dis-advantage to mention all the serverIP's in seednode parameter?
4.How cassandra scales with respect to data other than(Primary key and Tunable consistency). As per my assumption replication factor can improve HA chances but not performances.
then how performance will increase by adding more nodes?
5.Is there any sharding mechanism in Cassandra.
Answers are in order:
It's recommended to point to at least to 2 nodes per DC
Seed/contact node is used only for initial bootstrap - when your program reaches any of listed nodes, it "learns" the topology of whole cluster, and then driver listens for nodes status change, and adjust a list of available hosts. So even if seed node(s) goes down after connection is already established, driver will able to reach other nodes
it's harder to maintain usually - you need to keep a configuration parameters for your driver & list of nodes in sync.
When you have RF > 1, Cassandra may read or write data from/to any replica. Consistency level regulates how many nodes should return answer for read or write operation. When you add the new node, the data is redistributed to new node, and if you have correctly selected partition key, then new node start to receive requests in parallel to old nodes
Partition key is responsible for selection of replica(s) that will hold data associated with it - you can see it as a shard. But you need to be careful with selection of partition key - it's easy to create too big partitions, or partitions that will be "hot" (receiving most of operations in cluster - for example, if you're using the date as partition key, and always writing reading data for today).
P.S. I would recommend to read DataStax Architecture guide - it contains a lot of information about Cassandra as well...

Cassandra Database replication to another instance

Is it possible to do Cassandra data replication into another server instance to run read only data operations on it? As we have explored SAN and it become more hardware expensive
Note: I am not allowed to copy data into file and therefore. It should be like mirroring of data.
You can use Cassandras internal replication for this. Use NetworkTopologyAwarePolicy and configure a second datacenter where data will be replicated to (can be lower than your production cluster). Then use this datacenter for your read-only workload and the other one for production.
Your application needs to be reconfigured to use LOCAL_QUORUM or another LOCAL consistency level so the second datacenter isn't used for requests.
This technique is uses for separating resource demanding analytic workloads from the rest for example.

Can we backup only one availability zone for AZ replicated Cassandra cluster

Since my Cassandra cluster is replicated across three availability zones, I would like to backup only one availability zone to lower the backup costs. I have also experimented restoring nodes in a single availability zone and got back most of my data in a test environment. I would like to know if there are any drawbacks to this approach before deploying this solution in production. Is anyone following this approach in your production clusters?
Note: As I backup at regular intervals, I know that I may loose updates happened to other two AZ nodes quorum at the time of snapshot but that's not a problem.
You can backup only specific dc, or even nodes.
AFAIK, the only drawback is does your data consistent/up-to-date, and since you can afford to lose some data it shouldn't be a problem. And if you, for example performing writes with ALL consistency level, the data should be up-to-date on all nodes.
BUT, you must be sure that your data is indeed replicated between multi a-z, by playing with rack/dc properties or using ec2 switch that supports multi a-z.
EDIT:
Global Snapshot
Running nodetool snapshot is only run on a single node at a time.
This only creates a partial backup of your entire data. You will want
to run nodetool snapshot on all of the nodes in your cluster. But
it’s best to run them at the exact same time, so that you don’t have
fragmented data from a time perspective. You can do this a couple of
different ways. The first, is to use a parallel ssh program to
execute the nodetool snapshot command at the same time. The second,
is to create a cron job on each of the nodes to run at the same time.
The second assumes that your nodes have clocks that are in sync, which
Cassandra relies on as well.
Link to the page:
http://datascale.io/backing-up-cassandra-data/

Is it possible to recover a Cassandra node without a snapshot?

Offsite backups for Cassandra seem like a challenging thing. You basically have to make yet another copy of ALL your data, including the copies of data that exist due to the replication factor. Snapshots make backups easy when you don't mind storing it on the same disk that your node already uses. I'm curious - in the event of a catastrophic failure of this disk, is it possible to recover the node using the nodes that the data was replicated to?
Yes, you can restore data on crashed node using a procedure in documentation - Replacing a dead node or dead seed node. It's for Cassandra 3.x, please pick your Cassandra version from a drop-down menu on the top of the page.
But please note that you still need to do backups if your data is valuable. If you using AWS you can use this project to backup Cassandra to S3 storage.
If you are looking for offsite or off-host backups, you can also look at opscenter from Datastax or Talena software (my company). Both provide you the ability to backup your database locally or to S3. As you may expect, you also have the ability to restore data in case of hardware failures, user errors or logical corruptions which the replicas will not protect you against.
Yes, it is possible. Just execute in terminal "nodetool repair" on the node with missed data. It can take a lot of time. Also I would recommend execute repair operation on each node every month to keep your data always replicated because cassandra does not repairs data automatically (for example after node(s) falling).

Restore Cassandra snapshot (from 3-node-cluster) on developer or test cluster (1-node cluster)

We have set up a backup/restore procedure for our Cassandra production environment via snapshots. The snapshot files, schema and token ring information are copied to S3.
The production cluster is a 3-node-cluster with a replication factor of 3.
For development and test, I would like to restore the snapshots from production into separated clusters. To save money and to keep maintenance easy, it would be nice to restore only the snapshot from one production node. Since we are using a replication factor of 3 in a 3-node-cluster, each snapshot should have all rows. Consistency is also not important for our use-case.
Is it possible (and how) to restore only a single snapshot?
All of your data should exist on all 3 nodes so copying the sstables from any 1 node to your test cluster should be sufficient. Making sure theres a recent repair beforehand may be good idea if worried about consistency.
First create the same schema on the test cluster. Then you can simply take a snapshot with nodetool snapshot -t cloneme. Once complete, copy all the sstables from the folder that is created (cloneme) into the equivalent tables folder on your test cluster. Then run nodetool refresh.
It gets much more complicated if you have a different topology (more nodes, different RF) but since your going with "every node has all the data" its pretty trivial.
Worth mentioning that OpsCenter has a feature to automate the copying of a backup to other clusters.

Resources