Copying Cassandra data from one cluster to another - cassandra

I have a cluster setup in Cassandra on AWS. Now, I need to move my cluster to some other place. Since taking image is not possible, I will create a new cluster exactly a replica of the old one. Now I need to move the data from this cluster to another. How can I do so?
My cluster has 2 data centers and each data center has 3 Cassandra machines with 1 seed machine.

Do you have connectivity between the old and new cluster? If yes, why not linking the cluster and let cassandra replicate the data to the new cluster? After data is transferred, shut down the old cluster. Ideally you wouldn`t even have any downtime.

You can take your SSTables from your data directory and then can use sstableloader in new data center to import the data.
Before doing this activity you might consider doing compaction so that you have only one SSTable per table.
SSTable Loader

Using SFTP server or through some other way, transfer the SSTables from old cluster to new cluster (one DC is enough) and use SSTableLoader. The data replication to another DC will be taken care by Cassandra.

In cassandra there are two type of strategy SimpleStrategy and NetworkTopologyStrategy by using NetworkTopologyStrategy you can replicate in different cluster. see this documentation Data replication
You can use COPY command to export and import csv from one table to another
Simple data importing and exporting with Cassandra

Related

Cassandra Cluster Migration Issues

Now:
Single Node
cassandra 3.11.3 + kairosdb 1.2
Two data storage path
/data/cassandra/data/kairosdb 4T Old data
/data1/cassandra/data/kaiosdb 1.1T Now Wrting data
target:
Three Node
cassandra 3.11.3 + kairosdb 1.2
One data storage path
/data/cassandra/data/kairosdb
In this case, how to migrate the data in two data directories under a single node to a three-node cluster, each node of this three-node cluster has only one data directory
I understand how to do it (and have practiced it) when migrating a single node to a three-node cluster, but only when there is only one data directory.2 data directories are migrated to 1,I have searched the Internet for a long time, but there is no reference material.
Data directories are something that the individual Cassandra node cares about but the Cluster doesn't.
Usually you'd want to have all nodes share the same configuration but for replication it really doesn't matter where the SSTables are on Disk on each node.
So migrating here would be the same as you've practiced.
That said the process I'd choose would be to add the new nodes as a second DC with the right replication, run a repair to have all the data in sync and then decommission the original node.

Is it possible to backup a 6-node DataStax Enterprise cluster and restore it to a new 4-node cluster?

I have this case. We have 6 nodes DSE cluster and the task is to back it up, and restore all the keyspaces, tables and data into a new cluster. But this new cluster has only 4 nodes.
Is it possible to do this?
Yes, it is definitely possible to do this. This operation is more commonly referred to as "cloning" -- you are copying the data from one DataStax Enterprise (DSE) cluster to another.
There is a Cassandra utility called sstableloader which reads the SSTables and loads it to a cluster even when the destination cluster's topology is not identical to the source.
I have previously documented the procedure in How to migrate data in tables to a new Cassandra cluster which is also applicable to DSE clusters. Cheers!

Cassandra backup system keyspaces

I have 3 node cassandra cluster and I have a script which backup all of the keyspaces, but when it comes to restore on fresh cluster, data keyspaces restored correctly, but system_* keyspaces not.
So it is necessary to backup system keyspaces in cassandra?
You will need to backup the keyspace system_schema at the same time, as it will contain the definition of keyspaces, tables, and columns. The other system* keyspaces should be left untouched.
Fresh cluster makes own setup like token ranges etc based on configurations.you can restore the cluster on new cluster but you need to create schema and configuration same as old cluster. There is many ways to take backup and restore procedure based on requirements as below:-
https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsBackupRestore.html

Cassandra and Spark

Hi I have a high level question regarding cluster topology and data replication with respect to cassandra and spark being used together in datastax enterprise.
It was my uderstanding that if there were 6 nodes in a cluster and there is heavy computing (e.g analytics) done then you could have three spark nodes and 3 cassandra nodes if you want. Or you don't need three nodes for analytics but your jobs would not run as fast. The reason you don't want the heavy analytics on the cassandra nodes is because the local memory is already being used up to handle the heavy read/write load of cassandra.
This much is clear, but here are my questions :
How does the replicated data work then?
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
Does all the data get replicated to the spark nodes?
How does that work if it does?
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
How does the replicated data work then?
Regular Cassandra replication will operate between nodes and DC's. As far as replication goes this is the same as having a c* only cluster with two data centers.
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
With the default DSE Snitch, your C* nodes will be in one DC and the Spark nodes in another DC. They will all be in a default rack. If you want to use multiple racks you will have to configure that yourself by using an advanced snitch. GPFS or PFS are good choices depending on your orchestration mechanisms. Learn more in the DataStax Documentation
Does all the data get replicated to the spark nodes? How does that work if it does?
Replication is controlled at the keyspace level and depends on your replication strategy:
SimpleStrategy will simply ask you the number of replicas you want in your cluster (it is not data center aware so don't use it if you have multiple DC's)
create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3 }
This assumes you only have one DC and that you'll have 3 copies of each bit of data
NetworkTopology strategy let's you pick number of replicas per DC
create KEYSPACE tst WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2': 3 }
You can choose to have a different number of replicas per DC.
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
The procedure to update RF is in the datastax documentation. Here it is verbatim:
Updating the replication factor Increasing the replication factor
increases the total number of copies of keyspace data stored in a
Cassandra cluster. If you are using security features, it is
particularly important to increase the replication factor of the
system_auth keyspace from the default (1) because you will not be able
to log into the cluster if the node with the lone replica goes down.
It is recommended to set the replication factor for the system_auth
keyspace equal to the number of nodes in each data center.
Procedure
Update a keyspace in the cluster and change its replication strategy
options. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' :
'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2}; Or if using
SimpleStrategy:
ALTER KEYSPACE "Excalibur" WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor' : 3 }; On each affected node,
run the nodetool repair command. Wait until repair completes on a
node, then move to the next node.
Know that increasing the RF in your cluster will generate lots of IO and CPU utilization as well as network traffic, while your data gets pushed around your cluster.
If you have a live production workload, you can throttle the impact by using nodetool getstreamthroughput / nodetool setstreamthroughput.
You can also throttle the resulting compactions with nodetool getcompactionthroughput nodetool setcompactionthroughput
How does Cassandra and Spark work together on the analytics nodes and
not fight for resources? If you are not going to limit Cassandra at all in the whole cluster, then what is the point of limiting Spark, just have all the nodes Spark enabled.
The key point is that you won't be pointing your main transactional reads / writes at the Analytics DC (use something like consistency level ONE_LOCAL, or QUORUM_LOCAL to point those requests to the C* DC). Don't worry, your data still arrives at the analytics DC by virtue of replication, but you won't wait for acks to come back from analytics nodes in order to respond to customer requests. The second DC is eventually consistent.
You are right in that cassandra and spark are still running on the same boxes in the analytics DC (this is critical for data locality) and have access to the same resources (and you can do things like control the max spark cores so that cassandra still has breathing room). But you achieve workload isolation by having two Data Centers.
DataStax drivers, by default, will consider the DC of the first contact point they connect with as the local DC so just make sure that your contact point list only includes machines in the local (c* DC).
You can also specify the local datacenter yourself depending on the driver. Here's an example for the ruby driver, check the driver documentation for other languages.
use the :datacenter cluster method: First datacenter found will be
assumed current by default. Note that you can skip this option if you
specify only hosts from the local datacenter in :hosts option.
You are correct, you want to separate your cassandra and your analytics workload.
A typical setup could be:
3 Nodes in one datacenter (name: cassandra)
3 Nodes in second datacenter (name: analytics)
When creating your keyspaces you define them with a NetworkTopologyStrategy and a replication factor defined for each datacenter, like so:
CREATE KEYSPACE myKeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'cassandra': 2, 'analytics': 2};
With this setup, your data will be replicated twice in each datacenter. This is done automatically by cassandra. So when you insert data in DC cassandra the inserted data will get replicated to DC analytics automatically and vice versa. Note: you can define what data is replicated by using seperate keyspaces for the data you want to be analyzed and the data you don't.
In your cassandra.yaml you should use the GossipingPropertyFileSnitch. With this snitch you can define the DC and the rack of your node in the file cassandra-rackdc.properties. This information then gets propagated via the gossip protocol. So each node learns the topology of your cluster.

Strategies for Cassandra backups

We are considering additional backup strategies for our Cassandra database.
Currently we use:
nodetool snapshot - allows us to take a snapshot of all data on a given node at any time
replication level in itself is a backup - if any node ever goes down we still have additional copies of the data
Is there any other efficient strategy? The biggest issue is that we need a copy of life data in additional datacenter, which has just one server. So, for example let's say we 3 Cassandra servers, that we want to replicate to one backup Cassandra server in a second datacenter. the only way I can think we can do it is to set the replication level to 4. That means all the data will have to be written to all servers at all times.
We are also considering running Cassandra on top of some network distributed system like HDFS. Correct me if I'm wrong but that would be a very inefficient use of Cassandra?
Are there any other efficient backup strategies for Cassandra that can backup the current data without impairing it's performance?

Resources