I am migrating a simple 4 node Cassandra cluster from one cloud provider to another. The number of nodes in both the clouds are same however the newer cluster is at version 3.11.0 and the older one is at 3.0.11. I am using sstableloader to stream data from one cluster to another (schema has been created on new cluster separately). As per the release notes this should not be a problem.
However, for certain column families with sstableloader I get progress to 100% but then it hangs there for hours (time hang >> time to stream). The total data to stream on each node is below 500 GB. Any help on why this is happening and how to avoid is appreciated.
Create a new node and add to the existing cluster from the new cloud server
Flush tables from the memtable to SSTables on disk
Delete one node from old cloud server. Like wise repeat for each node.
Related
Now:
Single Node
cassandra 3.11.3 + kairosdb 1.2
Two data storage path
/data/cassandra/data/kairosdb 4T Old data
/data1/cassandra/data/kaiosdb 1.1T Now Wrting data
target:
Three Node
cassandra 3.11.3 + kairosdb 1.2
One data storage path
/data/cassandra/data/kairosdb
In this case, how to migrate the data in two data directories under a single node to a three-node cluster, each node of this three-node cluster has only one data directory
I understand how to do it (and have practiced it) when migrating a single node to a three-node cluster, but only when there is only one data directory.2 data directories are migrated to 1,I have searched the Internet for a long time, but there is no reference material.
Data directories are something that the individual Cassandra node cares about but the Cluster doesn't.
Usually you'd want to have all nodes share the same configuration but for replication it really doesn't matter where the SSTables are on Disk on each node.
So migrating here would be the same as you've practiced.
That said the process I'd choose would be to add the new nodes as a second DC with the right replication, run a repair to have all the data in sync and then decommission the original node.
trying to find articles regarding how Casandra balances the load when add a server/node? that is, after adding a node, how cassandra moves certain partitions from existing nodes to new node, and how quick it could be done?
When you add a node to your existing cluster cassandra automatically will assign token ranges to your new node and stream relevant data to it. While this happens nodetool status will show the node as JOINING.
After streaming completes the node is part of your and will handle requests as any other node also does reducing the load on them. But your data size on the old nodes won't shrink - you need to issue nodetool cleanup to get rid of the now obsolete data.
For how quick this can be done, that depends on your clusters load and data size that must be streamed - but streaming is limited on network bandwidth of course.
We have one requirement where we need to replicate Cassandra Cluster with existing nodes and existing data in it. Approx 2.5 TB of data is on Azure and 3.5 TB on AWS. We need to pull the remaining data from AWS to Azure. Your Kind Help is appreciated.
There are many options here.
You can connect the two using GPFS - stand up a DC in Azure replicate across remove the old DC.
You could unload the data via the Cassandra loader. https://github.com/brianmhess/cassandra-loader
You could take a snapshot and then stream the data to the new cluster via sstableloader.
It's hard to give a complete answer - it would depend on so many factors. The above should get you started at least.
I have 4 nodes in my cluster. When i take snapshots whether it checks for the latest data from the cluster or it will take from that node data alone. My question is, snapshots provides the latest data or not?
If it provides latest data no need to take snapshot on each and every node in the cluster right?
Snapshots flush all the memtables to disc (i.e. makes sstables) so that all the latest node data is present in your snapshot. The command works at the node level, meaning you back up the very latest data for each node, not every node at once.
The advice given on the DataStax docs is if you want to back up all the data at the same time, you should use a command-line utility that can execute requests in parallel (pssh is the suggested util).
If it provides latest data no need to take snapshot on each and every node in the cluster right?
I cant really see a case where you need to backup your data daily when you're using Cassandra unless you are running a single node environment (which for C* seems slightly pointless). If you have a high enough replication factor, your data will always be backed up, and unless you are expecting a catastrophic hardware failures where all your servers (aka your entire cluster) will simultaneously burst into fire, you do not need to backup daily.
I have a Cassandra cluster managed by Priam, with 3 nodes. I use ephemeral disks to store my Cassandra data, so when I start 1 node, the Cassandra data dir is empty.
I have Priam properly configured and I can see backups are saved in Amazon S3. Suppose a node goes down and then I start another node. Will Priam know how to automatic restore backup from S3 when the node comes up again? The Cassandra data dir will start empty, so I am assuming Priam would give the new node the same token as the old one and it would restore the data... Right?
Yes. I have been running standalone Cassandra on EC2, small Cassandra clusters on mesos on EC2, and larger DataStax Enterprise clusters (with Cassandra) on EC2.
I have been using the Priam 3.x branch.
On restore, it calculates the initial_token, updates the cassandra.yaml file, restores the snapshot and incremental backup files, and restarts Cassandra.
According to Priam/Netflix conventions, if you have a 3 node cluster with Cassandra, your nodes should be named some_thing-other-things. Each node should be a part of an Auto-scaling group called some_thing. Each node should also use a Security Group named some_thing.
Create a 3 node dev cluster and test your backups and restores with data that you can easily recreate, that you don't care about too much. Get used to managing the Auto-scaling groups and Priam. Then, try it on test clusters with data that you care about.