cassandra 3.11.x mixing vesions - cassandra

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines.
We are switching to physical machines on brand (8!) new servers that will have debian 11 and presumably cassandra 3.11.12.
Since the main version is always 3.11.x and ubuntu 16.04 is out of support, the question is: can we just let the new machines join the old cluster and then decommission the outdated?
I hope to get a tips about this becouse intuitively it seems fine but we are not too sure about that.
Thank you.

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines. We are switching to physical machines on brand (8!)
Quick tip here; but it's a good idea to build your clusters in multiples of your RF. Not sure what your RF is, but if RF=3, I'd either stay with six or get one more and go to nine. It's all about even data distribution.
can we just let the new machines join the old cluster and then decommission the outdated?
In short, no. You'll want to upgrade the existing nodes to 3.11.12, first. I can't recall if 3.11.3 and 3.11.12 are SSTable compatible, but I wouldn't risk it.
Secondly, the best way to do this, is to build your new (physical) nodes in the cluster as their own logical data center. Start them up empty, and then run a nodetool rebuild on each. Once that's complete, then decommission the old nodes.

There is a bit simpler solution - move data from each virtual machine into a physical server, as following:
Prepare Cassandra installation on a physical machine, configure the same cluster name, etc.
1.Stop Cassandra in a virtual machine & make sure that it won't start
Copy all Cassandra data /var/lib/cassandra or something like from VM to the physical server
Start Cassandra process on a physical server
Repeat that process for all VM nodes, at some point, updating seeds, etc. After process is finished, you can add two physical servers that are left. Also, to speedup process, you can do initial copy of the data before stopping Cassandra in the VM, and after it's stopped, re-sync data with rsync or something like. This way you can minimize the downtime.
This approach would be much faster compared to the adding a new node & decommissioning the old one as we won't need to stream data twice. This works because after node is initialized, Cassandra identify nodes by assigned UUID, not by IP address.
Another approach is to follow instructions on replacement of the dead node. In this case streaming of data will happen only once, but it could be a bit slower compared to the direct copy of the data.

Related

Cassandra cluster - Migrating all hosts in cluster

I am using Cassandra(3.5) with 20 nodes with data center-1 with 10 nodes and data center-2 with 10 nodes and has huge data. All hosts belong to say legacy hosts. Now we have newer generation hosts say generation-2.
I have tried adding new nodes and decommissioning old node. But this will be tie consuming.
Q1: How can I migrate all hosts from legacy hosts to generation-2 host? What is the best approach for that?
Q2: What will be rollback strategy?
Q3: Finally, How can I validate data once I migrate to generation-2 hosts?
If you just replacing the nodes with newer hardware, keeping the same number of nodes, then it's simple (operations should be done on every node):
prepare the new installation on every node, with configuration identical to existing nodes, but with different IP addresses but don't start the nodes
(optional) disable autocompaction with nodetool disableautocompaction - this could help to execute step 5 faster
copy data from old node to new node using rsync (this could take long time)
execute nodetool drain & stop old node
use rsync to synchronize changes happened since initial copying (it should be relatively fast)
make sure that the old node won't start again (for example, remove Cassandra package) - otherwise it could be a chaos
start the new node
This works because Cassandra node is identified by the UUID that is stored in the local table, so changing of IP doesn't affect the operations.
P.S. In future, if you'll need to replace node (not as described, but completely died), use the procedure of node replacement - in this case, you won't stream data twice, as happened when you do decomissioning and then re-adding node.

how to migrate Cassandra data from windows server on to LINUX server for Cassandra?

I am not sure if my windows based Cassandra installation will work as it is on Linux based Cassandra nodes.
My data resides on windows Cassandra-DB and plans to shift on to LINUX server in order to use ELASSANDRA now.
Can same data files be copied from Win-OS to Linux-OS in same directories of Cassandra Folders?
As both are with different file system so i have some doubts if that will ever work.
If not what is the workaround to migrate all data?
The issue with the files has more to do with the version of Cassandra, rather than the OS. Cassandra's implementation in Java makes the underlying OS somewhat (albeit not completely) irrelevant.
Each version of Cassandra has a specific format for writing its SSTable files. As long as the version of Cassandra is the same between each server, copying the files should work.
Otherwise, if the Windows and Linux servers can see each other on the network, the easiest way to migrate would be to join the Linux server to the "cluster" on Windows. Just give the Linux server the IP of the Windows machine as its seed, set the cluster_name to be the same, and it should join. Then adjust the keyspace replication and run a repair.
You should not repair, but stream data from your existing DC with a nodetool rebuild -- [source_dc_name]:
1-Just start all nodes in your new DC with auto_bootstrap: false in conf/cassandra.yaml
2-Run nodetool rebuild on these nodes
3-Remove auto_bootstrap: false in conf/cassandra.yaml
If you start Elassandra in your new DC, Elasticsearch indices wil be rebuild while streaming or reparing, so have fun with it !

Cassandra cluster on budget

I am learning Cassandra and want to run a cloud based cluster. I don't care much about speed.
What I want to really test is the replication and recovery features.
I would be running tests like
taking nodes offline every once in a while
kill -9 cassandra
powering off server
manually corrupting sstables/commitlog (not sure if this is recoverable)
I am thinking of going for a 4 node cluster.
Each node will have the following config:
2 GB RAM
10 GB SSD
2 CPUs (Virtual)
Two nodes will be in a European datacenter and other two will be in a North American data center.
I know 8GB is the recommended minimum for Cassandra. But that config would be quite expensive.
If it helps, I can run one more VM on a dedicated box. This VM can have 16 GB RAM and 8 virtual CPUs. I could also run 4 VMs with 4GB RAM each on this box. But I guess, having 4 separate VMs in different data centers would make a more realistic setup and bring to fore any issues that may arise out of network problems, latencies etc.
Is it okay to run Cassandra on machines with this config? Please share your thoughts.
Many people run multiple instances of cassandra on modern laptops using ccm ( https://github.com/pcmanus/ccm ). If you just want to get an idea of what it does (create a 3 node cluster, add data, add a 4th node, create a snapshot, remove a node, add it back, restore the snapshot, etc), using ccm on a PC may be 'good enough'.
Otherwise, you can certainly run with less than 1GB of ram, but it's not always fun. There have been some clusters on tiny hardware ( http://www.datastax.com/dev/blog/32-node-raspberry-pi-cassandra-cluster ). Depending on your budget, making a cluster on raspberry pi's may be as cost effective as your 2 VM cluster.

What is meant by a node in cassandra?

I am new to Cassandra and I want to install it. So far I've read a small article on it.
But there one thing that I do not understand and it is the meaning of 'node'.
Can anyone tell me what a 'node' is, what it is for, and how many nodes we can have in one cluster ?
A node is the storage layer within a server.
Newer versions of Cassandra use virtual nodes, or vnodes. There are 256 vnodes per server by default.
A vnode is essentially the storage layer.
machine: a physical server, EC2 instance, etc.
server: an installation of Cassandra. Each machine has one installation of Cassandra. The Cassandra server runs core processes such as the snitch, the partitioner, etc.
vnode: The storage layer in a Cassandra server. There are 256 vnodes per server by default.
Helpful tip:
Where you will get confused is that Cassandra terminology (in older blog posts, YouTube videos, and so on) had been used inconsistently. In older versions of Cassandra, each machine had one Cassandra server installed, and each server contained one node. Due to the 1-to-1-to-1 relationship between machine-server-node in old versions of Cassandra people previously used the terms machine, server and node interchangeably.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. Like all other distributed database systems, it provides high availability with no single point of failure.
You may got some ideas from the description of above paragraph. Generally, when we talk Cassandra, we mean a Cassandra cluster, not a single PC. A node in a cluster is just a fully functional machine that is connected with other nodes in the cluster through high internal network. All nodes work together to make sure that even if one of them failed due to unexpected error, they as a whole cluster can provide service.
All nodes in a Cassandra cluster are same. There is no concept of Master node or slave nodes. There are multiple reason to design like this, and you can Google it for more details if you want.
Theoretically, you can have as many nodes as you want in a Cassandra cluster. For example, Apple used 75,000 nodes served Cassandra summit in 2014.
Of course you can try Cassandra with one machine. It still work while just one node in this cluster.
What is meant by a node in cassandra?
Cassandra Node is a place where data is stored.
Data center is a collection of related nodes.
A cluster is a component which contains one or more data centers.
In other words collection of multiple Cassandra nodes which communicates with each other to perform set of operation.
In Cassandra, each node is independent and at the same time interconnected to other nodes.
All the nodes in a cluster play the same role.
Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
In the case of failure of one node, Read/Write requests can be served from other nodes in the network.
If you're looking to understand Cassandra terminology, then the following post is a good reference:
http://exponential.io/blog/2015/01/08/cassandra-terminology/

Enabling vNodes in Cassandra 1.2.8

I have a 4 node cluster and I have upgraded all the nodes from an older version to Cassandra 1.2.8. Total data present in the cluster is of size 8 GB. Now I need to enable vNodes on all the 4 nodes of cluster without any downtime. How can I do that?
As Nikhil said, you need to increase num_tokens and restart each node. This can be done one at once with no down time.
However, increasing num_tokens doesn't cause any data to redistribute so you're not really using vnodes. You have to redistribute it manually via shuffle (explained in the link Lyuben posted, which often leads to problems), by decommissioning each node and bootstrapping back (which will temporarily leave your cluster extremely unbalanced with one node owning all the data), or by duplicating your hardware temporarily just like creating a new data center. The latter is the only reliable method I know of but it does require extra hardware.
In the conf/cassandra.yaml you will need to comment out the initial_token parameter, and enable the num_tokens parameter (by default 256 I believe). Do this for each node. Then you will have to restart the cassandra service on each node. And wait for the data to get redistributed throughout the cluster. 8 GB should not take too much time (provided your nodes are all in the same cluster), and read requests will still be functional, though you might see degraded performance until the redistribution of data is complete.
EDIT: Here is a potential strategy to migrate your data:
Decommission two nodes of the cluster. The token-space should get distributed 50-50 between the other two nodes.
On the two decommissioned nodes, remove the existing data, and restart the Cassandra daemon with a different cluster name and with the num_token parameters enabled.
Migrate the 8 GB of data from the old cluster to the new cluster. You could write a quick script in python to achieve this. Since the volume of data is small enough, this should not take too much time.
Once the data is migrated in the new cluster, decommission the two old nodes from the old cluster. Remove the data and restart Cassandra on them, with the new cluster name and the num_tokens parameter. They will bootstrap and data will be streamed from the two existing nodes in the new cluster. Preferably, only bootstrap one node at a time.
With these steps, you should never face a situation where your service is completely down. You will be running with reduced capacity for some time, but again since 8GB is not a large volume of data you might be able to achieve this quickly enough.
TL;DR;
No you need to restart servers once the config has been edited
The problem is that enabling vnodes means a lot of the data is redistributed randomly (the docs say in a vein similar to the classic ‘nodetool move’

Resources