Difference between a node and vnode - cassandra

I'm new to Cassandra and I'm confused about the concepts of nodes and vnodes.
Here's what I had read: The hierarchy of elements in Cassandra is:
Cluster - Data Centre - Rack-Server-Node
The node was described as a data storage layer within a server and the server was the actual physical machine containing the Cassandra software.
From what I could understand, it seems to me that vnodes are different/more efficient than nodes in certain cases.
However I'm having trouble placing them in this hierarchy.
Is vnode just a different kind of node in the above hierarchy.
or
is it that after the concept of vnode was introduced, the element called server in the above hierarchy is now called a node and the one called node in the above hierarchy is now called a vnode!

You can see vnodes as a next step in the hierarchy you've described, after physical nodes.
Vnodes help redistribute data based on tokens when you are resizing your cluster and making data distribution much more flexible.
There's a good explanation from datastax site: https://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
EDIT: In old versions of Cassandra, tokens were splitted in the way that each server had one token(range) and it was replicated between physical machines based on replication factor. vnodes architecture (also used in riak for example) makes virtualization of the "node" layer, splitting the ring into high number of token ranges (vnodes) and each physical node (cassandra service) has number of vnodes running on it. Please review the link provided, there's very good explanation with examples.

Before Cassandra 1.2 each node was assigned a token. So adding/replacing a node implied some manual calculations of the initial_token property in cassandra.yaml and also significant data moves across the cluster.
Cassandra’s 1.2 release introduced the concept of virtual nodes, also
called vnodes for short. Instead of assigning a single token to a
node, the token range is broken up into multiple smaller ranges. Each
physical node is then assigned multiple tokens. By default, each node
will be assigned 256 of these tokens, meaning that it contains 256
virtual nodes. Virtual nodes have been enabled by default since 2.0.
From Cassandra, The definitive guide, Jeff Carpenter & Eben Hewitt.
Vnodes are good because you can adjust the number of vnodes on each Cassandra instance (node), depending on the machine capabilities by adjusting num_tokens property in the cassandra.yaml file.
Token assignments for vnodes are calculated by the org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocator class.

Related

How cassandra improve performance by adding nodes?

I'm going build apache cassandra 3.11.X cluster with 44 nodes. Each application server will have one cluster node so that application do r/w locally.
I have couple of questions running in my mind kindly answer if possible.
1.How many server Ip's should mention in seednode parameter?
2.How HA works when all the mentioned seed node goes down?
3.What is the dis-advantage to mention all the serverIP's in seednode parameter?
4.How cassandra scales with respect to data other than(Primary key and Tunable consistency). As per my assumption replication factor can improve HA chances but not performances.
then how performance will increase by adding more nodes?
5.Is there any sharding mechanism in Cassandra.
Answers are in order:
It's recommended to point to at least to 2 nodes per DC
Seed/contact node is used only for initial bootstrap - when your program reaches any of listed nodes, it "learns" the topology of whole cluster, and then driver listens for nodes status change, and adjust a list of available hosts. So even if seed node(s) goes down after connection is already established, driver will able to reach other nodes
it's harder to maintain usually - you need to keep a configuration parameters for your driver & list of nodes in sync.
When you have RF > 1, Cassandra may read or write data from/to any replica. Consistency level regulates how many nodes should return answer for read or write operation. When you add the new node, the data is redistributed to new node, and if you have correctly selected partition key, then new node start to receive requests in parallel to old nodes
Partition key is responsible for selection of replica(s) that will hold data associated with it - you can see it as a shard. But you need to be careful with selection of partition key - it's easy to create too big partitions, or partitions that will be "hot" (receiving most of operations in cluster - for example, if you're using the date as partition key, and always writing reading data for today).
P.S. I would recommend to read DataStax Architecture guide - it contains a lot of information about Cassandra as well...

Working of vnodes in Cassandra

Can someone explain the working of vnodes allocation in Cassandra?
If we have a cluster of N nodes and a new node is added how are token ranges allocated to this new node?
Rebalancing a cluster is automatically accomplished when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.
Here is some reading that might help you better understand how vnodes work and how ranges are being allocated - Virtual nodes in Cassandra 1.2
As I said above, Cassandra automatically handles the calculation of token ranges for each node in the cluster in proportion to their num_tokens value. Token assignments for vnodes are calculated by the org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocator class.
When a new node joins the cluster, it will inject it's own ranges and steal some rages from the existing nodes. Also this video might help

What is meant by a node in cassandra?

I am new to Cassandra and I want to install it. So far I've read a small article on it.
But there one thing that I do not understand and it is the meaning of 'node'.
Can anyone tell me what a 'node' is, what it is for, and how many nodes we can have in one cluster ?
A node is the storage layer within a server.
Newer versions of Cassandra use virtual nodes, or vnodes. There are 256 vnodes per server by default.
A vnode is essentially the storage layer.
machine: a physical server, EC2 instance, etc.
server: an installation of Cassandra. Each machine has one installation of Cassandra. The Cassandra server runs core processes such as the snitch, the partitioner, etc.
vnode: The storage layer in a Cassandra server. There are 256 vnodes per server by default.
Helpful tip:
Where you will get confused is that Cassandra terminology (in older blog posts, YouTube videos, and so on) had been used inconsistently. In older versions of Cassandra, each machine had one Cassandra server installed, and each server contained one node. Due to the 1-to-1-to-1 relationship between machine-server-node in old versions of Cassandra people previously used the terms machine, server and node interchangeably.
Cassandra is a distributed database management system designed to handle large amounts of data across many commodity servers. Like all other distributed database systems, it provides high availability with no single point of failure.
You may got some ideas from the description of above paragraph. Generally, when we talk Cassandra, we mean a Cassandra cluster, not a single PC. A node in a cluster is just a fully functional machine that is connected with other nodes in the cluster through high internal network. All nodes work together to make sure that even if one of them failed due to unexpected error, they as a whole cluster can provide service.
All nodes in a Cassandra cluster are same. There is no concept of Master node or slave nodes. There are multiple reason to design like this, and you can Google it for more details if you want.
Theoretically, you can have as many nodes as you want in a Cassandra cluster. For example, Apple used 75,000 nodes served Cassandra summit in 2014.
Of course you can try Cassandra with one machine. It still work while just one node in this cluster.
What is meant by a node in cassandra?
Cassandra Node is a place where data is stored.
Data center is a collection of related nodes.
A cluster is a component which contains one or more data centers.
In other words collection of multiple Cassandra nodes which communicates with each other to perform set of operation.
In Cassandra, each node is independent and at the same time interconnected to other nodes.
All the nodes in a cluster play the same role.
Every node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
In the case of failure of one node, Read/Write requests can be served from other nodes in the network.
If you're looking to understand Cassandra terminology, then the following post is a good reference:
http://exponential.io/blog/2015/01/08/cassandra-terminology/

What are virtual nodes? And how do they help during partitioning in Cassandra?

I know we can use Cassandra's virtual node facility so that we can prevent additional overhead of assigning token (start token) to different nodes of cluster. Instead of that we use num_tokens and its default value is 256.
In what way are these virtual nodes making difference in partitioning? Is Cassandra setting/assigning a token range (max and minimum token) for a particular node?
What is virtual nodes?
Prior to Cassandra 1.2, each node was assigned to a specific token range. Now each node can support multiple, non-contiguous token ranges. Instead of a node being responsible for one large range of tokens, it is responsible for many smaller ranges. In this way, one physical node is essentially hosting many smaller "virtual" nodes.
In what way these virtual nodes is making difference in partitioning?
Consider the image in this blog: Virtual nodes in Cassandra 1.2.
Having many smaller token ranges (nodes) on each physical node allows for a more even distribution of data. This becomes evident when you add a physical node to the cluster, in that rebalancing (manually reassigning token ranges) is no longer necessary. As the Virtual Node documentation states, the new node "assumes responsibility for an even portion of data from the other nodes in the cluster."
Cassandra is setting/assigning token range(max and minimum token) for a particular node?
Yes, Cassandra predetermines the size of each virtual node. However, you can control the number of virtual nodes assigned to each physical node. Assume that your physical nodes are all configured for the default of 256 virtual nodes. If you add a new machine with more resources than your current nodes, and you want that machine to handle more load, you could configure it to allow 384 virtual nodes instead. Likewise, a machine with fewer resources could be configured to support a smaller number of virtual nodes.

Enabling vNodes in Cassandra 1.2.8

I have a 4 node cluster and I have upgraded all the nodes from an older version to Cassandra 1.2.8. Total data present in the cluster is of size 8 GB. Now I need to enable vNodes on all the 4 nodes of cluster without any downtime. How can I do that?
As Nikhil said, you need to increase num_tokens and restart each node. This can be done one at once with no down time.
However, increasing num_tokens doesn't cause any data to redistribute so you're not really using vnodes. You have to redistribute it manually via shuffle (explained in the link Lyuben posted, which often leads to problems), by decommissioning each node and bootstrapping back (which will temporarily leave your cluster extremely unbalanced with one node owning all the data), or by duplicating your hardware temporarily just like creating a new data center. The latter is the only reliable method I know of but it does require extra hardware.
In the conf/cassandra.yaml you will need to comment out the initial_token parameter, and enable the num_tokens parameter (by default 256 I believe). Do this for each node. Then you will have to restart the cassandra service on each node. And wait for the data to get redistributed throughout the cluster. 8 GB should not take too much time (provided your nodes are all in the same cluster), and read requests will still be functional, though you might see degraded performance until the redistribution of data is complete.
EDIT: Here is a potential strategy to migrate your data:
Decommission two nodes of the cluster. The token-space should get distributed 50-50 between the other two nodes.
On the two decommissioned nodes, remove the existing data, and restart the Cassandra daemon with a different cluster name and with the num_token parameters enabled.
Migrate the 8 GB of data from the old cluster to the new cluster. You could write a quick script in python to achieve this. Since the volume of data is small enough, this should not take too much time.
Once the data is migrated in the new cluster, decommission the two old nodes from the old cluster. Remove the data and restart Cassandra on them, with the new cluster name and the num_tokens parameter. They will bootstrap and data will be streamed from the two existing nodes in the new cluster. Preferably, only bootstrap one node at a time.
With these steps, you should never face a situation where your service is completely down. You will be running with reduced capacity for some time, but again since 8GB is not a large volume of data you might be able to achieve this quickly enough.
TL;DR;
No you need to restart servers once the config has been edited
The problem is that enabling vnodes means a lot of the data is redistributed randomly (the docs say in a vein similar to the classic ‘nodetool move’

Resources