Cassandra Cluster Vs Nodes - cassandra

I am beginner in Cassandra, I am trying to understand few basic things.
1) Cassandra Cluster : Does it mean a physical server? Is it possible to run multiple clusters on a single physical machine?
2) Cassandra Nodes : By definition it looks that one cluster can have multiple nodes. Can we have multiple nodes on a single physical machine? or one node means one single machine?
3) I have two physical machines and I just installed Cassandra server on both machines and configured syncing between the both Cassandra servers, so if I create any keyspace with NetworkTopologyStrategy I am able to see on both servers. Does it mean that I created two clusters or two nodes?
Need help on above questions.

Let us work with JVM as a unit.
Cassandra Node: It is a single JVM instance to run Cassandra. It can be run on a single physical machine or on a VM or docker container.
Cassandra Cluster: One or more group of Cassandra nodes form a Cassandra cluster.
So if you have 2 physical machines you can always run more than 2 nodes depending on the capacity of your machine. You can also run multiple clusters. i.e eg: you can create 6 VM to prepare 6 nodes and group them into two clusters with 3 each. This is controlled by cassandra.yaml.
Does it mean that I created two clusters or two nodes?
No, it means you created two nodes and grouped them into one cluster.

Related

Warnings : Your replication factor 3 for keyspace books is higher than the number of nodes 1

When I tried:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 };
I got:
Warnings :
Your replication factor 3 for keyspace books is higher than the number of nodes 1
I have installed Cassandra on my local Ubuntu 22.04 machine and don't remember if I specified how many nodes it should have if it was ever possible doing that. I am just wondering to know is it possible to have multi nodes on a local machine? How should I check the number of Cassandra nodes and how can I change them?
A "node" in Cassandra is an individual machine instance. If you're deploying in the cloud or on Kubernetes, increasing the number of nodes is a trivial thing. However, if you're just testing on your own machine, and you've installed Cassandra, then you have one node.
How should I check the number of Cassandra nodes?
You can check this by querying the local node's system.peers table. That table holds data on each of the other nodes in the cluster.
SELECT * FROM system.peers;
So, if you get two rows back, then you have a three node cluster. If nothing comes back, then you have a one node cluster.
is it possible to have multi nodes on a local machine?
Yes. It's not easy, though. Cassandra makes exclusive use of a few ports on a machine instance, so you'd need multiple Cassandra directories, and each would need its own offsets of ports 7000, 7001, 7199, and 9042.
The easier way, is to use MiniKube to run multiple, small Cassandra nodes inside a Kubernetes cluster. I've build a repo to help folks with doing this on Windows: https://github.com/aploetz/cassandra_minikube
There's also a companion YouTube video for it: https://www.youtube.com/watch?v=eMKXiItZ0Q4

Three node Cassandra cluster all nodes configured different rack in same dc

I am a newbie in Cassandra.
In our production environment three node Cassandra clusters are running and serving production traffic but I have below mentioned doubts:-
1) All nodes are configured in different racks i.e rack 1, rack 2 and rack 3 in the same dc. Is it fine or does this configuration have some drawbacks?
2) We are using rf2 and network topology for all the keyspaces except system tables and these system tables are configured with rf2 and simplestrategy ..is it fine or does this need to be changed? should we increase the replication factor of system_auth? ..please let me know..
3) Now I want to add another node in the same dc, what will be the best procedure to do the same without impacting the live traffic?
Cassandra version is Apache cassandra 3.11.
Thanks in advance..
Ans 1) It seems good to have Cassandra nodes in different racks for availability and fault tolerance .
Ans 2) You must increase RF on system_auth so that you can avoid cqlsh login issue from other nodes.
Ans 3) You can add new node without affecting the live traffic on existing cluster. please follow below procedure.
http://cassandra.apache.org/doc/latest/operating/topo_changes.html
Cassandra is designed as a distributed system. Cassandra’s distributed architecture is specifically tailored for multiple-data center deployment. These features are robust and flexible enough that you can configure the cluster for optimal geographical distribution, for redundancy for fail-over and disaster recovery.
Multiple data center deployments are excellent for global solutions where in some applications are operational in one region and other applications in another region and yet using a single cluster of Cassandra which is working in multiple data centers across regions.
For single region applications, still having multiple data-centers is preferred option because it provides disaster recovery even in case one region goes down.
Ans 1) For a single DC Cassandra cluster , recommendation is to have 4 nodes with RF3. Rack 1 with 2 nodes and Rack 2 with 2 nodes. Remember that nodes in the same rack have faster network than nodes in different racks. With two nodes on the same Rack, queries with LOCAL_QUORUM will be faster as compared to queries on a cluster with all nodes on different racks.
If you are not concerned with the query latency , all nodes in different racks (3 racks) will give better disaster recovery as compared with two RACK deployment. Having said that, it's always recommended to use multi DC deployments for production cluster.
Ans 2) It’s always recommended to increase the replication factor of System_auth keyspace and change the replication class to NetworkTopologyStrategy. Please follow this documentation for more details https://docs.datastax.com/en/security/6.0/security/secSystemKeyspace.html
Ans 3) Yes, You can add a new node to existing cluster with ease without impacting the traffic. Please follow this documentation for more details: https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html

Creating Cassandra sub-clusters

I need to create K overlapping Cassandra clusters on N machines (K>>N). Each cluster can have between 1 to N nodes. I know that one way of doing so is to create a separate process (or docker container) for each cluster a node is a member of.
My question however is that can I change Cassandra to allow the creation of sub-clusters? meaning that there would be only 1 Cassandra instance running on each node, but I would be able to take control of data replication and data placement so that within a sub-cluster for example, I would be able to do a Quorum write for example.
No, it's not possible to define the sub-cluster as you describe - there is always a single Cassandra cluster per process.
But Cassandra has a notion of the Datacenter that defines where machine resides, and the keyspace that defines how the data is replicated between datacenters and nodes. And consistency level, like, QUORUM depends on the keyspace configuration.
In your case I would think in that direction - define datacenters, create necessary keyspaces, and setup correct replication factors for that keyspaces.

Cassandra: Who creates/distributes virtual Nodes among nodes - Leader?

In Cassandra, virtual nodes are created and distributed among nodes as given in http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2. But who does that process ? Creating the virtual nodes, distributing among peers. Is it some sort of leader ? How does it work ?
Also in case a node joins, virtual nodes are re-distributed. Lot more similar actions are present. Who does all those ?
Edit: Is it like when a node joins, it takes up some part of virtual nodes from existing cluster thus eliminating the need of leader ?
New node retrieves data about the cluster using seed nodes.
The new node will take his part of the cluster, based of num_tokens parameter (by default it will be distributed evenly between all nodes existing nodes), and will bootstrap it's part of data.
The rest of the cluster will be aware about the changes by "gossiping" - using gossip protocol.
Except the seed nodes part, there's no need for any "master" in the process.
Old nodes will not delete partitions automatically, you need to run nodetool cleanup on the old nodes after adding a new node.
Here's good article about it:
http://cassandra.apache.org/doc/latest/operating/topo_changes.html

Cassandra clusters separation

I have one single-node cluster and just added a multi-node cluster (on 4 seprate nodes, let's call them node1, node2,.., node4). The single-node cluster uses the localhost as seed_provider. The multi-node uses node1,node2 hosts as seeds (SimpleSeedProvider).
To my suprise when I started the multi-node cluster I was able to see they started talking to the single-node Cassandra and they downloaded data from it.
How to prevent the new cluster talking to the existing cluster? Do I miss anything else?
They will "gossip" on the network and detect each other if they are not separated.
Did you make sure your cluster_name value in your cassandra.yaml file is not the same for both of your clusters? That's how they differentiate each other as said in the sample configuration file :
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.

Resources