Cassandra clusters separation - cassandra

I have one single-node cluster and just added a multi-node cluster (on 4 seprate nodes, let's call them node1, node2,.., node4). The single-node cluster uses the localhost as seed_provider. The multi-node uses node1,node2 hosts as seeds (SimpleSeedProvider).
To my suprise when I started the multi-node cluster I was able to see they started talking to the single-node Cassandra and they downloaded data from it.
How to prevent the new cluster talking to the existing cluster? Do I miss anything else?

They will "gossip" on the network and detect each other if they are not separated.
Did you make sure your cluster_name value in your cassandra.yaml file is not the same for both of your clusters? That's how they differentiate each other as said in the sample configuration file :
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.

Related

Warnings : Your replication factor 3 for keyspace books is higher than the number of nodes 1

When I tried:
CREATE KEYSPACE IF NOT EXISTS books WITH REPLICATION = { 'class': 'SimpleStrategy', 'replication_factor': 3 };
I got:
Warnings :
Your replication factor 3 for keyspace books is higher than the number of nodes 1
I have installed Cassandra on my local Ubuntu 22.04 machine and don't remember if I specified how many nodes it should have if it was ever possible doing that. I am just wondering to know is it possible to have multi nodes on a local machine? How should I check the number of Cassandra nodes and how can I change them?
A "node" in Cassandra is an individual machine instance. If you're deploying in the cloud or on Kubernetes, increasing the number of nodes is a trivial thing. However, if you're just testing on your own machine, and you've installed Cassandra, then you have one node.
How should I check the number of Cassandra nodes?
You can check this by querying the local node's system.peers table. That table holds data on each of the other nodes in the cluster.
SELECT * FROM system.peers;
So, if you get two rows back, then you have a three node cluster. If nothing comes back, then you have a one node cluster.
is it possible to have multi nodes on a local machine?
Yes. It's not easy, though. Cassandra makes exclusive use of a few ports on a machine instance, so you'd need multiple Cassandra directories, and each would need its own offsets of ports 7000, 7001, 7199, and 9042.
The easier way, is to use MiniKube to run multiple, small Cassandra nodes inside a Kubernetes cluster. I've build a repo to help folks with doing this on Windows: https://github.com/aploetz/cassandra_minikube
There's also a companion YouTube video for it: https://www.youtube.com/watch?v=eMKXiItZ0Q4

DSE Analytics in Cluster Configuration

Previously we had three nodes cluster with two Cassandra nodes datacenter in one dc and one spark enabled node in different dc.
Spark was running smoothly in that configurations.
Then we tried adding another node in analytics dc with spark enabled. We had configured GossipingPropertyFileSnitch as well as added seeds.
But now when we start the cluster, spark master is assigned to both the nodes separately. So spark job still runs on a single node. What configurations are we missing regarding running spark job in a cluster?
Most probably you didn't make an adjustments in the Analytics keyspace replication, or didn't run the repair after you added a node. Please refer to instructions in official documentation.
Also, please check that you configured the same DC for both of Analytics nodes, because the Spark master is elected per DC.

How to setup Spark with a multi node Cassandra cluster?

First of all, I am not using the DSE Cassandra. I am building this on my own and using Microsoft Azure to host the servers.
I have a 2-node Cassandra cluster, I've managed to set up Spark on a single node but I couldn't find any online resources about setting it up on a multi-node cluster.
This is not a duplicate of how to setup spark Cassandra multi node cluster?
To set it up on a single node, I've followed this tutorial "Setup Spark with Cassandra Connector".
You have two high level tasks here:
setup Spark (single node or cluster);
setup Cassandra (single node or cluster);
This tasks are different and not related (if we are not talking about data locality).
How to setup Spark in Cluster you can find here Architecture overview.
Generally there are two types (standalone, where you setup Spark on hosts directly, or using tasks schedulers (Yarn, Mesos)), you should draw upon your requirements.
As you built all by yourself, I suppose you will use Standalone installation. The difference between one node is network communication. By default Spark runs on localhost, more commonly it uses FQDNS name, so you should configure it in /etc/hosts and hostname -f or try IPs.
Take a look at this page, which contains all necessary ports for nodes communication. All ports should be open and available between nodes.
Be attentive that by default Spark uses TorrentBroadcastFactory with random ports.
For Cassandra see this docs: 1, 2, tutorials 3, etc.
You will need 4 likely. You also could use Cassandra inside Mesos using docker containers.
p.s. If data locality it is your case you should come up with something yours, because nor Mesos, nor Yarn don't handle running spark jobs for partitioned data closer to Cassandra partitions.

DSE changing cassandra to use vnodes

We want to deploy DSE cluster of 3 nodes, where each node is Analytics running Spark.
We want to use vnodes in cassandra, because it enables much more even data distribution and easier adding of the nodes. We deploy DSE on AWS, using one of the available AMI images.
Although DSE by default deploys Cassandra cluster using single token nodes, we have to manually change cassandra.yaml file on all the nodes.
According to datastax documentation, I should:
uncomment num_tokens field (I left 256 default value)
leave initial_token field unassigned
After that, when I do nodetool status command, I see that my cluster still uses single token mode.
According to this, I should restart nodes in the cluster, so that changes take effect.
But after nodes are restarted both thru OPS center or AWS console, I get errors, nodes are in unresponsive state, and I cannot use nodetool command on my nodes, with error:
Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused'.
Is there something that I am doing wrong?
How to enable vnodes on DSE when deployed using AMI image?
Thank you

Cassandra Cluster Vs Nodes

I am beginner in Cassandra, I am trying to understand few basic things.
1) Cassandra Cluster : Does it mean a physical server? Is it possible to run multiple clusters on a single physical machine?
2) Cassandra Nodes : By definition it looks that one cluster can have multiple nodes. Can we have multiple nodes on a single physical machine? or one node means one single machine?
3) I have two physical machines and I just installed Cassandra server on both machines and configured syncing between the both Cassandra servers, so if I create any keyspace with NetworkTopologyStrategy I am able to see on both servers. Does it mean that I created two clusters or two nodes?
Need help on above questions.
Let us work with JVM as a unit.
Cassandra Node: It is a single JVM instance to run Cassandra. It can be run on a single physical machine or on a VM or docker container.
Cassandra Cluster: One or more group of Cassandra nodes form a Cassandra cluster.
So if you have 2 physical machines you can always run more than 2 nodes depending on the capacity of your machine. You can also run multiple clusters. i.e eg: you can create 6 VM to prepare 6 nodes and group them into two clusters with 3 each. This is controlled by cassandra.yaml.
Does it mean that I created two clusters or two nodes?
No, it means you created two nodes and grouped them into one cluster.

Resources