Cassandra seed values in a three node datacenter

Cassandra seed values in a three node datacenter - cassandra

We are deploying our application to production this month and our stack will include a 3 node, single datacenter Cassandra version 1.2 cluster. In anticipation of this, we have been getting our initial Cassandra.yaml settings worked out. While doing this I ran into a interesting situation for which I haven't been able to find an answer.
This has to do with setting the -seeds parameter in each of the nodes Cassandra.yaml files. All of the reading I've done say it is best practice to:
Have at least 2 seeds per datacenter. This makes sense so that one of the nodes can come down and other nodes can be seeded by the second seed.
These two seeds should be the same for all (in our case 3) nodes.
In the deployment I tested this on, I started out with all three nodes having a single seed, node 1's IP address. My intention was to change the seeds of all three nodes to the IP address of node1 and node2. First I did node 3 by:
decommissioning the node.
Shutting down Cassandra.
changing the -seeds value to ip_node1,ip_node2
starting up Cassandra.
running nodetool status to ensure the node was added back to the cluster.
Next I did node 2, following the exact same steps I did for node 3. But something unexpected happened. When I restarted Cassandra on node 2, it did not join the existing ring. Instead it started its own single node ring. It seems pretty obvious that of the two seed parameters I passed it, it used its own IP address and thus believed it was the first node in a new ring.
I was surprised Cassandra didn't select the seed argument of the other seed value I passed it (node 2's). The only way I could get it to join the existing datacenter was to set its seeds to one or both of the other nodes in the cluster.
An obvious work around to this is to configure each of my three nodes seeds value to the IP addresses of the other two nodes in the cluster. But since several sources have suggested this isn't a "Best Practice" I thought I'd ask how this should be handled. So my question is:
Is it normal for Cassandra to always use its own IP address as a seed if it is in the seed list?
Is configuring the cluster the way I've suggested, which goes against best practice a huge issue?

This might not be the solution to your question but did you compare all your cassandra.yaml files?
They should all be the same, apart from things like listen_address.
Is it possible you might have had a whitespace or typo in the cluster name also?
I just thought I'd mention it as something good to check.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.

Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.

To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Cassandra Query for a Specific Node

I am using cassandra 1.2.15 cluster with 4 number of nodes and having a keyspace with a replication of 2 in Simple Network Topology. And I am using Murmur3Partitioner. I have used default configurations that are available in the yaml file. First node is the seed node, other 3 nodes pointed the first node as seed node.
First node yaml configuration is
initial_token: left empty
num_tokens: 256
auto_bootstrap: false
Other 3 nodes yaml configuration is
initial_token: left empty
num_tokens: 256
auto_bootstrap: true
I have three questions, My main question is Question1,
Question 1:
I need to query a specific node on the cluster. (ie) In a four node cluster I need to make a query to select all the rows in a column family for the node 2 alone.Is it possible? If yes how to proceed?
Question 2:
Whether my yaml configuration is correct or not for the above approach?
Question 3:
Whether this configuration will make any trouble in future, if I add two nodes in the cluster?

Q1 I need to query a specific node on the cluster. (ie) In a four node cluster I need to make a query to select all the rows in a column family for the node 2 alone.Is it possible? If yes how to proceed?
Nope, not possible. What you can do is query a specific datacenter using the LOCAL_QUORUM or EACH_QUORUM consistency levels. Or you can connect to a specific node and query the system KS which is specific to each node (by specifying the address in either cqlsh or your driver). There are some times where this can be useful, but it's not what you're after.
Q2 Whether my yaml configuration is correct or not for the above approach?
In 1.2 I think it might be a better idea to populate the tokens on your own for your initial nodes rather than leaving that to C*.
As for auto_bootstrap, false is the right choice for a fresh cluster node:
This setting has been removed from default configuration. It makes new (non-seed)
nodes automatically migrate the right data to themselves. When initializing a
fresh cluster with no data, add auto_bootstrap: false.
Q3 Whether this configuration will make any trouble in future, if I add two nodes in the cluster?
I'd advice you to move away from simple network topology simply because it complicates the process of expanding to multiple data centres. Another thin to remember is to enable auto-bootstrap for your new nodes and it should work quite nicely with v-nodes.

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)

Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

Consequences of all Cassandra Nodes being Seeds?

Is there any reason why it would be bad to have all Cassandra nodes be in the seed nodes list?
We're working on automated deployments of Cassandra, and so can easily maintain a list of every node that is supposed to be in the cluster and distribute this as a list of seed nodes to all existing nodes, and all new nodes on startup.
All the documentation I can find suggests having a minimum number of seeds, but doesn't clarify what would happen if all nodes were seeds. There is some mention of seeds being preferred in the gossip protocol, but it is not clear what the consequence would be if all nodes were seeds.

There is no reason as far as I am aware that is is bad to have all nodes as seeds in your list. I'll post some doc links below to give you some background reading but to summarise; the seed nodes are used primarily for bootstrapping. Once a node is running it will maintain a list of nodes it has established gossip with for subsequent startups.
The only disadvantage of having too many is that the proceedure for replacing nodes if they are seed nodes is slightly different:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_seed_node.html
Further background reading: (note some of the older docs although superceded sometimes do contain more lengthy explanations)
http://www.datastax.com/docs/1.0/cluster_architecture/gossip
http://www.datastax.com/documentation/cassandra/1.2/cassandra/initialize/initializeMultipleDS.html

One Cassandra node fails to join the Cassandra cluster

I'm configuring a three node Cassandra cluster which are located in three different machines. I can ping to one another, and ssh too.
I have setup the cassandra cluster in these three machines. Say they are A, B, C where A is the seed. Here, C joins to the seed (A) successfully, and a joining log get printed. When I analyze the cluster via A, I can see that C has joined, and has 66.7% ownership. 'A' has 33.3% ownership. (I have equally divided the tokens.)
But the node B didn't join the cluster. There is no errors printed. The configurations of B, and C are similar except for the listen_address , and rpc_address. I verified the config between these two, and they are similar.
This is probably an issue with network, but I'm not sure whether that's the case. There is no issue gets printed. Any suggestions on things I can try out here? This seems pretty strange. May be this is due to some port issue?

What Cassandra version are you on?
Try shutting down every node and bringing them up one by one. Cassandra prior to 1.1.6 (I think it was that point version) had a problem where nodes sometimes couldn't re-join the ring.
Secondly, make sure every node is configured with the same cluster name and with the same set of seed nodes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string