I'm configuring a three node Cassandra cluster which are located in three different machines. I can ping to one another, and ssh too.
I have setup the cassandra cluster in these three machines. Say they are A, B, C where A is the seed. Here, C joins to the seed (A) successfully, and a joining log get printed. When I analyze the cluster via A, I can see that C has joined, and has 66.7% ownership. 'A' has 33.3% ownership. (I have equally divided the tokens.)
But the node B didn't join the cluster. There is no errors printed. The configurations of B, and C are similar except for the listen_address , and rpc_address. I verified the config between these two, and they are similar.
This is probably an issue with network, but I'm not sure whether that's the case. There is no issue gets printed. Any suggestions on things I can try out here? This seems pretty strange. May be this is due to some port issue?
What Cassandra version are you on?
Try shutting down every node and bringing them up one by one. Cassandra prior to 1.1.6 (I think it was that point version) had a problem where nodes sometimes couldn't re-join the ring.
Secondly, make sure every node is configured with the same cluster name and with the same set of seed nodes.
Related
I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.
I have added new 4 nodes to existing 4 nodes cluster. Now some data are missing on cluster.
What can be the reason for it? what can I do for resolve it?
Data missing keyspace RF is 1 when I was adding to the cluster. It can be a issue?
Note: Once I added new nodes to cluster executed repair command to all nodes
You really shouldn't be running a RF of 1.
I imagine that if you added them all in a short timeframe with a low RF that the VNodes got shuffled from one node to another without settling. I'm suprised a full repair didn't do anything.
You might check the HDDs of the original nodes to see if the repair didn't delete the old data. If it's still there you may be able to remove the new nodes (temporarily) and then add each node back in 1 by 1 while repairing.
Edit: additionally probably use an odd number of nodes.
I understand the seed nodes play the role of a "first point of contact" in the gossip protocol which allows the nodes to discover and maintain the list of their peers.
The documentation recommends the following.
A) Start the seeds one at a time, then the remaining nodes in the cluster.
B) Define the same seeds on each node.
I am not sure why. Below is my best guess.
A) If all the nodes started at the same time, one of them might contact a seed which does not have the right information yet. This might prevent/delay the lists of peers convergence.
B) This is because the seeds have more up-to-date information than the other nodes. As such the lists of peers converge faster if all the nodes use the same seeds.
Are my guesses correct? Is there any other reason? Are there cases where the lists of peers cannot converge because the seeds are configured differently on each node, or the nodes are not started in the right order?
It's important to point out that seed nodes are only relevant when a node first joins the cluster, e.g. when adding a new node to manage capacity. The node will contact the seed node as the first point of contact to get the current node list and topology. After that, the list is maintained through gossip, even when a node goes down and comes back online, so once a node has joined the cluster, seeds are no longer relevant.
To me, those recommendations are to ensure a clean startup of a new cluster.
A) For obvious reasons, the seeds should be started before non-seed nodes. Starting seed nodes one at a time ensure that nodes which start after the first will be able to contact an existing node you start up and get the cluster info. Otherwise they may think they are the first node in the cluster and spend time doing initial setup -- the nodes should eventually converge to form the desired cluster, but it may take longer.
B) If you have different seeds on each node, then you may inadvertently setup a node startup order which has to be followed for successful cluster creation. e.g. if C and D has seeds A and B, but E has seeds C,D, then you will need to set your configuration to start C or D after A or B, and then start E after that, even though they may all be seed nodes.
This rule lets you set your configuration to: Start seeds (any order) then start non-seeds (any order).
We are deploying our application to production this month and our stack will include a 3 node, single datacenter Cassandra version 1.2 cluster. In anticipation of this, we have been getting our initial Cassandra.yaml settings worked out. While doing this I ran into a interesting situation for which I haven't been able to find an answer.
This has to do with setting the -seeds parameter in each of the nodes Cassandra.yaml files. All of the reading I've done say it is best practice to:
Have at least 2 seeds per datacenter. This makes sense so that one of the nodes can come down and other nodes can be seeded by the second seed.
These two seeds should be the same for all (in our case 3) nodes.
In the deployment I tested this on, I started out with all three nodes having a single seed, node 1's IP address. My intention was to change the seeds of all three nodes to the IP address of node1 and node2. First I did node 3 by:
decommissioning the node.
Shutting down Cassandra.
changing the -seeds value to ip_node1,ip_node2
starting up Cassandra.
running nodetool status to ensure the node was added back to the cluster.
Next I did node 2, following the exact same steps I did for node 3. But something unexpected happened. When I restarted Cassandra on node 2, it did not join the existing ring. Instead it started its own single node ring. It seems pretty obvious that of the two seed parameters I passed it, it used its own IP address and thus believed it was the first node in a new ring.
I was surprised Cassandra didn't select the seed argument of the other seed value I passed it (node 2's). The only way I could get it to join the existing datacenter was to set its seeds to one or both of the other nodes in the cluster.
An obvious work around to this is to configure each of my three nodes seeds value to the IP addresses of the other two nodes in the cluster. But since several sources have suggested this isn't a "Best Practice" I thought I'd ask how this should be handled. So my question is:
Is it normal for Cassandra to always use its own IP address as a seed if it is in the seed list?
Is configuring the cluster the way I've suggested, which goes against best practice a huge issue?
This might not be the solution to your question but did you compare all your cassandra.yaml files?
They should all be the same, apart from things like listen_address.
Is it possible you might have had a whitespace or typo in the cluster name also?
I just thought I'd mention it as something good to check.
I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.