Is there any reason why it would be bad to have all Cassandra nodes be in the seed nodes list?
We're working on automated deployments of Cassandra, and so can easily maintain a list of every node that is supposed to be in the cluster and distribute this as a list of seed nodes to all existing nodes, and all new nodes on startup.
All the documentation I can find suggests having a minimum number of seeds, but doesn't clarify what would happen if all nodes were seeds. There is some mention of seeds being preferred in the gossip protocol, but it is not clear what the consequence would be if all nodes were seeds.
There is no reason as far as I am aware that is is bad to have all nodes as seeds in your list. I'll post some doc links below to give you some background reading but to summarise; the seed nodes are used primarily for bootstrapping. Once a node is running it will maintain a list of nodes it has established gossip with for subsequent startups.
The only disadvantage of having too many is that the proceedure for replacing nodes if they are seed nodes is slightly different:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_seed_node.html
Further background reading: (note some of the older docs although superceded sometimes do contain more lengthy explanations)
http://www.datastax.com/docs/1.0/cluster_architecture/gossip
http://www.datastax.com/documentation/cassandra/1.2/cassandra/initialize/initializeMultipleDS.html
Related
I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.
I have a cassandra(version 2.2.8) cluster of total 6 nodes, where 3 nodes are seed nodes. One of the seed node recently went down. I need to replace that dead seed node. My cluster is setup in a way where it cannot survive a loss of more than 1 node. I read this documentation to replace the dead seed node.
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
As per the documentation, I am scared to remove the dead seed node from the seed list and do a rolling restart. If for any reason any node doesnt start, Ill lose the data.
How to approach this scenario? Is it ok to not remove the dead seed node from the seed list until the new node is fully up and running? As i already have two working seed nodes already present in the seed list. Please advice.
In short: Yes it is okay to wait with removing the seed node.
Explanation: Seed node configuration do two things:
When adding new nodes. The new node will read the seed configuration to get a first contact point to the Cassandra cluster. After the node has joined the cluster it will save information about all Cassandra nodes in it's system.peers table. For all future starts it will use this information to connect to the cluster, not the seed node configuration.
Cassandra also uses seeds as a way to improve gossip. Basically, seed nodes are more likely to receive gossip messages than normal nodes. This improves the speed at which nodes receive updates about other nodes, such as the status.
Losing a seed node will in your case only impact 2. Since you have two more seed nodes I don't see this as a big issue. I would still do a rolling restart on all nodes as soon as you have updated your seed configuration.
I understand the seed nodes play the role of a "first point of contact" in the gossip protocol which allows the nodes to discover and maintain the list of their peers.
The documentation recommends the following.
A) Start the seeds one at a time, then the remaining nodes in the cluster.
B) Define the same seeds on each node.
I am not sure why. Below is my best guess.
A) If all the nodes started at the same time, one of them might contact a seed which does not have the right information yet. This might prevent/delay the lists of peers convergence.
B) This is because the seeds have more up-to-date information than the other nodes. As such the lists of peers converge faster if all the nodes use the same seeds.
Are my guesses correct? Is there any other reason? Are there cases where the lists of peers cannot converge because the seeds are configured differently on each node, or the nodes are not started in the right order?
It's important to point out that seed nodes are only relevant when a node first joins the cluster, e.g. when adding a new node to manage capacity. The node will contact the seed node as the first point of contact to get the current node list and topology. After that, the list is maintained through gossip, even when a node goes down and comes back online, so once a node has joined the cluster, seeds are no longer relevant.
To me, those recommendations are to ensure a clean startup of a new cluster.
A) For obvious reasons, the seeds should be started before non-seed nodes. Starting seed nodes one at a time ensure that nodes which start after the first will be able to contact an existing node you start up and get the cluster info. Otherwise they may think they are the first node in the cluster and spend time doing initial setup -- the nodes should eventually converge to form the desired cluster, but it may take longer.
B) If you have different seeds on each node, then you may inadvertently setup a node startup order which has to be followed for successful cluster creation. e.g. if C and D has seeds A and B, but E has seeds C,D, then you will need to set your configuration to start C or D after A or B, and then start E after that, even though they may all be seed nodes.
This rule lets you set your configuration to: Start seeds (any order) then start non-seeds (any order).
We are deploying our application to production this month and our stack will include a 3 node, single datacenter Cassandra version 1.2 cluster. In anticipation of this, we have been getting our initial Cassandra.yaml settings worked out. While doing this I ran into a interesting situation for which I haven't been able to find an answer.
This has to do with setting the -seeds parameter in each of the nodes Cassandra.yaml files. All of the reading I've done say it is best practice to:
Have at least 2 seeds per datacenter. This makes sense so that one of the nodes can come down and other nodes can be seeded by the second seed.
These two seeds should be the same for all (in our case 3) nodes.
In the deployment I tested this on, I started out with all three nodes having a single seed, node 1's IP address. My intention was to change the seeds of all three nodes to the IP address of node1 and node2. First I did node 3 by:
decommissioning the node.
Shutting down Cassandra.
changing the -seeds value to ip_node1,ip_node2
starting up Cassandra.
running nodetool status to ensure the node was added back to the cluster.
Next I did node 2, following the exact same steps I did for node 3. But something unexpected happened. When I restarted Cassandra on node 2, it did not join the existing ring. Instead it started its own single node ring. It seems pretty obvious that of the two seed parameters I passed it, it used its own IP address and thus believed it was the first node in a new ring.
I was surprised Cassandra didn't select the seed argument of the other seed value I passed it (node 2's). The only way I could get it to join the existing datacenter was to set its seeds to one or both of the other nodes in the cluster.
An obvious work around to this is to configure each of my three nodes seeds value to the IP addresses of the other two nodes in the cluster. But since several sources have suggested this isn't a "Best Practice" I thought I'd ask how this should be handled. So my question is:
Is it normal for Cassandra to always use its own IP address as a seed if it is in the seed list?
Is configuring the cluster the way I've suggested, which goes against best practice a huge issue?
This might not be the solution to your question but did you compare all your cassandra.yaml files?
They should all be the same, apart from things like listen_address.
Is it possible you might have had a whitespace or typo in the cluster name also?
I just thought I'd mention it as something good to check.
I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.