Boostrap many new cassandras to cluster with no errors - cassandra

I have cluster about 100 nodes and it grows. I need to add 10-50 on request. As I know by default cassandra has cassandra.consistent.rangemovement=true this means multiple nodes can't to bootstrap in a moment.
Anyway when I add many nodes using Terraform and some kind of default configuration (using Puppet) at least 2-3 becomes UJ state and eventually only one bootstrap successfully. Earlier I used random time delay before start cassandra.service, but it doesn't work adding 10+ nodes.
I'm trying to figure out how to implement kind of "lock" for bootstrap.
I have Consul and can get kind of lock for bootstrap in KV. For instance get lock using ExecPreStart systemd feature but I can't get how to release it after bootstrap.
I'm looking for any solutions for that.

I've done something similar using Rundeck before. Basically, we had Rundeck kick off a bash script, taking parameters about the deployment of our nodes as well as how many.
What we did, was parse the output of nodetool status. We'd count the number of nodes as well as the number of UN indicators. If those two numbers didn't match, we'd do a sleep 30s and try again.
Once those numbers matched, we knew that it was safe to add another node. The total operation could take a while to add all nodes, but it worked.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

slurm - I/O shared between to nodes? Is that possible?

I am working with NGS data and the newest test files are massive.
Normally our pipeline is using just one node and the output from different tools is its ./scratch folder.
To use just one node is not possible with the current massive data set. That's why I would like to use at least 2 nodes to solve the issues such as speed, not all jobs are submitted, etc.
Using multiple nodes or even multiple partitions is easy - i know how which parameter to use for that step.
So my issue is not about missing parameters, but the logic behind slurm to solve the following issue about I/O:
Lets say I have tool-A. Tool-A is running with 700 jobs on two nodes (340 jobs on node1 and 360 jobs on node2) - the ouput is saved on ./scratch on each node separately.
Tool-B is using the results from tool-A - which are on two different nodes.
What is the best approach to fix that?
- Is there a parameter which tells slurm which jobs belongs together and where to find the input for tool-B?
- would it be smarter to change the output on /scratch to a local-folder?
- or would it be better to merge the output from tool-A from both nodes to one node?
- any other ideas?
I hope I made my issue "simply" to understand... Please apologize if that is not the case!
My naive suggestion would be why not share a scratch nfs volume across all nodes ? This way all ouput datas of ToolA would be acessible for ToolB whatever the node. It migth not be the best solution for read/write speed, but to my mind it would be the easiest for your situation.
A more sofware solution (not to hard to develop) can be to implement a database that track where the files have been generated.
I hope it help !
... just for those coming across this via search engines: if you cannot use any kind of shared filesystem (NFS, GPFS, Lustre, Ceph) and you don't have only massive data sets, you could use "staging", meaning data transfer before and after your job really runs.
Though this is termed "cast"ing in the Slurm universe, it generally means you define
files to be copied to all nodes assigned to your job BEFORE the job starts
files to be copied from nodes assigned to your job AFTER the job completes.
This can be a way to get everything needed back and forth from/to your job's nodes even without a shared file system.
Check the man page of "sbcast" and amend your sbatch job scripts accordingly.

How to compute the time taken in data re balancing in hazelcast, in case of a node failure?

I am trying to figure out, how much time does hazelcast take to re-balance (re-partition) the data in case of a node failure. with varying backup counts.
Is there any way to figure this out.
I tried using the migration listener, but its not notified in case of a node exit. The call back happens only in case of a node being added. I have tried this with three nodes, so as to rule out data being reclaimed from the backup, and thus no migration.
The other approach I tried was using the "isClusterSafe" API. So when a member is notified of node exit (using MembershipListener), I measure the time till the "isClusterSafe" API returns true.
Is there any other way to figure out this? And will my second approach give accurate values?
You should take a look at the MigrationListener (not to be confused with PartitionListener). Seems this has a hook to know when a partition's migration has started and finished. You can calculate the time taken per partition ID.
I would then use this in conjunction with the MembershipListener so that you could figure out when a partition has been migrated to another node due to a node failure (and not just some sort of scheduled rebalancing).

Is it ok to set all cassandra nodes as seeds?

I'm interested in speeding up the process of bootstrapping a cluster and adding/removing nodes (Granted, in the case of node removal, most time will be spend draining the node). I saw in the source code that nodes that are seeds are not bootstrapped, and hence do not sleep for 30 seconds while waiting for gossip to stabilize. Thus, if all nodes are declared to be seeds, the process of creating a cluster will run 30 seconds faster. My question is is this ok? and what are the downsides of this? Is there a hidden requirement in cassandra that we have at least one non-seed node to perform a bootstrap (as suggested in the answer to the following question)? I know I can shorten RING_DELAY by modifying /etc/cassandra/cassandra-env.sh, but if simply setting all nodes to be seeds would be better or faster in some way, that might be better. (Intuitively, there must be a downside to setting all nodes to be seeds since it appears to strictly improve startup time.)
Good question. Making all nodes seeds is not recommended. You want new nodes and nodes that come up after going down to automatically migrate the right data. Bootstrapping does that. When initializing a fresh cluster without data, turn off bootstrapping. For data consistency, bootstrapping needs to be on at other times. A new start-up option -Dcassandra.auto_bootstrap=false was added to Cassandra 2.1: You start Cassandra with the option to put auto_bootstrap=false into effect temporarily until the node goes down. When the node comes back up the default auto_bootstrap=true is back in effect. Folks are less likely to go on indefinitely without bootstrapping after creating a cluster--no need to go back and forth configuring the yaml on each node.
In multiple data-center clusters, the seed list should include at least one node from each data center. To prevent partitions in gossip communications, use the same list of seed nodes in all nodes in a cluster. This is critical the first time a node starts up.
These recommendations are mentioned on several different pages of 2.1 Cassandra docs: http://www.datastax.com/documentation/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html.

SchemaDisagreementException

Trying to set up a cassandra cluster, but it tells me that I have a SchemaDisagreementException. What is weird is that writing sometimes works, just not all of the time. I assume that I must somewhere have different schemas, but I've cleared my data directories several times before, so it must not be in there. Where else would my schema be declared, other than in my code?
You've inspired a FAQ entry! :)
http://wiki.apache.org/cassandra/FAQ#schema_disagreement
(This didn't exist until 10 minutes ago, so no, you didn't miss it.)
FAQ didn't helped me in getting this issue resolved, instead fuzzed me. I have document all my concerns and reasons of confusion at http://nsinfra.blogspot.in/2013/06/cassandra-schema-disagreement-problem.html
Anyways I got rid of this problem by synchronizing the clocks of cluster nodes.
I'm seeing this too, about 1 in 3 times.
I'm able to reproduce it on a clean cluster -- no operations performed apart from starting the nodes -- and then listing schema versions confirms different some nodes already sometimes have different schemas.
I get the error even with just 2 nodes, seeds on both nodes set to the two nodes' IP addresses. Occurs starting concurrently or with a 2m delay before starting the second node.
This is Cassandra version 1.2.2. Times on the machines are in sync (<1s difference).

Resources