autoscale a crate cluster - autoscaling

I'm playing around deploying crate in a Rancher environment.
It's working fine, but I have issues with two config params:
gateway.expected_nodes and gateway.recover_after_nodes.
What is best practice regarding these two when it comes to scaling crate up and down.
/hw

The settings gateway.expected_nodes and gateway.recover_after_nodes are
only relevant during node startup.
scale-down: After you've removed some nodes you should update the configuration
to reflect the new number of nodes in the cluster. But you don't need to
restart.
scale-up: You should change the settings to the number of nodes you're going
to have. This should be done before you start those new nodes.
But you don't need to restart the existing nodes.
For a running node/cluster these values don't have any effect at all, that's why you don't necessarily have to restart (but the values should be correct in case you do restart them). They're
only relevant during start-up. They control if the node (that is just starting)
should recover the data from it's filesystem or if it should wait for other
nodes in the cluster and receive the data from them.
For example given the case that you've 2 nodes: N1 and N2.
You create a table
You stop N2
You delete the table (on N1)
You start N2
N2 reads the gateway settings - it's wrong so it thinks it's going to be the only node in the cluster and recovers the table because it doesn't know that it got deleted on N1 (it doesn't know about N1 yet)
N2 eventually joins N1
The table is back in the cluster
update
should I care about warning in admin when all nodes being started or restarted will have correct settings
If they will have the correct settings on a (re)start the warnings can be ignored.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Inconsistent Elassandra cluster state after node restart - less data on one node

I have migrated my existing data in 4 nodes Cassandra (with RF=3) to Elassandra and after putting my mappings whole data got indexed into Elassandra. After the completion of indexing, all nodes show a consistent result in /_cat/indices?v API. But as soon as I restart any node the data on that node is reduced substantially, index size as well as the number of records. If I restart another node of the cluster the problem shift to that node and previous node recovers automatically. For more details and detailed use case please refer to the issue I have created with Elassandra.
Upgrade to Elassandra v6.8.4.3 has resolved the problem. Thanks!

How to bring up the new node

It is a follow-up question from High Availability in Cassandra
1) Let's say we have three nodes N1, N2 and N3, I have RF =3 and WC = 3 and RC = 1, then which means I cannot handle any node failure in case of write.
2) Let's say If the N3 (Imagine It holds the data) went down and as of now we will not be able to write the data with the consistency as '3'.
Question 1: Now If I bring a new Node N4 up and attach to the cluster, Still I will not be able to write to the cluster with consistency 3, So How can I make the node N4 act as the third node?
Question 2: I mean let's say we have 7 node cluster with RF = 3, then If any node holding the replica went down, Is there a way to make existing other nodes in the cluster to act as a node holding the partition?
Look at the docs:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsReplaceNode.html
You want to replace a dead node in your scenario. N3 should be removed from the ring and replaced by N4.
It should be easy to follow that instructions step by step. It is critial that if you installed the node via package mangement to stop it before configuring it new and to wipe out all existing data, caches and commitlogs from it (often found under /var/lib/cassandra/*).
Also it is possible to remove a dead node from the ring with nodetool removenode as described here http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRemoveNode.html and here https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRemoveNode.html - this removes the node from your cluster (and you should ensure that it cant come back after that before wiping out its data).
Remember it only removes a dead node from the ring and assigns the token ranges to the remaining nodes, but no streaming will happen automatically. You will need to rum nodetool repair after removing a dead node.
If you want to remove a live node you can use nodetool decommission - but as above, ensure the node does not reenter the cluster by wiping out it's data.
Update:
Nodes in Cassandra are not "named" in that fashion N1, N2, etc. internally. The nodes have an uuid and they own so called token ranges which they are responsible for.
If a node is down - simply repair it if possible at all, bring it online again to join it your cluster - if that took less than the default 3 hours you are fine. Otherwise run nodetool repair.
But if the node is 'lost' completely and will never come back, run nodetool removenode on that dead node. This asks cassandra to assign the token ranges the dead node was responsible for to the remaining nodes. After that run nodetool repair so the nodes will stream the data which is missing. After that your cluster will now have one node less, so it will be six nodes.
Suppose you have a 7 node cluster. N1, N2, N3, ... ,N7. Suppose you have a data . That has RF = 3, Write consistency = 2, Read consistency = 2. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the cluster will be completely fine and data read/write operation will not be affected as long as consistency level for read and write operation is satisfied.
Suppose you have a data . That has RF = 3, Write consistency = 3, Read consistency = 3. Let's say node N1,N2, N3 are holding the data. If any of this nodes goes down,the operations will fail as the consistency level is not satisfied.
Now you can do two things if any of the N1,N2,N3 goes down:
1) You can replace the node. In this case newly replaced node will act like old dead node.
2) You can manually add a new node N8 and remove the old dead node N3. In this case Cassandra will distribute it's partiotioner among the ring and resize partiotion.

Cassandra seed values in a three node datacenter

We are deploying our application to production this month and our stack will include a 3 node, single datacenter Cassandra version 1.2 cluster. In anticipation of this, we have been getting our initial Cassandra.yaml settings worked out. While doing this I ran into a interesting situation for which I haven't been able to find an answer.
This has to do with setting the -seeds parameter in each of the nodes Cassandra.yaml files. All of the reading I've done say it is best practice to:
Have at least 2 seeds per datacenter. This makes sense so that one of the nodes can come down and other nodes can be seeded by the second seed.
These two seeds should be the same for all (in our case 3) nodes.
In the deployment I tested this on, I started out with all three nodes having a single seed, node 1's IP address. My intention was to change the seeds of all three nodes to the IP address of node1 and node2. First I did node 3 by:
decommissioning the node.
Shutting down Cassandra.
changing the -seeds value to ip_node1,ip_node2
starting up Cassandra.
running nodetool status to ensure the node was added back to the cluster.
Next I did node 2, following the exact same steps I did for node 3. But something unexpected happened. When I restarted Cassandra on node 2, it did not join the existing ring. Instead it started its own single node ring. It seems pretty obvious that of the two seed parameters I passed it, it used its own IP address and thus believed it was the first node in a new ring.
I was surprised Cassandra didn't select the seed argument of the other seed value I passed it (node 2's). The only way I could get it to join the existing datacenter was to set its seeds to one or both of the other nodes in the cluster.
An obvious work around to this is to configure each of my three nodes seeds value to the IP addresses of the other two nodes in the cluster. But since several sources have suggested this isn't a "Best Practice" I thought I'd ask how this should be handled. So my question is:
Is it normal for Cassandra to always use its own IP address as a seed if it is in the seed list?
Is configuring the cluster the way I've suggested, which goes against best practice a huge issue?
This might not be the solution to your question but did you compare all your cassandra.yaml files?
They should all be the same, apart from things like listen_address.
Is it possible you might have had a whitespace or typo in the cluster name also?
I just thought I'd mention it as something good to check.

DynamicSnitch Reads from empty new datacenter

When adding a new datacenter the dynamicSnitch causes us to read data from the new dc when the data is not there yet.
We have a cassandra (1.0.11) cluster running on 3 datacenters and we want to add a forth datacenter. The cluster is configured with PropertyFileSnitch and DynamicSnitch enabled with 0.0 badness factor. The relevant keyspaces replication factor are DC1:2, DC2:2, DC3:2. Our plan was to add the new datacenter to the ring, add it to the schema and run a rolling repair -pr on all the nodes so the new nodes will get all of their needed data.
Once we started the process we noticed that the new datacenter recieves read calls from the other data centers because it has a lower load and the DynamicSnitch decides it will be better to read from it. The problem is that the data center still doesn't have the data and returns no results.
We tried removing the DynamicSnitch entirely but once we did that every time a single server got a bit of load we experience extreme performance degredation.
Have anyone encountered this issue ?
Is there a way to directly influence the score of a specific data center so it won't be picked by the DynamicSnitch ?
Are there any better ways to add a datacenter in cassandra 1.0.11 ? Have anyone written a snitch that handles these issues ?
Thanks,
Izik.
You could bootstrap the nodes instead of adding to the ring without bootstrap and then repairing. The former ensures that no reads will be routed to it until it has all the data it needs. (That is why Cassandra defaults to auto_bootstrap: true and in fact disabling it is a sufficiently bad idea that we removed it from the example cassandra.yaml.)
The problem with this, and the reason that the documentation recommends adding all the nodes first without bootstrap, is that if you have N replicas configured for DC4, Cassassandra will try to replicate the entire dataset for that keyspace to the first N nodes you add, which can be problematic!
So here are the options I see:
If your dataset is small enough, go ahead and use the bootstrap plan
Increase ConsistencyLevel on your reads so that they will always touch a replica that does have the data, as well as one that does not
Upgrade to 1.2 and use ConsistencyLevel.LOCAL_ONE on your reads which will force it to never make cross-DC requests

Resources