Inconsistent Elassandra cluster state after node restart - less data on one node - cassandra

I have migrated my existing data in 4 nodes Cassandra (with RF=3) to Elassandra and after putting my mappings whole data got indexed into Elassandra. After the completion of indexing, all nodes show a consistent result in /_cat/indices?v API. But as soon as I restart any node the data on that node is reduced substantially, index size as well as the number of records. If I restart another node of the cluster the problem shift to that node and previous node recovers automatically. For more details and detailed use case please refer to the issue I have created with Elassandra.

Upgrade to Elassandra v6.8.4.3 has resolved the problem. Thanks!

Related

Avoid zombie data in cassandra

Recently I faced an issue in a customer setup with a 3 node cluster, where one node went down and came online only after 12 days. The default gc_grace_seconds for most of the table has been set to 1 day in our scenario and there are a lot of tables.
When this down node came up, stale data from this node got replicated to the other nodes leading to zombie data in all the three nodes.
A solution that I could think of was to clean the node before making it join the cluster and then run a repair which could prevent the occurrence of zombie data.
Could there be any other possible solution to avoid this issue where I don't need to clean the node.
You should never bring a node back online if it has been down for longer than the shortest gc_grace_seconds.
This is a challenge in environments where GC grace is set to a very low value. In these situations, the procedure is to completely rebuild the node as if it was never part of the cluster:
Completely wipe all contents of data/, commitlog/ and saved_caches/.
Remove the node's IP from its seeds list if it is listed as a seed node.
Replace the node with itself using the replace_address flag.
Cheers!

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.
Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.
To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Cassandra: How to find node with matching token for restoring to newer cluster?

I want to restore data from an existing cluster to newer cluster. I want to do so using the method, that of, copying the snapshot SSTables from old cluster to keyspaces of newer cluster, as explained in http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html.
The same document says, " ... the snapshot must be copied to the correct node with matching tokens". What does it really mean by "node with matching tokens"?
My current cluster is of 5 nodes and for each node num_tokens: 256. I am gonna create another cluster with same no of nodes and num_tokens and same schema. Do I need to follow the ring order while copying SSTables to newer cluster? How do I find matching target node for a given source node?
I tried command "nodetool ring" to check if I can use token values to match. But this command gives all the tokens for each host. How can I get the single token value (which determines the position of the node in the ring)? If I can get it, then I can find the matching nodes as well.
With vnodes its really hard to copy the sstables over correctly because its not just one assigned token that you have to reassign, but 256. To do what your asking you need to do some additional steps described http://datascale.io/cloning-cassandra-clusters-fast-way/. Basically reassign the 256 tokens of each node to a new node in other cluster so the ring is the same. The article you listed describes loading it on the same cluster which is a lot simpler because you dont have to worry about different topologies. Worth noting that even in that scenario, if a new node was added or a node was removed since the snapshot it will not work.
Safest bet will be to use sstableloader, it will walk through the sstable and distribute the data in the appropriate node. It will also open up possibility of making changes without worrying if everything is correct. Also it ensures everything is on the correct nodes so no worries about human errors. Each node in the original cluster can just run sstableloader on each sstable to the new cluster and you will parallelize the work pretty well.
I would strongly recommend you use this opportunity to decrease the number of vnodes to 32. The 256 default is excessive and absolutely horrible for rebuilds, solr indexes, spark, and most of all it ruins repairs. Especially if you use incremental repairs (default), the additional ranges will cause much more anticompactions and load. If you use sstableloader on each sstable it will just work. Increasing your streaming throughput in the cassandra.yaml will potentially speed this up a bit as well.
If by chance your using OpsCenter this backup and restore to new cluster is automated as well.

Can a new cassandra node be created by copying sstables directly to a new machine?

We had three nodes in our cluster with replication factor of three. Lost one because of disk issues. The backed up files from this node are also corrupted.
I am looking at these options:
Join a fresh node to the existing two nodes and let it catch up. This would be very slow as there is about 1TB data.
Copy all sstable data from another node to a new node and add it as the third node. Since each node has all the data the hope is this should be superfast.
Will option two work, or will this cause other issues? Are there any other options?
We are running Cassandra version 1.1.10.
I haven't used 1.1, but option 2 should work. Make sure to use the same initial_token config value for the new host (although it shouldn't really matter with your topology). Running a repair afterwards is recommended.

How to restart a seed node after its process crashes?

Is there any differences between replacing a dead node and restarting a dead node, specially for seed nodes ? Actually, I'm a little bit confused about how to restart a dead seed node.
When the process of a seed node crashes, should I restart it without doing any changes to cassandra.yaml ? Or, like replacing a seed node, should I remove its IP address from the seeds list (cassandra.yml) on each node ?
The documentation is not clear about that. It only deals about how to replace a dead node by another machine.
Thanks you
If you are simply restarting a dead seed node, then you shouldn't need to alter your cassandra.yaml file before the restart. As long as you have addressed whatever caused the node to die, and your node has not been down longer than gc_grace_seconds (see note below), then restarting shouldn't be an issue.
The concerns noted in the documentation you have linked center around replacing dead seed nodes. The problem with replacing seed nodes, is that the new node will not bootstrap into the cluster if it is configured as a seed. In that case, a different node in the cluster should be promoted to be a seed node.
Note: the About Deletes section of the documentation warns about bringing a node back that has been down a long time. Specifically, longer than the value set for gc_grace_seconds (or the shortest value set, if you have changed it on any individual tables).
...if a node is down
longer than the grace period, the node can miss the delete because the
tombstone disappears after gc_grace_seconds. Cassandra always attempts
to replay missed updates when the node comes back up again. After a
failure, it is a best practice to run node repair to repair
inconsistencies across all of the replicas when bringing a node back
into the cluster. If the node doesn't come back within
gc_grace,_seconds, remove the node, wipe it, and bootstrap it again.

Resources