How to validate success of cassandra version upgrade and cross datacenter backup

How to validate success of cassandra version upgrade and cross datacenter backup - cassandra

Here is a production cassandra cluster with one datacenter of 3 hosts. The version is 1.0.7. I want to upgrade from 1.0.7 to 2.1.8 and then add another cassandra data center with 3 hosts of version 2.1.8.
I have experimented on test cluster and can upgrade the cluster without any ERRORS. But I still worry about is there any data loss or modified. So I want to design a quick method to validate the following 2 points.
Are there any data losses or damages when the cluster upgraded from 1.0.7 to 2.1.8?
I add an extra data center in the cluster and alter the keyspace strategy to NETWORKTOPOLOGYSTRATEGY with 2 replicas each data center. How to validate 2 data centers holding the same replicas?
There are about 10G rows in the current clusters. It is tedious to match the rows. Are there any better way to validate the points above? Or I can just trust the cassandra itself.

I'm not sure it's really practical (or necessary) in most cases to check every row of data.
I'd probably do some before and after checks of things like this:
Spot check some selected subset of rows. If some of them are correct, likely all of them are.
Compare the data sizes before and after the upgrade to make sure they are in the same ballpark.
Monitor the upgrade process for errors (which you're already doing).
Run full repairs on the nodes after the upgrade and see if there is an unusual amount of data movement suggesting some nodes were not fully populated.

Related

Can I upgrade a Cassandra cluster swapping in new nodes running the updated version?

I am relatively new to Cassandra... both as a User and as an Operator. Not what I was hired for, but it's now on my plate. If there's an obvious answer or detail I'm missing, I'll be more than happy to provide it... just let me know!
I am unable to find any recent or concrete documentation that explicitly spells out how tolerant Cassandra nodes will be when a node with a higher Cassandra version is introduced to an existing cluster.
Hypothetically, let's say I have 4 nodes in a cluster running 3.0.16 and I wanted to upgrade the cluster to 3.0.24 (the latest version as of posting; 2021-04-19). For reasons that are not important here, running an 'in-place' upgrade on each existing node is not possible. That is: I can not simply stop Cassandra on the existing nodes and then do an nodetool drain; service cassandra stop; apt upgrade cassandra; service cassandra start.
I've looked at the change log between 3.0.17 and 3.0.24 (inclusive) and don't see anything that looks like a major breaking change w/r/t the transport protocol.
So my question is: Can I introduce new nodes (running 3.0.24) to the c* cluster (comprised of 3.0.16 nodes) and then run nodetool decommission on each of the 3.0.16 nodes to perform a "one for one" replacement to upgrade the cluster?
Do i risk any data integrity issues with this procedure? Is there a specific reason why the procedure outlined above wouldn't work? What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens.
EDIT: After some back/forth on the #cassandra channel on the apache slack, it appears as though there's no issue w/ the procedure. There were some other comorbid issues caused by other bits of automation that did threaten the data-integrity of the cluster, however. In short, each new node was adding ITSSELF to list list of seed nodes as well. This can be seen in the logs: This node will not auto bootstrap because it is configured to be a seed node.
Each new node failed to bootstrap, but did not fail to take new writes.
EDIT2: I am not on a k8s environment; this is 'basic' EC2. Likewise, the volume of data / node size is quite small; ranging from tens of megabytes to a few hundred gigs in production. In all cases, the cluster is fewer than 10 nodes. The case I outlined above was for a test/dev cluster which is normally 2 nodes in two distinct rack/AZs for a total of 4 nodes in the cluster.

Running bootstrap & decommission will take quite a long time, especially if you have a lot of data - you will stream all data twice, and this will increase load onto cluster. The simpler solution would be to replace old nodes by copying their data onto new nodes that have the same configuration as old nodes, but with different IP and with 3.0.24 (don't start that node!). Step-by-step instructions are in this answer, when it's done correctly you will have minimal downtime, and won't need to wait for bootstrap decommission.
Another possibility if you can't stop running node is to add all new nodes as a new datacenter, adjust replication factor to add it, use nodetool rebuild to force copying of the data to new DC, switch application to new data center, and then decommission the whole data center without streaming the data. In this scenario you will stream data only once. Also, it will play better if new nodes will have different number of num_tokens - it's not recommended to have different num_tokens on the nodes of the same DC.
P.S. usually it's not recommended to do changes in cluster topology when you have nodes of different versions, but maybe it could be ok for 3.0.16 -> 3.0.24.

To echo Alex's answer, 3.0.16 and 3.0.24 still use the same SSTable file format, so the complexity of the upgrade decreases dramatically. They'll still be able to stream data between the different versions, so your idea should work. If you're in a K8s-like environment, it might just be easier to redeploy with the new version and attach the old volumes to the replacement instances.
"What about if the number of tokens each node was responsible for was increased with the new nodes? E.G.: 0.16 nodes equally split the keyspace over 128 tokens but the new nodes 0.24 will split everything across 256 tokens."
A couple of points jump out at me about this one.
First of all, it is widely recognized by the community that the default num_tokens value of 256 is waaaaaay too high. Even 128 is too high. I would recommend something along the lines of 12 to 24 (we use 16).
I would definitely not increase it.
Secondly, changing num_tokens requires a data reload. The reason, is that the token ranges change, and thus each node's responsibility for specific data changes. I have changed this before by standing up a new, logical data center, and then switching over to it. But I would recommend not changing that if at all possible.
"In short, each new node was adding ITSSELF to list list of seed nodes as well."
So, while that's not recommended (every node a seed node), it's not a show-stopper. You can certainly run a nodetool repair/rebuild afterward to stream data to them. But yes, if you can get to the bottom of why each node is adding itself to the seed list, that would be ideal.

Cassandra: How to find node with matching token for restoring to newer cluster?

I want to restore data from an existing cluster to newer cluster. I want to do so using the method, that of, copying the snapshot SSTables from old cluster to keyspaces of newer cluster, as explained in http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html.
The same document says, " ... the snapshot must be copied to the correct node with matching tokens". What does it really mean by "node with matching tokens"?
My current cluster is of 5 nodes and for each node num_tokens: 256. I am gonna create another cluster with same no of nodes and num_tokens and same schema. Do I need to follow the ring order while copying SSTables to newer cluster? How do I find matching target node for a given source node?
I tried command "nodetool ring" to check if I can use token values to match. But this command gives all the tokens for each host. How can I get the single token value (which determines the position of the node in the ring)? If I can get it, then I can find the matching nodes as well.

With vnodes its really hard to copy the sstables over correctly because its not just one assigned token that you have to reassign, but 256. To do what your asking you need to do some additional steps described http://datascale.io/cloning-cassandra-clusters-fast-way/. Basically reassign the 256 tokens of each node to a new node in other cluster so the ring is the same. The article you listed describes loading it on the same cluster which is a lot simpler because you dont have to worry about different topologies. Worth noting that even in that scenario, if a new node was added or a node was removed since the snapshot it will not work.
Safest bet will be to use sstableloader, it will walk through the sstable and distribute the data in the appropriate node. It will also open up possibility of making changes without worrying if everything is correct. Also it ensures everything is on the correct nodes so no worries about human errors. Each node in the original cluster can just run sstableloader on each sstable to the new cluster and you will parallelize the work pretty well.
I would strongly recommend you use this opportunity to decrease the number of vnodes to 32. The 256 default is excessive and absolutely horrible for rebuilds, solr indexes, spark, and most of all it ruins repairs. Especially if you use incremental repairs (default), the additional ranges will cause much more anticompactions and load. If you use sstableloader on each sstable it will just work. Increasing your streaming throughput in the cassandra.yaml will potentially speed this up a bit as well.
If by chance your using OpsCenter this backup and restore to new cluster is automated as well.

Forcing request from local node

Can I force a query to fetch from local itself. WE have two data centers with replication factor 3 and 3 and i want to see the replicaiton is done properly or not 1) across nodes and 2 ) across data centers. Can i force the query to check only from a particular node and see if data is present in that node? I know getendpoints will fetch if i give ids but if want to check table updates in general and see if the data is being replicated or not how can i do this? APart from local_quorum we have any other option? Thanks

I can understand why someone might want to do this (sanity check), and I'll put some instructions on how to do this below. However firstly, it's not really necessary; The reason is because Cassandra is a distributed system, under normal circumstances its not necessary to check data is on a given node, the replication will place a given row on a given node according to where the snitch determines placement. So for a given replication factor like DC1:3 and DC2:3 a row may be on any 3 nodes in each DC. As long as you can query the cluster as a whole or each DC and get the right results then you know that your replication is working ok.
Having said that, here is how you find a key, the caveat is it has to be flushed to disc (you might need to invoke a nodetool flush). It may seem convoluted but this is how you trace it to a sstable so you may find this useful to know:
Use nodetool getendpoints to locate the nodes the key is on
Use nodetool getsstables to find the sstables the key is present in
Locate the file on disc and then use sstable2json to view the contents of the table, or use sstablekeys to just view the keys
Note: the sstable2json tools are only listed under Cassandra 1.2 docs but they should still be present in 2.1, at least I verified they are and have used them in DSE4.7 and 4.8.

cassandra cluster, 1 table, how to plan forward

I am planning to create an application that will use just 1 cassandra table. Replication factor will be probably 2 or 3. I might start initially with 2 cassandra server and then keep adding servers as needed. But I am not sure if I need to pre-plan anything so that the table is distributed uniformly when I add more servers. Are there any best practices or things I need to be aware? I read about tokens , http://www.datastax.com/docs/1.1/initialize/token_generation , but I am not sure what I need to do.
I suppose the keys have to be distrubuted uniformly in the cluster, so:
how will that happen i.e. when I add the 2nd server and say the 1st one already has 1 million keys
do I need to pre-plan the keyspace or tables?

I can suggest two things.
First, when designing your schema, pick a good partition key (1st column in the primary key). You need to ensure a couple of things:
There are enough values such that you can distribute it to an arbitrary amount of nodes. For example, sex would be a bad partition key, because you only have two values and therefore can only distribute it to two nodes.
The distribution across different partition key values is more or less uniform. For example, country might not be best, because you will most likely have most of your rows in just a few unique countries.
Secondly, to ease deployment of new nodes later consider setting up your cluster to use virtual nodes (vnodes). If you do that you will be able to skip a few steps when expanding your cluster.
To configure virtual nodes, set num_tokens in cassandra.yaml to more than 1. This will decide how many virtual nodes your node will have. A recommended value is 256.
Later, when you add new nodes, you need to make sure add_bootstrap is true in cassandra.yaml for your new nodes. Then you configure network parameters as usual to match your cluster, and finally start your node. It should automatically bootstrap and start streaming appropriate data. After everything is settled down, you can run cleanup (nodetool clean) on your other nodes to make sure they purge redundant data that they're no longer responsible for.
For more detailed documentation, please see http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html

DynamicSnitch Reads from empty new datacenter

When adding a new datacenter the dynamicSnitch causes us to read data from the new dc when the data is not there yet.
We have a cassandra (1.0.11) cluster running on 3 datacenters and we want to add a forth datacenter. The cluster is configured with PropertyFileSnitch and DynamicSnitch enabled with 0.0 badness factor. The relevant keyspaces replication factor are DC1:2, DC2:2, DC3:2. Our plan was to add the new datacenter to the ring, add it to the schema and run a rolling repair -pr on all the nodes so the new nodes will get all of their needed data.
Once we started the process we noticed that the new datacenter recieves read calls from the other data centers because it has a lower load and the DynamicSnitch decides it will be better to read from it. The problem is that the data center still doesn't have the data and returns no results.
We tried removing the DynamicSnitch entirely but once we did that every time a single server got a bit of load we experience extreme performance degredation.
Have anyone encountered this issue ?
Is there a way to directly influence the score of a specific data center so it won't be picked by the DynamicSnitch ?
Are there any better ways to add a datacenter in cassandra 1.0.11 ? Have anyone written a snitch that handles these issues ?
Thanks,
Izik.

You could bootstrap the nodes instead of adding to the ring without bootstrap and then repairing. The former ensures that no reads will be routed to it until it has all the data it needs. (That is why Cassandra defaults to auto_bootstrap: true and in fact disabling it is a sufficiently bad idea that we removed it from the example cassandra.yaml.)
The problem with this, and the reason that the documentation recommends adding all the nodes first without bootstrap, is that if you have N replicas configured for DC4, Cassassandra will try to replicate the entire dataset for that keyspace to the first N nodes you add, which can be problematic!
So here are the options I see:
If your dataset is small enough, go ahead and use the bootstrap plan
Increase ConsistencyLevel on your reads so that they will always touch a replica that does have the data, as well as one that does not
Upgrade to 1.2 and use ConsistencyLevel.LOCAL_ONE on your reads which will force it to never make cross-DC requests

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string