What is the Correct order to restart a cluster for point-in-time restore?

What is the Correct order to restart a cluster for point-in-time restore? - cassandra

I have a mixed workload cluster across multiple datacenters. I have ran the sstableloader command for the tables I want to restore using snapshots which I had backed up. I have added commit log files which I had backed up from archive to a restore directory on all nodes. I have updated the commitlog_archiving.properties file with these configs.
What is the correct way and order to restart nodes of my cluster?
Do these considerations apply for restarting as well?

As a general rule, we recommend restarting seed nodes in the DC first before other nodes so gossip propagation happens faster particularly for larger clusters (arbitrarily 15+ nodes). It is important to note that a restart is not required if you restored data using sstableloader.
If you are just performing a rolling restart then the order of the DCs does not matter. But it matters if you are starting up a cluster from a cold shutdown meaning all nodes are down and the cluster is completely offline.
When starting from a cold shutdown, it is important to start with the "Analytics DC" (nodes running in Analytics mode, i.e with Spark enabled) because it makes it easier to elect a Spark master. Assuming that the replication for Analytics keyspaces are configured with the recommended replication factor of 3, you will need to start 2 or 3 nodes beginning with the seeds ideally 1 minute apart because the LeaderManager requires a quorum of nodes to elect a Spark master.
We recommend leaving DCs with nodes running in Search mode (with Solr enabled) last as a matter of convenience so that all the other DCs are operational before the cluster starts accepting Search requests from the application(s). Cheers!

If you've done all of that, I don't think the order matters too much. Although, you should restart your seed nodes first, that way the nodes in the cluster have a common cluster entrypoint to find their way back in and correctly rejoin.

Related

Hazelcast High Availability in case of 3 nodes cluster

We are using Hazelcast IMDG as an in memory grid. The number of nodes in our cluster is three, and we have one sync backup and the cluster is partition aware. In that case, I expect the distributed map will be distributed across 3 nodes (more or less) homogeneously. In case of a node break down, the leadership should be transferred to a healthy node(which has the sync backup for the lost data). If there is a write request to this newly assigned leader node, the same partition should be replicated synchronously to one of the alive nodes. Does it mean that in case of node failure, approximately one third of the distributed map should be replicated and during the replication time, all reads are blocked? Availability is hit if one of three node is down in case of one sync backup till approximately one third of the distribution is restored?

If a node goes down, the cluster will promote the backup partitions to primary.
And the migrations will start to create backups of these new primary partitions.
Please check the Data Partitioning section.
During migrations, read operations are not blocked.
Only write operations are blocked on the partition that is actively migrating.
Since the partitions are migrated one by one, the effect on availability is minimal.

Way to determine healthy Cassandra cluster?

I've been tasked with re-writing some sub-par Ansible playbooks to stand up a Cassandra cluster in CentOS. Quite frankly, there doesn't seem to be much information on Cassandra out there.
I've managed to get the service running on all three nodes at the same time, using the following configuration file, info scrubbed.
HOSTIP=10.0.0.1
MSIP=10.10.10.10
ADMIN_EMAIL=my#email.com
LICENSE_FILE=/tmp/license.conf
USE_LDAP_REMOTE_HOST=n
ENABLE_AX=y
MP_POD=gateway
REGION=test-1
USE_ZK_CLUSTER=y
ZK_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
ZK_CLIENT_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
USE_CASS_CLUSTER=y
CASS_HOSTS="10.0.0.1:1,1 10.0.0.2:1,1 10.0.0.3:1,1"
CASS_USERNAME=test
CASS_PASSWORD=test
The HOSTIP changes depending on which node the configuration file is on.
The problem is, when I run nodetool ring, each node says there's only two nodes in the cluster: itself and one other, seemingly random from the other two.
What are some basic sanity checks to determine a "healthy" Cassandra cluster? Why is nodetool saying each one thinks there's a different node missing from the cluster?

nodetool status - overview of the cluster (load, state, ownership)
nodetool info - more granular details at the node-level
As for the node mismatch I would check the following:
cassandra-topology.properties - identical across the cluster (all 3 IPs listed)
cassandra.yaml - I typically keep this file the same across all nodes. The parameters that MUST stay the same across the cluster are: cluster_name, seeds, partitioner, snitch).
verify all nodes can reach each other (ping, telnet, etc)
DataStax (Cassandra Vendor) has some good documentation. Please note that some features are only available on DataStax Enterprise -
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html
Also check out the Apache Cassandra site -
http://cassandra.apache.org/community/
As well as the user forums -
https://www.mail-archive.com/user#cassandra.apache.org/

Actually, the thing you really want to check is if all the nodes "AGREE" on schema_id. nodetool status shows if nodes or up, down, joining, yet it does not really mean 'healthy' enough to make schema changes or do other changes.
The simplest way is:
nodetool describecluster
Cluster Information:
Name: FooBarCluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2, 10.136.2.3]
If schema IDs do not match, you need to wait for schema to settle, or run repairs, say for example like this:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2]
43fe9177-382c-327e-904a-c8353a9dxxxx: [10.136.2.3]
However, running nodetool is 'heavy' and hard to parse.
The information is inside the database, you can check here:
'SELECT schema_version, release_version FROM system.local' and
'SELECT peer, schema_version, release_version FROM system.peers'
Then you compare schema_version across all nodes... if they match, the cluster is very likely healthy. You should ALWAYS check this before making any changes to schema.
Now, during a rolling upgrade, when changing engine versions, the release_version is different, so to support automatic rolling upgrades, you need to check schema_id matching within release_versions separately.

I'm not sure all of the problems you might be having, but...
Check the cassandra.yaml file. You need minimum 3 things to be the same - seeds: list (but do not list all nodes as seeds!), cluster_name, and snitch. Make sure your listen_address is correct.
If you are using gossipingPropertyFileSnitch then check cassandra-topology.properties and/or cassandra-rackdc.properties files for accuracy.
Don't start all the nodes at the same time. Start the seed nodes 1st - the other nodes will "gossip" with the seed node to learn cluster topology. Shutdown the seed nodes last.
Don't use shared storage. That defeats the purpose of distributed data and is considered a cassandra anti-pattern.
If you're in AWS, don't use auto-scaling groups unless you know what you're doing.
Once you've done all that, use nodetool status | ring | info or jmx to see what the cluster is doing.
Datastax does have decent documentation for cassandra.

Can we backup only one availability zone for AZ replicated Cassandra cluster

Since my Cassandra cluster is replicated across three availability zones, I would like to backup only one availability zone to lower the backup costs. I have also experimented restoring nodes in a single availability zone and got back most of my data in a test environment. I would like to know if there are any drawbacks to this approach before deploying this solution in production. Is anyone following this approach in your production clusters?
Note: As I backup at regular intervals, I know that I may loose updates happened to other two AZ nodes quorum at the time of snapshot but that's not a problem.

You can backup only specific dc, or even nodes.
AFAIK, the only drawback is does your data consistent/up-to-date, and since you can afford to lose some data it shouldn't be a problem. And if you, for example performing writes with ALL consistency level, the data should be up-to-date on all nodes.
BUT, you must be sure that your data is indeed replicated between multi a-z, by playing with rack/dc properties or using ec2 switch that supports multi a-z.
EDIT:
Global Snapshot
Running nodetool snapshot is only run on a single node at a time.
This only creates a partial backup of your entire data. You will want
to run nodetool snapshot on all of the nodes in your cluster. But
it’s best to run them at the exact same time, so that you don’t have
fragmented data from a time perspective. You can do this a couple of
different ways. The first, is to use a parallel ssh program to
execute the nodetool snapshot command at the same time. The second,
is to create a cron job on each of the nodes to run at the same time.
The second assumes that your nodes have clocks that are in sync, which
Cassandra relies on as well.
Link to the page:
http://datascale.io/backing-up-cassandra-data/

Enabling vNodes in Cassandra 1.2.8

I have a 4 node cluster and I have upgraded all the nodes from an older version to Cassandra 1.2.8. Total data present in the cluster is of size 8 GB. Now I need to enable vNodes on all the 4 nodes of cluster without any downtime. How can I do that?

As Nikhil said, you need to increase num_tokens and restart each node. This can be done one at once with no down time.
However, increasing num_tokens doesn't cause any data to redistribute so you're not really using vnodes. You have to redistribute it manually via shuffle (explained in the link Lyuben posted, which often leads to problems), by decommissioning each node and bootstrapping back (which will temporarily leave your cluster extremely unbalanced with one node owning all the data), or by duplicating your hardware temporarily just like creating a new data center. The latter is the only reliable method I know of but it does require extra hardware.

In the conf/cassandra.yaml you will need to comment out the initial_token parameter, and enable the num_tokens parameter (by default 256 I believe). Do this for each node. Then you will have to restart the cassandra service on each node. And wait for the data to get redistributed throughout the cluster. 8 GB should not take too much time (provided your nodes are all in the same cluster), and read requests will still be functional, though you might see degraded performance until the redistribution of data is complete.
EDIT: Here is a potential strategy to migrate your data:
Decommission two nodes of the cluster. The token-space should get distributed 50-50 between the other two nodes.
On the two decommissioned nodes, remove the existing data, and restart the Cassandra daemon with a different cluster name and with the num_token parameters enabled.
Migrate the 8 GB of data from the old cluster to the new cluster. You could write a quick script in python to achieve this. Since the volume of data is small enough, this should not take too much time.
Once the data is migrated in the new cluster, decommission the two old nodes from the old cluster. Remove the data and restart Cassandra on them, with the new cluster name and the num_tokens parameter. They will bootstrap and data will be streamed from the two existing nodes in the new cluster. Preferably, only bootstrap one node at a time.
With these steps, you should never face a situation where your service is completely down. You will be running with reduced capacity for some time, but again since 8GB is not a large volume of data you might be able to achieve this quickly enough.

TL;DR;
No you need to restart servers once the config has been edited
The problem is that enabling vnodes means a lot of the data is redistributed randomly (the docs say in a vein similar to the classic ‘nodetool move’

Best way to shrink a Cassandra cluster

So there is a fair amount of documentation on how to scale up a Cassandra, but is there a good resource on how to "unscale" Cassandra and remove nodes from the cluster? Is it as simple as turning off a node, letting the cluster sync up again, and repeating?
The reason is for a site that expects high spikes of traffic, climbing from the daily few thousand hits to hundreds of thousands over a few days. The site will be "ramped up" before hand, starting up multiple instances of the web server, Cassandra, etc. After the torrent of requests subsides, the goal is to turn off the instances that are not longer used, rather than pay for servers that are just sitting around.

If you just shut the nodes down and rebalance cluster, you risk losing some data, that exist only on removed nodes and hasn't replicated yet.
Safe cluster shrink can be easily done with nodetool. At first, run:
nodetool drain
... on the node removed, to stop accepting writes and flush memtables, then:
nodetool decommission
To move node's data to other nodes, and then shut the node down, and run on some other node:
nodetool removetoken
... to remove the node from the cluster completely. The detailed documentation might be found here: http://wiki.apache.org/cassandra/NodeTool
From my experience, I'd recommend to remove nodes one-by-one, not in batches. It takes more time, but much more safe in case of network outages or hardware failures.

When you remove nodes you may have to re-balance the cluster, moving some nodes to a new token. In a planed downscale, you need to:
1 - minimize the number of moves.
2 - if you have to move a node, minimize the amount of transfered data.
There's an article about cluster balancing that may be helpful:
Balancing Your Cassandra Cluster
Also, the begining of this video is about add node and remove node operations and best strategies to minimize the cluster impact in each of these operations.
Hopefully, these 2 references will give you enough information to plan your downscale.

First, on the node, which will be removed, flush memory (memtable) to SSTables on disk:
-nodetool flush
Second, run command to leave a cluster:
-nodetool decommission
This command will assign ranges that the node was responsible for to other nodes and replicates the data appropriately.
To monitor a process you can use command:
- nodetool netstats
Found an article on how to remove nodes from Cassandra. It was helpful for me scaling down cassandra.All actions are described step-by-step there.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string