Currently we have used EC2Snitch using two AZs in a single AWS region. The goal was to provide resiliency even when one AZ is not available. Most data are replicated with RF=2, so each AZ gets a copy based on Ec2Snitch.
Now we have come to a conclusion to move to GossipingPropertyFileSnitch. Reason primarily is that we have realized that one AZ going down is a remote occurrence and even if it happens, there are other systems in our stack that don't support this; so eventually whole app goes down if that happens.
Other reason is that with EC2Snitch and two AZs, we had to scale in factor of 2 (one in each AZ). With GossipingPropertyFileSnitch using just one rack, we can scale in factor of 1.
When we change this snitch setting, will the topology change? I want to avoid having a need to run nodetool repair. We always had failures with running nodetool repair and it runs forever.
Whether the topology changes depends on how you carry out the change. If you assign the same logical dc and rack to the node as what it's currently configured to, you shouldn't get a topology change.
You have to match the rack to the AZ after updating to GossipingPropertyFileSnitch. You need to do a rolling restart for the re-configuration to take place.
Example cassandra-rackdc.properties for 2 nodes in 1 dc across 2 AZs:
# node=10.0.0.1, dc=first, AZ=1
dc_suffix=first
# Becomes
dc=first
rack=1
# node=10.0.0.2, dc=first, AZ=2
dc_suffix=first
# Becomes
dc=first
rack=2
On a side note you need to explore why repairs are failing. Unfortunately they are very important for cluster health.
Related
I have a mixed workload cluster across multiple datacenters. I have ran the sstableloader command for the tables I want to restore using snapshots which I had backed up. I have added commit log files which I had backed up from archive to a restore directory on all nodes. I have updated the commitlog_archiving.properties file with these configs.
What is the correct way and order to restart nodes of my cluster?
Do these considerations apply for restarting as well?
As a general rule, we recommend restarting seed nodes in the DC first before other nodes so gossip propagation happens faster particularly for larger clusters (arbitrarily 15+ nodes). It is important to note that a restart is not required if you restored data using sstableloader.
If you are just performing a rolling restart then the order of the DCs does not matter. But it matters if you are starting up a cluster from a cold shutdown meaning all nodes are down and the cluster is completely offline.
When starting from a cold shutdown, it is important to start with the "Analytics DC" (nodes running in Analytics mode, i.e with Spark enabled) because it makes it easier to elect a Spark master. Assuming that the replication for Analytics keyspaces are configured with the recommended replication factor of 3, you will need to start 2 or 3 nodes beginning with the seeds ideally 1 minute apart because the LeaderManager requires a quorum of nodes to elect a Spark master.
We recommend leaving DCs with nodes running in Search mode (with Solr enabled) last as a matter of convenience so that all the other DCs are operational before the cluster starts accepting Search requests from the application(s). Cheers!
If you've done all of that, I don't think the order matters too much. Although, you should restart your seed nodes first, that way the nodes in the cluster have a common cluster entrypoint to find their way back in and correctly rejoin.
I already have a working datacenter with 3 nodes (replication factor 2). I want to add another datacenter with only one node to have all backup data from existing datacenter. The final solution:
dc1: 3 nodes (2 rf)
dc2: 1 node (1 rf)
My application would then connect only to dc1 nodes and send data. If dc1 breaks down I can recover data from dc2 which is on the other physical machine in different location. I could also use dc2 for AI queries or some other task. I'm a newbie in case of cassandra configuration so I want to know if I'm not making some kind of a mistake in my thinking. I'm planing on using this configuration docs to add new dc: https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsAddDCToCluster.html
Is there anything more I should keep in mind to get this to work or some easier solution to have data backup?
Update: It won't only be a backup, we wont to use this second DC for connecting application also when dc1 would be unavailable (ex. power outage).
Update: dc2 is running, I had some problems with coping data from one dc to other and nodetool status didn't show 2 dc's but after fixing firewall rules for port 7000 I managed to connect both dc's and share data between them.
with this approach, your single node will get 2 times more traffic than other nodes. Also, it may add a load to the nodes in dc1 because they will need to collect hints, etc. when node in dc2 is not available. If you need just backup, setup something like medusa, and store data in the cheap environment, like, S3 - but of course, it will require time to restore if you lose the whole DC.
But in reality, you need to think about your high-availability strategy - what will happen with your clients if you lose the primary DC? Is it critical to wait until recovery, or you're really requiring full fault tolerance? I recommend to read the Designing Fault-Tolerant Applications with DataStax and Apache Cassandra™ whitepaper from DataStax - it explains the details of designing really fault tolerant applications.
I've been tasked with re-writing some sub-par Ansible playbooks to stand up a Cassandra cluster in CentOS. Quite frankly, there doesn't seem to be much information on Cassandra out there.
I've managed to get the service running on all three nodes at the same time, using the following configuration file, info scrubbed.
HOSTIP=10.0.0.1
MSIP=10.10.10.10
ADMIN_EMAIL=my#email.com
LICENSE_FILE=/tmp/license.conf
USE_LDAP_REMOTE_HOST=n
ENABLE_AX=y
MP_POD=gateway
REGION=test-1
USE_ZK_CLUSTER=y
ZK_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
ZK_CLIENT_HOSTS="10.0.0.1 10.0.0.2 10.0.0.3"
USE_CASS_CLUSTER=y
CASS_HOSTS="10.0.0.1:1,1 10.0.0.2:1,1 10.0.0.3:1,1"
CASS_USERNAME=test
CASS_PASSWORD=test
The HOSTIP changes depending on which node the configuration file is on.
The problem is, when I run nodetool ring, each node says there's only two nodes in the cluster: itself and one other, seemingly random from the other two.
What are some basic sanity checks to determine a "healthy" Cassandra cluster? Why is nodetool saying each one thinks there's a different node missing from the cluster?
nodetool status - overview of the cluster (load, state, ownership)
nodetool info - more granular details at the node-level
As for the node mismatch I would check the following:
cassandra-topology.properties - identical across the cluster (all 3 IPs listed)
cassandra.yaml - I typically keep this file the same across all nodes. The parameters that MUST stay the same across the cluster are: cluster_name, seeds, partitioner, snitch).
verify all nodes can reach each other (ping, telnet, etc)
DataStax (Cassandra Vendor) has some good documentation. Please note that some features are only available on DataStax Enterprise -
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html
Also check out the Apache Cassandra site -
http://cassandra.apache.org/community/
As well as the user forums -
https://www.mail-archive.com/user#cassandra.apache.org/
Actually, the thing you really want to check is if all the nodes "AGREE" on schema_id. nodetool status shows if nodes or up, down, joining, yet it does not really mean 'healthy' enough to make schema changes or do other changes.
The simplest way is:
nodetool describecluster
Cluster Information:
Name: FooBarCluster
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2, 10.136.2.3]
If schema IDs do not match, you need to wait for schema to settle, or run repairs, say for example like this:
43fe9177-382c-327e-904a-c8353a9df590: [10.136.2.1, 10.136.2.2]
43fe9177-382c-327e-904a-c8353a9dxxxx: [10.136.2.3]
However, running nodetool is 'heavy' and hard to parse.
The information is inside the database, you can check here:
'SELECT schema_version, release_version FROM system.local' and
'SELECT peer, schema_version, release_version FROM system.peers'
Then you compare schema_version across all nodes... if they match, the cluster is very likely healthy. You should ALWAYS check this before making any changes to schema.
Now, during a rolling upgrade, when changing engine versions, the release_version is different, so to support automatic rolling upgrades, you need to check schema_id matching within release_versions separately.
I'm not sure all of the problems you might be having, but...
Check the cassandra.yaml file. You need minimum 3 things to be the same - seeds: list (but do not list all nodes as seeds!), cluster_name, and snitch. Make sure your listen_address is correct.
If you are using gossipingPropertyFileSnitch then check cassandra-topology.properties and/or cassandra-rackdc.properties files for accuracy.
Don't start all the nodes at the same time. Start the seed nodes 1st - the other nodes will "gossip" with the seed node to learn cluster topology. Shutdown the seed nodes last.
Don't use shared storage. That defeats the purpose of distributed data and is considered a cassandra anti-pattern.
If you're in AWS, don't use auto-scaling groups unless you know what you're doing.
Once you've done all that, use nodetool status | ring | info or jmx to see what the cluster is doing.
Datastax does have decent documentation for cassandra.
Since my Cassandra cluster is replicated across three availability zones, I would like to backup only one availability zone to lower the backup costs. I have also experimented restoring nodes in a single availability zone and got back most of my data in a test environment. I would like to know if there are any drawbacks to this approach before deploying this solution in production. Is anyone following this approach in your production clusters?
Note: As I backup at regular intervals, I know that I may loose updates happened to other two AZ nodes quorum at the time of snapshot but that's not a problem.
You can backup only specific dc, or even nodes.
AFAIK, the only drawback is does your data consistent/up-to-date, and since you can afford to lose some data it shouldn't be a problem. And if you, for example performing writes with ALL consistency level, the data should be up-to-date on all nodes.
BUT, you must be sure that your data is indeed replicated between multi a-z, by playing with rack/dc properties or using ec2 switch that supports multi a-z.
EDIT:
Global Snapshot
Running nodetool snapshot is only run on a single node at a time.
This only creates a partial backup of your entire data. You will want
to run nodetool snapshot on all of the nodes in your cluster. But
it’s best to run them at the exact same time, so that you don’t have
fragmented data from a time perspective. You can do this a couple of
different ways. The first, is to use a parallel ssh program to
execute the nodetool snapshot command at the same time. The second,
is to create a cron job on each of the nodes to run at the same time.
The second assumes that your nodes have clocks that are in sync, which
Cassandra relies on as well.
Link to the page:
http://datascale.io/backing-up-cassandra-data/
I have a 4 node cluster and I have upgraded all the nodes from an older version to Cassandra 1.2.8. Total data present in the cluster is of size 8 GB. Now I need to enable vNodes on all the 4 nodes of cluster without any downtime. How can I do that?
As Nikhil said, you need to increase num_tokens and restart each node. This can be done one at once with no down time.
However, increasing num_tokens doesn't cause any data to redistribute so you're not really using vnodes. You have to redistribute it manually via shuffle (explained in the link Lyuben posted, which often leads to problems), by decommissioning each node and bootstrapping back (which will temporarily leave your cluster extremely unbalanced with one node owning all the data), or by duplicating your hardware temporarily just like creating a new data center. The latter is the only reliable method I know of but it does require extra hardware.
In the conf/cassandra.yaml you will need to comment out the initial_token parameter, and enable the num_tokens parameter (by default 256 I believe). Do this for each node. Then you will have to restart the cassandra service on each node. And wait for the data to get redistributed throughout the cluster. 8 GB should not take too much time (provided your nodes are all in the same cluster), and read requests will still be functional, though you might see degraded performance until the redistribution of data is complete.
EDIT: Here is a potential strategy to migrate your data:
Decommission two nodes of the cluster. The token-space should get distributed 50-50 between the other two nodes.
On the two decommissioned nodes, remove the existing data, and restart the Cassandra daemon with a different cluster name and with the num_token parameters enabled.
Migrate the 8 GB of data from the old cluster to the new cluster. You could write a quick script in python to achieve this. Since the volume of data is small enough, this should not take too much time.
Once the data is migrated in the new cluster, decommission the two old nodes from the old cluster. Remove the data and restart Cassandra on them, with the new cluster name and the num_tokens parameter. They will bootstrap and data will be streamed from the two existing nodes in the new cluster. Preferably, only bootstrap one node at a time.
With these steps, you should never face a situation where your service is completely down. You will be running with reduced capacity for some time, but again since 8GB is not a large volume of data you might be able to achieve this quickly enough.
TL;DR;
No you need to restart servers once the config has been edited
The problem is that enabling vnodes means a lot of the data is redistributed randomly (the docs say in a vein similar to the classic ‘nodetool move’