Cassandra multi datacenter, WAN hiccups and replication resiliency - cassandra

I have two DCs. My Cassandra ring spans both DCs. I use local quorum with replication factor=3. I do a write in DC1 with local quorum. Data gets written to multiple nodes in DC1. For the same write to propagate to DC2 only one copy is sent from the coordinator node in DC1 to a coordinator node in DC2 for optimizing traffic over the WAN (from what I have read in the Cassandra documentation)
Imagine there is a wan hiccup for a few seconds.
Questions about replication resiliency:
Will this Wan hiccup result in a Hinted Handoff (HH) being created in DC1's coordinator for DC2 to be delivered when the Wan link is up again?
I've read that HH only starts once the failure detector has recognized that a replica is unavailable. Will this hiccup of a few seconds cause some data to go undetected to DC2 till the next nodetool repair is run?
Is it advisable to run nodetool repairs in case of Wan hiccups that last a few seconds/minutes?

Yes
This is obsolete as of Cassandra 1.0, so no. (See http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-windows-service-new-cql-clients-and-more)
As above, no.

Related

Cassandra repair after datacenter went down

I have a Cassandra db (version 3.11.2) running in AWS, with 2 Datacenters - each in another AWS region and 3 nodes in each one.
The replication factor on all keyspaces is 3, so full replication of data on every node. The size of data is about 10GB per node.
All of our writes are in LOCAL_QUORUM against one DC (lets call it DC1). Basically the other DC is just for a kind of backup and disaster recovery, in case the AWS region for DC1 will be unavailable we will redirect traffic to DC2.
My issue is that we had a network disconnection between the two DCs, for several hours, and after several days we noticed that there is missing data in DC2. This all makes sense, since the time the DCs were apart is larger than the Hinted Handoff window (3 hours). So we need to run a repair to bring DC2 back to sync with DC1.
I went over the cassandra docs, and read countless SO answers and for the life of me I couldn't understand what is the right repair to do...
Do I need to issue a 'nodetool repair --full --sequential' from only one node? Do I need to run it on every node in the cluster? Maybe it's better to run 'nodetool rebuild'?
Executing nodetool cleanup on the nodes on datacenter2 should be able to bring up the data up to sync, but depending on the data size affected, this may be a task that can take time and resources. If the datacenter2 is only as a backup for disaster recovery purposes, it may be easier and quicker to backup the current dc1 cluster and restore it in the second datacenter (more information is available here.

Cassandra 2.2.8: routing traffic from a node using GossipingPropertyFileSnitch

using Cassandra 2.2.8 , gossipingpropertyfilesnitch
I'm repairing a node and compacting large number of sstables - i'm thinking to alleviate load on the cpu/node and want to routing incoming web traffic to other nodes in the cluster.
may you guys please share how i can route internet traffic to other nodes in the cluster so to let the node keep using cpu on the major maintenance work?
thanks in advance
Providing you have a replication factor and consistency level that can handle a node being down, you can remove the node from the cluster during the compactions
nodetool disablebinary
nodetool disablethrift
This would prevent your client application from sending requests and it acting as coordinator but it will still recieve the mutations from writes so it wont get behind. If you want to reduce load further you can completely remove it with
nodetool disablebinary
nodetool disablethrift
nodetool disablegossip
But make sure you enable gossip again before your max_hint_window_in_ms which is defined in the cassandra.yaml (default 3 hours). If you dont the hints for that node will expire and not be delivered, leading to a consistency issue that will not be resolved without a repair.
Once you reconnect wait for the pending hints and active hints are down to 0 before disabling gossip again. Note: pending will always be +1 since it has a regular scheduled task, so 1 not zero.
Can check the hint thread pool with OpsCenter, nodetool tpstats or via JMX with org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=PendingTasks and org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks

How recover cassandra datacenter after failure

I have the following datacenter-aware configuration:
Primary Datacenter: 3 node cluster, RF=3
Secondary Datacenter: 3 node cluster, RF=33
Data size is more than 100GB per node
Assuming that secondary datacenter is breakdown. Recovered after failure after few days.
How I can sync data from primary datacenter to secondary quickly?
I tried "nodetool repair" with various keys. But it takes much time.
Nodetool rebuild will be faster in this case. It is just streams all the data from the other DC.

Cassandra and Spark

Hi I have a high level question regarding cluster topology and data replication with respect to cassandra and spark being used together in datastax enterprise.
It was my uderstanding that if there were 6 nodes in a cluster and there is heavy computing (e.g analytics) done then you could have three spark nodes and 3 cassandra nodes if you want. Or you don't need three nodes for analytics but your jobs would not run as fast. The reason you don't want the heavy analytics on the cassandra nodes is because the local memory is already being used up to handle the heavy read/write load of cassandra.
This much is clear, but here are my questions :
How does the replicated data work then?
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
Does all the data get replicated to the spark nodes?
How does that work if it does?
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
How does the replicated data work then?
Regular Cassandra replication will operate between nodes and DC's. As far as replication goes this is the same as having a c* only cluster with two data centers.
Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?
With the default DSE Snitch, your C* nodes will be in one DC and the Spark nodes in another DC. They will all be in a default rack. If you want to use multiple racks you will have to configure that yourself by using an advanced snitch. GPFS or PFS are good choices depending on your orchestration mechanisms. Learn more in the DataStax Documentation
Does all the data get replicated to the spark nodes? How does that work if it does?
Replication is controlled at the keyspace level and depends on your replication strategy:
SimpleStrategy will simply ask you the number of replicas you want in your cluster (it is not data center aware so don't use it if you have multiple DC's)
create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3 }
This assumes you only have one DC and that you'll have 3 copies of each bit of data
NetworkTopology strategy let's you pick number of replicas per DC
create KEYSPACE tst WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2': 3 }
You can choose to have a different number of replicas per DC.
What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?
The procedure to update RF is in the datastax documentation. Here it is verbatim:
Updating the replication factor Increasing the replication factor
increases the total number of copies of keyspace data stored in a
Cassandra cluster. If you are using security features, it is
particularly important to increase the replication factor of the
system_auth keyspace from the default (1) because you will not be able
to log into the cluster if the node with the lone replica goes down.
It is recommended to set the replication factor for the system_auth
keyspace equal to the number of nodes in each data center.
Procedure
Update a keyspace in the cluster and change its replication strategy
options. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' :
'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2}; Or if using
SimpleStrategy:
ALTER KEYSPACE "Excalibur" WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor' : 3 }; On each affected node,
run the nodetool repair command. Wait until repair completes on a
node, then move to the next node.
Know that increasing the RF in your cluster will generate lots of IO and CPU utilization as well as network traffic, while your data gets pushed around your cluster.
If you have a live production workload, you can throttle the impact by using nodetool getstreamthroughput / nodetool setstreamthroughput.
You can also throttle the resulting compactions with nodetool getcompactionthroughput nodetool setcompactionthroughput
How does Cassandra and Spark work together on the analytics nodes and
not fight for resources? If you are not going to limit Cassandra at all in the whole cluster, then what is the point of limiting Spark, just have all the nodes Spark enabled.
The key point is that you won't be pointing your main transactional reads / writes at the Analytics DC (use something like consistency level ONE_LOCAL, or QUORUM_LOCAL to point those requests to the C* DC). Don't worry, your data still arrives at the analytics DC by virtue of replication, but you won't wait for acks to come back from analytics nodes in order to respond to customer requests. The second DC is eventually consistent.
You are right in that cassandra and spark are still running on the same boxes in the analytics DC (this is critical for data locality) and have access to the same resources (and you can do things like control the max spark cores so that cassandra still has breathing room). But you achieve workload isolation by having two Data Centers.
DataStax drivers, by default, will consider the DC of the first contact point they connect with as the local DC so just make sure that your contact point list only includes machines in the local (c* DC).
You can also specify the local datacenter yourself depending on the driver. Here's an example for the ruby driver, check the driver documentation for other languages.
use the :datacenter cluster method: First datacenter found will be
assumed current by default. Note that you can skip this option if you
specify only hosts from the local datacenter in :hosts option.
You are correct, you want to separate your cassandra and your analytics workload.
A typical setup could be:
3 Nodes in one datacenter (name: cassandra)
3 Nodes in second datacenter (name: analytics)
When creating your keyspaces you define them with a NetworkTopologyStrategy and a replication factor defined for each datacenter, like so:
CREATE KEYSPACE myKeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'cassandra': 2, 'analytics': 2};
With this setup, your data will be replicated twice in each datacenter. This is done automatically by cassandra. So when you insert data in DC cassandra the inserted data will get replicated to DC analytics automatically and vice versa. Note: you can define what data is replicated by using seperate keyspaces for the data you want to be analyzed and the data you don't.
In your cassandra.yaml you should use the GossipingPropertyFileSnitch. With this snitch you can define the DC and the rack of your node in the file cassandra-rackdc.properties. This information then gets propagated via the gossip protocol. So each node learns the topology of your cluster.

How to separate ring from cluster in cassandra

We have a cassandra DSE cluster with 10 nodes for cassandra ring and 10 nodes for hadoop ring. Now the application writes the data to the cassandra ring and cassandra will replicate the data to hadoop ring.
We want to separate the two ring's and make them as two different cluster's and application writes the data to two clusters at the same time.
How to separate the cluster? is that possible?
we have ~600GB of data in the cluster and we cannot delete it.
You should test this first, but this basic procedure should work. It will need some tweaking if you have counters.
Set your application writing to both DCs using LOCAL_QUORUM.
Run repair on the whole cluster. This is to ensure each DC has a copy of the data.
Isolate the clusters so the two DCs can't talk to each other, probably using a firewall.
Assuming your DCs are DC1 and DC2, change your replication factor to be DC2:0 on DC1 and DC1:0 on DC2.
On each DC, run 'nodetool removenode' for each node in the other DC. This will just remove the DOWN nodes from the ring but won't have any affect on the data because the other nodes have replication factor zero.
This should work with zero data loss.

Resources