Failover and Replication in 2-node Cassandra cluster - cassandra

I run KairosDB on a 2-node Cassandra cluster, RF = 2, Write CL = 1, Read CL = 1. If 2 nodes are alive, client sends half of data to node 1 (e.g. metric from METRIC_1 to METRIC_5000) and the other half of data to node 2 (e.g. metric from METRIC_5001 to METRIC_10000). Ideally, each node always has a copy of all data. But if one node is dead, client sends all data to the alive node.
Client started sending data to the cluster. After 30 minutes, I turned node 2 off for 10 minutes. During this 10-minute period, client sent all data to node 1 properly. After that, I restarted node 2 and client continued sending data to 2 nodes properly. One hour later I stopped the client.
I wanted to check if the data which was sent to node 1 when node 2 was dead had been automatically replicated to node 2 or not. To do this, I turned node 1 off and queried the data within time when node 2 was dead from node 2 but it returned nothing. This made me think that the data had not been replicated from node 1 to node 2. I posted a question Doesn't Cassandra perform “late” replication when a node down and up again?. It seems that the data was replicated automatically but it was so slow.
What I expect is data in both 2 servers are the same (for redundancy purpose). That means the data sent to the system when node 2 is dead must be replicated from node 1 to node 2 automatically after node 2 becomes available (because RF = 2).
I have several questions here:
1) Is the replication truly slow? Or did I configure something wrong?
2) If client sends half of data to each node as in this question I think it's possible to lose data (e.g. node 1 receives data from client, while node 1 is replicating data to node 2 it suddenly goes down). Am I right?
3) If I am right in 2), I am going to do like this: client sends all data to both 2 nodes. This can solve 2) and also takes advantages of replication if one node is dead and is available later. But I am wondering that, this would cause duplication of data because both 2 nodes receive the same data. Is there any problem here?
Thank you!

Can you check the value of hinted_handoff_enabled in cassandra.yaml config file?
For your question: Yes you may lose data in some cases, until the replication is fully achieved, Cassandra is not exactly doing late replication - there are three mechanisms.
Hinted handoffs http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesHintedHandoff.html
Repairs - http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRepair.html
Read Repairs - those may not help much on your use case - http://wiki.apache.org/cassandra/ReadRepair
AFAIK, if you are running a version greater than 0.8, the hinted handoffs should duplicate the data after node restarts without the need for a repair, unless data is too old (this should not be the case for 10 minutes). I don't know why those handoffs where not sent to your replica node when it was restarted, it deserves some investigation.
Otherwise, when you restart the node, you can force Cassandra to make sure that data is consistent by running a repair (e.g. by running nodetool repair).
By your description I have the feeling you are getting confused between the coordinator node and the node that is getting the data (even if the two nodes hold the data, the distinction is important).
BTW, what is the client behaviour with metrics sharding between node 1 and node 2 you are describing? Neither KairosDB nor Cassandra work like that, is it your own client that is sending metrics to different KairosDB instances?
The Cassandra partition is not made on metric name but on row key (partition key exactly, but it's the same with kairosDB). So every 3-weeks data for each unique series will be associated a token based on hash code, this token will be use for sharding/replication on the cluster.
KairosDB is able to communicate with several nodes and would round robin between those as coordinator nodes.
I hope this helps.

Related

Cassandra: what node will data be written if the needed node is down?

Suppose I have a Cassandra cluster with 3 nodes (node 0, node 1 and node 2) and replication factor of 1.
Suppose that I want to insert a new data to the cluster and the partition key directs the new row to node 1. However, node 1 is temporarily unavailable. In this case, will the new data be inserted to node 0 or node 2 (although it should not be placed there according to the partition key)?
In Cassandra, Replication Factor (RF) determines how many copies of data will ultimately exist and is set/configured at the keyspace layer. Again, its purpose is to define how many nodes/copies should exist if things are operating "normally". They could receive the data several ways:
During the write itself - assuming things are functioning "normally" and everything is available
Using Hinted Handoff - if one/some of the nodes are unavailable for a configured amount of time (< 3 hours), cassandra will automatically send the data to the node(s) when they become available again
Using manual repair - "nodetool repair" or if you're using DSE, ops center can repair/reconcile data for a table, keyspace, or entire cluster (nodesync is also a tool that is new to DSE and similar to repair)
During a read repair - Read operations, depending on the configurable client consistency level (described next) can compare data from multiple nodes to ensure accuracy/consistency, and fix things if they're not.
The configurable client consistency level (CL) will determine how many nodes must acknowledge they have successfully received the data in order for the client to be satisfied to move on (for writes) - or how many nodes to compare with when data is read to ensure accuracy (for reads). The number of nodes available must be equal to or greater than the client CL number specified or the application will error (for example it won't be able to compare a QUORUM level of nodes if a QUORUM number of nodes are not available). This setting does not dictate how many nodes will receive the data. Again, that's the RF keyspace setting. That will always hold true. What we're specifying here is how many must acknowledge each write or compare for each read in order the client to be happy at that moment. Hopefully that makes sense.
Now...
In your scenario with a RF=1, the application will receive an error upon the write as the single node that should receive the data (based off of a hash algorithm) is down (RF=1 again means only a single copy of the data will exist, and that single copy is determined by a hash algorithm to be the unavailable node). Does that make sense?
If you had a RF=2 (2 copies of data), then one of the two other nodes would receive the data (again, the hash algorithm picks the "base" node, and then another algorithm will chose where the cop(ies) go), and when the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair). If you chose a RF=3 (3 copies) then the other 2 nodes would get the data, and again, once the unavailable node became available, it would eventually receive the data (either by hinted handoff or repair).
FYI, if you ever want to know where a piece of data will/does exist in a Cassandra cluster, you can run "nodetool getendpoints". The output will be where all copies will/do reside.

How to Manage Node Failure with Cassandra Replication Factor 1?

I have a three node Cassandra (DSE) cluster where I don't care about data loss so I've set my RF to 1. I was wondering how Cassandra would respond to read/write requests if a node goes down (I have CL=ALL in my requests right now).
Ideally, I'd like these requests to succeed if the data exists - just on the remaining available nodes till I replace the dead node. This keyspace is essentially a really huge cache; I can replace any of the data in the event of a loss.
(Disclaimer: I'm a ScyllaDB employee)
Assuming your partition key was unique enough, when using RF=1 each of your 3 nodes contains 1/3 of your data. BTW, in this case CL=ONE/ALL is basically the same as there's only 1 replica for your data and no High Availability (HA).
Requests for "existing" data from the 2 up nodes will succeed. Still, when one of the 3 nodes is down a 1/3 of your client requests (for the existing data) will not succeed, as basically 1/3 of you data is not available, until the down node comes up (note that nodetool repair is irrelevant when using RF=1), so I guess restore from snapshot (if you have one available) is the only option.
While the node is down, once you execute nodetool decommission, the token ranges will re-distribute between the 2 up nodes, but that will apply only for new writes and reads.
You can read more about the ring architecture here:
http://docs.scylladb.com/architecture/ringarchitecture/

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

Best way to add multiple nodes to existing cassandra cluster

We have a 12 node cluster with 2 datacenters(each DC has 6 nodes) with RF-3 in each DC.
We are planning to increase cluster capacity by adding 3 nodes in each DC(total 6 nodes).
What is the best way to add multiple nodes at once.(ya,may be with 2 min difference).
auto_bootstrap:false - Use auto_bootstrap:false(as this is quicker process to start nodes) on all new nodes , start all nodes & then run 'nodetool rebuild' to get data streamed to this new nodes from exisitng nodes.
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
OR
auto_bootstrap:true - Use auto_bootstrap:true and then start one node at a time , wait until streaming process finishes(this might take time I guess as we have huge data approx 600 GB+ on each node) before starting next node.
If I go this way , I have to wait until whole streaming process done on a node before proceed adding next new node.
Kindly suggest a best way to add multiple nodes all at once.
PS: We are using c*-2.0.3.
Thanks in advance.
As each depends on streaming data over the network, it largely depends how distributed your cluster is, and where your data currently is.
If you have a single-DC cluster and latency is minimal between all nodes, then bringing up a new node with auto_bootstrap: true should be fine for you. Also, if at least one copy of your data has been replicated to your local datacenter (the one you are joining the new node to) then this is also the preferred method.
On the other hand, for multiple DCs I have found more success with setting auto_bootstrap: false and using nodetool rebuild. The reason for this, is because nodetool rebuild allows you to specify a data center as the source of the data. This path gives you the control to contain streaming to a specific DC (and more importantly, not to other DCs). And similar to the above, if you are building a new data center and your data has not yet been fully-replicated to it, then you will need to use nodetool rebuild to stream data from a different DC.
how read requests would be handled ?
In both of these scenarios, the token ranges would be computed for your new node when it joins the cluster, regardless of whether or not the data is actually there. So if a read request were to be sent to the new node at CL ONE, it should be routed to a node containing a secondary replica (assuming RF>1). If you queried at CL QUORUM (with RF=3) it should find the other two. That is of course, assuming that the nodes which are picking-up the slack are not overcome by their streaming activities that they cannot also serve requests. This is a big reason why the "2 minute rule" exists.
The bottom line, is that your queries do have a higher chance of failing before the new node is fully-streamed. Your chances of query success increase with the size of your cluster (more nodes = more scalability, and each bears that much less responsibility for streaming). Basically, if you are going from 3 nodes to 4 nodes, you might get failures. If you are going from 30 nodes to 31, your app probably won't notice a thing.
Will the new node try to pull data from nodes in the other data centers too?
Only if your query isn't using a LOCAL consistency level.
I'm not sure this was ever answered:
If I go this way , where read requests go soon starting this new nodes , as at this point it has only token range assigned to them(new nodes) but NO data got streamed to this nodes , will it cause Read request failures/CL issues/any other issue?
And the answer is yes. The new node will join the cluster, receive the token assignments, but since auto_bootstrap: false, the node will not receive any streamed data. Thus, it will be a member of the cluster, but will not have any old data. New writes will be received and processed, but existing data prior to the node joining, will not be available on this node.
With that said, with the correct CL levels, your new node will still do background and foreground read repair, so that it shouldn't respond any differently to requests. However, I would not go this route. With 2 DC's, I would divert traffic to DCA, add all of the nodes with auto_bootstrap: false to DCB, and then rebuild the nodes from DCA. The rebuild will need to be from DCA because the tokens have changed in DCB, and with auto_bootstrap: false, the data may no longer exist. You could also run repair, and that should resolve any discrepancies as well. Lastly, after all of the nodes have been bootstrapped, run cleanup on all of the nodes in DCB.

Can cassandra guarantee the replication factor when a node is resyncing it's data?

Let's say I have a 3 node cluster.
I am writing to node #1.
If node #2 in that cluster goes down, and then comes back up and is resyncing the data from the other nodes, and I continue writing to node #1, will the data be synchronously replicated to node #2? That is, is the replication factor of that write honored synchronously or is it behind the queue post resync?
Thanks
Steve
Yes granted that you are reading and writing at a consistency level that can handle 1 node becoming unavailable.
Consider the following scenario:
You have a 3 node cluster with a keyspace 'ks' with a replication factor of 3.
You are writing at a Consistency Level of 'QUORUM'
You are reading at a Consistency level of 'QUORUM'.
Node 2 goes down for 10 minutes.
Reads and Writes can successfully continue while node is down since 'QUORUM' only requires 2 (3/2+1=2) nodes to be available. While Node 2 is down, both Node 1 and 3 maintain 'hints' for Node 2.
Node 2 comes online. Node 1 and 3 send hints they recorded while Node 2 was down to Node 2.
If a read happens and the coordinating cassandra node detects that nodes are missing data/not consistent, it may execute a 'read repair'
If Node 2 was down for a long time, Node 1 and Node 3 may not retain all hints destined for it. In this case, an operator should consider running repairs on a scheduled basis.
Also note that when doing reads, if Cassandra finds that there is a data mismatch during a digest request, it will always consider the data with the newest timestamp as the right one (see 'Why cassandra doesn't need vector clocks').
Node2 will immediately start taking the new writes and also any hints stored for this node by others. It is good idea to run a read repair on the node after it is back up, which will ensure the data is accurate with other nodes.
Note that each column has a timestamp stored against it which will help cassandra determine which data is recent when running node repair.

Resources