Can cassandra guarantee the replication factor when a node is resyncing it's data? - cassandra

Let's say I have a 3 node cluster.
I am writing to node #1.
If node #2 in that cluster goes down, and then comes back up and is resyncing the data from the other nodes, and I continue writing to node #1, will the data be synchronously replicated to node #2? That is, is the replication factor of that write honored synchronously or is it behind the queue post resync?
Thanks
Steve

Yes granted that you are reading and writing at a consistency level that can handle 1 node becoming unavailable.
Consider the following scenario:
You have a 3 node cluster with a keyspace 'ks' with a replication factor of 3.
You are writing at a Consistency Level of 'QUORUM'
You are reading at a Consistency level of 'QUORUM'.
Node 2 goes down for 10 minutes.
Reads and Writes can successfully continue while node is down since 'QUORUM' only requires 2 (3/2+1=2) nodes to be available. While Node 2 is down, both Node 1 and 3 maintain 'hints' for Node 2.
Node 2 comes online. Node 1 and 3 send hints they recorded while Node 2 was down to Node 2.
If a read happens and the coordinating cassandra node detects that nodes are missing data/not consistent, it may execute a 'read repair'
If Node 2 was down for a long time, Node 1 and Node 3 may not retain all hints destined for it. In this case, an operator should consider running repairs on a scheduled basis.
Also note that when doing reads, if Cassandra finds that there is a data mismatch during a digest request, it will always consider the data with the newest timestamp as the right one (see 'Why cassandra doesn't need vector clocks').

Node2 will immediately start taking the new writes and also any hints stored for this node by others. It is good idea to run a read repair on the node after it is back up, which will ensure the data is accurate with other nodes.
Note that each column has a timestamp stored against it which will help cassandra determine which data is recent when running node repair.

Related

Read fail with consistency one

I've got 3 nodes; 2 in datacenter 1 (node 1 and node 2) and 1 in datacenter 2 (node 3). Replication strategy: Network Topology, dc1:2, dc2: 1.
Initially I keep one of the nodes in dc1 off (node 2) and write 100 000 entries with consistency 2 (via c++ program). After writing, I shut down the node in datacenter 2 (node 3) and turn on node 2.
Now, if I try to read those 100 000 entries I had written (again via c++ program) with consistency set as ONE, I'm not able to read all those 100 000 entries i.e. I'm able to read only some of the entries. As I run the program again and again, my program fetches more and more entries.
I was expecting that since one of the 2 nodes which are up contains all the 100 000 entries, therefore, the read program should fetch all the entries in the first execution when the set consistency is ONE.
Is this related to read repair? I'm thinking that because the read repair is happening in the background, that is why, the node is not able to respond to all the queries? But nowhere could I find anything regarding this behavior.
Let's run through the scenario.
During the write of 100K rows (DC1) Node1 and (DC2) Node3 took all the writes. As it was happening Node1 also might have taken hints for Node2 (DC1) for default 3 hours and then stop doing that.
Once Node2 comes back up online, unless a repair was run - it takes a bit to catch up through replay of hints. If the node was down for more than 3 hours, repair becomes mandatory.
During the reads, it can technically reach to any node in the cluster based on the loadbalancy policy used by driver. Unless specified to do "DCAwareRoundRobinPolicy", the read request might even reach any of the DC (DC1 or DC2 in this case). Since the consistency requested is "ONE", practically any ALIVE node can respond - NODE1 & NODE2 (DC1) in this case. So NODE2 may not even have all data and it can still respond with NULL value and thats why you received empty data sometimes and correct data some other time.
With consistency "ONE" read repair doesn't even happen, as there no other node to compare it with. Here is the documentation on it . Even in case of consistency "local_quorum" or "quorum" there is a read_repair_chance set at the table level which is default to 0.1. Which means only 10% of reads will trigger read_repair. This is to save performance by not triggering every time. Think about it, if read repair can bring the table entirely consistent across nodes, then why does "nodetool repair" even exist?
To avoid this situation, whenever the node comes back up online its best practice to do a "nodetool repair" or run queries with consistency "local_quorum" to get consistent data back.
Also remember, consistency "ONE" is comparable to uncommitted read (dirty read) in the world of RDBMS (WITH UR). So expect to see unexpected data.
Per documentation, consistency level ONE when reads:
Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. Provides the highest availability of all the levels if you can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
Did you check that your code contacted the node that always was online & accepted writes?
The DSE Architecture guide, and especially Database Internals section provides good overview how Cassandra works.

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

Read repair for cassandra with quorum

I need to understand on read-repair for Cassandra 3.0. For example, I have three nodes A, B & C. My replication factor is 3. Now, I wrote with Quorum, and it successfully wrote on Node A & B, so client will receive success but somehow data was not written on Node C (it was down, and hints throttle time elapsed).
I have not run manual repair and my read repair change is 0.1.
After, few days, my node A is down, leaving me with Node B & Node C. So if I issue a read query with quorum, will read repair always write data to node C and return to client successfully or there may be a scenario, where client can receive an error of "unable to achieve consistency level".
If 2 out of 3 replicas are up, then, the Quorum consistency will be achieved therefore the client will be able to read data. As one of the nodes doesn't have any data therefore, read repair will happen.
As per my understanding(I'm new in Cassandra), whenever a query is executed in Cassandra, the coordinator node checks if the desired number of replicas(requested consistency) are able to respond to the query. If it happens, then the client receives the most recent version of the data(timestamps of the data returned by each node is compared) and then that recent version is written over all the remaining replicas via read repair in case of mismatch.

Failover and Replication in 2-node Cassandra cluster

I run KairosDB on a 2-node Cassandra cluster, RF = 2, Write CL = 1, Read CL = 1. If 2 nodes are alive, client sends half of data to node 1 (e.g. metric from METRIC_1 to METRIC_5000) and the other half of data to node 2 (e.g. metric from METRIC_5001 to METRIC_10000). Ideally, each node always has a copy of all data. But if one node is dead, client sends all data to the alive node.
Client started sending data to the cluster. After 30 minutes, I turned node 2 off for 10 minutes. During this 10-minute period, client sent all data to node 1 properly. After that, I restarted node 2 and client continued sending data to 2 nodes properly. One hour later I stopped the client.
I wanted to check if the data which was sent to node 1 when node 2 was dead had been automatically replicated to node 2 or not. To do this, I turned node 1 off and queried the data within time when node 2 was dead from node 2 but it returned nothing. This made me think that the data had not been replicated from node 1 to node 2. I posted a question Doesn't Cassandra perform “late” replication when a node down and up again?. It seems that the data was replicated automatically but it was so slow.
What I expect is data in both 2 servers are the same (for redundancy purpose). That means the data sent to the system when node 2 is dead must be replicated from node 1 to node 2 automatically after node 2 becomes available (because RF = 2).
I have several questions here:
1) Is the replication truly slow? Or did I configure something wrong?
2) If client sends half of data to each node as in this question I think it's possible to lose data (e.g. node 1 receives data from client, while node 1 is replicating data to node 2 it suddenly goes down). Am I right?
3) If I am right in 2), I am going to do like this: client sends all data to both 2 nodes. This can solve 2) and also takes advantages of replication if one node is dead and is available later. But I am wondering that, this would cause duplication of data because both 2 nodes receive the same data. Is there any problem here?
Thank you!
Can you check the value of hinted_handoff_enabled in cassandra.yaml config file?
For your question: Yes you may lose data in some cases, until the replication is fully achieved, Cassandra is not exactly doing late replication - there are three mechanisms.
Hinted handoffs http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesHintedHandoff.html
Repairs - http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRepair.html
Read Repairs - those may not help much on your use case - http://wiki.apache.org/cassandra/ReadRepair
AFAIK, if you are running a version greater than 0.8, the hinted handoffs should duplicate the data after node restarts without the need for a repair, unless data is too old (this should not be the case for 10 minutes). I don't know why those handoffs where not sent to your replica node when it was restarted, it deserves some investigation.
Otherwise, when you restart the node, you can force Cassandra to make sure that data is consistent by running a repair (e.g. by running nodetool repair).
By your description I have the feeling you are getting confused between the coordinator node and the node that is getting the data (even if the two nodes hold the data, the distinction is important).
BTW, what is the client behaviour with metrics sharding between node 1 and node 2 you are describing? Neither KairosDB nor Cassandra work like that, is it your own client that is sending metrics to different KairosDB instances?
The Cassandra partition is not made on metric name but on row key (partition key exactly, but it's the same with kairosDB). So every 3-weeks data for each unique series will be associated a token based on hash code, this token will be use for sharding/replication on the cluster.
KairosDB is able to communicate with several nodes and would round robin between those as coordinator nodes.
I hope this helps.

cassandra rack replication fixed to (dead) nodes [RF and CL confusion]

1 cluster, 3 nodes, in 2 physical locations, grouped in 2 racks
RF 2
PropertyFileSnitch
CL QUORUM
The problem is:
First node's (in RAC1) replication is pointed to third node from RAC2 and does not change if that node is down, reads and writes fails.
If I start back third node and shut down second node, reads and writes works.
Both second and third node replicate to first node, and if first node is down, reads and writes also fails.
The question is:
Is it possible to make it automatically detect dead nodes and point replication to active detected nodes?
If first node is down, second and third node to replicate data between each other
If second or third node is down, first node should detect what is active and replicate to it
Update1:
Made some tests:
Shut down first node - reads from second and third node fails (Unable to complete request: one or more nodes were unavailable.)
Shut down second node - reads from first and third node works
Shut down third node - reads from first and second node fails
Very strange ...
Update2:
I think I found the answer. How it is now: 3 nodes, RF 2, writes and reads has CL 2. if one replica is down, reads and writes fails (I tested selecting different keys, some succeeded when one node was down, and fails when another is down)
Now I am thinking to do this: move all nodes to one rack, change RF to 3, for reads and writes I will use CL 2 (two replications will be required for write to succeed, and third will be made in background). So now there will be 3 replicas, if one fails, CL 2 will succeed anyway.
Am I right?
Will writes succeed if there are 2 nodes active, replication factor is 3, and consistency level for current write operation is 2?
Update3:
Yes, I think I'm on the right way. Same question here
From the screenshot it can assumed that it is Opscenter.
In Opscenter there is special feature called alerts. It will help you in detecting dead node.
Now coming to question of node down & read write fails, basically these thing depends in read/write consistency level. Go through consistency levels, you will be able to find out the solution on your own.
UPDATE:
May be you might find this blog interesting.
The only time Cassandra will fail is when too few replicas are alive when the coordinator receives the request. This might be the reason behind your strange situation
You want to have all three nodes in RAC1 with a replication factor of 3 and use QUORUM for your reads/writes. This will ensure that data is always persisted to two nodes, reads will be consistent, and there can be one failed node without downtime or data loss. If you don't care about reads always being consistent, i.e. stale data is allowed sometimes, you can make reads more performant by using ONE for reads.

Resources