Cassandra - data loss on a dead node with CF = 1 - cassandra

I'm a newbie to Cassandra and have a question on the commit log which is configured to use periodic mode (10 seconds).
Suppose we have a node that processes a request with CF = 1 and RF = 3. If the node is in a state in which the commit log has not been flushed to disk and replication of the data is also pending, would we loose data if the node crashes in this state?
Another follow-up question, which node is responsible for replicating the data on other nodes based on RF=3? Is is the coordinator node or some other node which processes the request depending on consistency level?

I think following link might be of use to you:
https://www.ecyrd.com/cassandracalculator/
Yes, data loss is possible in this scenario because data would not reach other nodes, so no copies exist. As if the data was not there. The thing is this window is actually quite small because with RF 3 the other nodes will receive the insert within the milliseconds (Unless there is some really heavy load on the node).
All of the RF requests (per single client request) are handled by the coordinator. Also if the node might not be there when the coordinator needs to replicate it stores the data in a hint.
So to sum it up yes data loss is possible but the probability is really small.

With CL=ONE when a coordinator crashes and goes down uncleanly there is a window where data loss is possible before the mutation is sent to replicas and commit log is flushed. Its pretty small window and unlikely but if its a concern use local quorum or batch mode.
The coordinator will send data to all replicas and store hints for whatever hasn't acked.

Related

Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Q1.
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

What does Cassandra return to client on dropped mutations?

When there are "dropped mutations" on Cassandra side, does it return corresponding failure to calling client? Or It's always success response to calling client which invoked the transaction even though the corresponding mutations are dropped at server side and resulting in data loss?
In one particular instance we observed lots of dropped mutations (around 6k dropped mutations per sec) when we had TPS around 80K/sec and increased latency of 4000+ ms. The cluster is 6 node cluster. Don't node/cassandra yaml config with me now. In general, how to trouble shoot this "dropped mutations".
Strangely, we couldn't reproduce this bahavior even with at later point.
On writes, if enough replicas respond within write_request_timeout_in_ms (2 seconds by default) you will see successful responses at the client.
So consider that case where you are writing with consistency QUORUM with a replication factor of 3. When a write is sent from a client to the coordinator, the coordinator sends a write request to all three replicas simultaneously. If 2 replicas are able to respond within write_request_timeout_in_ms, the coordinator will then send a successful response back to the client. Meanwhile, if the third replica is not able to begin processing the write mutation within write_request_timeout_in_ms it will drop the mutation.
In this scenario, the fact that the mutation was dropped is not visible to the client, but that's OK from the client perspective! All you asked for was a quorum of nodes to acknowledge the write.
From an operational perspective however, this is a cause for concern. You have replicas that aren't even able to start working on processing the mutation until the timeout would have elapsed, that's not good!
There are multiple possible causes for this, garbage collection thrashing, hardware issues, or maybe your cluster is simply under-provisioned. Monitoring for dropped mutations to identify these situations is a good step towards understanding what is happening.
If you are worried about consistency issues between replicas, cassandra employs multiple anti-entropy mechanisms to get into a consistent state. If inconsistencies are identified while reading data, read repair will get replicas into a consistent state on those nodes by applying the cells with the highest timestamp. Even if data does match between required replicas, a read repair may still be triggered based on table's configured read repair chance to ensure consistent data among all replicas. You should also run scheduled repairs as well.
One last note, in the case that not enough replicas respond to meet your consistency level, you will see WriteTimeoutExceptions surfaced to the client. This could mean that your replicas are dropping mutations, but that isn't necessarily the case. They could have begun processing the mutation, but not completed processing within the timeout. In this case, the write will be applied on those replicas.

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

Does Cassandra read have inconsistency?

I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?
For Tuning consistency cassandra allows to set the consistency on per query basis.
Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.
During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.
By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.
Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).
Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.
This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.
READ With Different CONSISTENCY LEVEL
READ Request in Cassandra
Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

Resources