How is strong consistency possible given two generals problem - cassandra

Many distributed systems (e.g. databases) say they can provide strong consistency. For example, assuming N replicas of the data, a requirement that W nodes acknowledge a write and R replicas respond to a read, the Cassandra documentation says that as long as R + W > N you will get strong consistency. Intuitively, that makes sense. But then I started thinking about this on the individual message level and I can't actually understand how it could be achieved.
To be specific, let's assume I have a Cassandra cluster with a replication factor of 3. For simplicity, let's assume only a single data partition so we have exactly 3 nodes in the system, A, B, and C. A client attempts to write some data, x = 11, with a write consitency of W = 3, that is, the write is only considered complete if all replicas acknowledge the write. So the client sends the write request to A which then forwards it to B and C. Let's assume B ACKs the write but C does not. The write should then fail. Another client then does a read with R = 1 and happens to talk to B. As R + W = 1 + 3 = 4 > 3 this should be a strongly consistent read. However, B has already ACK'd the write and there is thus at least some window of time where B will return x = 11 if asked (it may only be a window as A might tell B "never mind, the write failed"). If the client never retries its write we now have given wholly incorrect data to a client and it doesn't seem like we can consider this strong consistency.
We can start to think about schemes to fix this. For example, maybe the protocol is the nodes each ACK the message but won't return it until A reaches out to them again and tells them to commit (i.e. a two-phase commit). But again we run into trouble as now we can have B and C initially ACK, so A tells them to commit but C fails to get that message. As a result, a read from C would fail to return x = 11 even though the write appears to have succeeded. Attempts to fix this via additional rounds of messaging (e.g. each node has to ACK the commit phase) also inevitably run into issues as is proved by the two generals problem.
There's clearly something wrong with my reasoning here; Cassandra does provide strong consistency when used properly. My question is, at the node-to-node protocol level, how do they do it?

I think the answer here is that "strong consistency" here is something akin to uncommitted read meaning that dirty reads, as in my initial example, are in fact allowed and do happen. Indeed, I found this in the Cassandra documentation:
If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.

Related

What happens when data is written to a replica in Cassandra, but the coordinator returns a timeout exception to the client due to lack of quorum?

Scenario:
Client sends a write request to a coordinator node
Replication factor is 3 and Read/Write Consistency level is QUORUM.
Coordinator sends the request to nodes A, B and C. Data is committed to node A, but nodes B and C go down immediately after receiving the request from the coordinator.
Coordinator will send a timeout exception to the client since it has not received an ack from nodes B and C within the allotted time. The data on node A is now inconsistent with the data on nodes B and C.
Based on my understanding nodes B and C will be updated with the value on node A during read repair. So we had a timeout exception here, but the new value has been eventually written to all the nodes.
There could be other timeout exceptions where the new data has not been written to any of the nodes.
So it appears that the developer is expected to handle the timeout exception in the code which may not be straightforward in all cases(because the new value may be written in some cases and not in others and the developer has to check for that during a retry after the timeout).
I'm just learning Cassandra. So if my understanding is not correct, please correct me.
Some of you may say that this happens in a relational DB too, but it's a rare occurrence there since it's not a distributed system.
Here are some articles that I found, but it does not address my question specifically.
What happens if a coordinator node goes down during a write in Apache Cassandra?
https://www.datastax.com/blog/2012/08/when-timeout-not-failure-how-cassandra-delivers-high-availability-part-1
If the data is written you it is consistent, even if node B and C didnot sent the ACKT :
When the data is received by a node, it first goes to a commit log and if the node crashes, then it will replay the mutation as soon as it will starts up again.
As the second article said, it is more like a InProgressException than a TimedOutException.
On client side if you have a TimedOutException you are not 100% sure that the data was written, but it could be.
For your case, if the write as received by node B and C, even if they didnot sent ACK, the data is consistent. even if just one of the 2 nodes did, the data is consistent too due to QUORUM use.
Cluster side, there are several mechanisms that can hep Cassandra being more consistent : hinted handoff, read repair, and repair.
For better understanding, maybe worth taking a look at :
write path :
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_write_path_c.html
hinted handoff:
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_about_hh_c.html
read repair :
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesReadRepair.html
Thanks for the response. It still does not help answer the question from an end user/developer perspective since I need to write the code to handle the exception.
For whatever it's worth, I found the below article on DataStax.
https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right
If you refer to the section on
'WriteTimeOutException' and 'Non-idempotent operations', u can see that
the end user is expected to do a retry after receiving the exception.
If it's an idempotent operation, then no additional code is required on the application side. Things are not so straight forward for non-idempotent operations. Cassandra assumes that most of the write operations are generally idempotent and I don't necessarily agree with this. The business rules depend on the application.
Example of non-idempotent operations:
update table set counter = counter + 1 where key = 'xyz'
or update table set commission = commission * 1.02 where key = 'abc'
The article gives some recommendations on how to handle non-idempotent operations using CAS/lightweight transactions in the 'Non-idempotent operations' section. This makes things complicated/ugly in the client code, especially when you have a lot of DML in the code.
Even though it's NOT the answer that I was looking for, it appears that there's no better way at least for now.

Strong Consistency in Cassandra

According to datastax article, strong consistency can be guaranteed
if, R + W > N
where
R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
What does strong consistency mean here? Does it mean that 'every time' a query's response is given from the database, the response will 'always' be the last updated value? If conditions of strong consistency is maintained in cassandra, then, are there no scenarios where the data returned might be inconsistent? In short, does strong consistency mean 100% consistency?
Edit 1
Adding some additional material regarding some scenarios where Cassandra might not be consistent even when R+W>RF
Write fails with Quorum CL
Cassandra's eventual consistency
Cassandra has tunable consistency with some tradeoffs you can choose.
R + W > N - this simply means there must be one overlapping node in your roundtrip that has the actual and newest data available to be consistent.
For example if you write at CL.ONE you will need to read at CL.ALL to be sure to get a consistent result: N+1 > N - but you might not want CL.ALL as you can not tolerate a single node failure in your cluster.
Often you can choose CL.QUORUM at read and write time to ensure consistency and tolerate node failures. For example at RF=3 a QUORUM needs (3/2)+1=2 nodes available, so R+W>N will be 4>3 - your requests are consistent AND you can tolerate a single node failure.
One thing to keep in mind - it is really important to have thight synchronized clocks on all your nodes (cassandra and application), you will want to have ntp up and running.
While this is an old question, I thought I would chip in to set the record straight.
R+W>RF does not imply strong consistency
A system with **R+W>RF* will only be eventually consistent. The claims for strong consistency guarentee break during node failures or in between writes. For example consider the following scenario:
Assume that there are 3 nodes A,B,C with RF=3, W=3, R=2 (hence, R+W = 5 > 3 = RF)
Further assume key k is associated to value v i.e. (k,v) is stored on the database. Suppose the following series of actions occur:
t=1: (k,v1) write request is sent to A,B,C from a user
t=2: (k,v1) reaches A and is written to store at A
t=3: Reader 1 sends a read request for key k, which is replied to by A and B
t=4: Reader 1 receives response (k,v1) - by latest write wins rule
t=5: Reader 1 sends another read request which gets served by nodes B and C
t=6: Reader 1 receives response (k,v), which is an older value INCONSISTENCY
t=7: (k,v1) reaches C and is written to store at C
t=8: (k,v1) reaches B and is written to store at B
This demonstrates that W+R>RF cannot guarantee strong consistency. To ensure strong consistency you might want to use another algorithm such as paxos or raft that can help in ensuring that the writes are atomic. You can read an interesting article on the same here (Do checkout the FAQ section)
Edit:
Cassandra does have some internal mechanism (called the blocking read repairs) - that trigger synchronous writes before response from the db is sent back to client. This kind of synchronous read repair occurs in case of inconsistencies amongst the nodes queried to achieve read consistency level and ensures something known as Monotonic Read Consistency [See below for definitions]. This causes the (k,v1) in above example to be written to node B before response is returned in case of first read request and so the second read request would also have an updated value. (Thanks to #Nadav Har'El for pointing this out)
However, this still does not guarantee strong consistency. Below are some definitions to clear it of:
Sequential/Strong Consistency: the result of any execution is the same as if the reads and writes occur in some order, and the operations of each individual processor appear in this sequence in the order specified by its program [as defined by Leslie Lamport]
Monotonic Read Consistency: once you read a value, all subsequent reads will return this value or a newer version
Sequential consistency would require the client program/reader to see the latest value that was written since the write statement is executed before the read statement in the sequence of program instructions.
For both reads and writes, the consistency levels of ANY , ONE , TWO , and THREE are considered weak, whereas QUORUM and ALL are considered strong.
Yes. If R + W consistency is greater than replicas then you will always get consistent data. 100% consistency. But you will have to trade availability to achieve higher consistency.
Cassandra has concept of tunable consistency (set consistency on query basis).
I will actually regard this Strong Consistency as Strong read consistency. And it is sessional, aka Monotonic Read Consistency.( refer to #NadavHar'El answer).
But it is not sequential consistency as Cassandra doesn't fully support lock, transaction or serialze the write operation. There is only lightweight transaction, which supports local serialization of write operation and serialization of read operation.
To make things easy to understand. Let's say we have three nodes - A, B, C and set read quorum to be 3 and write to be 1.
If there is only one client, it writes to any node - A.
B and C might be not synchronized.(Eventually they will -- Eventual consistency)
But When the client reads again, it requires client to get at least three nodes' response and by comparing the latest timestamp, we will use A's record. This is Monotonic Read Consistency
However,if there are two client trying to update the records at the same time or if they try to read the value first and then rewrite it(e.g increase column by 100) at the same time:
Client C1 and Client C2 both read the current column value as 10, and they both decide to increase it by 100:
While C1 just need to write 110 to one node, client C2 will do the same and the final result on any node can only be 110 max.
Then we lose 100 in these operations(Lost updates). It is the issue caused by race condition or concurrent issues. It has to be fixed by serializing the operation and using any form of lock just like how other SQL DB implements transaction.
I know Cassandra now has new counter column which might solve it but it is still limited in terms of the full transaction. And Cassandra is also not supposed to be transactional as it is NoSQL database which sacrifice consistency for availability

Why R+W > RF means immediate consistency?

I'm trying to explain this to myself ..
Here's how I understand it :
Suppose I have 4 nodes, RF = 3 and CL = QUORUM for both read & write.
In my table (id, title) I write data {id = 1, title = 'mytext'} then write will return success if 2 nodes write this successfully. Say it's successfull, we now have (at least) 2 nodes with {id = 1, title = 'mytext'} and potentially one node with (id = 1, title = 'olddata')
Then any subsequent read (where id = 1) needs to find 2 nodes (QUORUM) with same data in order to return successfully which will never occur with the old data. because there's a maximum of 1 node remaining containing the old data.
Is that accurate?
Number of nodes is not that important, more important is the RF i.e. how many nodes have the copy of the data. So CL Quorum means:
2 nodes have to confirm it on write
2 nodes have to confirm it on read
Under the hood the request will not actually go to all of the nodes. Based on some statistics, and the component snitch the coordinator will choose one of the nodes that have the data and the other nodes will just be asked for a hash, not the whole data. If the received data matches hash wise, then it's returned to the client. If not, coordinator will request the full data from the other nodes and resolve conflict by using last write wins policy.
In order to be able to do this, clocks have to be in sync. Usually by the ntp ... but some guys go even as far as installing GPS receivers on hosts to keep the clock skew really tight.
In short your reasoning is totally o.k.
And if you want to learn a bit more about all the combination it doesn't hurt to look at the following link:
https://www.ecyrd.com/cassandracalculator/

Read/Write Strategy For Consistency Level

Based on Read Operation in Cassandra at Consistency level of Quorum?
there are 3 ways to read data consistency:
a. WRITE ALL + READ OoNE
b. WRITE ONE + READ ALL
c. WRITE QUORUM + READ QUORUM
For a data, the write operation usually happens once, but read operations often happens.
But take care of the read consistency, is it possible to merge a and b ?
This is, WRITE ONE -> READ ONE -> if not found -> READ ALL.
Does the approach usually fulfill read/write operation happen once?
There is only read ALL at first time on a node which has no the data.
So Is my understanding correct?
Wilian, thanks for exactly elaborating. I think I need to describe my use case, as bellow. I implemented a timeline uses can post to. And users can follow the interesting post. So notification will be sent to the followers. For saving bandwidth, users write/read post at CL ONE. Eventually, users always can read the post after a while by read repair. Followers will receive the notification of comments added to post if they listen the post. Here is my question. It must make sure followers can read the comments if notification were delivers to followers. So I am indented to use CL ONE to check if the comment was synced to the node queried. If no result, try CL ALL to synced the comment. So other followers query from the node don't bother to sync other nodes since the CL ALL was done before,which can save bandwidth and lower server overhead. So as for your final scenario, I don't care if the value is old or latest because the data was synced according to notifications. I need to ensure users can get the comment if notification was delivered to followers.
From the answer to the linked question, Carlo Bertuccini wrote:
What guarantees consistency is the following disequation
(WRITE CL + READ CL) > REPLICATION FACTOR
The cases A, B, and C in this question appear to be referring to the three minimum ways of satisfying that disequation, as given in the same answer.
Case A
WRITE ALL will send the data to all replicas. If your replication factor (RF) is three(3), then WRITE ALL writes three copies before reporting a successful write to the client. But you can't possibly see that the write occurred until the next read of the same data key. Minimally, READ ONE will read from a single one of the aforementioned replicas, and satisfies the necessary condition: WRITE(3) + READ(1) > RF(3)
Case B
WRITE ONE will send the data to only a single replica. In this case, the only way to get a consistent read is to read from all of them. The coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas, informing them that there's a newer value. The hint occurs asynchronously but only after the READ ALL occurs does it satisfy the necessary condition: WRITE(1) + READ(3) > RF(3)
Case C
QUORUM operations must involve FLOOR(RF / 2) + 1 replicas. In our RF=3 example, that is FLOOR(3 / 2) + 1 == 1 + 1 == 2. Again, consistency depends on both the reads and the writes. In the simplest case, the read operation talks to exactly the same replicas that the write operation used, but that's never guaranteed. In the general case, the coordinator node doing the read will talk to at least one of the replicas used by the write, so it will see the newer value. In that case, much like the READ ALL case, the coordinator node will get all of the answers, figure out which one is the most recent and then send a "hint" to the out-of-date replicas. Of course, this also satisfies the necessary condition: WRITE(2) + READ(2) > RF(3)
So to the OP's question...
Is it possible to "merge" cases A and B?
To ensure consistency it is only possible to "merge" if you mean WRITE ALL + READ ALL because you can always increase the number of readers or writers in the above cases.
However, WRITE ONE + READ ONE is not a good idea if you need to read consistent data, so my answer is: no. Again, using that disequation and our example RF=3: WRITE(1) + READ(1) > RF(3) does not hold. If you were to use this configuration, receiving an answer that there is no value cannot be trusted -- it simply means that the one replica contacted to do the read did not have a value. But values might exist on one or more of the other replicas.
So from that logic, it might seem that doing a READ ALL on receiving a no value answer would solve the problem. And it would for that use case, but there's another to consider: what if you get some value back from the READ ALL... how do you know that the value returned is "the latest" one? That's what's meant when we want consistency. If you care about reading the most recent write, then you need to satisfy the disequation.
Regarding the use case of "timeline" notifications in the edited question
If my understanding of your described scenario is correct, these are the main points to your use case:
Most (but not all?) timeline entries will be write-once (not modified later)
Any such entry can be followed (there is a list of followers)
Any such entry can be commented upon (there is a list of comments)
Any comment on a timeline entry should trigger a notification to the list of followers for that timeline entry
Trying to minimize cost (in this case, measured as bandwidth) for the "normal" case
Willing to rely on the anti-entropy features built into Cassandra (e.g. read repair)
I need to ensure users can get the comment if notification was delivered to followers.
Since most of your entries are write-once, and you care more about the existence of an entry and not necessarily the latest content for the entry, you might be able to get away with WRITE ONE + READ ONE with a fallback to READ ALL if you get no record for something that had some other indication it should exist (e.g. from a notification). For the timeline entry content, it does not sound like your case depends on consistency of the user content of the timeline entries.
If you don't care about consistency, then this discussion is moot; read/write with whatever Consistency Level and let Cassandra's asynchronous replication and anti-entropy features do their work. That said, though your goal is minimizing network traffic/cost, if your workload is mostly reads then the added cost of doing writes at CL QUORUM or ALL may not actually be that much.
You also said:
Followers will receive the notification of comments added to post if they listen the post.
This statement implies that you care about about not only whether the set of followers exists but also its contents (which users are following). You have not detailed how you are storing/tracking the followers, but unless you ensure the consistency of this data it is possible that one or more followers are not notified of a new comment because you retrieved an out-of-date version of the follower list. Or, someone who "unfollowed" a post could still receive notifications for the same reason.
Cassandra is very flexible and allows each discrete read and write operation to use different consistency levels. Take advantage of this and ensure strong consistency where it is needed and relax it where you are sure that "reading the latest write" is not important to your application's logic and function.

Understand cassandra replication factor versus consistency level

I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.

Resources