According to datastax article, strong consistency can be guaranteed
if, R + W > N
where
R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
What does strong consistency mean here? Does it mean that 'every time' a query's response is given from the database, the response will 'always' be the last updated value? If conditions of strong consistency is maintained in cassandra, then, are there no scenarios where the data returned might be inconsistent? In short, does strong consistency mean 100% consistency?
Edit 1
Adding some additional material regarding some scenarios where Cassandra might not be consistent even when R+W>RF
Write fails with Quorum CL
Cassandra's eventual consistency
Cassandra has tunable consistency with some tradeoffs you can choose.
R + W > N - this simply means there must be one overlapping node in your roundtrip that has the actual and newest data available to be consistent.
For example if you write at CL.ONE you will need to read at CL.ALL to be sure to get a consistent result: N+1 > N - but you might not want CL.ALL as you can not tolerate a single node failure in your cluster.
Often you can choose CL.QUORUM at read and write time to ensure consistency and tolerate node failures. For example at RF=3 a QUORUM needs (3/2)+1=2 nodes available, so R+W>N will be 4>3 - your requests are consistent AND you can tolerate a single node failure.
One thing to keep in mind - it is really important to have thight synchronized clocks on all your nodes (cassandra and application), you will want to have ntp up and running.
While this is an old question, I thought I would chip in to set the record straight.
R+W>RF does not imply strong consistency
A system with **R+W>RF* will only be eventually consistent. The claims for strong consistency guarentee break during node failures or in between writes. For example consider the following scenario:
Assume that there are 3 nodes A,B,C with RF=3, W=3, R=2 (hence, R+W = 5 > 3 = RF)
Further assume key k is associated to value v i.e. (k,v) is stored on the database. Suppose the following series of actions occur:
t=1: (k,v1) write request is sent to A,B,C from a user
t=2: (k,v1) reaches A and is written to store at A
t=3: Reader 1 sends a read request for key k, which is replied to by A and B
t=4: Reader 1 receives response (k,v1) - by latest write wins rule
t=5: Reader 1 sends another read request which gets served by nodes B and C
t=6: Reader 1 receives response (k,v), which is an older value INCONSISTENCY
t=7: (k,v1) reaches C and is written to store at C
t=8: (k,v1) reaches B and is written to store at B
This demonstrates that W+R>RF cannot guarantee strong consistency. To ensure strong consistency you might want to use another algorithm such as paxos or raft that can help in ensuring that the writes are atomic. You can read an interesting article on the same here (Do checkout the FAQ section)
Edit:
Cassandra does have some internal mechanism (called the blocking read repairs) - that trigger synchronous writes before response from the db is sent back to client. This kind of synchronous read repair occurs in case of inconsistencies amongst the nodes queried to achieve read consistency level and ensures something known as Monotonic Read Consistency [See below for definitions]. This causes the (k,v1) in above example to be written to node B before response is returned in case of first read request and so the second read request would also have an updated value. (Thanks to #Nadav Har'El for pointing this out)
However, this still does not guarantee strong consistency. Below are some definitions to clear it of:
Sequential/Strong Consistency: the result of any execution is the same as if the reads and writes occur in some order, and the operations of each individual processor appear in this sequence in the order specified by its program [as defined by Leslie Lamport]
Monotonic Read Consistency: once you read a value, all subsequent reads will return this value or a newer version
Sequential consistency would require the client program/reader to see the latest value that was written since the write statement is executed before the read statement in the sequence of program instructions.
For both reads and writes, the consistency levels of ANY , ONE , TWO , and THREE are considered weak, whereas QUORUM and ALL are considered strong.
Yes. If R + W consistency is greater than replicas then you will always get consistent data. 100% consistency. But you will have to trade availability to achieve higher consistency.
Cassandra has concept of tunable consistency (set consistency on query basis).
I will actually regard this Strong Consistency as Strong read consistency. And it is sessional, aka Monotonic Read Consistency.( refer to #NadavHar'El answer).
But it is not sequential consistency as Cassandra doesn't fully support lock, transaction or serialze the write operation. There is only lightweight transaction, which supports local serialization of write operation and serialization of read operation.
To make things easy to understand. Let's say we have three nodes - A, B, C and set read quorum to be 3 and write to be 1.
If there is only one client, it writes to any node - A.
B and C might be not synchronized.(Eventually they will -- Eventual consistency)
But When the client reads again, it requires client to get at least three nodes' response and by comparing the latest timestamp, we will use A's record. This is Monotonic Read Consistency
However,if there are two client trying to update the records at the same time or if they try to read the value first and then rewrite it(e.g increase column by 100) at the same time:
Client C1 and Client C2 both read the current column value as 10, and they both decide to increase it by 100:
While C1 just need to write 110 to one node, client C2 will do the same and the final result on any node can only be 110 max.
Then we lose 100 in these operations(Lost updates). It is the issue caused by race condition or concurrent issues. It has to be fixed by serializing the operation and using any form of lock just like how other SQL DB implements transaction.
I know Cassandra now has new counter column which might solve it but it is still limited in terms of the full transaction. And Cassandra is also not supposed to be transactional as it is NoSQL database which sacrifice consistency for availability
Related
I have always used Cassandra in spark applications, but I never wondered how it works internally. Reading the Cassandra documentation I got a small doubt (which may be a beginner's doubt).
I read in a book (Cassandra The Definitive Guide) and in the official Cassandra documentation that the formula would be:
(RF / 2) + 1.
So theoretically if I have a cluster with 6 nodes, and a replication factor of 3, I would only need response from 2 nodes.
And here come the small doubts:
1 - What would this response be? (The query return with the data?)
2 - If there was no data with the filters used in the query, is the empty return considered a response?
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?
Your reasoning sounds correct to me.
Basically, if you're reading at LOCAL_QUORUM and have an RF of 3, it's possible that the coordinator accepts a response from two replicas that are both inconsistent and leaves out the third replica that had consistent data.
It's one of the reasons Cassandra is considered an eventually consistent db, and also why regular repairs of the data are so important for production databases. Of course, if consistency mattered above all else, you could always read with a CL of ALL, but you'd sacrifice some amount of response time as a tradeoff. Assuming the db is provisioned well though, while it's certainly in the realm of possible, it isn't likely that only a single replica receives an incoming write unless you make a habit an only writing at a CL of ONE/LOCAL_ONE. If consistency mattered, you'd be writing to the db with a CL of at least LOCAL_QUORUM to avoid this very scenario.
To try and answer your questions directly, yes, having no data to return can be a valid response, and yes if the two replicas chosen by the coordinator both agree there is no data to return, the app will report that result.
1 - What would this response be? (The query return with the data?)
The coordinator node will wait for 2 replicas of the 3 (because CL=QUORUM) to respond to the query (with the request results). It will then send the response to the client.
2 - If there was no data with the filters used in the query, is the empty return considered a response?
Yes, the empty response will be sufficient and will be considered a valid response. Note that there is a mechanism last-write-wins (based on row write time) used in case of conflict.
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?
You have to understand that Apache Cassandra uses eventual consistency meaning that the client will decide on the desired CL. If you have a strong consistency, meaning you have an overlap of the write CL and read CL (Write CL + Read CL > RF), then will always retrieve the last data. I recommend you to watch this video: https://www.youtube.com/watch?v=Gx-pmH-b5mI
Scenario:
Client sends a write request to a coordinator node
Replication factor is 3 and Read/Write Consistency level is QUORUM.
Coordinator sends the request to nodes A, B and C. Data is committed to node A, but nodes B and C go down immediately after receiving the request from the coordinator.
Coordinator will send a timeout exception to the client since it has not received an ack from nodes B and C within the allotted time. The data on node A is now inconsistent with the data on nodes B and C.
Based on my understanding nodes B and C will be updated with the value on node A during read repair. So we had a timeout exception here, but the new value has been eventually written to all the nodes.
There could be other timeout exceptions where the new data has not been written to any of the nodes.
So it appears that the developer is expected to handle the timeout exception in the code which may not be straightforward in all cases(because the new value may be written in some cases and not in others and the developer has to check for that during a retry after the timeout).
I'm just learning Cassandra. So if my understanding is not correct, please correct me.
Some of you may say that this happens in a relational DB too, but it's a rare occurrence there since it's not a distributed system.
Here are some articles that I found, but it does not address my question specifically.
What happens if a coordinator node goes down during a write in Apache Cassandra?
https://www.datastax.com/blog/2012/08/when-timeout-not-failure-how-cassandra-delivers-high-availability-part-1
If the data is written you it is consistent, even if node B and C didnot sent the ACKT :
When the data is received by a node, it first goes to a commit log and if the node crashes, then it will replay the mutation as soon as it will starts up again.
As the second article said, it is more like a InProgressException than a TimedOutException.
On client side if you have a TimedOutException you are not 100% sure that the data was written, but it could be.
For your case, if the write as received by node B and C, even if they didnot sent ACK, the data is consistent. even if just one of the 2 nodes did, the data is consistent too due to QUORUM use.
Cluster side, there are several mechanisms that can hep Cassandra being more consistent : hinted handoff, read repair, and repair.
For better understanding, maybe worth taking a look at :
write path :
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_write_path_c.html
hinted handoff:
https://docs.datastax.com/en/cassandra-oss/2.1/cassandra/dml/dml_about_hh_c.html
read repair :
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesReadRepair.html
Thanks for the response. It still does not help answer the question from an end user/developer perspective since I need to write the code to handle the exception.
For whatever it's worth, I found the below article on DataStax.
https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right
If you refer to the section on
'WriteTimeOutException' and 'Non-idempotent operations', u can see that
the end user is expected to do a retry after receiving the exception.
If it's an idempotent operation, then no additional code is required on the application side. Things are not so straight forward for non-idempotent operations. Cassandra assumes that most of the write operations are generally idempotent and I don't necessarily agree with this. The business rules depend on the application.
Example of non-idempotent operations:
update table set counter = counter + 1 where key = 'xyz'
or update table set commission = commission * 1.02 where key = 'abc'
The article gives some recommendations on how to handle non-idempotent operations using CAS/lightweight transactions in the 'Non-idempotent operations' section. This makes things complicated/ugly in the client code, especially when you have a lot of DML in the code.
Even though it's NOT the answer that I was looking for, it appears that there's no better way at least for now.
When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.
The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.