How does the cassandra know that it has completed QUORUM? - cassandra

I have always used Cassandra in spark applications, but I never wondered how it works internally. Reading the Cassandra documentation I got a small doubt (which may be a beginner's doubt).
I read in a book (Cassandra The Definitive Guide) and in the official Cassandra documentation that the formula would be:
(RF / 2) + 1.
So theoretically if I have a cluster with 6 nodes, and a replication factor of 3, I would only need response from 2 nodes.
And here come the small doubts:
1 - What would this response be? (The query return with the data?)
2 - If there was no data with the filters used in the query, is the empty return considered a response?
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?

Your reasoning sounds correct to me.
Basically, if you're reading at LOCAL_QUORUM and have an RF of 3, it's possible that the coordinator accepts a response from two replicas that are both inconsistent and leaves out the third replica that had consistent data.
It's one of the reasons Cassandra is considered an eventually consistent db, and also why regular repairs of the data are so important for production databases. Of course, if consistency mattered above all else, you could always read with a CL of ALL, but you'd sacrifice some amount of response time as a tradeoff. Assuming the db is provisioned well though, while it's certainly in the realm of possible, it isn't likely that only a single replica receives an incoming write unless you make a habit an only writing at a CL of ONE/LOCAL_ONE. If consistency mattered, you'd be writing to the db with a CL of at least LOCAL_QUORUM to avoid this very scenario.
To try and answer your questions directly, yes, having no data to return can be a valid response, and yes if the two replicas chosen by the coordinator both agree there is no data to return, the app will report that result.

1 - What would this response be? (The query return with the data?)
The coordinator node will wait for 2 replicas of the 3 (because CL=QUORUM) to respond to the query (with the request results). It will then send the response to the client.
2 - If there was no data with the filters used in the query, is the empty return considered a response?
Yes, the empty response will be sufficient and will be considered a valid response. Note that there is a mechanism last-write-wins (based on row write time) used in case of conflict.
3 - And last but not least, if the empty return is considered a response, if these two nodes that complete the QUORUM don't have the replica data yet, my application that did the SELECT will understand that this data doesn't exist in the database, right?
You have to understand that Apache Cassandra uses eventual consistency meaning that the client will decide on the desired CL. If you have a strong consistency, meaning you have an overlap of the write CL and read CL (Write CL + Read CL > RF), then will always retrieve the last data. I recommend you to watch this video: https://www.youtube.com/watch?v=Gx-pmH-b5mI

Related

How does Cassandra handle inconsistencies between two replicas?

I have a simple question on the strategy Cassandra opted for when the following scenario happen
Scenario
At T1, replica 1 receives the write mutation like name = amit, language = english
At T1 + 1, replica 2 receives the update like language = japanese where name = amit
Assume, that if the write record is not replicated on replica 2 when the update for the record has come, then how does Cassandra handles the scenario.
My Guess - May be replica 2 will check the lamport timestamp of
update message say it 102 and ask replica 1 for any record which
is less than 102 so that it ( replica 2 ) can execute them first
then execute the update statement.
Any help would be appreciated.
Under the hood (for normal operations, not LWTs) both INSERTs and UPDATEs are UPSERTs - they aren't dependent on the previous state to perform update of the data. When you perform UPDATE, then Cassandra just put the corresponding value without checking if corresponding primary key exists, and that's all. And even if the earlier operations come later, Cassandra will check the "write time" to resolve the conflict.
For your case, it will go as following:
replica 1 receives the write, and retransmit it to other replicas in the cluster, including replica 2. If replica 2 isn't available in that moment, the mutation will be written as a hint that will be replayed when replica 2 is up.
replica 2 may receive new updates, and also retransmit it to other replicas.
The coordinator deals with the inconsistencies depending on the consistency level (CL) used. There are also other nuanced behaviour which again are tied to the consistency level for read and write requests.
CASE A - Failed writes
If your application uses a weak consistency of ONE or LOCAL_ONE for writes, the coordinator will (1) return a successful write response to the client/driver even if just one replica acknowledges the write, and (2) will store a hint for the replica(s) which did not respond.
When the replica(s) is back online, the coordinator will (3) replay the hint (resend the write/mutation) to the replica to keep it in sync with other replicas.
If your application uses a strong consistency of LOCAL_QUORUM or QUORUM for writes, the coordinator will (4) return a successful write response to the client/driver when the required number of replicas have acknowledged the write. If any replicas did not respond, the same hint storage in (2) and hint replay in (3) applies.
CASE B - Read with weak CL
If your application issues a read request with a CL of ONE or LOCAL_ONE, the coordinator will only ever query one replica and will return the result from that one replica.
Since the app requested the data from just one replica, the data does NOT get compared to any other replicas. This is the reason we recommend using a strong consistency level like LOCAL_QUORUM.
CASE C - Read with strong CL
For a read request with a CL of LOCAL_QUORUM against a keyspace with a local replication factor of 3, the coordinator will (5) request the data from 2 replicas. If the replicas don't match, the (6) data with the latest timestamp wins, and (7) a read-repair is triggered to repair the inconsistent replica.
For more info, see the following documents:
How read requests are accomplished in Cassandra
How writes are accomplished in Cassandra

What happens if a node in Cassandra database fails while transferring data to client?

Let's say we have a Cassandra cluster of 6 nodes and the RF=3. Thus, if we query to extract data from a particular node and while processing or transferring data the node fails. What are the possible outcomes for the following scenario?
Lets say its processing the required data from the disk and the node dies in the process, will the coordinator(the node that received our request) resend the request to one of the replicated nodes or just return an error to client?
Let's say the node died while it was transferring data. So will the coordinator return partial data? or will the coordinator realise that the information is incomplete and re-send the request to a different node(a replica)?
In either of the cases, as a programmer do we have to explicitly code any conditions to tell the Cassandra sever or is it all taken care internal?
Thanks in advance.
P.S: I am sorry if a similar question has been asked before. I did try searching but I couldn't find it.
One of the most important concepts to understand in Cassandra is its variable "Consistency Level", or CL. Perhaps the most common setting is CL=QUORUM, which means that with RF=3 (each piece of data is replicated on 3 nodes), Cassandra will require two successful responses from two replicas before returning a result to the client.
In a request for a particular partition, the coordinator will start out by sending out the client's requests to 2 of the 3 replicas known to hold the partition. Cassandra keeps an estimate of average response latency, and when this estimate has passed, it sends a third request to the third replica. Such a timeout will happen in the cases you mentioned - if the response doesn't complete quickly (it doesn't matter if it partially completed), the third request is sent. Unless two nodes are down at the same time, you will get your complete response and the client doesn't need to take care of anything. This is the "high availability" feature that Cassandra and other NoSQL databases are famous for.
Note that this answer is true even for extremely long responses (scanning the entire table, or fetching a very long partition). Such long responses are broken up to "pages" of reasonable lengths, each page is fetched in a separate request, and can come from 2 of the 3 replicas, not necessarily the same one.
Everything I wrote above also applies to Scylla, as well as Cassandra.

Strong Consistency in Cassandra

According to datastax article, strong consistency can be guaranteed
if, R + W > N
where
R is the consistency level of read operations
W is the consistency level of write operations
N is the number of replicas
What does strong consistency mean here? Does it mean that 'every time' a query's response is given from the database, the response will 'always' be the last updated value? If conditions of strong consistency is maintained in cassandra, then, are there no scenarios where the data returned might be inconsistent? In short, does strong consistency mean 100% consistency?
Edit 1
Adding some additional material regarding some scenarios where Cassandra might not be consistent even when R+W>RF
Write fails with Quorum CL
Cassandra's eventual consistency
Cassandra has tunable consistency with some tradeoffs you can choose.
R + W > N - this simply means there must be one overlapping node in your roundtrip that has the actual and newest data available to be consistent.
For example if you write at CL.ONE you will need to read at CL.ALL to be sure to get a consistent result: N+1 > N - but you might not want CL.ALL as you can not tolerate a single node failure in your cluster.
Often you can choose CL.QUORUM at read and write time to ensure consistency and tolerate node failures. For example at RF=3 a QUORUM needs (3/2)+1=2 nodes available, so R+W>N will be 4>3 - your requests are consistent AND you can tolerate a single node failure.
One thing to keep in mind - it is really important to have thight synchronized clocks on all your nodes (cassandra and application), you will want to have ntp up and running.
While this is an old question, I thought I would chip in to set the record straight.
R+W>RF does not imply strong consistency
A system with **R+W>RF* will only be eventually consistent. The claims for strong consistency guarentee break during node failures or in between writes. For example consider the following scenario:
Assume that there are 3 nodes A,B,C with RF=3, W=3, R=2 (hence, R+W = 5 > 3 = RF)
Further assume key k is associated to value v i.e. (k,v) is stored on the database. Suppose the following series of actions occur:
t=1: (k,v1) write request is sent to A,B,C from a user
t=2: (k,v1) reaches A and is written to store at A
t=3: Reader 1 sends a read request for key k, which is replied to by A and B
t=4: Reader 1 receives response (k,v1) - by latest write wins rule
t=5: Reader 1 sends another read request which gets served by nodes B and C
t=6: Reader 1 receives response (k,v), which is an older value INCONSISTENCY
t=7: (k,v1) reaches C and is written to store at C
t=8: (k,v1) reaches B and is written to store at B
This demonstrates that W+R>RF cannot guarantee strong consistency. To ensure strong consistency you might want to use another algorithm such as paxos or raft that can help in ensuring that the writes are atomic. You can read an interesting article on the same here (Do checkout the FAQ section)
Edit:
Cassandra does have some internal mechanism (called the blocking read repairs) - that trigger synchronous writes before response from the db is sent back to client. This kind of synchronous read repair occurs in case of inconsistencies amongst the nodes queried to achieve read consistency level and ensures something known as Monotonic Read Consistency [See below for definitions]. This causes the (k,v1) in above example to be written to node B before response is returned in case of first read request and so the second read request would also have an updated value. (Thanks to #Nadav Har'El for pointing this out)
However, this still does not guarantee strong consistency. Below are some definitions to clear it of:
Sequential/Strong Consistency: the result of any execution is the same as if the reads and writes occur in some order, and the operations of each individual processor appear in this sequence in the order specified by its program [as defined by Leslie Lamport]
Monotonic Read Consistency: once you read a value, all subsequent reads will return this value or a newer version
Sequential consistency would require the client program/reader to see the latest value that was written since the write statement is executed before the read statement in the sequence of program instructions.
For both reads and writes, the consistency levels of ANY , ONE , TWO , and THREE are considered weak, whereas QUORUM and ALL are considered strong.
Yes. If R + W consistency is greater than replicas then you will always get consistent data. 100% consistency. But you will have to trade availability to achieve higher consistency.
Cassandra has concept of tunable consistency (set consistency on query basis).
I will actually regard this Strong Consistency as Strong read consistency. And it is sessional, aka Monotonic Read Consistency.( refer to #NadavHar'El answer).
But it is not sequential consistency as Cassandra doesn't fully support lock, transaction or serialze the write operation. There is only lightweight transaction, which supports local serialization of write operation and serialization of read operation.
To make things easy to understand. Let's say we have three nodes - A, B, C and set read quorum to be 3 and write to be 1.
If there is only one client, it writes to any node - A.
B and C might be not synchronized.(Eventually they will -- Eventual consistency)
But When the client reads again, it requires client to get at least three nodes' response and by comparing the latest timestamp, we will use A's record. This is Monotonic Read Consistency
However,if there are two client trying to update the records at the same time or if they try to read the value first and then rewrite it(e.g increase column by 100) at the same time:
Client C1 and Client C2 both read the current column value as 10, and they both decide to increase it by 100:
While C1 just need to write 110 to one node, client C2 will do the same and the final result on any node can only be 110 max.
Then we lose 100 in these operations(Lost updates). It is the issue caused by race condition or concurrent issues. It has to be fixed by serializing the operation and using any form of lock just like how other SQL DB implements transaction.
I know Cassandra now has new counter column which might solve it but it is still limited in terms of the full transaction. And Cassandra is also not supposed to be transactional as it is NoSQL database which sacrifice consistency for availability

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Cassandra's atomicity and "rollback"

The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Resources