How does Cassandra guarantee eventual consistency in cross region replication? - cassandra

I cannot find much documentation about it. The only thing I can find is that when the consistency level is not set to EACH_QUORUM, cross region replication is done asynchronously.
But in asynchronous style, is it possible to lose messages? How does Cassandra handle losing messages?

If you don't use EACH_QUORUM and a destination node which would accept a write is down, then coordinator node is saving writes as "hinted handoffs".
When destination node becomes available again, coordinator replays hinted handoffs on destination.
For any occasion when hinted handoffs are lost, you have to do run a repair on your cluster.
Also you have to be aware of that storing hints is allowed for maximum of 3 hours by defaults.
For further info see documentation at:
http://www.datastax.com/dev/blog/modern-hinted-handoff
http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesHintedHandoff.html
Hope this helps.

When you issue a write in Cassandra, the coordinator sends the write to all online replicas, and then blocks. The duration of the block corresponds to consistency level - if you say "ALL", it blocks until all nodes ack the write. If you use "EACH_QUORUM", it blocks until a quorum of nodes in each datacenter ack the write.
For any replica that didn't ack the write, the coordinator will write a hint, and attempt to deliver that hint later (minutes, hours, no guarantee).
Note, though, that the writes were all sent at the same time - what you don't have is a guarantee as to which were delivered. Your guarantee is in the consistency level.
When you read, you'll do something similar - you'll block until you have an appropriate number of replicas answering. If you write with EACH_QUORUM, you can read with LOCAL_QUORUM and guarantee strong consistency. If you write with QUORUM, you could read with QUORUM. If you write with ONE, you could still guarantee strong consistency if you read with ALL.
To guarantee eventual consistency, you don't have to do anything - it'll eventually get there, as long as you wrote with CL >= ONE (CL ANY isn't really a guarantee).

Related

BLOCKING read repair blocks writes on other replicas?

Learning Cassandra. There are a couple of things about read repair that I don't understand.
The docs say this about BLOCKING read repair:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
To be honest, this entire sentence just doesn't make sense to me. First, why would read repair need to block writes? Isn't read repair in essence just a simple write of reconciled data? Second, how can read repair block writes on another replicas?
The docs also say that BLOCKING read repair breaks partition level write atomicity.
Cassandra attempts to provide partition level write atomicity, but since only the data covered by a SELECT statement is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it is written. For example, read repair can break write atomicity if you write multiple rows to a clustered partition in a batch, but then select a single row by specifying the clustering column in a SELECT statement.
Again, I don't understand how write atomicity gets broken. Single-partition batch is atomic and isolated, right? Can someone explain it more?
What implications this breaking of atomicity has for developers? I mean, it sure doesn't sound good.
EDIT:
For the first question see the accepted answer. For the second question this issue explains how atomicity gets broken.
I can see where the docs are a bit confusing. Allow me to expand on the subject and hopefully clarify it for you.
The wording in this paragraph could probably use a rewrite:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
It's referred to as a blocking read-repair because the reads are blocked (result is not returned to the client/driver by the coordinator) until the problematic replicas are repaired. The mutation/write is sent to the offending replica and the replica must acknowledge that the write is successful (i.e. persisted to commitlog).
The read-repair does not block ordinary writes -- it's just that the read request by the coordinator is blocked until the offending replica(s) involved in the request is repaired.
For the second part of your question, it's an extreme case where that scenario would take place because it's really a race condition between the batch and the read-repair. I've worked on a lot of clusters and I've never ran into that situation (maybe I'm just extremely lucky 🙂). I've certainly never had to worry about it before.
It has to be said that read-repairs are a problem because replicas miss mutations. In a distributed environment, you would expect the odd dropped mutation. But if it's a regular occurrence in the cluster, read-repair is the least of your worries since you probably have a bigger underlying issue -- unresponsive nodes from long GC pauses, commitlog disks not able to keep up with writes. Cheers!

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Cassandra's atomicity and "rollback"

The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Understand cassandra replication factor versus consistency level

I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.

How important is it to enable read repair in Cassandra?

If I understand it correctly, upon a write request the write is sent to all N replicas, and the operation succeeds when the first W responses are received. Is this correct?
If it is, then combined with Hinted Handoff, it seems that all replicas will already get all writes as soon as possible, do we really have to do read repair in this case?
Thanks.
Short answer: you still need read repair.
Longer answer: there wasn't a good discussion of Hinted Handoff anywhere, so I wrote one.
For Cassandra 1.0+, read the updated article. The crucial part being:
At first glance, it may appear that Hinted Handoff lets you safely get away without needing repair. This is only true if you never have hardware failure.
It is possible for hinted handoff to fail for various reasons. Such as the node the hint was written to can fail. With read repair enabled if hinted handoff doesn't work for some reason read repair will fix it. And then you should also run "nodetool repair" on your nodes to catch any cases where read repair and hinted handoff both fail to fix all the data.
Check the wiki for more info.
http://wiki.apache.org/cassandra/AntiEntropy
http://wiki.apache.org/cassandra/HintedHandoff
The consistency level can be varied for each write (and read).
For example, let's say we have 10 nodes, with a replication factor of 3.
But if we write with a consistency level of ANY, none of the eventual 3 replicas may initally have the data when the write call returns. If we use consistency level ONE, then only one of the eventual 3 replicas has to have the data before the write returns, so a read straight after the write may see outdated data if the read has a low consistency level.
See http://wiki.apache.org/cassandra/API for the definitions of the consistency levels, particularly the following:
Read level ONE: Will return the record
returned by the first replica to
respond. A consistency check is always
done in a background thread to fix any
consistency issues when
ConsistencyLevel.ONE is used. This
means subsequent calls will have
correct data even if the initial read
gets an older value. (This is called
ReadRepair)
See also http://wiki.apache.org/cassandra/ReadRepair :
Read repair means that when a query is
made against a given key, we perform a
digest query against all the replicas
of the key and push the most recent
version to any out-of-date replicas.
If a low ConsistencyLevel was
specified, this is done in the
background after returning the data
from the closest replica to the
client; otherwise, it is done before
returning the data.

Resources