Google Spanner's Availability - google-cloud-spanner

In a multi-region configuration of Spanner, what happens to read-write requests (and strong read requests) if all replicas in both read-write regions go down? What happens to read-only requests?

Short answer: assume that reads will fail.
Long answer: Its hard to say, as it depends on where the read originates, and the type of read. Note that from the application's point of view, it cannot send a request directly to a read replica.
Reads from a region close to where the read-write replicas are located will most likely fail, as they may be directed to the RW replicas.
For reads originating from a region where there is a working read only replica, it depends on the type of the read:
Strong read requests will fail (as they need to contact a RW replica).
Exact stale reads for a timestamp when the RW instance was up will succeed (up to the 1hr version GC timeout)
Bounded stale read requests will succeed until the time bound is greater than the period for which the RW replicas are down.

Related

BLOCKING read repair blocks writes on other replicas?

Learning Cassandra. There are a couple of things about read repair that I don't understand.
The docs say this about BLOCKING read repair:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
To be honest, this entire sentence just doesn't make sense to me. First, why would read repair need to block writes? Isn't read repair in essence just a simple write of reconciled data? Second, how can read repair block writes on another replicas?
The docs also say that BLOCKING read repair breaks partition level write atomicity.
Cassandra attempts to provide partition level write atomicity, but since only the data covered by a SELECT statement is repaired by a read repair, read repair can break write atomicity when data is read at a more granular level than it is written. For example, read repair can break write atomicity if you write multiple rows to a clustered partition in a batch, but then select a single row by specifying the clustering column in a SELECT statement.
Again, I don't understand how write atomicity gets broken. Single-partition batch is atomic and isolated, right? Can someone explain it more?
What implications this breaking of atomicity has for developers? I mean, it sure doesn't sound good.
EDIT:
For the first question see the accepted answer. For the second question this issue explains how atomicity gets broken.
I can see where the docs are a bit confusing. Allow me to expand on the subject and hopefully clarify it for you.
The wording in this paragraph could probably use a rewrite:
If a read repair is triggered, the read blocks writes sent to other replicas until the consistency level is reached by the writes.
It's referred to as a blocking read-repair because the reads are blocked (result is not returned to the client/driver by the coordinator) until the problematic replicas are repaired. The mutation/write is sent to the offending replica and the replica must acknowledge that the write is successful (i.e. persisted to commitlog).
The read-repair does not block ordinary writes -- it's just that the read request by the coordinator is blocked until the offending replica(s) involved in the request is repaired.
For the second part of your question, it's an extreme case where that scenario would take place because it's really a race condition between the batch and the read-repair. I've worked on a lot of clusters and I've never ran into that situation (maybe I'm just extremely lucky 🙂). I've certainly never had to worry about it before.
It has to be said that read-repairs are a problem because replicas miss mutations. In a distributed environment, you would expect the odd dropped mutation. But if it's a regular occurrence in the cluster, read-repair is the least of your worries since you probably have a bigger underlying issue -- unresponsive nodes from long GC pauses, commitlog disks not able to keep up with writes. Cheers!

Will Cassandra reach eventual consistency without manual repair if there is no read for that data during gc.grace.seconds?

Assume the following
Replication factor is 3
A delete was issued with consistency 2
One of the replica was busy (not down) so it drops the request
The other two replicas add the tombstone and send the response. So currently the record is marked for deletion in only two replicas.
There is no read repair happened as there was no read for that data gc.grace.seconds
Q1.
Will this data be resurrected when a read happens for that record after gc.grace.seconds if there was no manual repair?
(I am not talking about replica being down for more than gc.grace.seconds)
One of the replica was busy (not down) so it drops the request
In this case, the coordinator node realizes that the replica could not be written and stores it as a hint. Once the overwhelmed node starts taking requests again, the hint is "replayed" to get the replica consistent.
However, hints are only kept (by default) for 3 hours. After that time, they are dropped. So, if the busy node does not recover within that 3 hour window, then it will not be made consistent. And yes, in that case a query at consistency-ONE could allow that data to "ghost" its way back.

How does Cassandra guarantee eventual consistency in cross region replication?

I cannot find much documentation about it. The only thing I can find is that when the consistency level is not set to EACH_QUORUM, cross region replication is done asynchronously.
But in asynchronous style, is it possible to lose messages? How does Cassandra handle losing messages?
If you don't use EACH_QUORUM and a destination node which would accept a write is down, then coordinator node is saving writes as "hinted handoffs".
When destination node becomes available again, coordinator replays hinted handoffs on destination.
For any occasion when hinted handoffs are lost, you have to do run a repair on your cluster.
Also you have to be aware of that storing hints is allowed for maximum of 3 hours by defaults.
For further info see documentation at:
http://www.datastax.com/dev/blog/modern-hinted-handoff
http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesHintedHandoff.html
Hope this helps.
When you issue a write in Cassandra, the coordinator sends the write to all online replicas, and then blocks. The duration of the block corresponds to consistency level - if you say "ALL", it blocks until all nodes ack the write. If you use "EACH_QUORUM", it blocks until a quorum of nodes in each datacenter ack the write.
For any replica that didn't ack the write, the coordinator will write a hint, and attempt to deliver that hint later (minutes, hours, no guarantee).
Note, though, that the writes were all sent at the same time - what you don't have is a guarantee as to which were delivered. Your guarantee is in the consistency level.
When you read, you'll do something similar - you'll block until you have an appropriate number of replicas answering. If you write with EACH_QUORUM, you can read with LOCAL_QUORUM and guarantee strong consistency. If you write with QUORUM, you could read with QUORUM. If you write with ONE, you could still guarantee strong consistency if you read with ALL.
To guarantee eventual consistency, you don't have to do anything - it'll eventually get there, as long as you wrote with CL >= ONE (CL ANY isn't really a guarantee).

Cassandra difference between ANY and ONE consistency levels

Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.

When would Cassandra not provide C, A, and P with W/R set to QUORUM?

When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.

Resources