The questions are regarding the “CAS operations” paragraph into the article : http://www.datastax.com/dev/blog/cassandra-error-handling-done-right
a)
If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can’t know if the CAS operation has been applied..
How do you understand this?
I thought that If the paxos (prepare) phase fails then the coordinator will not initiate the commit phase at all?
I guess that it does not matter how the paxos phase fails (not enough replicas or replica timeouts or ..).
b)
The commit phase is then similar to regular Cassandra writes… you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read
Wondering about the above with relation to writes with ConsistencyLevel.QUORUM:
If the commit phase failed because there is no quorum (unavailable nodes or timeouts) then we get back WriteTimeoutException with a WriteType of SIMPLE, right?
In this case it is not clear if the write is actually successful or not, right?
So I’m not sure what are all the possibilities from now on (recover/rollback/nothing)?
Is it saying that if I use ConsistencyLevel.QUORUM for the read operation I can see the old data version (as if the above write was not successful) for some time and after that again with QUORUM read I will see that the write is successful?
(actually I’m seen exactly this in a 3 node cluster with replication factor=3 after WriteTimeoutException (2 replica were required but only 1 acknowledged the write) – quorum read just after that returned the old data and then when i check with cqlsh I see the new data).
How this is possible?
guess:
Probably after the timeout the coordinator says that we have no quorum for the commit phase yet (and subsequent QUORUM reads get the older data version) and returns the WriteTimeoutException.type=SIMPLE to the client. And when the nodes that have timeout actually respond/commit we have a quorum in this future moment and after it all quorum reads will obtain the newer data version.
But not sure about the explanation of when you use read with SERIAL.
Related
I didn’t get from the documentation what happens after a node fails during writes.
I got the idea of quorum but what happens after “the write transaction” fails?
For example:
I inserted the record and chose the level of consistency equal to QUORUM.
Assume QUORUM = 3 nodes and 2 of 3, or just 1 of 3 nodes wrote the date but the rest didn’t and failed.
I got an error.
What happens with the record on the nodes which wrote it?
How can Casandra prevent propagating this row to other nodes through a replica synchronization?
Or if I get errors on writing it actually means that this row could appear within some time on each replica?
Cassandra doesn't have transactions (except light-weight transactions, that are also different kind of thing). When some node received & written the data, and other not - there is no rollback or something like this. This data is written. But coordinator node sees that consistency level couldn't be reached, and report error back to client application saying about it, so it could be retried if necessary. If it's not retried, then data could be propagated through the repair operations - either read repair, or through explicit repair. But because the data is on a single node, this means that this node mail fail before repair happens, and data could be lost.
I'm not able to understand the scenario where during the write process, the desired write consistency level cannot be met. For e.g. suppose I have 3 nodes, 2 in one data center(dc1) and the remaining one in the other data center(dc2). Network Topology Strategy. Now if I'm writing with consistency level three and one of the node is down, what exactly will happen?
Since 2 nodes are up, they will be able to complete the write process, however since the consistency level cannot be met, therefore the coordinator node will return a write error to the client.
What will happen to the data written in the 2 nodes? The client will not be expecting any data in any node because he received a write error.
There is no rollback in Cassandra, then how does Cassandra remove failed writes?
According to the above link, Cassandra does not rollback writes.
Does Cassandra write to a node(which is up) even if Consistency cannot be met?
The accepted answer in the above link states that "On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back."
If the coordinator cannot write to enough replicas to meet the
requested consistency level, it throws an Unavailable Exception and
does not perform any writes.
If coordinator doesn't know about replica failure before hand i.e replica failed during write then coordinator will throw timeout exception and client will have to handle it. (Retry policies)
Cassandra Write Request
Suppose i have 3 nodes, the RF is 3, and using QUORUM CL. When i write a data record to the cluster, if one node succeed, one failed. So the whole write request is failed, what will happen to the succeed node? Will it be roll back automatically? or it will be propagated to other node via gossip. And finally the 3 nodes will all have the record even the original request was failed?
shutty's answer is wrong in subtle ways though the article referred to is correct and an excellent source. The first three points appear correct:
Query coordinator will try to persist your write on all nodes according to RF=3. If 2 of them has failed, the CL=QUORUM write considered as failed.
A single node which accepted the failed write will not rollback it. It will persist it on memtable/disk as nothing suspicious happened.
Cassandra is an eventually consistent database, so it's absolutely fine for it to be in an inconsistent state for some period of time, but converging to consistent state in some future.
However the last two appear wrong and here's the corrected version:
Next time you read (CL=QUORUM) the key you previously failed to write, if there's still not enough nodes online, you'll get failed read. If the two nodes that failed to write previously are online (and not the one that succeeded the write) you'll receive previous value, unaffected by the failed write.
If the node that succeeded in writing also is online, a QUORUM read will result in read-repair causing the nodes that failed to write the newer value to update to the new value, then it will be returned. (Note: the word 'newer' is in the timestamp sense, so it is possible that even though the data was written more recently that it has an older timestamp -> this cluster started in an inconsistent state.)
There's an article about it. TL&DR version:
Query coordinator will try to persist your write on all nodes according to RF=3. If 2 of them has failed, the CL=QUORUM write considered as failed.
A single node which accepted the failed write will not rollback it. It will persist it on memtable/disk as nothing suspicious happened.
Cassandra is an eventually consistent database, so it's absolutely fine for it to be in an inconsistent state for some period of time, but converging to consistent state in some future.
Next time you read (CL=QUORUM) the key you previously failed to write, if there's still not enough nodes online, you'll get failed read. If other 2 nodes will come back to life, they will have read quorum (even if the third node data differs for that key) and you'll receive previous value, unaffected by the failed write.
If Cassandra detects such a conflict for a single key, it performs read repair process, when conflicting minority nodes data will be overwritten by the data from quorum's majority. So your node, which accepted failed write, will self-heal inconsistent row on next successful quorum read.
The below statement from Cassandra documentation is the reason for my doubt.
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
Ref : http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
So does Cassandra write to a node(which is up) even if Consistency cannot be met ?
I got it. Cassandra will not even attempt to write if it knows that consistency cannot be met. If consistency CAN be met, but does not have enough replicas to satisfy replication factor, then Cassandra would write to currently available replicas and gives a success message. Later when the replica is up again, it will write to other replica.
For e.g. If Replication factor is 3 , 1 of 3 nodes are down, then if I write with a Consistency of 2, the write will succeed. But if Replication factor is 2 and 1 of 2 nodes are down , then if I write with a Consistency of 2, Cassandra will not even write to that single node which is available.
What is mentioned in the documentation is a case where while write was initiated when the consistency can be met. But in between, one node went down and couldn't complete the write, whereas write succeeded in other node. Since consistency cannot be met, client would get a failure message. The record which was written to a single node would be removed later during node repair or compaction.
Consistency in Cassandra can (is?) be defined at statement level. That means you specify on a particular query, what level of consistency you need.
This will imply that if the consistency level is not met, the statement above has not met consistency requirements.
There is no rollback in Cassandra. What you have in Cassandra is Eventual consistency. That means your statement might be a success in future if not immediately. When a replica node comes a live, the cluster (aka the Cassandra's fault tolerance) will take care of writing to the replica node.
So, if your statement is failed, it might be succeeded in future. This is in contrary to the RDBMS world, where an uncommitted transaction is rolled back as if nothing has happened.
Update:
I stand corrected. Thanks Arun.
From:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html
During a write operation, when hinted handoff is enabled and consistency can be met, the coordinator stores a hint about dead replicas in the local system.hints table under either of these conditions:
So it's still not rollback. Nodes know the current cluster state and doesn't initiate the write if consistency cannot be met.
At driver level, you get an exception.
On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back.
In a normal situation, you can consider that the data was not written to any of the nodes.
From the documentation:
If the write fails on one of the nodes but succeeds on the other,
Cassandra reports a failure to replicate the write on that node.
However, the replicated write that succeeds on the other node is not
automatically rolled back.
The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure