Write consistency not met while writing in Cassandra - cassandra

I'm not able to understand the scenario where during the write process, the desired write consistency level cannot be met. For e.g. suppose I have 3 nodes, 2 in one data center(dc1) and the remaining one in the other data center(dc2). Network Topology Strategy. Now if I'm writing with consistency level three and one of the node is down, what exactly will happen?
Since 2 nodes are up, they will be able to complete the write process, however since the consistency level cannot be met, therefore the coordinator node will return a write error to the client.
What will happen to the data written in the 2 nodes? The client will not be expecting any data in any node because he received a write error.
There is no rollback in Cassandra, then how does Cassandra remove failed writes?
According to the above link, Cassandra does not rollback writes.
Does Cassandra write to a node(which is up) even if Consistency cannot be met?
The accepted answer in the above link states that "On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back."

If the coordinator cannot write to enough replicas to meet the
requested consistency level, it throws an Unavailable Exception and
does not perform any writes.
If coordinator doesn't know about replica failure before hand i.e replica failed during write then coordinator will throw timeout exception and client will have to handle it. (Retry policies)
Cassandra Write Request

Related

Cassandra writes consistency, what happens after “transaction” fails?

I didn’t get from the documentation what happens after a node fails during writes.
I got the idea of quorum but what happens after “the write transaction” fails?
For example:
I inserted the record and chose the level of consistency equal to QUORUM.
Assume QUORUM = 3 nodes and 2 of 3, or just 1 of 3 nodes wrote the date but the rest didn’t and failed.
I got an error.
What happens with the record on the nodes which wrote it?
How can Casandra prevent propagating this row to other nodes through a replica synchronization?
Or if I get errors on writing it actually means that this row could appear within some time on each replica?
Cassandra doesn't have transactions (except light-weight transactions, that are also different kind of thing). When some node received & written the data, and other not - there is no rollback or something like this. This data is written. But coordinator node sees that consistency level couldn't be reached, and report error back to client application saying about it, so it could be retried if necessary. If it's not retried, then data could be propagated through the repair operations - either read repair, or through explicit repair. But because the data is on a single node, this means that this node mail fail before repair happens, and data could be lost.

Does Cassandra read have inconsistency?

I am new to Cassandra and am trying to understand how it works. Say if a write to a number of nodes. My understanding is that depending on the hash value of the key, its decided which node owns the data and then the replication happens. While reading the data , the hash of the key determines which node has the data and then it responds back. Now my question is that if reading and writing happen from the same set of nodes which always has the data then how does read inconsistency occurs and Cassandra returns stale data ?
For Tuning consistency cassandra allows to set the consistency on per query basis.
Now for your question, Let's assume CONSISTENCY is set to ONE and Replication factor is 3.
During WRITE request coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node to complete the write responds back to the coordinator, which then proxies the success message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas could miss the write if they happened to be down at the time the request was made. If a replica misses a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted handoff, read repair, or anti-entropy node repair.
By default, hints are saved for three hours after a replica fails because if the replica is down longer than that, it is likely permanently dead. You can configure this interval of time using the max_hint_window_in_ms property in the cassandra.yaml file. If the node recovers after the save time has elapsed, run a repair to re-replicate the data written during the down time.
Now when READ request is performed co-ordinator node sends these requests to the replicas that can currently respond the fastest. (Hence it might go to any 1 of 3 replica's).
Now imagine a situation where data is not yet replicated to third replica and during READ that replica is selected(chances are very negligible), then you get in-consistent data.
This scenario assumes all nodes are up. If one of the node is down and read-repair is not done once the node is up, then it might add up to issue.
READ With Different CONSISTENCY LEVEL
READ Request in Cassandra
Consider scenario where CL is QUORUM, in which case 2 out of 3 replicas must respond. Write request will go to all 3 replica as usual, if write to 2 replica fails, and succeeds on 1 replica, cassandra will return Failed. Since cassandra does not rollback, the record will continue to exist on successful replica. Now, when the read come with CL=QUORUM, and the read request will be forwarded to 2 replica node and if one of the replica node is the previously successful one then cassandra will return the new records as it will have latest timestamp. But from client perspective this record was not written as cassandra had returned failure during write.

Cassandra WriteTimeoutException handling during CAS operations

The questions are regarding the “CAS operations” paragraph into the article : http://www.datastax.com/dev/blog/cassandra-error-handling-done-right
a)
If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can’t know if the CAS operation has been applied..
How do you understand this?
I thought that If the paxos (prepare) phase fails then the coordinator will not initiate the commit phase at all?
I guess that it does not matter how the paxos phase fails (not enough replicas or replica timeouts or ..).
b)
The commit phase is then similar to regular Cassandra writes… you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read
Wondering about the above with relation to writes with ConsistencyLevel.QUORUM:
If the commit phase failed because there is no quorum (unavailable nodes or timeouts) then we get back WriteTimeoutException with a WriteType of SIMPLE, right?
In this case it is not clear if the write is actually successful or not, right?
So I’m not sure what are all the possibilities from now on (recover/rollback/nothing)?
Is it saying that if I use ConsistencyLevel.QUORUM for the read operation I can see the old data version (as if the above write was not successful) for some time and after that again with QUORUM read I will see that the write is successful?
(actually I’m seen exactly this in a 3 node cluster with replication factor=3 after WriteTimeoutException (2 replica were required but only 1 acknowledged the write) – quorum read just after that returned the old data and then when i check with cqlsh I see the new data).
How this is possible?
guess:
Probably after the timeout the coordinator says that we have no quorum for the commit phase yet (and subsequent QUORUM reads get the older data version) and returns the WriteTimeoutException.type=SIMPLE to the client. And when the nodes that have timeout actually respond/commit we have a quorum in this future moment and after it all quorum reads will obtain the newer data version.
But not sure about the explanation of when you use read with SERIAL.

Does Cassandra write to a node(which is up) even if Consistency cannot be met?

The below statement from Cassandra documentation is the reason for my doubt.
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
Ref : http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
So does Cassandra write to a node(which is up) even if Consistency cannot be met ?
I got it. Cassandra will not even attempt to write if it knows that consistency cannot be met. If consistency CAN be met, but does not have enough replicas to satisfy replication factor, then Cassandra would write to currently available replicas and gives a success message. Later when the replica is up again, it will write to other replica.
For e.g. If Replication factor is 3 , 1 of 3 nodes are down, then if I write with a Consistency of 2, the write will succeed. But if Replication factor is 2 and 1 of 2 nodes are down , then if I write with a Consistency of 2, Cassandra will not even write to that single node which is available.
What is mentioned in the documentation is a case where while write was initiated when the consistency can be met. But in between, one node went down and couldn't complete the write, whereas write succeeded in other node. Since consistency cannot be met, client would get a failure message. The record which was written to a single node would be removed later during node repair or compaction.
Consistency in Cassandra can (is?) be defined at statement level. That means you specify on a particular query, what level of consistency you need.
This will imply that if the consistency level is not met, the statement above has not met consistency requirements.
There is no rollback in Cassandra. What you have in Cassandra is Eventual consistency. That means your statement might be a success in future if not immediately. When a replica node comes a live, the cluster (aka the Cassandra's fault tolerance) will take care of writing to the replica node.
So, if your statement is failed, it might be succeeded in future. This is in contrary to the RDBMS world, where an uncommitted transaction is rolled back as if nothing has happened.
Update:
I stand corrected. Thanks Arun.
From:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html
During a write operation, when hinted handoff is enabled and consistency can be met, the coordinator stores a hint about dead replicas in the local system.hints table under either of these conditions:
So it's still not rollback. Nodes know the current cluster state and doesn't initiate the write if consistency cannot be met.
At driver level, you get an exception.
On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back.
In a normal situation, you can consider that the data was not written to any of the nodes.
From the documentation:
If the write fails on one of the nodes but succeeds on the other,
Cassandra reports a failure to replicate the write on that node.
However, the replicated write that succeeds on the other node is not
automatically rolled back.

Cassandra's atomicity and "rollback"

The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Resources