The Cassandra 2.0 documentation contains the following paragraph on Atomicity:
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
So, write requests are sent to 3 nodes, and we're waiting for 2 ACKs. Let's assume we only receive 1 ACK (before timeout). So it's clear, that if we read with consistency ONE, that we may read the value, ok.
But which of the following statements is also true:
It may occur, that the write has been persisted on a second node, but the node's ACK got lost? (Note: This could result in a read of the value even at read consistency QUORUM!)
It may occur, that the write will be persisted later to a second node (e.g. due to hinted handoff)? (Note: This could result in a read of the value even at read consistency QUORUM!)
It's impossible, that the write is persisted on a second node, and the written value will eventually be removed from the node via ReadRepair?
It's impossible, that the write is persisted on a second node, but it is necessary to perform a manual "undo" action?
I believe you are mixing atomicity and consistency. Atomicity is not guaranteed across nodes whereas consistency is. Only writes to a single row in a single node are atomic in the truest sense of atomicity.
The only time Cassandra will fail a write is when too few replicas are alive when the coordinator receives the request i.e it cannot meet the consistency level. Otherwise your second statement is correct. It will hint that the failed node (replica) will need to have this row replicated.
This article describes the different failure conditions.
http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
Related
I didn’t get from the documentation what happens after a node fails during writes.
I got the idea of quorum but what happens after “the write transaction” fails?
For example:
I inserted the record and chose the level of consistency equal to QUORUM.
Assume QUORUM = 3 nodes and 2 of 3, or just 1 of 3 nodes wrote the date but the rest didn’t and failed.
I got an error.
What happens with the record on the nodes which wrote it?
How can Casandra prevent propagating this row to other nodes through a replica synchronization?
Or if I get errors on writing it actually means that this row could appear within some time on each replica?
Cassandra doesn't have transactions (except light-weight transactions, that are also different kind of thing). When some node received & written the data, and other not - there is no rollback or something like this. This data is written. But coordinator node sees that consistency level couldn't be reached, and report error back to client application saying about it, so it could be retried if necessary. If it's not retried, then data could be propagated through the repair operations - either read repair, or through explicit repair. But because the data is on a single node, this means that this node mail fail before repair happens, and data could be lost.
I have a cassandra cluster with three nodes under normal circumstances. When I send write request cluster from node.js, I want all nodes to write back to me after writing, while reading, i want to be able to read which node I am connected to. I want this setup to continue when one of the three nodes has died. I chose replication factor= 3 consistency=2
How should I implement a configuration. Is the config implement true ?
With my respects...
So I unfortunately have no real clue about the provided numbers from the node JS driver, but I know something about the consistency levels, which I suspect you are using in the background, assuming that you are using this driver: http://datastax.github.io/nodejs-driver/.
Just a basic thing: The nodes don't write back to you directly, but your query is sent to one node, the coordinator of that query, which then distributes the query in your cluster according to your consistency level specifications (at least if it's a simple query, more complex destribution logic applies in case of batch queries). The coordinator then reports back to you when the query is executed.
Whether your requirements can be fulfilled at all depends on the replication factor you chose. The problem here is that cassandra only knows so many of them. The options for writing are: all (which at first looks what you want), quorum (which is also an option), one and any. So let's assume you choose all, because you want to write to all replicas. That's totally fine, but if one of the nodes goes down, there will be failing writes, because one of the replicas could not be updated. In case you are actually using replication factor 3, you can fallback to write with quorum, which is 2 nodes in this case. What should happen if another node fails? I know, very unlikely, but I've seen it in production, so it happens from time to time. Should the single last node be updated in this case? Then you need to fallback to consistency level 1. Everything fine.
But what if you choose the replication factor to be 5? Well, there is no way of saying: I want 4 nodes. You can only have a quorum in case of a failure of one node, and that would be 3, not 4. And the next fallback would be 1 and not 2.
The final question is: if you lose one node and you do a fallbak in the writing part, what happens when your node comes back (assuming that there are lost updates because some of the hinted handoffs are already discarded)? The reading part of your application can always read stale data because you always only read from a single node. It seems to me like you are trying to compensate for that in the write part. My personal idea would be using quorum when reading and writing, this way it's guaranteed that you read current data and a single node can go down (with replication factor 5 it's even 2 nodes). Also keep in mind that when you write to a node, cassandra will always attempt to write to the replicas in the background, so it tries to keep your data up to date. The risk of reading stale data even with a consistency level pair of one-one can be acceptable if you really need the speed.
The below statement from Cassandra documentation is the reason for my doubt.
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that node. However, the replicated write that succeeds on the other node is not automatically rolled back.
Ref : http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html
So does Cassandra write to a node(which is up) even if Consistency cannot be met ?
I got it. Cassandra will not even attempt to write if it knows that consistency cannot be met. If consistency CAN be met, but does not have enough replicas to satisfy replication factor, then Cassandra would write to currently available replicas and gives a success message. Later when the replica is up again, it will write to other replica.
For e.g. If Replication factor is 3 , 1 of 3 nodes are down, then if I write with a Consistency of 2, the write will succeed. But if Replication factor is 2 and 1 of 2 nodes are down , then if I write with a Consistency of 2, Cassandra will not even write to that single node which is available.
What is mentioned in the documentation is a case where while write was initiated when the consistency can be met. But in between, one node went down and couldn't complete the write, whereas write succeeded in other node. Since consistency cannot be met, client would get a failure message. The record which was written to a single node would be removed later during node repair or compaction.
Consistency in Cassandra can (is?) be defined at statement level. That means you specify on a particular query, what level of consistency you need.
This will imply that if the consistency level is not met, the statement above has not met consistency requirements.
There is no rollback in Cassandra. What you have in Cassandra is Eventual consistency. That means your statement might be a success in future if not immediately. When a replica node comes a live, the cluster (aka the Cassandra's fault tolerance) will take care of writing to the replica node.
So, if your statement is failed, it might be succeeded in future. This is in contrary to the RDBMS world, where an uncommitted transaction is rolled back as if nothing has happened.
Update:
I stand corrected. Thanks Arun.
From:
http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_about_hh_c.html
During a write operation, when hinted handoff is enabled and consistency can be met, the coordinator stores a hint about dead replicas in the local system.hints table under either of these conditions:
So it's still not rollback. Nodes know the current cluster state and doesn't initiate the write if consistency cannot be met.
At driver level, you get an exception.
On the nodes that the write succeeded, the data is actually written and it is going to be eventually rolled back.
In a normal situation, you can consider that the data was not written to any of the nodes.
From the documentation:
If the write fails on one of the nodes but succeeds on the other,
Cassandra reports a failure to replicate the write on that node.
However, the replicated write that succeeds on the other node is not
automatically rolled back.
When both read and write are set to quorum, I can be guaranteed the client will always get the latest value when reading.
I realize this may be a novice question, but I'm not understanding how this setup doesn't provide consistency, availability, and partitioning.
With a quorum, you are unavailable (i.e. won't accept reads or writes) if there aren't enough replicas available. You can choose to relax and read / write on lower consistency levels granting you availability, but then you won't be consistent.
There's also the case where a quorum on reads and writes guarantees you the latest "written" data is retrieved. However, if a coordinator doesn't know about required partitions being down (i.e. gossip hasn't propagated after 2 of 3 nodes fail), it will issue a write to 3 replicas [assuming quorum consistency on a replication factor of 3.] The one live node will write, and the other 2 won't (they're down). The write times out (it doesn't fail). A write timeout where even one node has writte IS NOT a write failure. It's a write "in progress". Let's say the down nodes come up now. If a client next requests that data with quorum consistency, one of two things happen:
Request goes to one of the two downed nodes, and to the "was live" node. Client gets latest data, read repair triggers, all is good.
Request goes to the two nodes that were down. OLD data is returned (assuming repair hasn't happened). Coordinator gets digest from third, read repair kicks in. This is when the original write is considered "complete" and subsequent reads will get the fresh data. All is good, but one client will have received the old data as the write was "in progress" but not "complete". There is a very small rare scenario where this would happen. One thing to note is that write to cassandra are upserts on keys. So usually retries are ok to get around this problem, however in case nodes genuinely go down, the initial read may be a problem.
Typically you balance your consistency and availability requirements. That's where the term tunable consistency comes from.
Said that on the web it's full of links that disprove (or at least try to) the Brewer's CAP theorem ... from the theorem's point of view the C say that
all nodes see the same data at the same time
Which is quite different from the guarantee that a client will always retrieve fresh information. Strictly following the theorem, in your situation, the C it's not respected.
The DataStax documentation contains a section on Configuring Data Consistency. In looking through all of the available consistency configurations, For QUORUM it states:
Returns the record with the most recent timestamp after a quorum of replicas has responded regardless of data center. Ensures strong consistency if you can tolerate some level of failure.
Note that last part "tolerate some level of failure." Right there it's indicating that by using QUORUM consistency you are sacrificing availability (A).
The document referenced above also further defines the QUORUM level, stating that your replication factor comes into play as well:
If consistency is top priority, you can ensure that a read always
reflects the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM consistency level
for both write and read operations and you are using a replication
factor of 3, then this ensures that 2 nodes are always written and 2
nodes are always read. The combination of nodes written and read (4)
being greater than the replication factor (3) ensures strong read
consistency.
In the end, it all depends on your application requirements. If your application needs to be highly-available, ONE is probably your best choice. On the other hand, if you need strong-consistency, then QUORUM (or even ALL) would be the better option.
I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.