I have a setup with RF=2 and all my read/writes are done with CL=1. There are few places where i open a session, write an entry, backend processin and read again. This mostly works but sometimes the read returns Nil. We are suspecting that the read's from the co-ordinator node goes to a node that is different from where the Write was done. My understanding is that a co-ordinator node sends the read request to both replica nodes and return the results correctly.
We are not worried about the updates to a row as most of the time we need immediate consistency only for newly created row's. We really don't need Quoram and the RF=2 is mostly for HA to tolerate the loss of one node. Any pointers on how to acheive immediate consistency with RF=2 and CL=1 is greatly appreciated.
have a RF=3 with QUORUM would give you immediate consistency with ability to have single node loss. Anything less then that and its impossible to guarantee it as there will always be windows where one node sees mutation before other.
R + W > N to have a consistent read/write.
R (number of nodes needed for read) + W (number of nodes needed for write) > N (number of nodes with data, RF)
with CL 1 on reads/writes and RF=2 you have 1+1 which is not > 2. You can use ALL, TWO or QUORUM on either read or write and you would get your consistency (only because rf=2 for TWO and QUORUM) but then any node failure will bring down abilities to do either reads or writes.
Related
I've got 3 nodes; 2 in datacenter 1 (node 1 and node 2) and 1 in datacenter 2 (node 3). Replication strategy: Network Topology, dc1:2, dc2: 1.
Initially I keep one of the nodes in dc1 off (node 2) and write 100 000 entries with consistency 2 (via c++ program). After writing, I shut down the node in datacenter 2 (node 3) and turn on node 2.
Now, if I try to read those 100 000 entries I had written (again via c++ program) with consistency set as ONE, I'm not able to read all those 100 000 entries i.e. I'm able to read only some of the entries. As I run the program again and again, my program fetches more and more entries.
I was expecting that since one of the 2 nodes which are up contains all the 100 000 entries, therefore, the read program should fetch all the entries in the first execution when the set consistency is ONE.
Is this related to read repair? I'm thinking that because the read repair is happening in the background, that is why, the node is not able to respond to all the queries? But nowhere could I find anything regarding this behavior.
Let's run through the scenario.
During the write of 100K rows (DC1) Node1 and (DC2) Node3 took all the writes. As it was happening Node1 also might have taken hints for Node2 (DC1) for default 3 hours and then stop doing that.
Once Node2 comes back up online, unless a repair was run - it takes a bit to catch up through replay of hints. If the node was down for more than 3 hours, repair becomes mandatory.
During the reads, it can technically reach to any node in the cluster based on the loadbalancy policy used by driver. Unless specified to do "DCAwareRoundRobinPolicy", the read request might even reach any of the DC (DC1 or DC2 in this case). Since the consistency requested is "ONE", practically any ALIVE node can respond - NODE1 & NODE2 (DC1) in this case. So NODE2 may not even have all data and it can still respond with NULL value and thats why you received empty data sometimes and correct data some other time.
With consistency "ONE" read repair doesn't even happen, as there no other node to compare it with. Here is the documentation on it . Even in case of consistency "local_quorum" or "quorum" there is a read_repair_chance set at the table level which is default to 0.1. Which means only 10% of reads will trigger read_repair. This is to save performance by not triggering every time. Think about it, if read repair can bring the table entirely consistent across nodes, then why does "nodetool repair" even exist?
To avoid this situation, whenever the node comes back up online its best practice to do a "nodetool repair" or run queries with consistency "local_quorum" to get consistent data back.
Also remember, consistency "ONE" is comparable to uncommitted read (dirty read) in the world of RDBMS (WITH UR). So expect to see unexpected data.
Per documentation, consistency level ONE when reads:
Returns a response from the closest replica, as determined by the snitch. By default, a read repair runs in the background to make the other replicas consistent. Provides the highest availability of all the levels if you can tolerate a comparatively high probability of stale data being read. The replicas contacted for reads may not always have the most recent write.
Did you check that your code contacted the node that always was online & accepted writes?
The DSE Architecture guide, and especially Database Internals section provides good overview how Cassandra works.
Assumptions: RF = 3
In some video on the Internet about Consistency level speaker says that CL = ONE is better then CL = ANY because when we use CL = ANY coordinator will be happy to store only hint(and data)(we are assuming here that all the other nodes with corresponding partition key ranges are down) and we can potentially lose our data due to coordinator's failure. But wait a minute.... as I understand it, if we used CL = ONE and for example we had only one(of three) available node for this partition key, we would have only one node with inserted data. Risk of loss is the same.
But I think we should assume equal situations - all nodes for particular token is gone. Then it's better to discard write operation then write with such a big risk of coordinator's loss.
CL=ANY should probably never be used on a production server. Writes will be unavailable until the hint is written to a node owning that partition because you can't read data when its in a hints log.
Using CL=ONE and RF=3 with two nodes down, you would have data stored in both a) the commit log and memtable on a node and b) the hints log. These are likely different nodes, but they could be the same 1/3 of the time. So, yes, with CL=ONE and CL=ANY you risk complete loss of data with a single node failure.
Instead of ANY or ONE, use CL=QUORUM or CL=LOCAL_QUORUM.
The thing is the hints will just be stored for 3 hours by default and for longer times than that you have to run repairs. You can repair if you have at least one copy of this data on one node somewhere in the cluster (hints that are stored on coordinator don't count).
Consistency One guarantees that at least one node in the cluster has it in commit log no matter what. Any is in worst case stored in hints of coordinator (other nodes can't access it) and this is stored by default in a time frame of 3 hours. After 3 hours pass by with ANY you are loosing data if other two instances are down.
If you are worried about the risk, then use quorum and 2 nodes will have to guarantee to save the data. It's up to application developer / designer to decide. Quorum will usually have slightly bigger latencies on write than One. But You can always add more nodes etc. should the load dramatically increase.
Also have a look at this nice tool to see what impacts do various consistencies and replication factors have on applications:
https://www.ecyrd.com/cassandracalculator/
With RF 3, 3 nodes in the cluster will actually get the write. Consistency is just about how long you want to wait for response from them ... If you use One, you will wait until one node has it in commit log. But the coordinator will actually send the writes to all 3. If they don't respond coordinator will save the writes into hints.
Most of the time any in production is a bad idea.
Scenario:
Total Nodes: 3 [ A, B, C ]
Replication factor: 2
Write consistency: Quorum (2 replicas need to ack)
Read consistency: Quorum
Nodes partion ranges:
A [ primary 1-25, replica 26-50 ]
B [ primary 26-50, replica 51-75 ]
C [ primary 51-75, replica 1-25 ]
Question:
Suppose I need to insert data 30 and node A is down. What would be the behavior of Cassandra in this situation? Would Cassandra be able to write the data and report success back to the driver (even though the replica node is down and Cassandra needs 2 nodes to acknowledge a write)?
You only have 1 replica available for the write (B), so you'll get error on write (UnavailableException).
It's better to design your consistency levels / replication factor so that you can tolerate node's failure for a token range (consider bumping your RF to 3).
Also better not to try to solve the availability by following the eventual consistency path (R + W <= N), e.g. putting W=1 in this case. We've tried that and operationally it was not worth the effort.
Is there strong reason behind RF=2? given the scenario, Quorum will not be satisfied in a node down scenario and your writes will fail. I suggest you to revisit your RF.
You have identified one of the key reasons why RF=2 is not an advised replication factor for highly available Cassandra deployments. What will happen is depending on driver behavior (tokenaware on or off).
Node B or C will be chosen as the coordinator
The coordinator will attempt to write to both B and A because a Quorum of 2 is 2
The coordinator will note that node A has not acknowledged the write and thus report back to the client that a Quorum was unable to be achieved.
Note, this does not mean that the write to Node B failed... in fact the value is written to Node B and the coordinator will store a hint for Node A. However you have not achieved your consistency goal so it is likely advisable that you attempt the write again until the node comes back up in most situations. In this specific situation, you are doing effectively ALL which is not going to give the expected behavior in node failure situations.
TLDR, don't use Quorum with RF=2
I want to clarify very basic concept of replication factor and consistency level in Cassandra. Highly appreciate if someone can provide answer to below questions.
RF- Replication Factor
RC- Read Consistency
WC- Write Consistency
2 cassandra nodes (Ex: A, B) RF=1, RC=ONE, WC=ONE or ANY
can I write data to node A and read from node B ?
what will happen if A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=2, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
3 cassandra nodes (Ex: A, B, C) RF=3, RC=QUORUM, WC=QUORUM
can I write data to node A and read from node C ?
what will happen if node A goes down ?
Short summary: Replication factor describes how many copies of your data exist. Consistency level describes the behavior seen by the client. Perhaps there's a better way to categorize these.
As an example, you can have a replication factor of 2. When you write, two copies will always be stored, assuming enough nodes are up. When a node is down, writes for that node are stashed away and written when it comes back up, unless it's down long enough that Cassandra decides it's gone for good.
Now say in that example you write with a consistency level of ONE. The client will receive a success acknowledgement after a write is done to one node, without waiting for the second write. If you did a write with a CL of ALL, the acknowledgement to the client will wait until both copies are written. There are very many other consistency level options, too many to cover all the variants here. Read the Datastax doc, though, it does a good job of explaining them.
In the same example, if you read with a consistency level of ONE, the response will be sent to the client after a single replica responds. Another replica may have newer data, in which case the response will not be up-to-date. In many contexts, that's quite sufficient. In others, the client will need the most up-to-date information, and you'll use a different consistency level on the read - perhaps a level ALL. In that way, the consistency of Cassandra and other post-relational databases is tunable in ways that relational databases typically are not.
Now getting back to your examples.
Example one: Yes, you can write to A and read from B, even if B doesn't have its own replica. B will ask A for it on your client's behalf. This is also true for your other cases where the nodes are all up. When they're all up, you can write to one and read from another.
For writes, with WC=ONE, if the node for the single replica is up and is the one you're connect to, the write will succeed. If it's for the other node, the write will fail. If you use ANY, the write will succeed, assuming you're talking to the node that's up. I think you also have to have hinted handoff enabled for that. The down node will get the data later, and you won't be able to read it until after that occurs, not even from the node that's up.
In the other two examples, replication factor will affect how many copies are eventually written, but doesn't affect client behavior beyond what I've described above. The QUORUM will affect client behavior in that you will have to have a sufficient number of nodes up and responding for writes and reads. If you get lucky and at least (nodes/2) + 1 nodes are up out of the nodes you need, then writes and reads will succeed. If you don't have enough nodes with replicas up, reads and writes will fail. Overall some QUORUM reads and writes can succeed if a node is down, assuming that that node is either not needed to store your replica, or if its outage still leaves enough replica nodes available.
Check out this simple calculator which allows you to simulate different scenarios:
http://www.ecyrd.com/cassandracalculator/
For example with 2 nodes, a replication factor of 1, read consistency = 1, and write consistency = 1:
Your reads are consistent
You can survive the loss of no nodes.
You are really reading from 1 node every time.
You are really writing to 1 node every time.
Each node holds 50% of your data.
What is the best write/read strategy that is fault tolerant and fast for reads when all nodes are up?
I have 2 replicas in each datacenter and at first I was considering using QUORUM for writes and LOCAL_QUORUM for reads but reads would fail if one node crashes.
The other strategy that I came up with is to use QUORUM for writes and TWO for reads. It should work fast in normal conditions (because we will get results from the nearest nodes first) and it will work slower when any node crashes.
Is this a situation where it is recommended to use consistency level TWO or it is for some other purpose?
When would you use CL THREE?
Do you have a better strategy for consistent and fault tolerant writes/reads?
You first have to chose if you want consistency or availability. If you chose consistency, then you need to have R + W > N, where R is how many nodes you read from, W is how many nodes you write to, and N is the number of replicas.
Then you have to chose if you want reads/writes to always span multiple data centers.
Once you make those choices, you can then chose you consistency level (or it will be dictated to you).
If, for example, you decide you need consistency, and you don't want writes/reads to span multiple data centers, then you can read at LOCAL_QUORUM (which is 2 in your case) and write at ONE, or vice versa.
2 copies per dc is an odd choice. Typically you want to do LOCAL_QUORUM, with 3 replicas in each data center. That lets you read and write only using nodes within a datacenter, but allows 1 node to go down.