Consider 2 data-centers in different geographical locations, correct me if any statement I have made is incorrect.
DC1's coordinator receives a write request from a client which it forwards to its local nodes containing the replicas. For a successful write to happen (Local_Quorum consistency level), the local nodes should have acknowledged the write to the coordinator within the Write_request_timeout_in_ms = 5000 ms period.
when DC1's coordinator receives the write request, it also sends the write request to a coordinator in DC2, which passes on the request to the local nodes containing the replica data. These DC2 nodes are not required to acknowledge back within the 5000 ms for the write mutation to be a success, only DC1 nodes need to respond within that 5000 ms for a successful write.
So my question is what timers govern the write requests happening at DC2? The DC2 coordinator has to respond back to the DC1 coordinator within a specific timeout period, otherwise DC1 coordinator stores a hinted handoff for DC2, so what is that specific timeout period called in the cassandra.yaml file?
Related
Cassandra uses the concept of hinted handoff for consistency.
It means that if a node is down, the coordinator takes note of it and waits till it's up, and then resends the write request to it.
Does it mean that Cassandra sends success response back to the client even while it's waiting for the unavailable node to be up? If yes, then what if all of the target nodes were down? Won't it mean a successful response to the client even without a single write?
Hints are not stored if consistency cannot be acheived
For example consider you have 3 replicas and all nodes are down. In this case if write consistency is quorum then hints will not get stored and write will fail. Hints get stored only when one node is down and coordinator got success response from two nodes.
Only exception is write consistency ANY. In this case even if all replicas are down hint will get stored and write will be successful.
I am testing our cassandra cluster for resiliency, its a 9 node cluster with rf=3. When i disable all traffic on port 7000 of one node, the client gets a
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
The reason being that this host was only partially able to communicate to other nodes, but it then proceeds to retry again on the same host
INFO - c.d.d.c.p.LoggingRetryPolicy - - Retrying on read timeout on same host at consistency QUORUM (initial consistency: QUORUM, required responses: 2, received responses: 1, data retrieved: true, retries: 0)
It continues ALL it retries on the same host, and never recovers, eventually the request fails.
I can create a custom policy, but wondering why it never tries any other nodes ?
As per your definitions, the database will have only 3 copies of the information (RF=3), so, even though you have 9 nodes, the QUORUM will be evaluated only with the 3 nodes that actually are the owners of the data, this is defined with the number of tokens and their assignation in the nodes.
Before disabling the port in that node, was the cluster reported as healthy? (in other words, nodetool status reported all the nodes as UN Up and Normal). Is the latency reported by all the nodes similar? If you have a node with increased latencies, the query will timeout before it gets a response from it.
Before creating "custom policies", and once that you confirmed that all the nodes are healthy, reachable and available, you may want to explore using a lower consistency level (like ANY or ONE) which can improve resiliency and performance with an impact of accuracy, or increase the replication factor which will increase the number of nodes where you can find the data but with the inconvenience that the amount of disk utilization will increase.
I have cassandra cluster of 3 nodes in one dc. Each keyspace is configured with replication factor 3.
All my queries are executed with LOCAL_QUORUM consistency. If one of my nodes goes down (for test purposes I just kill it via shell command) during read request, this request fails with ReadTimeoutException, saying that only one replica responded (2 expected), but all consequent read requests return data.
In my understanding this error shouldn't happen, because two nodes are still running and it should be enough for LOCAL_QUORUM consistency. How can I fix the exception?
I have a doubt when i read datastax documentation about cassandra write consistency. I have a question on how cassandra will maintain consistent state on following scenario:
Write consistency level = Quorum
replication factor = 3
As per docs, When a write occurs coordinator node will send this write request to all replicas in a cluster. If one replica succeed and other fails then coordinator node will send error response back to the client but node-1 successfully written the data and that will not be rolled back.
In this case,
Will read-repair (or hinted-handoff or nodetool repair) replicate the inconsistent data from node-1 to node-2 and node-3?
If not how will cassandra takes care of not replicating inconsistent data to other replicas?
Can you please clarify my question
You are completely right, the read repair or other methods will update the node-2 and node-3.
This means even the failed write will eventually update other nodes (if at least one succeeded). Cassandra doesn't have anything like rollback that relational databases have.
I don't see there is anything wrong - the system does what you tell it, i.e., two override one, and since the error messages sent back to the client as "fail", then the ultimate status should be "fail" by read repair tool.
Cassandra Coordinator node maintains the failed replica data in its storage and it will retry periodically (3 times or so) then if it succeeds then it will send the latest data, otherwise it will truncate the data in its storage.
In case of any read query, Coordinator node sends requests to all the replica nodes, and it will compare the results from all the replica nodes. If one of the replica node is not sending the latest data, then it will send read repair command to that node in order to keep the nodes in sync.
I have two Data Centers, each one with replication factor 3.
Will write with CL.ALL block until data is stored in both DCs (6 nodes, or 3 + 1)?
I would assume, that it blocks until all 3 replicas in local DC has acknowledged successful write.
I would like to have something like CL.ALL_LOCAL, which stores data on all replicas in single DC, so I can read with CL.ONE. The idea is, that write blocks until all replicas in single DC has persisted data, and following read will have high probability to read fresh data
There isn't currently a consistency level that provides what you are describing. The closest is LOCAL_QUORUM which will return after a quorum of nodes in the local datacenter respond.
You can file a ticket on jira to add this functionality if you would like.
https://issues.apache.org/jira/browse/CASSANDRA
I've checked Cassandra 1.1 code and noticed interesting behavior when writing with CL.ALL in multi DC deployment. Probably I've interpreted code wrong.... anyway:
on the beginning they are collecting IP addresses of nodes to send row mutation - this is independent from consistency level provided by the client. In 1.0 it were all nodes from all DCs, from 1.1 they get all nodes from local DC plus one node from each remote DC (the remaining nodes are as "forward to" in the message). Each mutation will be send by separate thread, so the requests can run in parallel. Each such mutation is being handled as a message by messaging service. When node in remote DC receives message, it forwards it to remaining nodes, which are provided in "forward to".
The consistency level provided by the client, defines number of nodes which must acknowledge received message. In case of CL.ALL this number is determined by replication factor - now is getting interesting: since we've send message to all nodes from local DC and to nodes from remote DCs, we will get also acknowledgement from those remove nodes too - yes this is still the number which is defined by replication factor, but depending on notwork latency, we can not be sure which nodes has conformed received message - could be mix from nodes from local and remote DC, but could be also only nodes from local DC. In the worst case, it could happen, that none of the local nodes got the message, and confirmation come from remote DCs (if you have many). This means - writing with CL.ALL does not grantee, that you can immediately read message from your local DC.