Cassandra - write with CL.ALL with multiple data centers

Cassandra - write with CL.ALL with multiple data centers - cassandra

I have two Data Centers, each one with replication factor 3.
Will write with CL.ALL block until data is stored in both DCs (6 nodes, or 3 + 1)?
I would assume, that it blocks until all 3 replicas in local DC has acknowledged successful write.
I would like to have something like CL.ALL_LOCAL, which stores data on all replicas in single DC, so I can read with CL.ONE. The idea is, that write blocks until all replicas in single DC has persisted data, and following read will have high probability to read fresh data

There isn't currently a consistency level that provides what you are describing. The closest is LOCAL_QUORUM which will return after a quorum of nodes in the local datacenter respond.
You can file a ticket on jira to add this functionality if you would like.
https://issues.apache.org/jira/browse/CASSANDRA

I've checked Cassandra 1.1 code and noticed interesting behavior when writing with CL.ALL in multi DC deployment. Probably I've interpreted code wrong.... anyway:
on the beginning they are collecting IP addresses of nodes to send row mutation - this is independent from consistency level provided by the client. In 1.0 it were all nodes from all DCs, from 1.1 they get all nodes from local DC plus one node from each remote DC (the remaining nodes are as "forward to" in the message). Each mutation will be send by separate thread, so the requests can run in parallel. Each such mutation is being handled as a message by messaging service. When node in remote DC receives message, it forwards it to remaining nodes, which are provided in "forward to".
The consistency level provided by the client, defines number of nodes which must acknowledge received message. In case of CL.ALL this number is determined by replication factor - now is getting interesting: since we've send message to all nodes from local DC and to nodes from remote DCs, we will get also acknowledgement from those remove nodes too - yes this is still the number which is defined by replication factor, but depending on notwork latency, we can not be sure which nodes has conformed received message - could be mix from nodes from local and remote DC, but could be also only nodes from local DC. In the worst case, it could happen, that none of the local nodes got the message, and confirmation come from remote DCs (if you have many). This means - writing with CL.ALL does not grantee, that you can immediately read message from your local DC.

Related

Cassandra LOCAL_QUORUM is waiting for remote datacenter responses

We have a 2 datacenters ( One in EU and one in US ) cluster with 4 nodes each deployed in AWS.
The nodes are separated in 3 racks ( Availability zones ) each.
In the cluster we have a keyspace test with replication: NetworkTopologyStrategy, eu-west:3, us-east:3
In the keyspace we have a table called mytable that has only one row 'id' text
Now, we were doing some tests on the performance of the database.
In CQLSH with a consistency level of LOCAL_QUORUM we were doing some inserts with TRACING ON and we noticed that the requests were not working as we expected them.
From the tracing data we found out that the coordinator node was hitting as expected 2 other local nodes and was also sending a request to one of the remote datacenter nodes. Now the problem here was that the coordinator was waiting not only for the local nodes ( who finished in no time ) but for the remote nodes too.
Now since our 2 datacenters are geographically far away from each other, our requests were taking a very long time to complete.
Notes:
- This does not happen with DSE but our understanding was we don't need to pay crazy money for LOCAL_QUORUM to work as is expected

There is a high probability that you're hitting CASSANDRA-9753 when the non-zero dclocal_read_repair_chance will trigger a query against remote DC. You need to check the trace for hint about triggering of read repair for your query. If you really get it, then you can set dclocal_read_repair_chance to 0 - this parameter is deprecated anyway...

For functional and performance tests it would be better to use the driver instead of CQLSH, as most of the time that will be the way that you are interacting with the database.
For this case, you may use a DC-aware policy like
Cluster cluster = Cluster.builder()
.addContactPoint("127.0.0.1")
.withLoadBalancingPolicy(
DCAwareRoundRobinPolicy.builder()
.withLocalDc("myLocalDC")
.build()
).build();
This is modified from the example here, where all the clauses that allow to interact with remote datacenters are removed, as your purpose is to isolate the calls to local.

driver default retry policy

I am testing our cassandra cluster for resiliency, its a 9 node cluster with rf=3. When i disable all traffic on port 7000 of one node, the client gets a
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
The reason being that this host was only partially able to communicate to other nodes, but it then proceeds to retry again on the same host
INFO - c.d.d.c.p.LoggingRetryPolicy - - Retrying on read timeout on same host at consistency QUORUM (initial consistency: QUORUM, required responses: 2, received responses: 1, data retrieved: true, retries: 0)
It continues ALL it retries on the same host, and never recovers, eventually the request fails.
I can create a custom policy, but wondering why it never tries any other nodes ?

As per your definitions, the database will have only 3 copies of the information (RF=3), so, even though you have 9 nodes, the QUORUM will be evaluated only with the 3 nodes that actually are the owners of the data, this is defined with the number of tokens and their assignation in the nodes.
Before disabling the port in that node, was the cluster reported as healthy? (in other words, nodetool status reported all the nodes as UN Up and Normal). Is the latency reported by all the nodes similar? If you have a node with increased latencies, the query will timeout before it gets a response from it.
Before creating "custom policies", and once that you confirmed that all the nodes are healthy, reachable and available, you may want to explore using a lower consistency level (like ANY or ONE) which can improve resiliency and performance with an impact of accuracy, or increase the replication factor which will increase the number of nodes where you can find the data but with the inconvenience that the amount of disk utilization will increase.

Cassandra DB - Node is down and a request is made to fetch data in that Node

If we configured our replication factor in such a way that there are no replica nodes (Data is stored in one place/Node only) and if the Node contains requested data is down, How will the request be handled by Cassandra DB?
Will it return no data or Other nodes gossip and somehow pick up data from failed Node(Storage) and send the required response? If data is picked up, Will data transfer between nodes happen as soon as Node is down(GOSSIP PROTOCOL) or after a request is made?
Have researched for long time on how GOSSIP happens and high availability of Cassandra but was wondering availability of data in case of "No Replicas" since I do not want to waste additional Storage for occasional failures and at the same time, I need availability and No data loss(though delayed)

I assume when you say that there is "no replica nodes" you mean that you have set the Replication Factor=1. In this case if the request is a Read then it will fail, if the request is a write it will be stored as a hint, up to the maximum hint time, and will be replayed. If the node is down for longer than the hint time then that write will be lost. Hinted Handoff: repair during write path
In general only having a single replica of data in your C* cluster goes against some the basic design of how C* is to be used and is an anti-pattern. Data duplication is a normal and expected part of using C* and is what allows for it's high availability aspects. Having an RF=1 introduces a single point of failure into the system as the server containing that data can go out for any of a variety of reasons (including things like maintenance) which will cause requests to fail.
If you are truly looking for a solution that provides high availability and no data loss then you need to increase your replication factor (the standard I usually see is RF=3) and setup your clusters hardware in such a manner as to reduce/remove potential single points of failure.

Cassandra throws ReadTimeout exception for LOCAL_QUORUM queries when one of three replicas goes down

I have cassandra cluster of 3 nodes in one dc. Each keyspace is configured with replication factor 3.
All my queries are executed with LOCAL_QUORUM consistency. If one of my nodes goes down (for test purposes I just kill it via shell command) during read request, this request fails with ReadTimeoutException, saying that only one replica responded (2 expected), but all consequent read requests return data.
In my understanding this error shouldn't happen, because two nodes are still running and it should be enough for LOCAL_QUORUM consistency. How can I fix the exception?

How data will be consistent on cassandra cluster

I have a doubt when i read datastax documentation about cassandra write consistency. I have a question on how cassandra will maintain consistent state on following scenario:
Write consistency level = Quorum
replication factor = 3
As per docs, When a write occurs coordinator node will send this write request to all replicas in a cluster. If one replica succeed and other fails then coordinator node will send error response back to the client but node-1 successfully written the data and that will not be rolled back.
In this case,
Will read-repair (or hinted-handoff or nodetool repair) replicate the inconsistent data from node-1 to node-2 and node-3?
If not how will cassandra takes care of not replicating inconsistent data to other replicas?
Can you please clarify my question

You are completely right, the read repair or other methods will update the node-2 and node-3.
This means even the failed write will eventually update other nodes (if at least one succeeded). Cassandra doesn't have anything like rollback that relational databases have.

I don't see there is anything wrong - the system does what you tell it, i.e., two override one, and since the error messages sent back to the client as "fail", then the ultimate status should be "fail" by read repair tool.

Cassandra Coordinator node maintains the failed replica data in its storage and it will retry periodically (3 times or so) then if it succeeds then it will send the latest data, otherwise it will truncate the data in its storage.
In case of any read query, Coordinator node sends requests to all the replica nodes, and it will compare the results from all the replica nodes. If one of the replica node is not sending the latest data, then it will send read repair command to that node in order to keep the nodes in sync.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string