Cassandra repairs on TWCS - cassandra

We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?

So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.

Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html

Related

High CPU usage and traffic on some Cassandra nodes

As stated in the title, we are having a problem with our Cassandra cluster. There are 9 nodes with a replication factor of 3 using NetworkTopologyStrategy. All in the same DC and Rack. Cassandra version is 3.11.4 (planning to move on 3.11.10). Instances have 4 CPU and 32 GB RAM. (planning to move on 8 CPU)
Whenever we try to run repair on our cluster (using Cassandra Reaper on one of our nodes), we lose one node somewhere in the process. We quickly stop the repair, restart Cassandra service on the node and wait for it to join the ring. Therefore we are never able to run repair these days.
I observed the problem and realized that this problem is caused by high CPU usage on some of our nodes (exactly 3). You may see the 1 week interval graph in below. Ups and downs are caused by the usage of the app. In the mornings, it's very low.
I compared the running processes on each node and there is nothing extra on the high CPU nodes. I compared the configurations. They are identical. Couldn't find any difference.
I also realized that these nodes are the ones that take most of the traffic. See the 1 week interval graph in below. Both sent & received bytes.
I made some research. I found this thread and at the end it is recommended to set dynamic_snitch: false in Cassandra configuration. I looked at our snitch strategy which is GossipingPropertyFileSnitch. In practice, this strategy should work properly but I guess it doesn't.
The job of a snitch is to provide information about your network topology so that Cassandra can efficiently route requests.
My only observation that could be cause of this issue is there is a file called cassandra-topology.properties which is specifically told to be removed if using GossipingPropertyFileSnitch
The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip. If cassandra-topology.properties exists, it is used as a fallback, allowing migration from the PropertyFileSnitch.
I did not remove this file as I couldn't find any hard proof that this is causing the issue. If you have any knowledge on this or see any other reason to my problem, I would appreciate your help.
These two sentences tell me some important things about your cluster:
high CPU usage on some of our nodes (exactly 3).
I also realized that these nodes are the ones that take most of the traffic.
The obvious point, is that your replication factor (RF) is 3 (most common). The not-so-obvious, is that your data model is likely keyed on date or some other natural key which results in the same (3?) nodes serving all of the traffic for long periods of time. Running repair during those high-traffic periods will likely lead to issues.
Some things to try:
Have a look at the data model, and see if there's a better way to partition the data to distribute traffic over the rest of the cluster. This is often done with a modeling technique known as "bucketing" (adding another component...usually time based...to the partition key).
Are the partitions large? (Check with nodetool tablehistograms) And by "large," like > 10MB? It could also be that the large partitions are causing the repair operations to fail. If so, hopefully lowering resource consumption (below) will help.
Does your cluster sustain high amounts of write throughput? If so, it may also be dealing with compactions (nodetool compactionstats). You could try lowering compaction throughput (nodetool setcompactionthroughput) to free up some resources. Repair operations can also invoke compactions.
Likewise, you can also lower streaming throughput (nodetool setstreamthroughput) during repairs. Repairs will take longer to stream data, but if that's what is really tipping-over the node(s), it might be necessary.
In case you're not already, set up another instance and use Cassandra Reaper for repairs. It is so much better than triggering from cron. Plus, the UI allows for some finely-tuned config which might be necessary here. It also lets you pause and resume repairs, to pick-up where it leaves off.

How to determine the sync status is up to date for particular node in a Cassandra cluster?

Suppose I have two node cassandra cluster and they are reside on physically different data-centers. Suppose the database inside that cluster has replication factor is 2 which means every data in that database should be sync with each other. suppose this database is a massive database which have millions of records of its tables. I named those nodes centers as node1 and node2. Suppose node2 is not reliable and there was a crash on that server and take few days to fix and get the server back to up and running state. After that according to my understating there should be a gap between node1 and node2 and it may take significant time to sync node2 with node1. So need a way to measure the gap between node2 and node1 for the mean time of sync happen? After some times how should I assure that node2 is equal to node1? Please correct me if im wrong with this question according to the cassandra architechure.
So let's start with your description. 2 node cluster, which sounds fine, but 2 nodes in 2 different data centers (DCs) - bad design, but doable. Each data center should have multiple nodes to ensure your data is highly available. Anyway, that aside, let's assume you have a 2 node cluster with 1 node in each DC. The replication factor (RF) is defined at the keyspace level (not at the cluster level - each DC will have a RF setting for a particular keyspace (or 0 if not specified for a particular DC)). That being said, you can't have RF=2 for a keyspace for either of your DCs if you only have a single node in each one (RF, which is how many copies of the data that exist, can't be more than the number of nodes in the DC). So let's put that aside for now as well.
You have the possibility for DCs to become out of sync as well as nodes within a DC to become out of sync. There are multiple protections against this problem.
Consistency Level (CL)
This is a lever that you (the client) have to be able to help control how far out of sync things get. There's a trade off between availability v.s. consistency (with performance implications as well). The CL setting is configured at connection time and/or each statement level. For writes, the CL determines how many nodes must IMMEDIATELY ACKNOWLEDGE the write before giving your application the "green light" to move on (a number of nodes that you're comfortable with - knowing the more nodes you immediately require the more consistent your nodes and/or DC(s) will be, but the longer it will take and the less flexibility you have in nodes becoming unavailable without client failure). If you specify less than RF it doesn't mean that RF won't be met, it just means that they don't need to immediately acknowledge the write to move on. For reads, this setting determines how many nodes' data are compared before the result is returned (if cassandra finds a particular row doesn't match from the nodes it's comparing, it will "fix" them during the read before you get your results - this is called read repair). There are a handful of CL options by the client (e.g. ONE, QUORUM, LOCAL_ONE, LOCAL_QUOURM, etc.). Again, there is a trade-off between availability and consistency with the selected choice.
If you want to be sure your data is consistent when your queries run (when you read the data), ensure the write CL + the read CL > RF. You can ensure that's done on a LOCAL level (e.g. the DC that the read/write is occurring on, say, LOCAL_QUORUM) or globally (all DCs with QUORUM). By doing this, you'll be sure that while your cluster may be inconsistent, your results during reads will not be (i.e. the results will be consistent/accurate - which is all that anyone really cares about). With this setting you also allow some flexibility in unavailable nodes (e.g. for a 3 node DC you could have a single node be unavailable without client failure for either reads or writes).
If nodes do become out of sync, you have a few options at this point:
Repair
Repair (run by "nodetool repair") - this is a facility that you can schedule or manually run to reconcile your tables, keyspaces and/or the entire node with other nodes (either in the DC the node resides or the entire cluster). This is a "node level" command and must be run on each node to "fix" things. If you have DSE, Ops Center can run repairs in the background fixing "chunks" of data - cycling the process repetitively.
NodeSync
Similar to repair, this is a DSE specific tool similar to repair that helps keep data in sync (the newer version of repair).
Unavailable nodes:
Hinted Handoff
Cassandra has the ability to "hold onto" changes if nodes become unavailable during writes. It will hang onto changes for a specified period of time. If the unavailable nodes become available before time runs out, the changes are sent over for application. If time runs out, hint collection stops and one of the other options, above, need to be performed to catch things up.
Finally, there is no way to know how inconsistent things are (e.g. 30% inconsistent). You simply try to utilize the tools mentioned above to control consistency without completely sacrificing availability.
Hopefully that makes sense and helps.
-Jim

Why do tables get out of sync over time when Write Consistency ALL is used?

Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".

Cassandra WRITE=ALL and READ=ONE applicability

Given we have a 2x5 nodes setup (across 2 DC) and replication factor of 3, and the fact that we create views asynchronously (so we can safely retry failed operations) does using WRITE=ALL and READ=ONE make sense?
If one replica fails, how can we know the recovery time so how to pick up right retry interval and timeout?
Any of the below combination should give you correct data:
WRITE=ALL READ=ONE
WRITE=ONE READ=ALL
WRITE=LOCAL_QUORUM READ=LOCAL_QUORUM
You can tune consistency level in your application, as per load of the application.
According to me, Number 3 LOCAL_QUORUM should work better, As sometimes a node can be under high load or maybe is down. Your application will not get affected.
In case, you have more writes than READ; WRITE CL=ALL will make your application slow.
The combination of WRITE=ALL and READ=ONE is correct in the sense of consistency - after you've written to all the replicas, you can indeed read from any one and expect the latest data. However, it is bad for high availability - if any one of the 6 replicas in both DCs is down, a write cannot complete. If one of the nodes is down for an hour, you cannot do any write for an hour. In some batch-processing setups this may make sense, but it usually not acceptable behavior for interactive workloads, where high-availability is a primary concern.
If you really don't care about high availability and just want to write when all the nodes are up, then I guess WRITE=ALL could work. You can tell when all the nodes are up using "nodetool", for example. Or just retry the writes periodically.

If all replicas will sync up eventually, what's the point of read repairs?

If all replicas will sync up eventually, what's the point of read repairs?
Wouldn't you have cases where if you have a write that's being sent to all replicas, then a read repair happens before the write, wouldn't you essentially be duplicating that same write twice?
Theres a few things, blocking read repairs, async read repairs, and if either are needed.
Blocking read repairs: Quorum reads are monotonically consistent for awhile now. If you read it twice you should get the same answer. People tend to use QUORUM reads as wanting stronger consistency, so the blocking read repairs prevent reads from going back in time. If this behavior ends it would cause potential surprises to existing applications. However the latency impact of these repairs have been causing issues and it may still be changed in very near future. You cannot currently disable this behavior and it will always be on.
Async read repairs: Repairs in background can be disabled or happen only a small percentage of time, or (recommended) only within a DC. This reduces background impact as there isnt as much cross DC traffic. This is controlled by the dc_local and global read repair chance settings. When you do a ONE or LOCAL_ONE etc query it will depending on that chance wait for the rest of the responses and compare results for a read repair.
Statistically your far more likely to be having unnecessary work with async read repairs as on a normal functioning perfect system they are not needed. Hinted Handoff however is not perfect and there are cases where hints are lost. In these situations the consistency will not be met until a anti-entropy repair is run (should be weekly or even daily depending on how repairs run, inc or full etc).
So other than for the sake of monotonic consistency (blocking on QUORUM+ requests), read repairs are not really critical or needed. Its something people have used to statistically put cluster in a more consistent state faster (maybe). Lots of people run without async read repairs (you cannot currently disable the read repair mechanism fwiw), and theres even serious talk of removing options around it completely due to misunderstandings.
One scenario that makes sense to me is this:
You write the data to a node (or a subset of the cluster)
You read the data (with Quorum), and one of the nodes has the fresher data.
because you specified QUROUM, several nodes are being asked for the value before the response is sent to the client. But because the data is fresher on one of the nodes, a blocking read-repair must first happen, to update all of them.
in this case, the read-repair needs to happen because the "eventual update" has yet to happen.
In highly dynamic applications with many nodes, there are times when an eventually consistent write doesn't make it to the node PRIOR to a read request for that piece of data on that node. This is common in environments with heavy load on an undersized cluster, unknown hardware issues and other reasons. Its likely also where write consistency is set to one, while read consistency is set to local_quorum.
Example 1: random & sporadic network drops due to an unknown network switch failing that affects the write to the node but doesn't affect the read.
Example 2: the write occurs during a peak load time period, and as a result doesn't make it to the overloaded node, prior to the read request.

Resources