So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )
Related
We have a 13 nodes Cassandra cluster (version 3.10) with RP 2 and read/write consistency of 1.
This means that the cluster isn't fully consistent, but eventually consistent. We chose this setup to speed up the performance, and we can tolerate a few seconds of inconsistency.
The tables are set with TWCS with read-repair disabled, and we don't run full repairs on them
However, we've discovered that some entries of the data are replicated only once, and not twice, which means that when the not-updated node is queried it fails to retrieve the data.
My first question is how could this happen? Shouldn't Cassandra replicate all the data?
Now if we choose to perform repairs, it will create overlapping tombstones, therefore they won't be deleted when their time is up. I'm aware of the unchecked_tombstone_compaction property to ignore the overlap, but I feel like it's a bad approach. Any ideas?
So you've obviously made some deliberate choices regarding your client CL. You've opted to potentially sacrifice consistency for speed. You have achieved your goals, but you assumed that data would always make it to all of the other nodes in the cluster that it belongs. There are no guarantees of that, as you have found out. How could that happen? There are multiple reasons I'm sure, some of which include: networking/issues, hardware overload (I/O, CPU, etc. - which can cause dropped mutations), cassandra/dse being unavailable for whatever reasons, etc.
If none of your nodes have not been "off-line" for at least a few hours (whether it be dse or the host being unavailable), I'm guessing your nodes are dropping mutations, and I would check two things:
1) nodetool tpstats
2) Look through your cassandra logs
For DSE: cat /var/log/cassandra/system.log | grep -i mutation | grep -i drop (and debug.log as well)
I'm guessing you're probably dropping mutations, and the cassandra logs and tpstats will record this (tpstats will only show you since last cassandra/dse restart). If you are dropping mutations, you'll have to try to understand why - typically some sort of load pressure causing it.
I have scheduled 1-second vmstat output that spools to a log continuously with log rotation so I can go back and check a few things out if our nodes start "mis-behaving". It could help.
That's where I would start. Either way, your decision to use read/write CL=1 has put you in this spot. You may want to reconsider that approach.
Consistency level=1 can create a problem sometimes due to many reasons like if data is not replicating to the cluster properly due to mutations or cluster/node overload or high CPU or high I/O or network problem so in this case you can suffer data inconsistency however read repair handles this problem some times if it is enabled. you can go with manual repair to ensure consistency of the cluster but you can get some zombie data too for your case.
I think, to avoid this kind of issue you should consider CL at least Quorum for write or you should run manual repair within GC_grace_period(default is 10 days) for all the tables in the cluster.
Also, you can use incremental repair so that Cassandra run repair in background for chunk of data. For more details you can refer below link
http://cassandra.apache.org/doc/latest/operating/repair.html or https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsRepair.html
Iam running a cassandra 3.11.4 cluster with 1 data center, 2 racks and 11 nodes. My keyspaces and the tables are set to replication 2. I use the Prometheus-Grafana-Combo to monitor the cluster.
Observation: During (massive) inserts using Write-Consistency Level ALL (i.e. 2 nodes) the affected tables/nodes get slowly out of sync (worst case on one node: from 100% to 83% within 6 hours). My expectation is that this could only happen if I use ANY (or anything less than my replication factor).
I would really like to understand this behaviour.
What is also interesting: If I dare to use write consistency ANY I get exactly that- and even though all nodes are online Cassandra does not even seem attempt to write to all nodes. In any case (ANY or ALL) if have to perform incremental repairs.
First of all, your expectation is correct: Writes, regardless of what the consistency-level is (ALL or ONE or ANY or whatever), do make every attempt to write to all replicas. The different write-consistency levels only differ on when "success" is reported to the client: ALL waits until all writes were done, while ONE waits for just one (and does the other ones in the background). So unless one of your nodes goes down, or severely overloaded, none of the writes should be missing on any of the nodes, and there should be zero inconsistencies. The "hinted handoff" feature makes inconsistencies even less likely (if one node is temporarily down, other nodes save for it the writes it missed, and replay them later).
I think your only problem is that you're misinterpreting what the "percentrepaired" statistic means. The "percentrepaired" metric is used by incremental repair. In incremental repair, data on disk is split between "repaired" data (data that already went through a repair process) and "unrepaired" data - new data that still did not yes pass through repair. This does not mean that the new data is inconsistent or differs between nodes - it just that nobody checked that yet! To mark this new data "repaired" you'd need to run an (incremental) repair - it will realize the data does not differ between nodes, and mark it as "repaired".
We added a new node to datacenter and then run nodetool cleanup according to Add new node to existing cluster in cassandra. But after cleanup completed, we noticed that we lost some data.
What could be the reason?
Yes, it's important to understand that nodetool cleanup is a potentially destructive tool. Your cluster needs to be in a fully-repaired state (from regular, successful runs of nodetool repair prior).
When you add a new node to the cluster, the token ranges that each node is responsible for are adjusted, and lowered per node. This leaves data on the original nodes that they are no longer responsible for. And that is by design.
The idea was that if for whatever reason the node add process failed and you had to leave your cluster at its original size, then the data is still there. But if you can't guarantee that your cluster was in a fully-repaired state in the first place and cleanup was run, it's possible that not all replicas would have made it to their proper nodes. But like nodetool getendpoints the bootstrap process would have assumed that it was.
That's why it's important to ensure that you have been regularly running nodetool repair on your cluster before running nodetool cleanup.
nodetool cleanup frees partition keys no longer belonging to a node, so after adding a node and transferring it's portion of data, this "portion" is no longer belongs to the old node, so running cleanup will free some space on this node.
If you see that old node now have lower storage, it is ok, there wasn't any data loss.
On other hand, if you really can't find some data, it can be due to data corruption or deleted data (with tombstones). What do you mean by data loss anyway?
I don't believe any of my nodes have been down for an extended period of time, so I believe all of my deletes should have been replicated throughout all of them. However, I keep seeing recommendations as normal maintenance to run node repair within GCGraceSeconds. I don't believe node repair has ever been ran on my cluster (I inherited it a few months ago). Do I have anything to worry about? Will I have zombie data if I run node repair even if I haven't had any nodes down for an extended time?
My main question is - what can I do to get out of this state so I can start routinely running nodetool repair?
Cassandra has no 'normal' deletes as relational databases have. When you delete something Cassandra just adds some record which marking data as deleted, named 'tombstone'. Even if all of your tombstones are properly replicated they're still lives in your files, and can affect performance and even make some deleted records be 'alive' again.
In general, you need to run 'nodetool repair' on every node of your cluster regularly.
You can check details in the documentation.
From the documentation:
Using the nodetool repair -pr (–partitioner-range) option repairs only the primary range for that node, the other replicas for that range still have to perform the Merkle tree calculation, causing a validation compaction. Because all the replicas are compacting at the same time, all the nodes may be slow to respond for that portion of the data.
There is probably never a time where I can accept all nodes to be slow for a certain portion of the data. But I wonder: Why does it do that (or is there maybe just a mixup with the "-par" option in the documentation?!), when nodetool repair seems to be smarter:
By default, the repair command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots. For example, if you have RF=3 and A, B and C represents three replicas, this command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots (A<->B, A<->C, B<->C) instead of repairing A, B, and C all at once. This allows the dynamic snitch to maintain performance for your application via the other replicas, because at least one replica in the snapshot is not undergoing repair.
However, the datastax blog addresses this issue:
This first phase can be intensive on disk io, however. You can mitigate this to some degree with compaction throttling (since this phase is what we call a validation compaction.) Sometimes that isn’t enough though, and some people try to mitigate this further by using the -pr (–partitioner-range) option to nodetool repair, which repairs only the primary range for that node. Unfortunately, the other replicas for that range will still have to perform the Merkle tree calculation, causing a validation compaction. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data. Fortunately, there is way around this by using the -snapshot option.
That could be nice, but actually, there is no -snapshot option for nodetool repair (see the manpage, or the documentation) (has this option been removed?!)
So overall,
I cannot use nodetool repair -pr, it seems, because I always need at least to keep the system responsive enough to read/write with consistency ONE, without significant delay. (Note: We have only one data center.) Or am I missing/misunderstanding something?
Why is nodetool repair smart, keeping one node responsive, while nodetool repair -pr makes all nodes slow for a portion of data?
Where is the -snapshot option: Has it been removed, never implemented, or does it now maybe automatically work like that, also when using nodetool repair -pr?
The blog below addresses these issues:
http://www.datastax.com/dev/blog/repair-in-cassandra
A simple nodetool repair will not only kick off a repair on the node itself but also all the nodes that hold replicas if its ranges. While this is ok, it is very expensive and typically not an operation you'll carry out on a busy production system during peak times.
Consequently nodetool repair -pr will carry out a repair of the primary ranges on that node. You will need to run this on every node of the cluster as the blog says. Customers with large production systems will typically use this in a rolling fashion across their cluster.
On another note Datastax OpsCenter offers the repair service which runs smaller sub-range repairs all the time so although you're always repairing its going on in the background all the time at a lower resource level.
As for the snapshots, running a regular repair will invoke a snapshot as you stated, you can also invoke a snapshot yourself using nodetool snapshot
Hope this helps!