Our repair jobs failed for a long period (> 14 days).
Today i manually started an repair job with nodetool repair -pr. Afterwards it looks like we lost some data from a table.
Question:
Is it theoretically possible to lose data after a repair job?
If yes what can be done to avoid this?
You should not lose data with repair. If anything, you could gain back records that were deleted (resurrected zombie records).
One scenario where data might appear to be "lost" is if you have a missing tombstone cell copied from an alternate node during repair. That would be a correct value, not a lost value. If your client CL was something small, say, 1, and you're on the node with the data (but missing the tombstone), you might think that all-of-the-sudden you lost the cell, but again, that's the correct value.
Another scenario where things might appear to be "lost" is if the nodes time/clocks ever got out of sync and on your cluster where certain cells have incorrect time/date values causing things to get potentially messed up when repair tries to sync things up.
That's all I can think of off the top of my head.
-Jim
Related
My table is a time series one. The queries are going to process the latest entries and TTL expire them after successful processing. If they are not successfully processed, TTL will not set.
The only query I plan to run on this is to select all entries for a given entry_type. They will be processed and records corresponding to processed entries will be expired.
This way every time I run this query I will get all records in the table that are not processed and processing will be done. Is this a reasonable approach?
Would using a listenablefuture with my own executor add any value to this considering that the thread doing the select is just processing.
I am concerned about the TTL and tombstones. But if I use clustering key of timeuuid type is this ok?
You are right one important thing getting in your way will be tombstones. By Default you will keep them around for 10 days. Depending on your access patter this might cause significant problems. You can lower this by setting the directly on the table or change it in the cassandra yaml file. Then it will be valid for all the newly created table gc_grace_seconds
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
It is very important that you make sure you are running the repair on whole cluster once within this period. So if you lower this setting to let's say 2 days, then within two days you have to have one full repair done on the cluster. This is very important because processed data will reaper. I saw this happening multiple times, and is never pleasant especially if you are using cassandra as a queue and it seems to me that you might be using it in your solution. I'll try to give some tips at the end of the answer.
I'm slightly worried about you setting the ttl dynamically depending on result. What would be the point of inserting the ttl-ed data that was successful and keeping forever the data that wasn't. I guess some sort of audit or something similar. Again this is a queue pattern, try to avoid this if possible. Also one thing to keep in mind is that you will almost always insert the data once in the beginning and then once again with the ttl should your processing be o.k.
Also getting all entries might be a bit tricky. For very moderate load 10-100 req/s this might be reasonable but if you have thousands per second getting all the requests every time might not be a good idea. At least not if you put them into single partition.
Separating the workload is also good idea. So yes using listenable future seems totally legit.
Setting clustering key to be timeuuid is usually the case with time series thata and I totally agree with you on this one.
In reality as I mentioned earlier you have to to take into account you will be saving 10 days worth of data (unless you tweak it) no matter what you do, it doesn't matter if you ttl it. It's still going to be ther, and every time cassandra will scan the partition will have to read the ttl-ed columns. In short this is just pain. I would seriously consider actually using something as kafka if I were you because what you are describing simply looks to me like a queue.
If you still want to stick with cassandra then please consider using buckets (adding date info to partitioning key and having a composite partitioning key). Depending on the load you are expecting you will have to bucket by month, week, day, hour even minutes. In some cases you might even want to add artificial columns to reduce load on the cluster. But then again this might be out of scope of this question.
Be very careful when using cassandra as a queue, it's a known antipattern. You can do it, but there are a lot of variables and it extremely depends on the load you are using. I once consulted a team that sort of went down the path of cassandra as a queue. Since basically using cassandra there was a must I recommended them bucketing the data by day (did some calculations that proved this is o.k. time unit) and I also had a look at this solution https://github.com/paradoxical-io/cassieq basically there are a lot of good stuff in this repo when using cassandra as a queue, data models etc. Basically this team had zombie rows, slow reading because of the tombstones etc. etc.
Also the way you described it it might happen that you have "hot rows" basically since you would just have one wide partition where all your data would go some nodes in the cluster might not even be that good utilised. This can be avoided by artificial columns.
When using cassandra as a queue it's very easy to mess a lot of things up. (But it's possible for moderate workloads)
I'm running a Cassandra 3.9 cluster, and today I noticed some NULL values in some generated reports.
I opened up cqlsh and after some queries I noticed that null values are appearing all over the data, apparently in random columns.
Replication factor is 3.
I've started a nodetool repair on the cluster but it hasn't finished yet.
My question is: I searched for this behavior and could not find it anywhere. Apparently the random appearance of NULL values in columns is not a common problem.
Does anyone know what's going on? This kind of data corruption seems pretty serious. Thanks in advance for any ideas.
ADDED Details:
Happens on columns that are frequently updated with toTimestamp(now()) which never returns NULL, so it's not about null data going in.
Happens on immutable columns that are only inserted once and never changed. (But other columns on the table are frequently updated.)
Do updates cause this like deletions do? Seems kinda serious to me, to wake up to a bunch of NULL values.
I also know specifically some of the data that has been lost, three entries I've already identified are for important entries which are missing. These have not been deleted for sure - there is no deletion on one specific table which is full of NULL everywhere.
I am the sole admin and nobody ran any nodetool commands overnight, 100% sure.
UPDATE
nodetool repair has been running for 6+ hours now and it fully recovered the data on one varchar column "item description".
It's a Cassandra issue and no, there were no deletions at all. And like I said functions which never return null had null in them(toTimestamp(now())).
UPDATE 2
So nodetool repair finished overnight but the NULLs were still there in the morning.
So I went node by node stopping and restarting them and voilà, the NULLs are gone and there was no data loss.
This is a major league bug if you ask me. I don't have the resources now to go after it, but if anyone else faces this here's the simple "fix":
Run nodetool repair -dcpar to fix all nodes in the datacenter.
Restart node by node.
I faced a similar issue some months ago. It's explained quite good in the following blog. (This is not written by me).
The null values actually have been caused by updates in this case.
http://datanerds.io/post/cassandra-no-row-consistency/
Mmmh... I think that if this was a Cassandra bug it would already be reported. So I smell code bug in your application, but you didn't post any code, so this will remain only a (wild) guess until you provide some code (i'd like to have a look at the update code).
You don't delete data, nor use TTL. It may seem there are no other ways to create NULL values, but there's one more tricky one: failing at binding, that is explictly binding to NULL. It may seem strange, but it happens...
Since
...null values are appearing all over the data...
I'd expect to catch this very fast enabling some debugging or assert code on the values before issuing any updates.
check the update query if it updates only the columns necessary, or it does it through Java beans which includes the list of all columns in the table. This would explain the NULL updates for other columns which weren't desired to be updated.
Is Repair really needed if all operations execute at quorum.
Repair is generally needed to ensure all nodes are in sync, but quorum already ensures success is only returned when the quorum is in sync.
So if all operations execute at quorum, then do we need repair?
In our use-case, we never update records, we simply add then delete the record. (If we see the message after a 'delete' failure is ok, it is not disastrous).In fact - a repair could bring the record back to life..that would be undesirable (but not disastrous)
I would think with this situation, unless there was corruption of one of the nodes, we would not need repair.
I would also argue with this setup, even if delete succeeded, and we saw the record again, it would not be a 'big-deal'. As such I think we could in fact set gc_grace=0, if the quroum operation succeeded, then only 2 would be left..which would never give us quorum against those 'offending nodes, as such we would never see those records anyways (unless..a node dies).
So if a node dies post delete (assume 5 nodes 3 for quorum),
then we have 'stale-mate' 2vs2 and cannot achieve quorum, however hint-repair would kick if one of those records were read again (I'm not clear if this WILL run, or only runs the configured chance amount I.E. 10% is the default if we had quorum failure?).
Either with if gc_grace=0, it would likely come back to life after the delete, so maybe having gc_grace=24 hours (to allow read-repair to correct) would reduce the chance of seeing the record again.
Thoughts?
Your basic thought process is sound - if you write with quorum and read with quorum and never overwrite, then yes, you can likely get by without repair.
You MAY need to run repair if you have to rebuild a failed node, as it's possible that the replacement could miss one of the replicas, and you'd be left with one of three, which may be missed upon read. If that happens, having run incremental repairs previously would make subsequent repairs faster, but it's not strictly necessary.
Your final two paragraphs aren't necessarily accurate - your logic in those is flawed (with 5 nodes and 1 dying, there is no 2v2 stalemate for quorum, that's fundamentally misunderstanding how quorum works). Hints are also best effort and only within a limited window, and read repair isn't guaranteed unless you change read repair to non-default settings.
I do know that it's a cassandra anti-pattern to delete rows (and more so – doing it frequently), but in my simple use case I have a local cassandra (single instance, replication factor set to 1) that I use for unit tests, which drop all tables before running, naturally to perform the tests with a clean slate.
Over time, the performance of this cassandra instance degraded extremely. It surprised me a bit that dropping the keyspaces althogether didn't help at all. Only by manually deleting everything in cassandra data directory I managed to recover all the performance.
This solution is quite fine for me as I don't care for the test data I delete over and over again, but it certainly feels a bit weird to have to delete these things manually on file system. Is there a better way to deal with such situation? Or am I going about this whole case completely wrong?
Based on the little information provided, I will provide some info:
First, deleting data creates tombstones in cassandra. The default behavior is to keep these tombstones for 10 days, set by the variable gc_grace_seconds.
Given you only have 1 node and don't care about the data once you delete it, you could set gc_grace_seconds to zero. You also could make sure to run compaction after you do a lot of deletes.
Documentation here:
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/tabProp.html
http://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsCompact.html
Lastly, there is a feature known as TTL, Time To Live. You could use that instead of deleting and let the database do the "deletes" once the data expires. If you go this route, I would still set gc_grace_seconds to zero and run compactions (via an hourly cronjob since its a dev environment).
So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.
What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )