When Apache Cassandra scheduled repair becomes necessary operational practice?

When Apache Cassandra scheduled repair becomes necessary operational practice? - cassandra

Like eventual consistency, scheduled repair seems eventually useful from nodes drifting away too much from other.
Trying to understand why and when "Scheduled Repair" becomes mandatory. We are relatively new to operating Cassandra and progressively adopting it. Despite there are no scheduled repairs configured, few services are working quite well for months.
Hence, Have few questions about repair?
What is the statistical evidence that developer reliably look at, so
he/she can understand the immediate or eventual benefit of repair
processes?
Is there any indicator (from log or metrics) that warns
ahead of time about need of repair?
If we build read-heavy (very
rare transaction) reference data system, do we still need to repair
regularly?
Mistakenly if material-views used in application, should
we abstain repair till we re-write applications without material-views?

The answer is simple -- repair is part of the normal operation of Cassandra.
There are no metrics/statistics/indicators that determine when to run repairs. You just have to run repairs once every gc_grace_seconds. It's as simple as that.
By default, GC grace is 10 days so for simplicity you should run repairs at least once a week if you're not using automated tools like Reaper -- the free, open-source tool for automated Cassandra repairs. Cheers!

Related

Is there a way to speed up cassandra nodeltool repair ?

I've 10 nodes of Cassandra Cluster and currently installed version is 3.0.13.
How I launched : nodetool repair -j 4 -pr
Would like to know if there are some configuration options to speed up this process, I still see "Anticompaction after repair" is in progress when i check for compactionstats.

The current state of the art way of doing repairs are subrange repairs running all the time. See http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html for some explanations:
While the idea behind incremental repair is brilliant, the implementation still has flaws that can cause severe damage to a production cluster, especially when using LCS and DTCS. The improvements and fixes planned for 4.0 will need to be thoroughly tested to prove they fixed incremental repair and allow it to be safely used as a daily routine.
That beeing said (or quoted), have a look at http://cassandra-reaper.io/ - a simple and easy tool managing your repairs.

Preserving Data Locality in Accumulo

Recently I've been watching the data locality on my Accumulo cluster and I've noticed that it seems to be deteriorating over time. My instinct tells me that it's due to the master redistributing the tablets to help balance out the cluster, specifically after I've completed a rolling restart.
I'm thinking of setting up manual major compactions to run overnight against all of my tables to keep this data locality as close to 100% as possible. Is this something any of you have done before or is there a better way to handle this?

As long as you continue to write more data into Accumulo, you'll have a "not-quite-100%" locality measurement. As you write more data, you'll cause tablets to split: one tablet becomes two. Typically, after a split, one of the children will be moved to another server because it invalidates the distribution of tablets which Accumulo is trying to maintain. Until the child tablet of a split itself gets automatically major compacted, you won't have any locality. This is actually an area where Accumulo could make more intelligent decisions about balancing tablets, favoring HDFS locality instead of just the distribution of tablets across tabletservers (but that would be a major effort to undertake).
For your case, it's certainly not absurd to consider running a major compaction on cron overnight (or whenever your "off-peak" time is). We could probably even do something smart and create a tool which judges the locality of all tablets for a table and actually prune down the number of tablets that are below some threshold of locality (e.g. <90% local) which would help avoid re-compacting data which is already local.
If you're interested, please feel free to subscribe and send a message to user#accumulo.apache.org; I would be happy to help out in more detail there.

Restoring cassandra from snapshot

So I did something of a test run/disaster recovery practice deleting a table and restoring in Cassandra via snapshot on a test cluster I have built.
This test cluster has four nodes, and I used the node restart method so after truncating the table in question, all nodes were shutdown, commitlog directories cleared, and the current snapshot data copied back into the table directory for each node. Afterwards, I brought each node back up. Then following the documentation I ran a repair on each node, followed by a refresh on each node.
My question is, why is it necessary for me to run a repair on each node afterwards assuming none of the nodes were down except when I shut them down to perform the restore procedure? (in this test instance it was a small amount of data and took very little time to repair, if this happened in our production environment the repairs would take about 12 hours to perform so this could be a HUGE issue for us in a disaster scenario).
And I assume running the repair would be completely unnecessary on a single node instance, correct?
Just trying to figure out what the purpose of running the repair and subsequent refresh is.

What is repair?
Repair is one of Cassandra's main anti-entropy mechanisms. Essentially it ensures that all your nodes have the latest version of all the data. The reason it takes 12 hours (this is normal by the way) is that it is an expensive operation -- io and CPU intensive -- to generate merkel trees for all your data, compare them with merkel trees from other nodes, and stream any missing / outdated data.
Why run a repair after a restoring from snapshots
Repair gives you a consistency baseline. For Example: If the snapshots weren't taken at the exact same time, you have a chance of reading stale data if you're using CL ONE and hit a replica restored from the older snapshot. Repair ensures all your replicas are up to date with the latest data available.
tl;dr:
repairs would take about 12 hours to perform so this could be a HUGE
issue for us in a disaster scenario).
While your repair is running, you'll have some risk of reading stale data if your snapshots don't have the same exact data. If they are old snapshots, gc_grace may have already passed for some tombstones giving you a higher risk of zombie data if tombstones aren't well propagated across your cluster.
Related side rant - When to run a repair?
The coloquial definition of the term repair seems to imply that your system is broken. We think "I have to run a repair? I must have done something wrong to get to this un-repaired state!" This is simply not true. Repair is a normal maintenance operation with Cassandra. In fact, you should be running repair at least every gc_grace seconds to ensure data consistency and avoid zombie data (or use the opscenter repair service).
In my opinion, we should have called it AntiEntropyMaintenence or CassandraOilChange or something rather than Repair : )

How to speedup the bootstrap of single node

I have a single node Cassandra installation on my development machine (and very little experience with Cassandra). I always had very few data in the node and I experienced no problems. I inserted about 9,000 elements in a table today to experiment with a real world use case. When I start up the node the boot time is extremely long now. I get this in system.log
Replaying /var/lib/cassandra/commitlog/CommitLog-3-1388134836280.log
...
Log replay complete, 9274 replayed mutations
That took 13 minutes and is hardly bearable. I wonder if there is a way to store data in such a way that can be read at once without replaying the log. After all 9,000 elements are nothing and there must be a quicker way to boot. I googled for hints and searched into Cassandra's documentation but I didn't find anything. It's obvious that I'm not looking for the right things, would anybody be so kind to point me to the right documents? Thanks.

There are a few things that might help. The most obvious thing you can do is flush the commit log before you shutdown Cassandra. This is a good idea to do in production too. Before I stop a Cassandra node in production I'll run the following commands:
nodetool disablethrift
nodetool disablegossip
nodetool drain
The first two commands gracefully shut down connections to clients connected to this node and then to other nodes in the ring. The drain command flushes memtables to disk (sstables). This should minimize what needs to be replayed on startup.
There are other factors that can make startup take a long time. Cassandra opens all the SSTables on disk at startup. So the more column families and SSTables you have on disk the longer it will take before a node is able to start serving clients. There was some work done in the 1.2 release to speed this up (so if you are not on 1.2 yet you should consider upgrading). Reducing the number of SSTables would probably improve your start time.
Since you mentioned this was a development machine I'll also give you my dev environment observations. On my development machine I do a lot of creating and dropping column families and key spaces. This can cause some of the system CFs to grow significantly and eventually cause a noticeable slowdown. The easiest way to handle this is to have a script that can quickly bootstrap a new database and blow away all the old data in /var/lib/cassandra.

Do CouchDB databases get extremely large in a short time span?

I'm reading the Beginning CouchDB book by Apress and there is a line that confuses me a bit:
Also important to note is that CouchDB
will never overwrite existing
documents, but rather it will append a
new document to the database, with the
latest revision gaining prominence and
the other being stored for archival
purposes.
Doesn't this mean that after a couple of updates, you would have a huge database? Thank you!

The short answer is "not really, no".
In reality in depends on the average size of your document and the amount of them. This will define when you should be running a compact job on your database, which is the job that removes all of the previous revisions from the database. Read more about compaction at http://wiki.apache.org/couchdb/Compaction
Another sysadmin point for this, try to schedule your compaction jobs when the database isn't under load. You most specifically care about write load, because if writes are happening too quickly when you run compaction, then your compaction job could (in theory) run forever and take the database with it. However, I've seen some not-so-nice behavior around running compaction while under a heavy read load. So, if you can stand only compacting once a day, do it at 3am with the rest of your system/database maintenance cron jobs.
Oh, and possibley most importantly, if you're just starting to learn couchdb, then it's probably premature to start worrying about when to run your compaction jobs compared to your system's load. Premature optimization and all that - focus on other aspects for now.
Cheers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string