Compactions in cassandra - cassandra

I was trying to make a configuration change in cassandra-env.sh file.
For the change to be effective I have to bounce my cassandra nodes. But the nodes I want to bounce are running compactions.
So what will happen to these pending tasks if I bounce my nodes when the compactions are in progress?

As said before, compactions will stop when you bounce the nodes. But it will pick up once you start the nodes again. No warn on that. If you have really long compactions ongoing, you might want to wait for those to finish.
nodetool compactionstats -H is your friend to check the current status and expected ETA of the current compactions.
If you want your nodes to startup faster, flush nodetool flush, drain nodetool drain and then stop the node. (This way you clear the commitlog).

Its very simple: your pending compaction will be failed for the dependent nodes. For development or test environment, you may do whatever you want but for production environment we prefer, complete your all pending compaction tasks and then go for the changes.
If you are in a hurry then go for Nodetool Stop, it'll stop your compaction process then go for the changes.

Related

Why shouldn't you run nodetool removenode?

I am wondering why it is a best practice to not run nodetool removenode. What is it used for? Is there a hierarchy of commands to run instead? What kind of issues arise when running said command? Any first hand experience/nightmare stories of using removenode? Overarching why not?
A default preference order would be:
Replacement node option (if replacement is planned)
Decomission
RemoveNode
Assassinate
However - there are situations where you will still choose a lower entry over an earlier one.
If the node being removed is operational, then you would normally run a decommission and allow the node to stream the data from itself to other nodes which will now be holding one of the replicas that was previously on the node being removed.
Removing a node will cause the token ranges to be recalculated and move, potentially requiring all nodes to start streaming data to other nodes who now own that range.
If the node is not operational, you can perform a nodetool removenode - this will trigger the same range movement and cause a large amount of streaming. There are streaming throughput throttles that are in place by default and can be adjusted to limit this impact.
You can also forcibly terminate either a decommission or a removenode by using nodetool [decommission | removenode] force - however, this means that one of the replicas of the data has not been re-established to another node, leaving you with less resilience.
Why would you do that? for the same streaming reason, if you accept the loss of resilience for a period of time, you can roll out the repair node by node in a controlled manner. This option should not be considered your 'default approach' or the choice taken lightly - I cannot stress or make that in bold enough.
The final option, when decommission / removenode is not available, is to assassinate the node - this is pretty much the same as performing a removenode, followed by an immediate force. You then have to manage to repairs and clean up in the same manner.
Outside of all of these 3 options - the best option is if your intention is to replace the node, then performing a replacement instead of a remove / add is the winner - this will only then require the new node has data streamed to it from the other replica's and there is no further token ring range movement. Instructions here
If the data disks are available, it is also possible to bring the replacement up without streaming the data, instructions here
The Datastax Documentation goes over the use case for nodetool removenode in depth.
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/nodetool/toolsRemoveNode.html
The gist of why it can be bad is this:
Warning: This command triggers cluster streaming. In large
environments, the additional streaming activity causes more pending
gossip tasks in the output of nodetool tpstats. Nodes can start to
appear offline and may need to be restarted to clear up the back log
of pending gossip tasks.
According to the document, this is when it should be used:
When the node is down and nodetool decommission cannot be used, use nodetool removenode. Run this command only on nodes that are down.

how to know if cassandra autocompaction is enabled or not

Autocompactions can be enabled or disabled using nodetool enableautocompaction and disableautocompaction. But is there any way to know the status? I do not see any nodetool command which will show the status.
There is no mechanism to tell short of taking a heap dump currently. Best option is just to use nodetool enableautocompaction if you want it on regardless to be safe or setting alerting on compaction pending tasks.
I think you are searching for one of the below commands:
1.CompactionHistory
Description: Provides the history of compaction operations.
CompactionStats
Provide statistics about a compaction. The total column shows the total number of uncompressed bytes of SSTables being compacted. The system log lists the names of the SSTables compacted.
As suggested Chris, nodetool compactionstats will probably help if autocompaction enabled then you can see some running task and pending task may be 0 or any number but if autocompaction disabled then you can see many pending task and no running task on nodetool compactionstats.

Can I avoid running repair while compaction is going on in Cassandra cluster?

I have scheduled incremental repair for everyday. But while the repair is going on, our monitoring system reports COMPACTIONEXECUTOR_PENDING tasks.
I am wondering, if I can introduce a check, to see, if compaction is not running, before I trigger repair.
I should be able to check if compaction is running by parsing output of nodetool netstats and compactionstats command output.
I will proceed with repair if both of the following checks passes:
nodetool netstats output contains Not sending any streams.
nodetool compactionstats output contains pending tasks: 0
But I want to get some expert opinion before I proceed.
Is my understanding correct?
I don't want to get into situation, in which, these checks are failing always and repair process is not getting triggered at all.
Thanks.
Compaction is occurring regularly in Cassandra. So I'm a bit scared that only triggering repair when pending_compactions=0 will result in repair not running enough. But it depends on your traffic of course, e.g. if you have few writes you won't do many compactions. You should probably add a max wait time for pending_compactions=0 so that after a specified time if the condition is not true repair will run anyway.
To answer your question. Nodetool uses JMX to fetch MBeans in Cassandra. You can see all available MBeans here: http://cassandra.apache.org/doc/latest/operating/metrics.html
You want this MBean:
org.apache.cassandra.metrics:type=Compaction name=PendingTasks
You can create your own JMX Client like this: How to connect to a java program on localhost jvm using JMX?
Or you can use jmxterm: https://github.com/jiaqi/jmxterm
My understanding is you could use it like this:
java -jar jmxterm-1.0.0-uber.jar
get -b org.apache.cassandra.metrics:type=Compaction name=PendingTasks

why lost some data after nodetool cleanup in cassandra

We added a new node to datacenter and then run nodetool cleanup according to Add new node to existing cluster in cassandra. But after cleanup completed, we noticed that we lost some data.
What could be the reason?
Yes, it's important to understand that nodetool cleanup is a potentially destructive tool. Your cluster needs to be in a fully-repaired state (from regular, successful runs of nodetool repair prior).
When you add a new node to the cluster, the token ranges that each node is responsible for are adjusted, and lowered per node. This leaves data on the original nodes that they are no longer responsible for. And that is by design.
The idea was that if for whatever reason the node add process failed and you had to leave your cluster at its original size, then the data is still there. But if you can't guarantee that your cluster was in a fully-repaired state in the first place and cleanup was run, it's possible that not all replicas would have made it to their proper nodes. But like nodetool getendpoints the bootstrap process would have assumed that it was.
That's why it's important to ensure that you have been regularly running nodetool repair on your cluster before running nodetool cleanup.
nodetool cleanup frees partition keys no longer belonging to a node, so after adding a node and transferring it's portion of data, this "portion" is no longer belongs to the old node, so running cleanup will free some space on this node.
If you see that old node now have lower storage, it is ok, there wasn't any data loss.
On other hand, if you really can't find some data, it can be due to data corruption or deleted data (with tombstones). What do you mean by data loss anyway?

Clarifications about nodetool repair -pr

From the documentation:
Using the nodetool repair -pr (–partitioner-range) option repairs only the primary range for that node, the other replicas for that range still have to perform the Merkle tree calculation, causing a validation compaction. Because all the replicas are compacting at the same time, all the nodes may be slow to respond for that portion of the data.
There is probably never a time where I can accept all nodes to be slow for a certain portion of the data. But I wonder: Why does it do that (or is there maybe just a mixup with the "-par" option in the documentation?!), when nodetool repair seems to be smarter:
By default, the repair command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots. For example, if you have RF=3 and A, B and C represents three replicas, this command takes a snapshot of each replica immediately and then sequentially repairs each replica from the snapshots (A<->B, A<->C, B<->C) instead of repairing A, B, and C all at once. This allows the dynamic snitch to maintain performance for your application via the other replicas, because at least one replica in the snapshot is not undergoing repair.
However, the datastax blog addresses this issue:
This first phase can be intensive on disk io, however. You can mitigate this to some degree with compaction throttling (since this phase is what we call a validation compaction.) Sometimes that isn’t enough though, and some people try to mitigate this further by using the -pr (–partitioner-range) option to nodetool repair, which repairs only the primary range for that node. Unfortunately, the other replicas for that range will still have to perform the Merkle tree calculation, causing a validation compaction. This can be a problem, since all the replicas will be doing it at the same time, possibly making them all slow to respond for that portion of your data. Fortunately, there is way around this by using the -snapshot option.
That could be nice, but actually, there is no -snapshot option for nodetool repair (see the manpage, or the documentation) (has this option been removed?!)
So overall,
I cannot use nodetool repair -pr, it seems, because I always need at least to keep the system responsive enough to read/write with consistency ONE, without significant delay. (Note: We have only one data center.) Or am I missing/misunderstanding something?
Why is nodetool repair smart, keeping one node responsive, while nodetool repair -pr makes all nodes slow for a portion of data?
Where is the -snapshot option: Has it been removed, never implemented, or does it now maybe automatically work like that, also when using nodetool repair -pr?
The blog below addresses these issues:
http://www.datastax.com/dev/blog/repair-in-cassandra
A simple nodetool repair will not only kick off a repair on the node itself but also all the nodes that hold replicas if its ranges. While this is ok, it is very expensive and typically not an operation you'll carry out on a busy production system during peak times.
Consequently nodetool repair -pr will carry out a repair of the primary ranges on that node. You will need to run this on every node of the cluster as the blog says. Customers with large production systems will typically use this in a rolling fashion across their cluster.
On another note Datastax OpsCenter offers the repair service which runs smaller sub-range repairs all the time so although you're always repairing its going on in the background all the time at a lower resource level.
As for the snapshots, running a regular repair will invoke a snapshot as you stated, you can also invoke a snapshot yourself using nodetool snapshot
Hope this helps!

Resources