Some of Nodes in our PROD cluster goes Yellow, RED or even Grey because of high load. But nodes are still working.
Timeout during this time comes in Bulk.
All of this happen during Compaction activities running on this node.
Is there a way to control Auto Compaction activities for a keyspace or control whole compaction & run them on weekend during idle time??
This will give relief to Production nodes during Business hours.
There may be multiple reasons for high load. it may be due to high TPS on cassandra cluster. compaction is a heavy process and it requires at least 50% disk space for healthy compaction if using STCS compaction strategy. you can also check concurrent_read/cassandra_write in cassandra.yaml and tune it. Also,you can tune your heap if using G1GC. we can tune compactionthroughput with respect to system configuration.we can disable auto compaction by nodetool disableautocompaction but it is not recommended on prod cluster. auto comaction must be enabled to reclaim disk space. upgrade cassandra cluster if you are using lower version of cassandra.higher version is giving better performance as I am using 3.11.2 and 3.11.3.
Related
What are the symptoms/signs that indicates that the existing cluster nodes are over-capacity and that the cluster would need more nodes to be added? Want to know what are the possible performance symptoms after which more nodes are to be added to the cluster.
It depends a lot on the configuration and your use cases. You may have to take a look at the different metrics from your existing cluster. A few metrics that you should keep an eye on includes.
CPU usage
Query Latency
Memory (Depends how you are using the heap memory)
Disk usage
Based on these metrics, you should make a decision whether to scale out the cluster or not.
These are the common scenarios you should look after for adding a new node
Performance of the cluster is degraded. You are not getting required throughput even after all the tunings.
Require more Disk space- Generally you can increase the disk space by adding a new disk but after a limit(2TB generally) it is advised to add a new node.
You have metrices in your hand to identify that your performance is degrading. For example you can use nodetool tablehistograms to identify read and write latency for a particular table. If read/write latency is under your required latencies then you are good and if you see your system is getting slower with more traffic, then it is sign that you should add a node to the cluster.
I am currently working on setting up a Cassandra cluster that will be used by different applications each with their own keyspace (in a multi-tenancy fashion).
So I was wondering if I could limitate the usage of my cluster for each keyspace individually.
For example, if keyspace1 is using 65% of the cluster resources, every new request on that keyspace would be put in a queue so it doesn't impact requests on other keyspaces.
I know I can get statistics on each keyspace using nodetool cfstats but I don't know how to take counter measures.
Cluster resources is also a term to define as it can be total CPU usage, JVM heap usage, or proportion of write/read on each keyspace on the cluster at instant t.
Also, if you have strategies to avoid entering into this kind of situation, I'm glad to hear about it !
No, Cassandra doesn't have such functionality. That's why it's recommended to setup separate clusters to isolate from noisy neighbors...
Theoretically you can do this on Docker/Kubernetes/... but it could take a lot of resources to build something working reliably.
I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.
Will any issues arise if I deprioritize the Cassandra "nodetool repair" command using "nice" ? It causes high CPU "user time" load and is having a negative impact on our production systems, causing API timeouts on our Usergrid implementation. I see documentation on limiting network throughput, but iowait does not appear to be the issue. Additionally, are there any good methods for mitigating this problem?
The nodetool command doesn't actually do any work. It just calls a JMX operation in C* to kick off the repair and then listens for updates to print out. Doing nice wont make any difference. There are a couple main phases to the repair
build merkle trees (on each node)
stream changes
compactions
Possibly the validation compaction (on some versions can be controlled with compaction throttle) or the streams (can set stream throughput via nodetool or cassandra.yaml) are burning your CPU. If so can try using the throttles, but in some versions it wont make a difference.
After the repair is completed there are normal compactions that kick off for anti compaction in incremental repairs, and also for full repairs if theres a lot of differences streamed. Some problems are very version specific, so pay attention to logs and when CPU is high to drill down more.
So there is a fair amount of documentation on how to scale up a Cassandra, but is there a good resource on how to "unscale" Cassandra and remove nodes from the cluster? Is it as simple as turning off a node, letting the cluster sync up again, and repeating?
The reason is for a site that expects high spikes of traffic, climbing from the daily few thousand hits to hundreds of thousands over a few days. The site will be "ramped up" before hand, starting up multiple instances of the web server, Cassandra, etc. After the torrent of requests subsides, the goal is to turn off the instances that are not longer used, rather than pay for servers that are just sitting around.
If you just shut the nodes down and rebalance cluster, you risk losing some data, that exist only on removed nodes and hasn't replicated yet.
Safe cluster shrink can be easily done with nodetool. At first, run:
nodetool drain
... on the node removed, to stop accepting writes and flush memtables, then:
nodetool decommission
To move node's data to other nodes, and then shut the node down, and run on some other node:
nodetool removetoken
... to remove the node from the cluster completely. The detailed documentation might be found here: http://wiki.apache.org/cassandra/NodeTool
From my experience, I'd recommend to remove nodes one-by-one, not in batches. It takes more time, but much more safe in case of network outages or hardware failures.
When you remove nodes you may have to re-balance the cluster, moving some nodes to a new token. In a planed downscale, you need to:
1 - minimize the number of moves.
2 - if you have to move a node, minimize the amount of transfered data.
There's an article about cluster balancing that may be helpful:
Balancing Your Cassandra Cluster
Also, the begining of this video is about add node and remove node operations and best strategies to minimize the cluster impact in each of these operations.
Hopefully, these 2 references will give you enough information to plan your downscale.
First, on the node, which will be removed, flush memory (memtable) to SSTables on disk:
-nodetool flush
Second, run command to leave a cluster:
-nodetool decommission
This command will assign ranges that the node was responsible for to other nodes and replicates the data appropriately.
To monitor a process you can use command:
- nodetool netstats
Found an article on how to remove nodes from Cassandra. It was helpful for me scaling down cassandra.All actions are described step-by-step there.