Cassandra compaction tasks stuck - cassandra

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0
This should be enough for running a cluster with a light load, storing less than 1 GB of data.
Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.
However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.
Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.
It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.
Edit:
Here's the output from nodetool compactionstats
Edit 2:
I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83
Edit 3:
This is output from dstat on a node that behaving normally
This is output from dstat on a node with stucked compaction
Edit 4:
Output from iostat on node with stucked compaction, see the high "iowait"

azure storage
Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.
For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.
See details here

So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.
Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.
We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.
We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.
We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.
I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.
Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally
These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.
If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.

Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.

The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.
You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:
Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting
Sources:
Opscenter Config DataStax docs
Metrics Config DataStax Docs

Related

Frequent Compaction of OpsCenter.rollup_state on all the nodes consuming CPU cycles

I am using Datastax Cassandra 4.8.16. With cluster of 8 DC and 5 nodes on each DC on VM's. For last couple of weeks we observed below performance issue
1) Increase drop count on VM's.
2) LOCAL_QUORUM for some write operation not achieved.
3) Frequent Compaction of OpsCenter.rollup_state and system.hints are visible in Opscenter.
Appreciate any help finding the root cause for this.
Presence of dropped mutations means that cluster is heavily overloaded. It could be increase of the main load, so it + load from OpsCenter, overloaded system - you need to look into statistics about number of requests, latencies, etc. per nodes and per tables, to see where increase happened. Please also check the I/O statistics on machines (for example, with iostat) - sizes of the queues, read/write latencies, etc.
Also it's recommended to use a dedicated OpsCenter cluster to store metrics - it could be smaller size, and doesn't require an additional license for DSE. How it said in the OpsCenter's documentation:
Important: In production environments, DataStax strongly recommends storing data in a separate DataStax Enterprise cluster.
Regarding VMs - usually it's not really recommended setup, but heavily depends on what kind of underlying hardware - number of CPUs, RAM, disk system.

Cassandra cluster planning

I would like to know about hardware limitations in cluster planning (in TBs) specific to my use case. I have read few threads and documents related to it but some content seem to be over 5 years old. Thought of giving it a shot again:
Use case: Building a time-series cassandra cluster where there is from time-to-time bulk loading from data sources which are in Gigabytes. However, the end-user will majorly be focused in reading the data from the cluster. Quite rarely will be some update or delete on the rows
I have an initial hardware configuration with me to setup Cassandra cluster:
2*12 Cores
128 GB RAM
HDD SAS 3.27 TB
This is the initial plan that I come up with:
When I now speculate over the setup, and after reading the post:
should I further divide my nodes with lesser RAM, vCPUs and HDD?
If yes, what would be the good fit wrt my case?

planning for graphite components for big cassandra cluster monitoring

I am planning to setup a 80 nodes cassandra cluster (current version 2.1 but will upgrade to 3 in future).
I have gone though http://graphite.readthedocs.io/en/latest/tools.html which has list of tools that graphite supports.
I want to decide which tools to choose as listener and storage so that it could scale.
As a listener should i use the default carbon or should i choose graphite-ng ?
However as storage component, i am confused that whether default whisper is enough? Or should I look at ohter option (like Influxdata,cynite or some rdms db (postgres/mysql))?
As gui component i have finalized to use grafana for better visulization.
I think datadog + grafana will work fine but datadog is not opensource.So Please suggest an opensource scalable up to 100 cassandra nodes alternative.
I have 35 Cassandra nodes (different clusters) monitored without any problems with graphite + carbon + whisper + grafana. But i have to tell that re-configuring collection and aggregations windows with whisper is a pain.
There's many alternatives today for this job, you can use influxdb (+ telegraf) stack for example.
Also with datadog you don't need grafana, they're also a visualizing platform. I've worked with it some time ago, but they have some misleading names for some metrics in their plugin, and some metrics were just missing. As a pros for this platform, it's really easy to install and use.
We have a cassandra cluster of 36 nodes in production right now (we had 51 but migrated the instance type since then so we need less C* servers now), monitored using a single graphite server. We are also saving data for 30 days but in a 60s resolution. We excluded the internode metrics (e.g. open connections from a to b) because of the scaling of the metric count, but keep all other. This totals to ~510k metrics, each whisper file being ~500kb in size => ~250GB. iostat tells me, that we have write peaks to ~70k writes/s. This all is done on a single AWS i3.2xlarge instance which include 1.9TB nvme instance storage and 61GB of RAM. To fully utilize the power of the this disk type we increased the number of carbon caches. The cpu usage is very low (<20%) and so is the iowait (<1%).
I guess we could get away with a less beefy machine, but this gives us a lot of headroom for growing the cluster and we are constantly adding new servers. For the monitoring: Be prepared that AWS will terminate these machines more often than others, so backup and restore are more likely a regular operation.
I hope this little insight helped you.

Cassandra CPU performance

I deployed a Cassandra 2.2 ring composed by 4 nodes in the cloud with 8 vCPU and 8GB of ram. I am running some tests now with cassandra-stress and YCSB tools to test its performance. I am mainly interested in read requests with a small amount of write requests (95%/5%).
Running the experiments, I noticed that even setting a high number of threads (or clients) the CPU (and disk) does not saturate, but still always around the 60% of utilisation.
I am trying to figure out where is the bottleneck in my system. From the hardware point of view it seems all ok to me.
dstat
I also looked into the Cassandra configuration file to see if there are some tuning parameters to increase the system throughput. I increase the value of concurrent_read/write parameter, but it doesn't increase the performance.
The log file also does not contain any warning.
What it could be that is limiting my system?
Thanks
You might want to consider running cassandra-stress from outside the cluster and on multiple instances as described in
Usage of the Cassandra tool cassandra-stress

Repartitioning of data in Cassandra when adding new servers

Let's imagine I have a Cassandra cluster with 3 nodes, each having 100GB of available hard disk space. Replication Factor for this cluster is set to 3 and R/W CLs are set to 2, meaning I can tolerate one of my nodes going down without sacrificing consistency or availability.
Now imagine my servers have started to fill up (80GB as an example) and I would like to add another 3 servers of the same specification to my cluster, maintaining the same CLs and RFs.
My question is: after I've added the new nodes to my cluster and run the node repair tool, is it fair to assume that each of my nodes should roughly (more or less a few GBs) contain 40GB of data each?
If not, how can I add new nodes without having the fear of running out of hard disk space?
A little background of why I'm asking this question: I am developing an app that connects to a server that runs Cassandra for its data storage. As this is only developed by me, and I have limited resources in terms of money to buy servers, I've decided that I would like to buy small, cheap "servers" instead of the more expensive rack options but I'm really worried about the nodes running out of space if the disk allocation is not (at least partially)
homogenous.
Many thanks for you help,
My question is: after I've added the new nodes to my cluster and run
the node repair tool, is it fair to assume that each of my nodes
should roughly (more or less a few GBs) 40GB of data each
After also running nodetool cleanup you should see roughly 40GB of data on each node. Cleanup removes data which the node is no longer responsible for. If you don't run this command the old data will remain on the machine.

Resources