Cassandra CPU performance - cassandra

I deployed a Cassandra 2.2 ring composed by 4 nodes in the cloud with 8 vCPU and 8GB of ram. I am running some tests now with cassandra-stress and YCSB tools to test its performance. I am mainly interested in read requests with a small amount of write requests (95%/5%).
Running the experiments, I noticed that even setting a high number of threads (or clients) the CPU (and disk) does not saturate, but still always around the 60% of utilisation.
I am trying to figure out where is the bottleneck in my system. From the hardware point of view it seems all ok to me.
dstat
I also looked into the Cassandra configuration file to see if there are some tuning parameters to increase the system throughput. I increase the value of concurrent_read/write parameter, but it doesn't increase the performance.
The log file also does not contain any warning.
What it could be that is limiting my system?
Thanks

You might want to consider running cassandra-stress from outside the cluster and on multiple instances as described in
Usage of the Cassandra tool cassandra-stress

Related

Cassandra keep using 100% of CPU and not utilizing memoery?

We have setup Cassandra single node of 3.11 with JDK 1.8 on ec2 with instance type t2.large which has 2 CPU and 7 GB of RAM.
We facing the issue that Cassandra keeps reaching CPU 100% even we do not have that much load.
We have 7GB of RAM but Cassandra not utilizing that Memory.it only uses 1.7-1.8 GB of RAM.
What configuration needs to change to reduce CPU utilization to not reach to 100%.
what best configuration to get better performance out of Cassandra.
Right now we able to get only about 100-120 read and 50-100 write operation per sec.
Please, some one helps us to understand the issue and what ways to improve performance configuration.

Cassandra Performance Tuning

I have successfully installed a multi-node Cassandra cluster with 10nodes,
The nodetool status command shows every node is UP and NORMAL.
but the Performance I am getting is very bad.
here are my results:
Operations /seconds = 4000
Read Latency = 13ms
write Latency = 10ms
I am using YCSB to measure performance
Tuning that I have done till now:
Consistency level = 1
Replication Factor = 3
Heap size = 4GB
My Hardware:
Each node is a VM with CentOS
2GHZ CPU with 8 cores
8GB RAM
1GB/ps N/W
Please let me know what more settings I can tweak to get maximum performance out of my cluster.
If you have 1 system with 10 VMs running on it and 1 disk, the performance of any (not in-memory) database will be bad. Especially with spinning disks (no matter how expensive they are) is going to be a major contention point. With a really good SSD you may be able to pull off a few instances, but performance stress testing will likely always hit either that or a CPU bottleneck (if things configured correctly for system).
Pretty good chance with 4gb heaps and a stress workload you are going to be hitting GC and memory issues, do you have any monitoring around that? Can use visualvm and connect to the ip:7199 (ip set in cassandra-env.sh).
8gb of ram per vm is on the minimum spec end. You want at least 8gb of JVM heap with space for the offheap stuff and OS. A 16gb system is likely sufficient. Once again the shared disk will kill performance so it will only go so far. but should be able to do far better than 4k/sec.

Increasing Number of VMs decreases Cassandra Throughput. What can be reason?

I am using YCSB benchmarking tool to benchmark Cassandra cluster.
I am varying the number of Virtual machines in the cluster.
I am using 1 physical host and I am using 1,2,3,4 virtual machines for benchmarking(as shown in attached figure).
The generated workload is same all the time Workload C 10,000,00 operations, 10,000 records
Each VM has 2 GB RAM, 20GB drive
Cassandra - 1 seed node, endpoint_snitch - gossipproperty
Keyspace YCSB - Replication factor 3,
The problem is that when I increase the number of virtual machines in the cluster, the throughput decreases. What can be the reason?
By definition, by increasing compute resources(i.e virtual machines), the cluster should offer better performance, but the opposite is happening as shown in attached figure. Kindly explain what can be the probable reason for this? I am writing my thesis on this topic but I am unable to figure out the reason for this, please help, I will be grateful to you.
Throughput observed by varying number of VMs in Cassandra cluster:
Very likely hitting a disk IO bottleneck. Especially with non ssd drives this is completely expected. Unless you have dedicated disk/cpu per vm the competition for resources will cause contention like this. Also 2gb per vm is not enough to do any kind of performance benchmark with Cassandra since the minimum recommended JVM heap size is 8gb.
Cassandra is great at horizontal scaling (nearly linear), but that doesn't mean that simply adding VMs to one physical host will increase throughput - a single VM on the physical host will have less contention for resources (disk, cpu, memory, network) than 4, so it's likely one VM would perform better than 4.
By definition, if you WERE increasing resources, you SHOULD see it perform better - but you're not, you're simply adding contention to existing resources. If you want to scale cassandra, you need to test it with additional physical resources - more physical machines, not more VMs on the same machine.
Finally, as Chris Lohfink mentions, your VMs are too small to do meaningful tests - 8GB JVM heap is recommended, with another 8GB of vm page cache to support reads - running Cassandra with less than 16G of RAM is typically non-ideal in production.
You're trying to test a jet engine (a distribute database designed for hundreds or thousands of physical nodes) with a gas station level equipment - your benchmark hardware isn't viable for a real production environment, so your benchmark results aren't meaningful.

Why is my client CPU utilization so high when I use a cassandra cluster?

This is a follow-on question to Why is my cassandra throughput not improving when I add nodes?. I have configured my client and nodes as closely as I could to what is recommended here: http://docs.datastax.com/en/cassandra/2.1/cassandra/install/installRecommendSettings.html. The whole setup is not exactly world class (the client is on a laptop with 32G of RAM and a modern'ish processor, for example). I am more interested in developing an intuition for the cassandra infrastructure at this point.
I notice that if I shut down all but one of the nodes in the cluster and run my test client against it, I get a throughput of ~120-140 inserts/s and a CPU utilization of ~30-40%. When I crank up all 6 nodes and run this one client against them, I see a throughput of ~110-120 inserts/s and my CPU utilization gets to between ~80-100%.
All my tests are run with a clean DB (I completely delete all DB files and restart from scratch) and I insert 30M rows.
My test client is multi-threaded and each thread writes exclusively to one partition using unlogged batches, as recommend by various sources for a schema like mine (e.g. https://lostechies.com/ryansvihla/2014/08/28/cassandra-batch-loading-without-the-batch-keyword/).
Is this CPU spike expected behavior?

Cassandra compaction tasks stuck

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0
This should be enough for running a cluster with a light load, storing less than 1 GB of data.
Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.
However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.
Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.
It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.
Edit:
Here's the output from nodetool compactionstats
Edit 2:
I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83
Edit 3:
This is output from dstat on a node that behaving normally
This is output from dstat on a node with stucked compaction
Edit 4:
Output from iostat on node with stucked compaction, see the high "iowait"
azure storage
Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.
For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.
See details here
So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.
Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.
We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.
We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.
We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.
I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.
Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally
These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.
If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.
Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.
The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.
You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:
Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting
Sources:
Opscenter Config DataStax docs
Metrics Config DataStax Docs

Resources