I recently used 9 cassandra VMs on openstack to test our product. Each VM has 16vCPUs, 50GB SSD and 20GB memory, but I find each node can only bear 10000+ opers/second with 70% CPUu, that is 90000 opers for 9 nodes.
The data model is simple read/write mixed scenario on normal tables, I haven't seen any obvious performance bottleneck during the test. From internet I can see some guys can achieve 4000 opers/s on AWS T2 medium nodes(only 2 vCPU) and some cassandra training materials say they can achieve 6000-12000 transactions per second.
Can anyone share your benchmark results on apache cassandra?
First of all, Alex is right. Schema (specifically primary key definitions) matters. The rest of this answer assumes you have built that anti-pattern free.
So the standard deployment image which I use for OpenStack is 16GB RAM w/ 8 CPUs (# 2.6GHz). That's a smaller amount of RAM than I'd recommend for most production deploys, unless you have some extra time to engineer for efficiency. And yes, there are some clusters where that just wasn't enough and we had to build with more RAM. But this has largely been our standard for about 4 years.
The approach of many small nodes has worked well. There are clusters I've built which have sustained 250k ops/sec.
with 70% CPU
TBH, I've found that CPU with Cassandra doesn't matter as much as it does with other databases. When it gets high, it's usually an indicator of another problem.
I haven't seen any obvious performance bottle-net during the test.
On shared resource environments (like OpenStack) noisy neighbors are one of the biggest problems. Our storage team has imposed IOPs limits on provisioned disks, in an attempt to keep heavy loads from affecting others. Therefore, our top performing clusters required specially-configured volumes to allow levels of IOPs higher than what would normally be allowed.
Cassandra's metrics can tell you if your disk latency is high. If you see that your disk (read or write) latency is in double-digit milliseconds, then your disk is likely rate-limiting you.
Another thing to look at, is your table's histograms (with nodetool). That can give you all sorts of good info, specifically around things like latency and partition sizes.
bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 124 5
75% 0.00 0.00 0.00 124 5
95% 0.00 0.00 0.00 124 5
98% 0.00 0.00 0.00 124 5
99% 0.00 0.00 0.00 124 5
Min 0.00 0.00 0.00 104 5
Max 0.00 0.00 0.00 124 5
If you look at the size of your usual partition, you can get an idea of how to optimize the table's chunk size. This value represents the size of the building blocks the table uses to interact with the disk.
AND compression = {'chunk_length_in_kb': '64',
'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
For instance, in the case above, I could save a lot on disk payloads just by setting my chunk_length_in_kb down to 1 (the minimum), since my partitions are all less than 1024 bytes.
In any case, give your disk stats a look and see if there are some "wins" to be had there.
Related
I have a three-node Cassandra cluster with replication factor 3 and consistency level LOCAL_QUORUM. My table consists of two columns of MAP<BLOB, BLOB> type. Εach map contains up to 100 entries. I'm writing (append) into both maps and I'm reading from one (1R/1W per transaction).
After a few hours of writing and reading across 500k partitions, the table statistics were as follows:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 35.43 2346.80 1597 60
75% 10.00 51.01 4055.27 2299 72
95% 12.00 105.78 17436.92 6866 215
98% 12.00 182.79 36157.19 6866 215
99% 12.00 454.83 52066.35 8239 215
Min 5.00 3.31 379.02 104 3
Max 14.00 186563.16 322381.14 9887 310
So far, so good. The next step was to create 30 million new partitions.
After about 15 hours of writing (in random partitions) I noticed a massive TPS drop (about 2k):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 51.01 20924.30 1916 50
75% 3.00 73.46 43388.63 1916 60
95% 4.00 126.93 89970.66 1916 60
98% 4.00 219.34 107964.79 2299 72
99% 4.00 379.02 129557.75 6866 179
Min 0.00 3.97 51.01 104 3
Max 8.00 186563.16 322381.14 9887 310
Performing the first test again across 500k partitions, read latency remained high:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 5.00 51.01 30130.99 1916 60
75% 6.00 73.46 62479.63 1916 60
95% 7.00 152.32 129557.75 1916 60
98% 8.00 263.21 155469.30 3311 103
99% 8.00 545.79 186563.16 6866 179
Min 3.00 3.97 454.83 104 3
Max 10.00 107964.79 557074.61 9887 310
Read latency increases, even more, when workload involves writing a counter column(into another table):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 42.51 62479.63 1916 50
75% 10.00 61.21 107964.79 1916 60
95% 12.00 105.78 186563.16 1916 60
98% 12.00 182.79 223875.79 3311 103
99% 12.00 379.02 268650.95 6866 179
Min 6.00 4.77 545.79 104 3
Max 14.00 129557.75 557074.61 9887 310
What would be the probable causes? Could the map column type be the root cause? Any suggestions(configuration or schema changes)?
I am using prepared statements:
Fetch row by partition ID
SELECT id,attr,uids FROM user_profile WHERE id=:id
Update map entries
UPDATE user_profile SET attr=attr+:attr, attr=attr-:attrstoremove, uids=uids+:newuserids, md=md+:metadata, md=md-:attrstoremove, up=:up WHERE id=:id
Increase counter
UPDATE user_profile_counter SET cnt=cnt+:cnt WHERE cnt_id=:cnt_id AND id=:id;
This is my schema:
CREATE TABLE IF NOT EXISTS PROFILING.USER_PROFILE
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- DATA
ATTR MAP<BLOB, BLOB>, --USER ATTRIBUTES
MD MAP<BLOB, BLOB>, --METADATA PER ATTRIBUTE
UIDS SET<TEXT>,
UP TIMESTAMP, --LAST_UPDATE
PRIMARY KEY (ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '1'
};
CREATE TABLE IF NOT EXISTS PROFILES.USER_PROFILE_COUNTER
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- CLUSTERING KEY
CNT_ID BLOB, -- COUNTER ID
-- DATA
CNT COUNTER, -- COUNTER VALUE
PRIMARY KEY (ID, CNT_ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '10' };
The data are encrypted. Here is a row sample(consisting of three map entries):
YkceUdD6qEvOLw3Wgd8zWA |{0x95f56f594522:
0xacb7f42c7f0ac8187f17a8f2c04e5065, 0xa365a3dc007d:
0x24252727706b5065f9e1f65efec7ced8, 0xf0d55b110f87:
0x5a5ef3b0a041af8c7acf4040333afc96} |
{0x95f56f594522:
0x000d31363333333334363636323639, 0xa365a3dc007d:
0x000d31363333333431323938363735, 0xf0d55b110f87:
0x000d31363333333431323938363735} |
{'46TyNYCKTplibRyAfFsNRPQbvfQINNIIY4WmItuPayfvjDjEp49bnXSXLmD9hAm9'} |
2021-10-04 09:54:58.675000+0000
Cluster info (per node)
Total memory 18GB(heap 5GB)
6 CPU cores
Versions
Cassandra 3.11
DataStax Java driver 4.11.3
JDK 16
Edit
After a few more tests, a high number of disk IOPS was observed, most of them read operations. More specifically, 20k IOPS and 1500MBps bandwidth were observed. I tried to reduce the number of SSTable touched by using the LeveledCompactionStrategy and lowering the chunk_length_in_kb parameter to 4KB but there was not much difference.
Note that SAN storage is used. I know it's an anti-pattern but unfortunately, we have no other option. Any ideas?
So to at least provide some theories before the weekend, let me just touch on a couple of things.
JDK 16
So Cassandra 3.11 won't run on anything higher than JDK 8, so I'm guessing this is from the client side?
Total memory 18GB(heap 5GB)
I suspect that the nodes would perform better with a larger heap. On an 18GB instance, I would bump that up to somewhere around 8GB-10GB. 9GB will give you a happy 50% of RAM, so that might work well.
If the heap size ends up exceeding 8GB in size, I would go with G1GC (Cassandra 3 still defaults to CMS GC). If you feel more comfortable staying on CMS, then I'd make the heap new size (Xmn) somewhere around 40% to 50% of the max (Xmx) and initial (Xms) sizes.
three-node Cassandra cluster with replication factor 3
It's also quite possible that you need more than 3 nodes to handle this load. The Cassandra nodes might be filling up with write backpressure, and dying a slow death. After bumping the heap size, I'd try this next.
Have you had a look at the disk I/O metrics (IOPS in particular)? The partition sizes are good, and so is the query, so maybe the disk is slow?
Not sure if you're running on shared infra or not (ex: OpenStack), but one thing I've run into in the past, was noisy neighbors. You have instance that you're sharing compute or storage with, which is taking up all the bandwidth. I'd check with your infrastructure team, and see if that's an issue.
Also, while working on that shared infra environment, I once ran into an issue where we found out that the storage scheduler had put all drives for one cluster on the same physical volume. Hence, we didn't have any real hardware availability, plus the cluster's high write throughput was basically choking itself out.
Have a look at some of that, and see if you find anything.
Edit 20211114
Yes provisioning Cassandra’s drives on a SAN is going to be difficult to glean performance off of. I can think of two things to try:
Direct your commitlog to a different drive (hopefully a local, non-SAN) than the data.
Provision each node’s data drive to a different SAN.
We are running a cassandra 2-node cluster .
The following is the latency stat for reads or writes when executed independently :
99% write avg write latency 99% read avg read latency GC time
545 .227 2816 1.793 2400
However,the total read time for the same batch set is almost 3 times worse when performing read and write in parallel(write latencies being almost unaffected).
99% read avg read latency GC time
4055 1.955 6851
There is not compaction recorded on the application keyspace - though we could see compaction on the system and system_schema tablespaces.
What may be causing the sizeable jump in read timings for the same sample set - when writes happen concurrent to read?
Another point to mention is that the bloom filter false positives is always 0 - which seems to indicate bloom filters are being used effectively.
Any pointers to investigate is appreciated.
This is a continuation to an earlier question that I had asked.
NoSpamLogger.java Maximum memory usage reached Cassandra
Based on that thread I re-partitioned my data to be minute-wise instead of hourly. This has improved the stats for that table.
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 35.00 73.46 3449259.15 219342 72
75% 35.00 88.15 4966933.18 943127 258
95% 35.00 182.79 4966933.18 5839588 1109
98% 35.00 315.85 4966933.18 17436917 3973
99% 35.00 379.02 4966933.18 36157190 11864
Min 30.00 20.50 2874382.63 51 0
Max 35.00 2346.80 4966933.18 3449259151 2816159
If you notice the 99th percentile is less than 40MB, but the max sized partition is still reported to be 3.44GB.
Also I continue to see 'Maxiumum memory usage reached' error in the system.log every couple days after a cluster restart.
So I am trying to hunt down the partitions that are reportedly large. How can I find these?
What are the settings to consider for lightweight transactions( Compare and Set) in Cassandra–2.1.8?
a. We are using token aware load balancing policy with a LeveledCompactionStrategy setting on the table. Table has skinny rows with a single column in the primary key. We use prepared statements for all the queries and are prepared once and cached.
b. The below are the settings,
i. Max Heap – 4G, New Heap – 1G, 4 Core CPU, CentOS
ii. Connection pool is based on the concurrency settings for the test.
final PoolingOptions pools = new PoolingOptions();
pools.setNewConnectionThreshold(HostDistance.LOCAL, concurrency);
pools.setCoreConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setCoreConnectionsPerHost(HostDistance.REMOTE, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.REMOTE, maxConnections);
iii. protocol version – V3
iv. Set tcp delay to true to disable Nagle’s algorithm. (default)
v. Compression is enabled.
2. Throughput increases with concurrency on a single connection. For CAS RW, the throughput does scale at same rate as Simple RW.
100000 requests, 1 thread Simple RW CAS RW
Mean rate (ops/sec) 643 265.4
Mean latency (ms) 1.554 3.765
Median latency (ms) 1.332 2.996
75th percentile latency (ms) 1.515 3.809
95th percentile latency (ms) 2.458 8.121
99th percentile latency (ms) 5.038 11.52
Standard latency deviation 0.992 2.139
100000 requests, 25 threads Simple RW CAS RW
Mean rate (ops/sec) 7686 1881
Mean latency (ms) 3.25 13.29
Median latency (ms) 2.695 12.203
75th percentile latency (ms) 3.669 14.389
95th percentile latency (ms) 6.378 20.139
99th percentile latency (ms) 11.59 61.973
Standard latency deviation 3.065 6.492
The most important consideration for speed on LWT is partition contention. If you have several updates on a single partition, it will be slower. Beyond that, you are looking at machine performance tuning.
There is a free, full course here to help with that: https://academy.datastax.com/courses/ds210-operations-and-performance-tuning
This is a followup question to this one: Why is my cassandra throughput not improving when I add nodes?
My schema currently looks like this (the blobs are roughly all the same size, about 140 bytes):
create keyspace nms WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 };
use nms;
CREATE TABLE qos(
hour timestamp,
qos int,
id int,
ts timestamp,
tz int,
data blob,
PRIMARY KEY ((hour, qos), id, ts));
In both scenarios, I have a single node. Other than the obvious IP address and storage locations, the Apache C* 2.1.5 config is out of the box.
When I run the client and single node in separate hosts, I get roughly 55K inserts/s. The cfhistograms output looks roughly like this:
nms/qos histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 86.00 0.00 42510 535
75% 0.00 124.00 0.00 42510 642
95% 0.00 179.00 0.00 61214 1109
98% 0.00 215.00 0.00 61214 1109
99% 0.00 258.00 0.00 61214 1109
Min 0.00 4.00 0.00 150 3
Max 0.00 61214.00 0.00 61214 1109
When I run the client on the same host as the single node, I get roughly 90K inserts/s. A histogram snapshot looks like this (pretty much the same above):
nms/qos histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 86.00 0.00 42510 535
75% 0.00 103.00 0.00 42510 642
95% 0.00 179.00 0.00 61214 1109
98% 0.00 310.00 0.00 61214 1109
99% 0.00 535.00 0.00 61214 1109
Min 0.00 3.00 0.00 150 3
Max 0.00 126934.00 0.00 61214 1109
Why the big difference in insertion rates? I would have thought the rates would be equivalent, or better in the split setup?
BTW, I see this odd behavior with all the permutations of hardware that I have available to me, so there is more to it than client horsepower.
Marc B, you are correct. If you see this and would like to post your comment as an answer, I will give you credit for it.
In more detail, what was happening is that while my connection to the network was 1G, I was going through an unexpected 100Mb router somewhere. Once I realized this and ensured all the moving parts were in the same 1G network, my rates jumped to 180K inserts/s.
In case someone cares, the Linux command to check you interface speed is
sudo ethtool eth0
The tool to test the speed between boxes is iperf.