Cassandra Map collection type read latency - cassandra

I have a three-node Cassandra cluster with replication factor 3 and consistency level LOCAL_QUORUM. My table consists of two columns of MAP<BLOB, BLOB> type. Εach map contains up to 100 entries. I'm writing (append) into both maps and I'm reading from one (1R/1W per transaction).
After a few hours of writing and reading across 500k partitions, the table statistics were as follows:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 35.43 2346.80 1597 60
75% 10.00 51.01 4055.27 2299 72
95% 12.00 105.78 17436.92 6866 215
98% 12.00 182.79 36157.19 6866 215
99% 12.00 454.83 52066.35 8239 215
Min 5.00 3.31 379.02 104 3
Max 14.00 186563.16 322381.14 9887 310
So far, so good. The next step was to create 30 million new partitions.
After about 15 hours of writing (in random partitions) I noticed a massive TPS drop (about 2k):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 51.01 20924.30 1916 50
75% 3.00 73.46 43388.63 1916 60
95% 4.00 126.93 89970.66 1916 60
98% 4.00 219.34 107964.79 2299 72
99% 4.00 379.02 129557.75 6866 179
Min 0.00 3.97 51.01 104 3
Max 8.00 186563.16 322381.14 9887 310
Performing the first test again across 500k partitions, read latency remained high:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 5.00 51.01 30130.99 1916 60
75% 6.00 73.46 62479.63 1916 60
95% 7.00 152.32 129557.75 1916 60
98% 8.00 263.21 155469.30 3311 103
99% 8.00 545.79 186563.16 6866 179
Min 3.00 3.97 454.83 104 3
Max 10.00 107964.79 557074.61 9887 310
Read latency increases, even more, when workload involves writing a counter column(into another table):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 42.51 62479.63 1916 50
75% 10.00 61.21 107964.79 1916 60
95% 12.00 105.78 186563.16 1916 60
98% 12.00 182.79 223875.79 3311 103
99% 12.00 379.02 268650.95 6866 179
Min 6.00 4.77 545.79 104 3
Max 14.00 129557.75 557074.61 9887 310
What would be the probable causes? Could the map column type be the root cause? Any suggestions(configuration or schema changes)?
I am using prepared statements:
Fetch row by partition ID
SELECT id,attr,uids FROM user_profile WHERE id=:id
Update map entries
UPDATE user_profile SET attr=attr+:attr, attr=attr-:attrstoremove, uids=uids+:newuserids, md=md+:metadata, md=md-:attrstoremove, up=:up WHERE id=:id
Increase counter
UPDATE user_profile_counter SET cnt=cnt+:cnt WHERE cnt_id=:cnt_id AND id=:id;
This is my schema:
CREATE TABLE IF NOT EXISTS PROFILING.USER_PROFILE
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- DATA
ATTR MAP<BLOB, BLOB>, --USER ATTRIBUTES
MD MAP<BLOB, BLOB>, --METADATA PER ATTRIBUTE
UIDS SET<TEXT>,
UP TIMESTAMP, --LAST_UPDATE
PRIMARY KEY (ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '1'
};
CREATE TABLE IF NOT EXISTS PROFILES.USER_PROFILE_COUNTER
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- CLUSTERING KEY
CNT_ID BLOB, -- COUNTER ID
-- DATA
CNT COUNTER, -- COUNTER VALUE
PRIMARY KEY (ID, CNT_ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '10' };
The data are encrypted. Here is a row sample(consisting of three map entries):
YkceUdD6qEvOLw3Wgd8zWA |{0x95f56f594522:
0xacb7f42c7f0ac8187f17a8f2c04e5065, 0xa365a3dc007d:
0x24252727706b5065f9e1f65efec7ced8, 0xf0d55b110f87:
0x5a5ef3b0a041af8c7acf4040333afc96} |
{0x95f56f594522:
0x000d31363333333334363636323639, 0xa365a3dc007d:
0x000d31363333333431323938363735, 0xf0d55b110f87:
0x000d31363333333431323938363735} |
{'46TyNYCKTplibRyAfFsNRPQbvfQINNIIY4WmItuPayfvjDjEp49bnXSXLmD9hAm9'} |
2021-10-04 09:54:58.675000+0000
Cluster info (per node)
Total memory 18GB(heap 5GB)
6 CPU cores
Versions
Cassandra 3.11
DataStax Java driver 4.11.3
JDK 16
Edit
After a few more tests, a high number of disk IOPS was observed, most of them read operations. More specifically, 20k IOPS and 1500MBps bandwidth were observed. I tried to reduce the number of SSTable touched by using the LeveledCompactionStrategy and lowering the chunk_length_in_kb parameter to 4KB but there was not much difference.
Note that SAN storage is used. I know it's an anti-pattern but unfortunately, we have no other option. Any ideas?

So to at least provide some theories before the weekend, let me just touch on a couple of things.
JDK 16
So Cassandra 3.11 won't run on anything higher than JDK 8, so I'm guessing this is from the client side?
Total memory 18GB(heap 5GB)
I suspect that the nodes would perform better with a larger heap. On an 18GB instance, I would bump that up to somewhere around 8GB-10GB. 9GB will give you a happy 50% of RAM, so that might work well.
If the heap size ends up exceeding 8GB in size, I would go with G1GC (Cassandra 3 still defaults to CMS GC). If you feel more comfortable staying on CMS, then I'd make the heap new size (Xmn) somewhere around 40% to 50% of the max (Xmx) and initial (Xms) sizes.
three-node Cassandra cluster with replication factor 3
It's also quite possible that you need more than 3 nodes to handle this load. The Cassandra nodes might be filling up with write backpressure, and dying a slow death. After bumping the heap size, I'd try this next.
Have you had a look at the disk I/O metrics (IOPS in particular)? The partition sizes are good, and so is the query, so maybe the disk is slow?
Not sure if you're running on shared infra or not (ex: OpenStack), but one thing I've run into in the past, was noisy neighbors. You have instance that you're sharing compute or storage with, which is taking up all the bandwidth. I'd check with your infrastructure team, and see if that's an issue.
Also, while working on that shared infra environment, I once ran into an issue where we found out that the storage scheduler had put all drives for one cluster on the same physical volume. Hence, we didn't have any real hardware availability, plus the cluster's high write throughput was basically choking itself out.
Have a look at some of that, and see if you find anything.
Edit 20211114
Yes provisioning Cassandra’s drives on a SAN is going to be difficult to glean performance off of. I can think of two things to try:
Direct your commitlog to a different drive (hopefully a local, non-SAN) than the data.
Provision each node’s data drive to a different SAN.

Related

High latency over time

I am running a nodejs application which uses redis and sequelize library(to connect mysql).The application runs on cluod run. Initally morning when the transactions starts the response is fast.But as time passes by, the response time for 50 percentile is less than 1 sec. Whereas my 99 percentile and 95 percentile response time is less than (15 secs) resulting in very high latency. But memory stays at 20% out of 512 MB. Also my 95 percentile and 99 percentile is more than 80% cpu but my 50 percentile is less than 30%. What could be the issue? Is it due to memory paging or any other rasons?

Ask for cassandra benchmark result

I recently used 9 cassandra VMs on openstack to test our product. Each VM has 16vCPUs, 50GB SSD and 20GB memory, but I find each node can only bear 10000+ opers/second with 70% CPUu, that is 90000 opers for 9 nodes.
The data model is simple read/write mixed scenario on normal tables, I haven't seen any obvious performance bottleneck during the test. From internet I can see some guys can achieve 4000 opers/s on AWS T2 medium nodes(only 2 vCPU) and some cassandra training materials say they can achieve 6000-12000 transactions per second.
Can anyone share your benchmark results on apache cassandra?
First of all, Alex is right. Schema (specifically primary key definitions) matters. The rest of this answer assumes you have built that anti-pattern free.
So the standard deployment image which I use for OpenStack is 16GB RAM w/ 8 CPUs (# 2.6GHz). That's a smaller amount of RAM than I'd recommend for most production deploys, unless you have some extra time to engineer for efficiency. And yes, there are some clusters where that just wasn't enough and we had to build with more RAM. But this has largely been our standard for about 4 years.
The approach of many small nodes has worked well. There are clusters I've built which have sustained 250k ops/sec.
with 70% CPU
TBH, I've found that CPU with Cassandra doesn't matter as much as it does with other databases. When it gets high, it's usually an indicator of another problem.
I haven't seen any obvious performance bottle-net during the test.
On shared resource environments (like OpenStack) noisy neighbors are one of the biggest problems. Our storage team has imposed IOPs limits on provisioned disks, in an attempt to keep heavy loads from affecting others. Therefore, our top performing clusters required specially-configured volumes to allow levels of IOPs higher than what would normally be allowed.
Cassandra's metrics can tell you if your disk latency is high. If you see that your disk (read or write) latency is in double-digit milliseconds, then your disk is likely rate-limiting you.
Another thing to look at, is your table's histograms (with nodetool). That can give you all sorts of good info, specifically around things like latency and partition sizes.
bin/nodetool tablehistograms stackoverflow.stockquotes
stackoverflow/stockquotes histograms
Percentile Read Latency Write Latency SSTables Partition Size Cell Count
(micros) (micros) (bytes)
50% 0.00 0.00 0.00 124 5
75% 0.00 0.00 0.00 124 5
95% 0.00 0.00 0.00 124 5
98% 0.00 0.00 0.00 124 5
99% 0.00 0.00 0.00 124 5
Min 0.00 0.00 0.00 104 5
Max 0.00 0.00 0.00 124 5
If you look at the size of your usual partition, you can get an idea of how to optimize the table's chunk size. This value represents the size of the building blocks the table uses to interact with the disk.
AND compression = {'chunk_length_in_kb': '64',
'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
For instance, in the case above, I could save a lot on disk payloads just by setting my chunk_length_in_kb down to 1 (the minimum), since my partitions are all less than 1024 bytes.
In any case, give your disk stats a look and see if there are some "wins" to be had there.

How to Reduce load on cassandra server so to avoid NoHostAvailable Exceptions

I am running 5 node apache cassandra cluster(3.11.4), given 48 GB RAM , 12 GB heap memory and 6 vcpus per each node. I can see a lot of load (18 GB)on the cassandra server nodes even when there is no data processing in cassandra servers.I can a lot of GC pauses, because of which I can see "NoHostAvailable" exceptions when I try to push data to cassandra.
Please suggest me how to reduce this load and how can I avoid connection failures "NoHostAvailable".
ID : a65c8072-636a-480d-8774-2c5704361bec
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 18.07 GiB
Generation No : 1576158587
Uptime (seconds) : 205965
Heap Memory (MB) : 3729.16 / 11980.81
Off Heap Memory (MB) : 12.81
Data Center : dc1
Rack : rack1
Exceptions : 21
Key Cache : entries 2704, size 5.59 MiB, capacity 100 MiB, 1966 hits, 4715 requests, 0.417 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit
rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 25, size 1.56 MiB, capacity 480 MiB, 4207149 misses, 4342386 requests, 0.031 recent hit rate, NaN microseconds miss latency
Percent Repaired : 34.58708788430304%
Token : (invoke with -T/--tokens to see all 256 tokens)
If you have 48Gb RAM I recommend to get to at least cheap of 16Gb or 20. Make sure that you are using G1 GC (default in Java 8).
But NoHostAvailable may depend on the consistency level that you are using, and other factors..
On other side, you may consider to throttle your application - sometimes pushing slower may lead to better throughput.

Identifying the large partition in cassandra

This is a continuation to an earlier question that I had asked.
NoSpamLogger.java Maximum memory usage reached Cassandra
Based on that thread I re-partitioned my data to be minute-wise instead of hourly. This has improved the stats for that table.
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 35.00 73.46 3449259.15 219342 72
75% 35.00 88.15 4966933.18 943127 258
95% 35.00 182.79 4966933.18 5839588 1109
98% 35.00 315.85 4966933.18 17436917 3973
99% 35.00 379.02 4966933.18 36157190 11864
Min 30.00 20.50 2874382.63 51 0
Max 35.00 2346.80 4966933.18 3449259151 2816159
If you notice the 99th percentile is less than 40MB, but the max sized partition is still reported to be 3.44GB.
Also I continue to see 'Maxiumum memory usage reached' error in the system.log every couple days after a cluster restart.
So I am trying to hunt down the partitions that are reportedly large. How can I find these?

Cassandra settings tuning for CAS updates

What are the settings to consider for lightweight transactions( Compare and Set) in Cassandra–2.1.8?
a. We are using token aware load balancing policy with a LeveledCompactionStrategy setting on the table. Table has skinny rows with a single column in the primary key. We use prepared statements for all the queries and are prepared once and cached.
b. The below are the settings,
i. Max Heap – 4G, New Heap – 1G, 4 Core CPU, CentOS
ii. Connection pool is based on the concurrency settings for the test.
final PoolingOptions pools = new PoolingOptions();
pools.setNewConnectionThreshold(HostDistance.LOCAL, concurrency);
pools.setCoreConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setCoreConnectionsPerHost(HostDistance.REMOTE, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.REMOTE, maxConnections);
iii. protocol version – V3
iv. Set tcp delay to true to disable Nagle’s algorithm. (default)
v. Compression is enabled.
2. Throughput increases with concurrency on a single connection. For CAS RW, the throughput does scale at same rate as Simple RW.
100000 requests, 1 thread Simple RW CAS RW
Mean rate (ops/sec) 643 265.4
Mean latency (ms) 1.554 3.765
Median latency (ms) 1.332 2.996
75th percentile latency (ms) 1.515 3.809
95th percentile latency (ms) 2.458 8.121
99th percentile latency (ms) 5.038 11.52
Standard latency deviation 0.992 2.139
100000 requests, 25 threads Simple RW CAS RW
Mean rate (ops/sec) 7686 1881
Mean latency (ms) 3.25 13.29
Median latency (ms) 2.695 12.203
75th percentile latency (ms) 3.669 14.389
95th percentile latency (ms) 6.378 20.139
99th percentile latency (ms) 11.59 61.973
Standard latency deviation 3.065 6.492
The most important consideration for speed on LWT is partition contention. If you have several updates on a single partition, it will be slower. Beyond that, you are looking at machine performance tuning.
There is a free, full course here to help with that: https://academy.datastax.com/courses/ds210-operations-and-performance-tuning

Resources