I am running 5 node apache cassandra cluster(3.11.4), given 48 GB RAM , 12 GB heap memory and 6 vcpus per each node. I can see a lot of load (18 GB)on the cassandra server nodes even when there is no data processing in cassandra servers.I can a lot of GC pauses, because of which I can see "NoHostAvailable" exceptions when I try to push data to cassandra.
Please suggest me how to reduce this load and how can I avoid connection failures "NoHostAvailable".
ID : a65c8072-636a-480d-8774-2c5704361bec
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 18.07 GiB
Generation No : 1576158587
Uptime (seconds) : 205965
Heap Memory (MB) : 3729.16 / 11980.81
Off Heap Memory (MB) : 12.81
Data Center : dc1
Rack : rack1
Exceptions : 21
Key Cache : entries 2704, size 5.59 MiB, capacity 100 MiB, 1966 hits, 4715 requests, 0.417 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit
rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 25, size 1.56 MiB, capacity 480 MiB, 4207149 misses, 4342386 requests, 0.031 recent hit rate, NaN microseconds miss latency
Percent Repaired : 34.58708788430304%
Token : (invoke with -T/--tokens to see all 256 tokens)
If you have 48Gb RAM I recommend to get to at least cheap of 16Gb or 20. Make sure that you are using G1 GC (default in Java 8).
But NoHostAvailable may depend on the consistency level that you are using, and other factors..
On other side, you may consider to throttle your application - sometimes pushing slower may lead to better throughput.
Related
I am running a nodejs application which uses redis and sequelize library(to connect mysql).The application runs on cluod run. Initally morning when the transactions starts the response is fast.But as time passes by, the response time for 50 percentile is less than 1 sec. Whereas my 99 percentile and 95 percentile response time is less than (15 secs) resulting in very high latency. But memory stays at 20% out of 512 MB. Also my 95 percentile and 99 percentile is more than 80% cpu but my 50 percentile is less than 30%. What could be the issue? Is it due to memory paging or any other rasons?
I have a three-node Cassandra cluster with replication factor 3 and consistency level LOCAL_QUORUM. My table consists of two columns of MAP<BLOB, BLOB> type. Εach map contains up to 100 entries. I'm writing (append) into both maps and I'm reading from one (1R/1W per transaction).
After a few hours of writing and reading across 500k partitions, the table statistics were as follows:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 35.43 2346.80 1597 60
75% 10.00 51.01 4055.27 2299 72
95% 12.00 105.78 17436.92 6866 215
98% 12.00 182.79 36157.19 6866 215
99% 12.00 454.83 52066.35 8239 215
Min 5.00 3.31 379.02 104 3
Max 14.00 186563.16 322381.14 9887 310
So far, so good. The next step was to create 30 million new partitions.
After about 15 hours of writing (in random partitions) I noticed a massive TPS drop (about 2k):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 2.00 51.01 20924.30 1916 50
75% 3.00 73.46 43388.63 1916 60
95% 4.00 126.93 89970.66 1916 60
98% 4.00 219.34 107964.79 2299 72
99% 4.00 379.02 129557.75 6866 179
Min 0.00 3.97 51.01 104 3
Max 8.00 186563.16 322381.14 9887 310
Performing the first test again across 500k partitions, read latency remained high:
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 5.00 51.01 30130.99 1916 60
75% 6.00 73.46 62479.63 1916 60
95% 7.00 152.32 129557.75 1916 60
98% 8.00 263.21 155469.30 3311 103
99% 8.00 545.79 186563.16 6866 179
Min 3.00 3.97 454.83 104 3
Max 10.00 107964.79 557074.61 9887 310
Read latency increases, even more, when workload involves writing a counter column(into another table):
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 10.00 42.51 62479.63 1916 50
75% 10.00 61.21 107964.79 1916 60
95% 12.00 105.78 186563.16 1916 60
98% 12.00 182.79 223875.79 3311 103
99% 12.00 379.02 268650.95 6866 179
Min 6.00 4.77 545.79 104 3
Max 14.00 129557.75 557074.61 9887 310
What would be the probable causes? Could the map column type be the root cause? Any suggestions(configuration or schema changes)?
I am using prepared statements:
Fetch row by partition ID
SELECT id,attr,uids FROM user_profile WHERE id=:id
Update map entries
UPDATE user_profile SET attr=attr+:attr, attr=attr-:attrstoremove, uids=uids+:newuserids, md=md+:metadata, md=md-:attrstoremove, up=:up WHERE id=:id
Increase counter
UPDATE user_profile_counter SET cnt=cnt+:cnt WHERE cnt_id=:cnt_id AND id=:id;
This is my schema:
CREATE TABLE IF NOT EXISTS PROFILING.USER_PROFILE
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- DATA
ATTR MAP<BLOB, BLOB>, --USER ATTRIBUTES
MD MAP<BLOB, BLOB>, --METADATA PER ATTRIBUTE
UIDS SET<TEXT>,
UP TIMESTAMP, --LAST_UPDATE
PRIMARY KEY (ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '1'
};
CREATE TABLE IF NOT EXISTS PROFILES.USER_PROFILE_COUNTER
(
-- PARTITION KEY
ID TEXT, -- PROFILE ID
-- CLUSTERING KEY
CNT_ID BLOB, -- COUNTER ID
-- DATA
CNT COUNTER, -- COUNTER VALUE
PRIMARY KEY (ID, CNT_ID)
) WITH caching = {
'keys' : 'ALL',
'rows_per_partition' : '10' };
The data are encrypted. Here is a row sample(consisting of three map entries):
YkceUdD6qEvOLw3Wgd8zWA |{0x95f56f594522:
0xacb7f42c7f0ac8187f17a8f2c04e5065, 0xa365a3dc007d:
0x24252727706b5065f9e1f65efec7ced8, 0xf0d55b110f87:
0x5a5ef3b0a041af8c7acf4040333afc96} |
{0x95f56f594522:
0x000d31363333333334363636323639, 0xa365a3dc007d:
0x000d31363333333431323938363735, 0xf0d55b110f87:
0x000d31363333333431323938363735} |
{'46TyNYCKTplibRyAfFsNRPQbvfQINNIIY4WmItuPayfvjDjEp49bnXSXLmD9hAm9'} |
2021-10-04 09:54:58.675000+0000
Cluster info (per node)
Total memory 18GB(heap 5GB)
6 CPU cores
Versions
Cassandra 3.11
DataStax Java driver 4.11.3
JDK 16
Edit
After a few more tests, a high number of disk IOPS was observed, most of them read operations. More specifically, 20k IOPS and 1500MBps bandwidth were observed. I tried to reduce the number of SSTable touched by using the LeveledCompactionStrategy and lowering the chunk_length_in_kb parameter to 4KB but there was not much difference.
Note that SAN storage is used. I know it's an anti-pattern but unfortunately, we have no other option. Any ideas?
So to at least provide some theories before the weekend, let me just touch on a couple of things.
JDK 16
So Cassandra 3.11 won't run on anything higher than JDK 8, so I'm guessing this is from the client side?
Total memory 18GB(heap 5GB)
I suspect that the nodes would perform better with a larger heap. On an 18GB instance, I would bump that up to somewhere around 8GB-10GB. 9GB will give you a happy 50% of RAM, so that might work well.
If the heap size ends up exceeding 8GB in size, I would go with G1GC (Cassandra 3 still defaults to CMS GC). If you feel more comfortable staying on CMS, then I'd make the heap new size (Xmn) somewhere around 40% to 50% of the max (Xmx) and initial (Xms) sizes.
three-node Cassandra cluster with replication factor 3
It's also quite possible that you need more than 3 nodes to handle this load. The Cassandra nodes might be filling up with write backpressure, and dying a slow death. After bumping the heap size, I'd try this next.
Have you had a look at the disk I/O metrics (IOPS in particular)? The partition sizes are good, and so is the query, so maybe the disk is slow?
Not sure if you're running on shared infra or not (ex: OpenStack), but one thing I've run into in the past, was noisy neighbors. You have instance that you're sharing compute or storage with, which is taking up all the bandwidth. I'd check with your infrastructure team, and see if that's an issue.
Also, while working on that shared infra environment, I once ran into an issue where we found out that the storage scheduler had put all drives for one cluster on the same physical volume. Hence, we didn't have any real hardware availability, plus the cluster's high write throughput was basically choking itself out.
Have a look at some of that, and see if you find anything.
Edit 20211114
Yes provisioning Cassandra’s drives on a SAN is going to be difficult to glean performance off of. I can think of two things to try:
Direct your commitlog to a different drive (hopefully a local, non-SAN) than the data.
Provision each node’s data drive to a different SAN.
My Cassandra application entails primarily counter writes and reads. As such, having a counter cache is important to performance. I increased the counter cache size in cassandra.yaml from 1000 to 3500 and did a cassandra service restart. The results were not what I expected. Disk use went way up, throughput went way down and it appears the counter cache is not being utilized at all based on what I'm seeing in nodetool info (see below). It's been almost two hours now and performance is still very bad.
I saw this same pattern yesterday when I increased the counter cache from 0 to 1000. It went quite awhile without using the counter cache at all and then for some reason it started using it. My question is whether there is something I need to do to activate counter cache utilization?
Here are my settings in cassandra.yaml for the counter cache:
counter_cache_size_in_mb: 3500
counter_cache_save_period: 7200
counter_cache_keys_to_save: (currently left unset)
Here's what I get out of nodetool info after about 90 minutes:
Gossip active : true
Thrift active : false
Native Transport active: false
Load : 1.64 TiB
Generation No : 1559914322
Uptime (seconds) : 6869
Heap Memory (MB) : 15796.00 / 20480.00
Off Heap Memory (MB) : 1265.64
Data Center : WDC07
Rack : R10
Exceptions : 0
Key Cache : entries 1345871, size 1.79 GiB, capacity 1.95 GiB, 67936405 hits, 83407954 requests, 0.815 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 5294462, size 778.34 MiB, capacity 3.42 GiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 24064, size 1.47 GiB, capacity 1.47 GiB, 65602315 misses, 83689310 requests, 0.216 recent hit rate, 3968.677 microseconds miss latency
Percent Repaired : 8.561186035383143%
Token : (invoke with -T/--tokens to see all 256 tokens)
Here's a nodetool info on the Counter Cache prior to increasing the size:
Counter Cache : entries 6802239, size 1000 MiB, capacity 1000 MiB,
57154988 hits, 435820358 requests, 0.131 recent hit rate,
7200 save period in seconds
Update:
I've been running for several days now trying various values of the counter cache size on various nodes. It is consistent that the counter cache isn't enabled until it reaches capacity. That's just how it works as far as I can tell. If anybody knows a way to enable the cache before it is full let me know. I'm setting it very high because it seems optimal but that means that the cache is down for several hours while it fills up and while it's down my disks are absolutely maxed out with read requests...
Another update:
Further running shows that occasionally the counter cache does kick in before it fills up. I really don't know why that is. I don't see a pattern yet. I would love to know the criteria for when this does and does not work.
One last update:
While the counter cache is filling up native transport is disabled for the node as well. Setting the counter to 3.5 GB I'm now going 24 hours with the node in this low performance state with native transport disabled.
I have found out a way to 100% of the time avoid the counter cache not being enabled and native transport mode disabled. This approach avoids the serious performance problems I encountered waiting for the counter cache to enable (sometimes for hours in my case since I want a large counter cache):
1. Prior to starting Cassandra, set cassandra.yaml file field counter_cache_size_in_mb to 0
2. After starting cassandra and getting it up and running use node tool commands to set the cache sizes:
Example command:
nodetool setcachecapacity 2000 0 1000
In this example, the first value of 2000 sets the key cache size, the second value of 0 is the row cache size and the third value of 1000 is the counter cache size.
Take measurements and decide if those are the optimal values. If not, you can repeat step two without restarting Cassandra with new values as needed
Further details:
Some things that don't work:
Setting the counter_cache_size_in_mb value if the counter cache is not yet enabled. This is the case where you started Cassandra with a non-zero value in counter_cache_size_in_mb in Cassandra.yaml and you have not yet reached that size threshold. If you do this, the counter cache will never enabled. Just don't do this. I would call this a defect but it is the way things currently work.
Testing that I did:
I tested this on five separate nodes multiple times with multiple values. Both initially when Cassandra is just coming up and after some period of time. This method I have described worked in every case. I guess I should have saved some screenshots of nodetool info to show results.
One last thing: If Cassandra developers are watching could they please consider tweaking the code so that this workaround isn't necessary?
I have 6 nodes, 1 Solr, 5 Spark nodes, using datastax. My cluster is on a similar server to Amazon EC2, with EBS volume. Each node has 3 EBS volumes, which compose a logical data disk using LVM. In my OPS center the same node frequently becomes unresponsive, which leads to a connect time out of my data system. My data amount is around 400GB with 3 replicas. I have 20 streaming jobs with batch interval every minute. Here is my error message:
/var/log/cassandra/output.log:WARN 13:44:31,868 Not marking nodes down due to local pause of 53690474502 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:40:34,944 FailureDetector.java:258 - Not marking nodes down due to local pause of 64532052919 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-25 16:59:12,023 FailureDetector.java:258 - Not marking nodes down due to local pause of 66027485893 > 5000000000
/var/log/cassandra/system.log:WARN [GossipTasks:1] 2016-09-26 13:44:31,868 FailureDetector.java:258 - Not marking nodes down due to local pause of 53690474502 > 5000000000
EDIT:
These are my more specific configurations. I would like to know wether I am doing something wrong and if so how can I find out in details what it is and how to fix it?
out heap is set to
MAX_HEAP_SIZE="16G"
HEAP_NEWSIZE="4G"
current heap:
[root#iZ11xsiompxZ ~]# jstat -gc 11399
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
0.0 196608.0 0.0 196608.0 6717440.0 2015232.0 43417600.0 23029174.0 69604.0 68678.2 0.0 0.0 1041 131.437 0 0.000 131.437
[root#iZ11xsiompxZ ~]# jmap -heap 11399
Attaching to process ID 11399, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14
using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 51539607552 (49152.0MB)
NewSize = 1363144 (1.2999954223632812MB)
MaxNewSize = 30920409088 (29488.0MB)
OldSize = 5452592 (5.1999969482421875MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 16777216 (16.0MB)
Heap Usage:
G1 Heap:
regions = 3072
capacity = 51539607552 (49152.0MB)
used = 29923661848 (28537.427757263184MB)
free = 21615945704 (20614.572242736816MB)
58.059545404588185% used
G1 Young Generation:
Eden Space:
regions = 366
capacity = 6878658560 (6560.0MB)
used = 6140461056 (5856.0MB)
free = 738197504 (704.0MB)
89.26829268292683% used
Survivor Space:
regions = 12
capacity = 201326592 (192.0MB)
used = 201326592 (192.0MB)
free = 0 (0.0MB)
100.0% used
G1 Old Generation:
regions = 1443
capacity = 44459622400 (42400.0MB)
used = 23581874200 (22489.427757263184MB)
free = 20877748200 (19910.572242736816MB)
53.04110320109241% used
40076 interned Strings occupying 7467880 bytes.
I don't know why this happens. Thanks a lot.
The message you see Not marking nodes down due to local pause is due to the JVM pausing. Although you're doing some good things here by posting JVM information, often a good place to start is just looking at the /var/log/cassandra/system.log for example check for things such as ERROR, WARN. Also check for length and frequency of GC events by grepping for GCInspector.
Tools such as nodetool tpstats are your friend here, seeing if you have backed up or dropped mutations, blocked flush writers and such.
Docs here have some good things to check for: https://docs.datastax.com/en/landing_page/doc/landing_page/troubleshooting/cassandra/cassandraTrblTOC.html
Also check your nodes have the recommended production settings, this is something often overlooked:
http://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettingsLinux.html
Also one thing to note is that Cassandra is rather i/o sensitive and "normal" EBS might not be fast enough for what you need here. Throw Solr into the mix too and you can see a lot of i/o contention when you hit a Cassandra compaction and Lucene Merge going for disk at the same time.
I'm investigating a production cassandra 1.1 performance problem:
Background: read latencies are going above a second. The ring is spread over 2 data centers, 5 nodes in each, on the east and west coasts. The nodes have 64GB of RAM. Row caching is disabled, the JVM heap size is set to 8GB, key caching is enabled with a max capacity of 2GB.
Problem: the key cache hit rate is abysmal, nearly 0%, and the despite all the misses, the cache is not filling up:
(from "nodetool info", here's the key cache info for 2 of the nodes):
Key Cache : size 172992 (bytes), capacity 2147483616 (bytes), 112226 hits, 81631832 requests, 0.000 recent hit rate, 14400 save period in seconds
Key Cache : size 166896 (bytes), capacity 2147483616 (bytes), 94182 hits, 62270620 requests, 0.000 recent hit rate, 14400 save period in seconds
Has anyone seen this before, where there are lots of key cache misses and lots of room in the key cache, and yet the cache is not being populated? Thanks in advance.
The key cache is designed to speed up access for existing data, not non-existing data. You should look into why non-existing data is not being short-circuited at the bloom filter level instead.