I am running a nodejs application which uses redis and sequelize library(to connect mysql).The application runs on cluod run. Initally morning when the transactions starts the response is fast.But as time passes by, the response time for 50 percentile is less than 1 sec. Whereas my 99 percentile and 95 percentile response time is less than (15 secs) resulting in very high latency. But memory stays at 20% out of 512 MB. Also my 95 percentile and 99 percentile is more than 80% cpu but my 50 percentile is less than 30%. What could be the issue? Is it due to memory paging or any other rasons?
Related
I set an alarm average cpu utilization (1 minutes) > 65 on my Node.js elasticbeanstalk. While installing Node.js dependencies, the EC2 instance is using a lot of CPU resources.
However, I found that the "average" CPU usage didn't exceed this threshold, while the "maximum" CPU utilization exceeded this threshold. Why the elasticbeanstalk alarm occurs even if the average cpu utilization doesn't exceed the threshold?
[
Why is it happening? I'm tired of false positive CPU alarms :(
How do I solve this problem?
I set an alarm average cpu utilization (1 minutes) > 65 on my Node.js
elasticbeanstalk.
It means that the cloud watch alarm trigger will take an average of 1 minute and trigger the alarm if it crosses 60.
In the first screenshot, it seems that the CPU utilization was high from almost 9:57 until 10:07 for 10 minutes.
In the second screenshot, it shows the average was a max of 30 during this period. Let's do some math to understand it:
Cpu utilization was not consistently high, the graph shows the peak recorded and if the CPU is utilised 90% for 3 seconds and 10% for 57 seconds, the average will be 27% high for 1 minute.
The above case is almost similar to it. That's why you see different graphs in maximum and average.
I have a spark batch job that runs every minute and processes ~200k records per batch. The usual processing delay of the app is ~30 seconds. In the app, for each request, we make a write request to DynamoDB. At times, the server-side DDB write latency is ~5 ms instead of 3.5 ms (~30% increase w.r.t to usual latency 3.5ms). This is causing the overall delay of the app to bump by 6 times (~3 minutes).
How does sub-second latency of DDB call impact the overall latency of the app by 6 times?
PS: I have verified the root cause through overlapping the cloud-watch graphs of DDB put latency and the spark app processing delay.
Thanks,
Vinod.
Just a ballpark estimate:
If the average is 3.5 ms latency and about half of your 200k records are processed in 5ms instead of 3.5ms, this would leave us with:
200.000 * 0.5 * (5 - 3.5) = 150.000 (ms)
of total delay, which is 150 seconds or 2.5 minutes. I don't know how well the process is parallelized, but this seems to be within the expected delay.
The graph shows cpu's max > 96%, but cpu's avg < 10%
How can this be the case? (I mean, shouldn't cpu's avg > 40, or at least >30?)
Not really, I estimated some of the values from the Graph, and put them in a spreadsheet and calculated a 5 Min Average, as well as calculated the Max CPU and the Average of the 5 Min Average. Below is what it looks like. When you are doing an Average over a time, it smooths out all the peaks and lows.
Max 5 Min Avg
85
40
20
5
25 35
40 26
5 19
10 17
99 35.8
Max Average
99 26.56
If it is continually at high CPU, then your overall average will start growing.
However that average does look rather low on your graph, but you aren't showing the Min CPU either, so it may be short burst where it is high, but more often low CPU usage, you should graph that as well.
Are you trying to configure alerts or scaling? Then you should be looking at the average over a small period e.g. 5 minutes, and if that exceeds a threshold (usually 75-80%) then you send the alert and or scale out.
I asked Microsoft Azure support about this. The answer I received was not great and essentially amounts to "Yes, it does that." They suggested only using the average statistic since (as we've noticed) "max" doesn't work. This is due to the way data gets aggregated internally. The Microsoft Product engineering team has a request (ID: 9900425) in their large list to get this fixed, so it may happen someday.
I did not find any documentation on how that aggregation works, nor would Microsoft provide any.
Existing somewhat useful docs:
Data sources: https://learn.microsoft.com/en-us/azure/azure-monitor/agents/data-sources#azure-resources
Metrics and data collection: https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/data-platform-metrics#data-collection
What are the settings to consider for lightweight transactions( Compare and Set) in Cassandra–2.1.8?
a. We are using token aware load balancing policy with a LeveledCompactionStrategy setting on the table. Table has skinny rows with a single column in the primary key. We use prepared statements for all the queries and are prepared once and cached.
b. The below are the settings,
i. Max Heap – 4G, New Heap – 1G, 4 Core CPU, CentOS
ii. Connection pool is based on the concurrency settings for the test.
final PoolingOptions pools = new PoolingOptions();
pools.setNewConnectionThreshold(HostDistance.LOCAL, concurrency);
pools.setCoreConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.LOCAL, maxConnections);
pools.setCoreConnectionsPerHost(HostDistance.REMOTE, maxConnections);
pools.setMaxConnectionsPerHost(HostDistance.REMOTE, maxConnections);
iii. protocol version – V3
iv. Set tcp delay to true to disable Nagle’s algorithm. (default)
v. Compression is enabled.
2. Throughput increases with concurrency on a single connection. For CAS RW, the throughput does scale at same rate as Simple RW.
100000 requests, 1 thread Simple RW CAS RW
Mean rate (ops/sec) 643 265.4
Mean latency (ms) 1.554 3.765
Median latency (ms) 1.332 2.996
75th percentile latency (ms) 1.515 3.809
95th percentile latency (ms) 2.458 8.121
99th percentile latency (ms) 5.038 11.52
Standard latency deviation 0.992 2.139
100000 requests, 25 threads Simple RW CAS RW
Mean rate (ops/sec) 7686 1881
Mean latency (ms) 3.25 13.29
Median latency (ms) 2.695 12.203
75th percentile latency (ms) 3.669 14.389
95th percentile latency (ms) 6.378 20.139
99th percentile latency (ms) 11.59 61.973
Standard latency deviation 3.065 6.492
The most important consideration for speed on LWT is partition contention. If you have several updates on a single partition, it will be slower. Beyond that, you are looking at machine performance tuning.
There is a free, full course here to help with that: https://academy.datastax.com/courses/ds210-operations-and-performance-tuning
I tried load testing our site with terrible results, so I decided to try load test a new ("out of the box") sailsjs project with no changes other than local.js port (8080) and environment mode (production). We are using Google Cloud platform for both site hosting and load testing. The site resources can easily handle the requests:
30% cpu usage - Disk I/O 16k bytes/sec - ram < 10% - no db used
The average and max response times, in milliseconds:
250 users:
Avg 10
Max 89
500 users:
Avg 10
Max 122
750 users:
Avg 26
Max 847
1000 users:
Avg 50 (but starts jumping faster from this point)
Max 3000
2000 users:
Avg 700
Max 6400
2500 user:
Avg 1115
Max 7611
4000 users:
Avg 3030
Max 10370
Is there possibly some bottleneck created because of a limit of 1 thousand, because that's when the bad delays start?
When I try profiling, the major part of the delay is given as (idle).
Sailsjs out of the box seems to be no where near the hundreds of thousands of concurrent users others have achieved, with good response times.