My web app faces huge CPU spikes. Not because of traffic increasing, but because of heavy load, such as reports going out. Some of these cause the CPU to go from a healthy 30% load to 100% for the next 2-10 min... Here i'll describe as if i had only 1 server, but i've seen up to 4 servers going crazy because the alignment of the stars made around 50 of my clients want a report at the same time... i'm hosted on azure and I use the auto-scale to handle these spikes. if the load goes north of 70% for more than 2 min, a new instance goes up.
The thing is, because server 1 is 100% backed-up, when it goes up, (i hope) the load balance will direct every new request to server 2 until server 1 can handle more again. Because of this (expected) behavior, I was wondering if I should raise the min number of threads so it can faster handle the requests that are coming.
My usual rate of requests is around 15/s, so thought i should start the pool with at least 50...
what you guys think?
Edit 1 2017-07-13
So far this is working fine... i'll try a higher setting and see what happens
This strategy did prove itself very helpful and mitigated a lot of issues. Not all my problems are gone but the errors/timeouts decreased immensely.
Related
so my requirement is to run 90 concurrent user doing mutiple scenario (15 scenario)simultenously for 30 minutes in virtual macine.so some of the threads i use concurrent thread group and normal thread group.
now my issue is
1)after i execute all 15 scenarios, my max response for each scenario displayed very high (>40sec). is there any suggestion to reduce this high max response?
2)one of the scenario is submit web form, there is no issue if submit only one, however during the 90 concurrent user execution, some of submit web form will get 500 error code. is the error is because i use looping to achieve 30 min duration?
In order to reduce the response time you need to find the reason for this high response time, the reasons could be in:
lack of resources like CPU, RAM, etc. - make sure to monitor resources consumption using i.e. JMeter PerfMon Plugin
incorrect configuration of the middleware (application server, database, etc.), all these components need to be properly tuned for high loads, for example if you set maximum number of connections on the application server to 10 and you have 90 threads - the 80 threads will be queuing up waiting for the next available executor, the same applies to the database connection pool
use a profiler tool to inspect what's going on under the hood and why the slowest functions are that slow, it might be the case your application algorithms are not efficient enough
If your test succeeds with single thread and fails under the load - it definitely indicates the bottleneck, try increasing the load gradually and see how many users application can support without performance degradation and/or throwing errors. HTTP Status codes 5xx indicate server-side errors so it also worth inspecting your application logs for more insights
Because i'm struggling with thread starvation symptoms like i've decided to rewrite the whole calls chain (I/O, db paths) in async/await mode.
After huge effort to rewrite async/await my terrible surprise was that has no effect on medium/high load.
So i decided to reduce the test to the following sequence:
WebStressTool -> IIS(reverse proxy) -> Kestrel(out of process) -> await MiddlewarePipeline.Next() -> async Controller.Action -> await db.StoredProcedure(sql: WAIT FOR 1000ms)
The processing time for every request should be near 1000ms in full async/await calls chain.
I expect that as the requests/second increase the response time should remain the same (1000ms).
Here is the result:
1-9 rqs/sec -> ResponseTime: 1010ms. (WebStresTool: 9 rqs/sec). Its prefect,as expected
15 rqs/sec -> ResponseTime: 2200ms. (WebStresTool: 9 rqs/sec). Bottleneck
25 rqs/sec - ResponseTime: 4500ms. (WebStresTool: 9 rqs/sec). Bottleneck
....and so on
as the requests /sec increase the response time increase exponentially.
The normal behavior in a full async/await scenario should be : as the increase or rqs/sec the response time should remain the same : time = 1000ms (Scalability).
So the symptoms are not related to thread starvation but to some bottleneck issue somewhere... who knows.
After more tests i discovered that the issue appears only when run under IIS (reverse proxy). The application threads not exceed 40 and has huge response time with load. Repeating the test without IIS (just kestrel server) the result was a success. The application is scalable.
It seems the runnig of kestrel behind IIS is the issue
Any suggestions, ideas, hints etc would be highly appreciated.
Best regards
Well i have a terrible suspicion : running the stress scenario on my development machine is limited by the hidden IIS own concurrency limitation. Only on windows server has no limitation. This means that i spent hundreds of hours to improve the response time, rewrite hot paths async, caching etc and i was misguided by this hidden limitation.
Now i'm moving the stress test on windows server 2012. I'll be back soon
We've recently created a new Standard 1 GB Azure Redis cache specifically for distributed locking - separated from our main Redis cache. This was done to improve stability on our main Redis cache which is a very long term issue which this action seems to of significantly helped with.
On our new cache, we observe bursts of ~100 errors within the same few seconds every 1 - 3 days. The errors are either:
No connection is available to service this operation (StackExchange.Redis error)
Or:
Could not acquire distributed lock: Conflicted (RedLock.net error)
As they are errors from different packages, I suspect the Redis cache itself is the problem here. None of the stats during this time look out of the ordinary and the workload should fit comfortably in the Standard 1GB size.
I'm guessing this could be caused by the advertised Low network performance advertised, is this likely the cause?
Your theory sounds plausible.
Checking for insufficient network bandwidth
Here is a handy table showing the maximum observed bandwidth for various pricing tiers. Take a look at the observed maximum bandwidth for your SKU, then head over to your Redis blade in the Azure Portal and choose Metrics. Set the aggregation to Max, and look at the sum of cache read and cache write. This is your total bandwidth consumed. Overlay the sum of these two against the time period when you're experiencing the errors, and see if the problem is network throughput. If that's the case, scale up.
Checking server load
Also on the Metrics tab, take a look at server load. This is the percentage that Redis is busy and is unable to process requests. If you hit 100%, Redis cannot respond to new requests and you will experience timeout issues. If that's the case, scale up.
Reusing ConnectionMultiplexer
You can also run out of connections to a Redis server if you're spinning up a new instance of StackExchange.Redis.ConnectionMultiplexer per request. The service limits for the number of connections available based on your SKU are here on the pricing page. You can see if you're exceeding the maximum allowed connections for your SKU on the Metrics tab, select max aggregation, and choose Connected Clients as your metric.
Thread Exhaustion
This doesn't sound like your error, but I'll include it for completeness in this Rogue's Gallery of Redis issues, and it comes into play with Azure Web Apps. By default, the thread pool will start with 4 threads that can be immediately allocated to work. When you need more than four threads, they're doled out at a rate of one thread per 500ms. So if you dump a ton of requests on a Web App in a short period of time, you can end up queuing work and eventually having requests dropped before they even get to Redis. To test to see if this is a problem, go to Metrics for your Web App and choose Threads and set the aggregation to max. If you see a huge spike in a short period of time that corresponds with your trouble, you've found a culprit. Resolutions include making proper use of async/await. And when that gets you no further, use ThreadPool.SetMinThreads to a higher value, preferably one that is close to or above the max thread usage that you see in your bursts.
Rob has some great suggestions but did want to add information on troubleshooting traffic burst and poor ThreadPool settings. Please see: Troubleshoot Azure Cache for Redis client-side issues
Bursts of traffic combined with poor ThreadPool settings can result in delays in processing data already sent by the Redis Server but not yet consumed on the client side.
Monitor how your ThreadPool statistics change over time using an example ThreadPoolLogger. You can use TimeoutException messages from StackExchange.Redis like below to further investigate:
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
IOCP: (Busy=6,Free=999,Min=2,Max=1000), WORKER: (Busy=7,Free=8184,Min=2,Max=8191)
Notice that in the IOCP section and the WORKER section you have a Busy value that is greater than the Min value. This difference means your ThreadPool settings need adjusting.
You can also see in: 64221. This value indicates that 64,211 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
You can configure your ThreadPool Settings to make sure that your thread pool scales up quickly under burst scenarios.
I hope you find this additional information is helpful.
How frequently the Traffic Manager monitors endpoints? It's very obvious that it's not event driven (when an endpoint is down it takes up-to 30 secs - 2.5 mins to identify the status of the endpoint as per my observations). Can we configure this frequency, I cannot see any configuration for this.
Is there a relationship between Traffic Manager Monitoring interval and TTL?
This may look like a general question, but my real issue is that I experience a service downtime in a fail over scenario (fail over of the primary). I understand the effect in TTL where until the client DNS cache expires they are calling the cached endpoint. I spent a lot of time on this and now I have narrowed down it to a specific question.
Issue is that there is a delay in Traffic Manager identifying the endpoint status after it's stopped or started. I need a logical explanation for this, could not find any Azure reference which explains this.
Traffic manager settings
I need to understand this delay and plan for that down time.
I have gone through the same issue. Check this link, it explains the Monitoring behaviour
Traffic Manager Monitoring
The monitoring system performs a GET, but does not receive a response in 10 seconds or less. It then performs three more tries at 30 second intervals. This means that at most, it takes approximately 1.5 minutes for the monitoring system to detect when a service becomes unavailable. If one of the tries is successful, then the number of tries is reset. Although not shown in the diagram, if the 200 OK message(s) come back more than 10 seconds after the GET, the monitoring system will still count this as a failed check.
This explains the 30-2 mins delay.
basically the maximum delay would be 1.5 mins + TTL as per the details.
I cannot figure out what is the cause of the bottleneck on this site, very bad response times once about 400 users reached. The site is on Google compute engine, using an instance group, with network load balancing. We created the project with sailjs.
I have been doing load testing with Google container engine using kubernetes, running the locust.py script.
The main results for one of the tests are:
RPS : 30
Spawn rate: 5 p/s
TOTALS USERS: 1000
AVG(res time): 27500!! (27,5 seconds)
The response time initially is great, below one second, but when it starts reaching about 400 users the response time starts to jump massively.
I have tested obvious factors that can influence that response time, results below:
Compute engine Instances
(2 x standard-n2, 200gb disk, ram:7.5gb per instance):
Only about 20% cpu utilization used
Outgoing network bytes: 340k bytes/sec
Incoming network bytes: 190k bytes/sec
Disk operations: 1 op/sec
Memory: below 10%
MySQL:
Max_used_connections : 41 (below total possible)
Connection errors: 0
All other results for MySQL also seem fine, no reason to cause bottleneck.
I tried the same test for a new sailjs created project, and it did better, but still had terrible results, 5 seconds res time for about 2000 users.
What else should I test? What could be the bottleneck?
Are you doing any file reading/writing? This is a major obstacle in node.js, and will always cause some issues. Caching read files or removing the need for such code should be done as much as possible. In my own experience, serving files like images, css, js and such trough my node server would start causing trouble when the amount of concurrent requests increased. The solution was to serve all of this trough a CDN.
Another proble could be the mysql driver. We had some problems with connection not being closed correctly (Not using sails.js, but I think they used the same driver at the time I encountered this), so they would cause problems on the mysql server, resulting in long delays when fetching data from the database. You should time/track the amount of mysql queries and make sure they arent delayed.
Lastly, it could be some special issue with sails.js and Google compute engine. You should make sure there arent any open issues on either of these about the same problem you are experiencing.