I am currently stress testing a .Net Core application, targeting netcoreapp2.2, that is hosted on Azure as a App Service connected to a P1V2 (210 ACU, 3.5GB memory) service plan with 2 instances.
The endpoint that I'm stress testing is very simple, it validates a Oauth2.0 token, gets the user and some info about the user from a P2 (250 DTU) Azure hosted database, total 4 db queries per request, and returns the string "Pong".
When running 15 concurrent users (or more) in 200 loops I see the stop(s) in processing seen in the image (between the high peaks). The service plan never hits more than around 20-35% CPU and the database never uses more than 2% load. Increasing the users decreases the average throughput.
When looking at the slow requests it is like it just randomly stops, never at the same place. When I look at the DB requests I never see a request that takes longer than a couple of 100 milliseconds while some requests can take upwards to 5-6s to process.
It feels like I reach some limit which results in something stopping for a period of time, but I can't figure out where the problem lies.
When running the same stress locally I don't see these stops.
I'm using jmeter cli to run the stress tests against both environments.
Any help is greatly appreciated, thanks!
This could be because of Azure DDOS protection behaviour.
If your application is being attacked by a DDOS attack, Microsoft will
stop all connections to your end point and in effect taking down your
service.
To avoid this you need to setup Web application firewall (WAF) to exclude healthy requests.
Related
We have a service which pings our EP1 Premium service and yesterday we received 3 client side timeout errors after 2 minutes of waiting. When opening the trace in App insights, these requests which time out are not even logged and have no trace of ever being received Azure side, and therefore stay unanswered. By looking at the metrics provided in the Azure Functions app, I found out that 1-2 minutes after the request has been sent, the app loses all its ability to work as its Total App Domains falls to 0 as well as all connections, threads and so on and this state lasts until the next request is received, therefore "skipping" the request that happened beforehand. This is a big issue as I need to make sure requests get answered in a timely manner.
The client service sent HTTP requests to the Azure Functions app expecting an answer, only to time out while the Azure-side doesn't have any record of ever receiving the request.
I believe this issues is related to Consumption Plan of Azure Functions called Cold Start behaviour. The "skipping" mechanism is explained below:
Apps may scale to zero when idle, meaning some requests may have additional latency at startup. The consumption plan does have some optimizations to help decrease cold start time, including pulling from pre-warmed placeholder functions that already have the function host and language processes running.https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#cold-start-behavior
Please also consider of having look on this article, which explains the behaviour. https://azure.microsoft.com/en-us/blog/understanding-serverless-cold-start/
I have recently setup an 'Azure Database for MySQL flexible server' using the burstable tier. The database is queried by a React frontend via a node.js api; which each run on their own seperate Azure app services.
I've noticed that when I come to the app first thing in the morning, there is a delay before database queries complete. The React app is clearly running when I first come to it, which is serving the html front-end with no delays, but queries to the database do not return any data for maybe 15-30 seconds, like it is warming up. After this initial slow performance though, it then runs with no delays.
The database contains about 10 records at the moment, and 5 tables, so it's tiny.
This delay could conceivably be due to some delay with the node.js server, but as the React server is running on the same type of infrastructure (an app service), configured in the same way, and is immediately available when I go to its URL, I don't think this is the issue. I also have no such delays in my dev environment which runs on my local PC.
I therefore suspect there is some delay with the database server, but I'm not sure how to troubleshoot. Before I dive down that rabbit hole though, I was wondering whether a delay when you first start querying a database (after, say, 12 hours of inactivity) is simply a characteristic of the burtsable tier on Azure?
There may be more factors affecting this (see comments from people on my original question), but my solution has been to set two global variables which cache data, improving initial load times. The following should be set to ON in the Azure config:
'innodb_buffer_pool_dump_at_shutdown'
'innodb_buffer_pool_load_at_startup'
This is explained further in the following best practices documentation: https://learn.microsoft.com/en-us/azure/mysql/single-server/concept-performance-best-practices in the section marked 'Use InnoDB buffer pool Warmup'
In some 1-5% of our requests, we are seeing slow communication between APIs (REST API requests). Both APIs are developed by us and hosted on Azure, each app service on its own app service plan in the same region, P1v2 tier.
What we are seeing on application insights is that POST or GET requests on origin API can take a few seconds to execute, while real execution time on destination API is only a few milliseconds.
Examples (first line POST request on origin, second execution time on destination API): slow req 1, slow req 2
Our best guess is that the time difference is lost in communication between components. We don't have an explanation for it since the payload is really small and in most cases, communication takes less than 5 milliseconds.
We dismiss the possible explanation it could be due to component cold start since it happens during constant load and no horizontal scaling was performed.
Do you have any idea what might cause it or how to do additional analysis in order to discover it?
If you're running multiple sites on the App Service Plan, then enable the "Always On" setting for your web app > All Settings > Application Settings > Click on Always On
See here for details: https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
When Always On is off, the site is shut down after 20 minutes of inactivity to free up resources for any additional websites that might be using the same App Service Plan.
The amount of information it needs to collect, process and then present itself requires some time, and involve internal calls as well, that is why considering the server load and usage, it takes around 6 to 7 seconds sometimes even more.
To Troubleshoot that latency, try this steps, provided by Microsoft.
We have a dotnet core 3.0 solution running in a docker image running on Azure. For now, we haven't set it up in a k8s cluster. Our app service plan is PremiumV2, which basically means that we're running on dedicated hardware and not sharing our resources with anyone else.
We have a simple api-call to get the executing user based on the JWT. This validates the JWT, gets the users mail from the claims and queries cosmos to get more information about the user. When the request is sent from Postman the first time, it takes roughly 320 ms, however the subsequential requests takes around 50 ms. If we're waiting, lets say 10 minutes more or less, the requests is back at around 300 ms and again the subsequential requests takes around 50 ms. This indicates that the behavior is re-producable. Its worth mentioning that its not only this call that we see this behavior, but every "first" requests to our api takes more time than the other following.
Looking into application insights, apparently cosmos is not the bottleneck here. We've also configured the app service to be "always on"
Any ideas on how we can trace down this issue? Has anyone else experienced the same behavior? Is there any settings or configuration we should look at in Azure?
When we migrated our apps to azure from rackspace, we saw almost 50% of http requests getting read timeouts.
We tried placing the client both inside and outside azure with the same results. The client in this case is also a server btw, so no geographic/browser issues either.
We even tried increasing the size of the box to ensure azure wasn't throttling. But even using D boxes for a single request, the result was the same.
Once we moved out apps out of azure they started functioning properly again.
Each query was done directly on an instance using a public ip, so no load balancer issues either.
Almost 50% of queries ran into this issue. The timeout was set to 15 minutes.
Region was US East 2
Having 50% of HTTP requests timing out is not normal behavior. This is why you need to analyze what is causing those timeouts by validating the requests are hitting your VM. For this, I would recommend you running a packet capture on your server and analyze response times, as well as look for high number of retransmissions; it is even better if you can take a simultaneous network trace on your clients machines so you can do TCP sequence number analysis and compare packets sent vs received.
If you are seeing high latencies in the packet capture or high number of retransmissions, it requires detailed analysis. I strongly suggest you to open a support incident so Microsoft support can help you investigate your issue further.