Azure App Service - WEBSITE_HEALTHCHECK_MAXPINGFAILURES and time of Load Balancing - azure

Description
The required number of failed requests for an instance to be deemed unhealthy and removed from the load balancer. For example, when set to 2, your instances will be removed after 2 failed pings. (Default value is 10)
Here is the description for WEBSITE_HEALTHCHECK_MAXPINGFAILURES. What is the difference between WEBSITE_HEALTHCHECK_MAXPINGFAILURES and the Load Balancing in the picture below?
I found when I change Load Balancing to 5, the value of WEBSITE_HEALTHCHECK_MAXPINGFAILURES will be changed to 5.
Test
Localhost will send two requests in one minute.
Before enabling Health Check, there is no any request.
After enabling Health Check, two requests will be received in every minute for every instance.

There is no difference. The portal offers a better UI experience for setting WEBSITE_HEALTHCHECK_MAXPINGFAILURES. Both represent the total amount of time in minutes of getting failed pings before App Service determines it is unhealthy and removes it, because:
Health check pings this path on all instances of your App Service app at 1-minute intervals.

Related

1 minute Service timeout for AMLS models deployed on ACI or AKS

We have created an image scoring model on Machine learning Service and deployed using AMLS portal on ACI and AKS both.
Though it runs on smaller images , for larger images it gets timed-out after exactly 1 minute on both ACI and AKS.
It is expected that an image scoring can take few minutes.
Wanted to know , if it’s a limitation on using AMLS deployment, or on ACI and AKS that they timeout the deployed webservice after 60 seconds??
Any workaround would be welcomed
ACI Error :-
Post http://localhost:5001/score: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
AKS Error :-
Replica closed connection before replying
If you are deploying a service in AKS, then #Greg's solution should be sufficient for most cases. However, if your value for scoring_timeout_ms is going to exceed 60000 milliseconds (i.e. 60 secs), then I recommend also tuning with the following config settings. When your model gets deployed in Kubernetes as a deployment, we define a LivenessProbe so that if your model container becomes unresponsive, Kubernetes can automatically restart your container in an effort to restore the health of your model.
period_seconds: the time interval between each LivenessProbe. If your model is going to take 45 seconds to respond to a scoring request, then 1 thing you can do is to increase the time interval between each LivenessProbe execution from the default 10 seconds to possibly 30 seconds (or more).
failure_threshold: the number of LivenessProbe failures after which Kubernetes restarts your model container. If you want to run LivenessProbe every 10 seconds and your model is going to take 45 seconds to respond, then you can increase failure_threshold from default 3 to 10. This would mean after 10 consecutive LivenessProbe failures, Kubernetes will restart your container.
timeout_seconds: the time interval for LivenessProbe to wait before giving up. One other option you could consider is increasing the timeout_seconds from default 2 seconds to 30 seconds. This would result in LivenessProbe waiting for up to 30 seconds when your container is busy but when it is not, it will reply back earlier.
There is no one "correct" config setting to modify, but the combination of these will definitely help in preventing 502 "Replica closed connection before replying" error.
The deployment class has a timeout setting you can change in the constructor, that can help. Some clients will time out anyways.
https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aks.aksservicedeploymentconfiguration?view=azure-ml-py
scoring_timeout_ms : int => A timeout to enforce for scoring calls to this Webservice. Defaults to 60000

Azure Front Door - How to do rolling update of backend pool?

Has anyone successfully done rolling updates with Azure Front Door? We have an application in 2 regions, and we want to disable the backend in region 1 while it gets updated and same for the backend in region 2. However, there seems to be a ridiculous amount of lag time between when you disable or remove a backend from a pool, making this basically impossible.
We've tried:
Disabling/totally removing backends
Setting high/low backend priorities/weights
Modifying health probe intervals
Changing sample size/successful samples/latency to 1/1/100
I have an endpoint that I watch during the deployment process which tells me which region it's in, and it never changes during the operation, and becomes unavailable when the region is being updated. There's gotta be a way to do this, right?
I have a suggestion,
Reduce the Health Probe Interval.
Reduce the sample size and successful sample required. (Make sure you are probing to a simple HTTP page so your backend resource can handle the loads. You will start receiving probes from all the POP servers with the interval you specified.)
3.For the sever which you need to do maintenance, stop the service or make the probe fail, so that all traffic will switch to the healthy server. Then do the maintenance and start the service again. This will make sure your service is not disrupted.

Load test on Azure

I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.

Azure WebSites / App Service Unexplained 502 errors

We have a stateless (with shared Azure Redis Cache) WebApp that we would like to automatically scale via the Azure auto-scale service. When I activate the auto-scale-out, or even when I activate 3 fixed instances for the WebApp, I get the opposite effect: response times increase exponentially or I get Http 502 errors.
This happens whether I use our configured traffic manager url (which worked fine for months with single instances) or the native url (.azurewebsites.net). Could this have something to do with the traffic manager? If so, where can I find info on this combination (having searched)? And how do I properly leverage auto-scale with traffic-manager failovers/perf? I have tried putting the traffic manager in both failover and performance mode with no evident effect. I can gladdly provide links via private channels.
UPDATE: We have reproduced the situation now the "other way around": On the account where we were getting the frequent 5XX errors, we have removed all load balanced servers (only one server per app now) and the problem disappeared. And, on the other account, we started to balance across 3 servers (no traffic manager configured) and soon got the frequent 502 and 503 show stoppers.
Related hypothesis here: https://ask.auth0.com/t/health-checks-response-with-500-http-status/446/8
Possibly the cause? Any takers?
UPDATE
After reverting all WebApps to single instances to rule out any relationship to load balancing, things ran fine for a while. Then the same "502" behavior reappeared across all servers for a period of approx. 15 min on 04.Jan.16 , then disappeared again.
UPDATE
Problem reoccurred for a period of 10 min at 12.55 UTC/GMT on 08.Jan.16 and then disappeared again after a few min. Checking logfiles now for more info.
UPDATE
Problem reoccurred for a period of 90 min at roughly 11.00 UTC/GMT on 19.Jan.16 also on .scm. page. This is the "reference-client" Web App on the account with a Web App named "dummy1015". "502 - Web server received an invalid response while acting as a gateway or proxy server."
I don't think Traffic Manager is the issue here. Since Traffic Manager works at the DNS level, it cannot be the source of the 5XX errors you are seeing. To confirm, I suggest the following:
Check if the increased response times are coming from the DNS lookup or from the web request.
Introduce Traffic Manager whilst keeping your single instance / non-load-balanced set up, and confirm that the problem does not re-appear
This will help confirm if the issue relates to Traffic Manager or some other aspect of the load-balancing.
Regards,
Jonathan Tuliani
Program Manager
Azure Networking - DNS and Traffic Manager

Does Azure load balancer allow connection draining

I cant seem to find any documentation for it.
If connection draining is not available how is one supposed to do zero-downtime deployments?
Rick Rainey answered essentially the same question on Server Fault. He states:
The recommended way to do this is to have a custom health probe in
your load balanced set. For example, you could have a simple
healthcheck.html page on each of your VM's (in wwwroot for example)
and direct the probe from your load balanced set to this page. As long
as the probe can retrieve that page (HTTP 200), the Azure load
balancer will keep sending user requests to the VM.
When you need to update a VM, then you can simply rename the
healthcheck.html to a different name such as _healthcheck.html. This
will cause the probe to start receiving HTTP 404 errors and will take
that machine out of the load balanced rotation because it is not
getting HTTP 200. Existing connections will continue to be serviced
but the Azure LB will stop sending new requests to the VM.
After your updates on the VM have been completed, rename
_healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending
requests to this VM again.
Repeat this for each VM in the load balanced set.
Note, however, that Kevin Williamson from Microsoft states in his MSDN blog post Heartbeats, Recovery, and the Load Balancer, "Make sure your probe path is not a simple HTML page, but actually includes logic to determine your service health (eg. Try to connect to your SQL database)." So you may actually want an aspx page that can check several factors, including a custom "drain" flag you put somewhere.
Your clients need to simply retry.
The load balancer only forwards a request to an instance that is alive (determined by pings), it doesn't keep track of the connections. So if you have long-standing connections, it is your responsibility to clean them up on restart events or leave it to the OS to clean them up on restarts (which is obviously not gracefully in most of the cases).
Zero-downtime means that you'll always be able to reach an instance that is alive, nothing more- it gives you no guarantees on long running requests.
Note that when a probe is down, only new connections will go to other VMs
Existing connections are not impacted.

Resources