Routing requests away from an unhealthy (503ing) instance? - azure

We have a Web App hosted on multiple (scaled-out) Premium Dv2 instances using Azure App Service.
Occasionally our application fails to start-up after a restart. This will result in a 503 Service Unavailable response for requests to that instance. But when this happens, requests still get routed evenly between this instance and the healthy instances.
Shouldn't the load-balancer rather route requests away from this instance? Can this be achieved?
NOTE: We are not using API Management or App Service Environment.

Shouldn't the load-balancer rather route requests away from this instance?
Azure Load Balancer can probe the health of the various server instances. When a probe fails to respond, the load balancer stops sending new connections to the unhealthy instances.
AFAIK, before you get 503 error, it still get routed to that instance.
But when this happens, requests still get routed evenly between this instance and the healthy instances.
I found the following possible scenes that you still get routed when the instances are unhealthy.
1.The timeout and frequency values set in SuccessFailCount determine whether an instance is confirmed to be running or not running. In the Azure portal, the timeout is set to two times the value of the frequency.
2.The HTTP server doesn't respond at all after the timeout period. Depending on the timeout value that is set, multiple probe requests might go unanswered before the probe gets marked as not running.
3.If you have web roles that use w3wp.exe, you also get automatic monitoring of your website. Failures in your website code return a non-200 status to the load balancer probe.Consequently, the load balancer doesn't take that instance out of rotation.
4.The TCP server doesn't respond at all after the timeout period. When the probe is marked as not running depends on the number of failed probe requests that were configured to go unanswered before marking the probe as not running.
For more detail, you could refer to this article.

Related

Why is azure load balancer still sending traffic to nodes after health probe down?

I have 2 Azure VM sitting behind a Standard Azure Load Balancer.
The load balancer has a healthprobe pinging every 5 seconds with HTTP on /health for each VM.
Interval is set to 5, port is set to 80 and /health, and "unhealthy threshold" is set to 2.
During deployment of an application, we set the /health-endpoint to return 503 and then wait 35 seconds to allow the load balancer to mark the instance as down, and so stop sending new traffic.
However, Load balancer does not seem to fully take the VM out of load. It still sends traffic inbound to the down instance, causing downtime for our customers.
I can see in IIS-logs that the /health-endpoint is indeed returning 503 when it should.
Any ideas whats wrong? Can it be some sort of TCP keep-alive?
I got confirmation from microsoft that this is working "as intended", which makes the Azure Load Balancer a bad fit for web applications. This is the answer from Microsoft:
I was able to discuss your observation with the internal team.
They explained that the Load balancer does not currently have
“Connection Draining” feature and would not terminate existing
connections.
Connection Draining is available with the Application Gateway
Connection Draining.
I heard this is being planning for the Load balancer also as future
Road map . You could also add your voice to the request for this
feature for the Load balancer by filling the feedback Form.
Load Balancer is a pass through service which does not terminate existing TCP connections where the flow is always between the client and the VM's guest OS and application. If a backend endpoint's health probe fails, established TCP connections to this backend endpoint continue, but it will stop sending new flows to the respective unhealthy instance. This is by design to give you opportunity to gracefully shutdown from the application to avoid any unexpected and sudden termination of ongoing application workflow.
Also you may consider configuring TCP reset on idle https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset to reduce number of idle connections.
I would suggest you the following approach
You could have to place a healthcheck.html page on each of your VM's. As long as the probe can retrieve the page, the load balancer will keep sending user requests to the VM.
When you do the deployment, simply rename the healthcheck.html to a different name such as _healthcheck.html. This will cause the probe to start receiving HTTP 404 errors and will take that machine out of the load balanced rotation.
After your deployment have been completed, rename _healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending requests to this VM again.
Thanks,
Manu

Azure Traffic Manger Priority Routing is not working

I created an Azure Traffic manager and routing with Priority.As per this
The Traffic Manager profile contains a prioritized list of service
endpoints. By default, Traffic Manager sends all traffic to the
primary (highest-priority) endpoint. If the primary endpoint is not
available, Traffic Manager routes the traffic to the second endpoint.
If both the primary and secondary endpoints are not available, the
traffic goes to the third, and so on
My Traffic Manager monitoring
Low Priority
High Priority
I tried to increase the priority and decrease the priority but there is no change.
Still, you can see that traffic manager pointing towards the teststatic site alone
Another question from the above doc
If the primary endpoint is not available
Here what is mean by not available? As I'm using Azure Web Apps for my testing purpose, So I thought When Stopping my webapp could be not available. But I'm wrong, Even though I stop the web app, still, the traffic manager pointing the stopped web app. So I'm confused about what is mean by not available here?
In your screenshots, the test endpoint monitor status is always a Degraded status. This indicated that the endpoint is not included in DNS reponses and does not receive traffic. So the Traffic Manager is still pointing towards the teststatic site alone. Traffic Manager considers an endpoint to be ONLINE only when the probe receives an HTTP 200 response back from the probe path If the monitoring protocol is HTTP or HTTPS. Any other non-200 response is a failure.
You need to troubleshoot degraded state on Azure Traffic Manager and see Traffic Manager shows monitor status is degraded – Resolution
what is mean by not available here?
The traffic manager chooses an endpoint based on the status of each endpoint (disabled endpoint are not returned), the current health of each endpoint and the chosen traffic-routing method. If the endpoint is not available, that is to say the endpoint is not included the DNS response or is an unhealthy endpoint. But an exception to this is if all endpoints are degraded, in which case all of them will be considered to be returned in the query response. You can get more details from endpoint monitor status.
An endpoint is unhealthy when any of the following events occur: A
non-200 response is received (including a different 2xx code, or a
301/302 redirect); Request for client authentication; Timeout (the
timeout threshold is 10 seconds; Unable to connect.
Besides, Type ipconfig /flushdns to flush the DNS resolver cache when you verity the Traffic Manager Settings.

How to stop requests to one Azure VM in load balanced set and avoid the 503 error?

I get Service unavailable 503 error occasionally after stopping app pool in one Azure VM in a load balanced set. I stopped the app pool to perform maintenance on the VM.
Is there a better way to stop requests to one VM and avoid the 503 error?
You can define probes associated with your load balancer. Those probes will ping each instance at a defined interval (default is 5 seconds). After consecutive retry failures (default is 2 retry failures in a row) then the load balancer will no longer route traffic to that VM. So by default the user should not encounter any 503s after 10 seconds of the VM being unresponsive.

How do I get notified that my azure endpoint has been removed from the load-balanced set

I have 3 ubuntu Vm's configured (They are not websites but configured as a cloud service) with http endpoints in a load-balanced set. This is working really well with the probe configured to check every 15 secs on port 80. When the url does not return a status of 200 it gets removed from the load-balanced set until the next time that it returns a status of 200. I have used the webPortal to configure the endpoints and probe settings.
When the unhealthy instances is taken out of the load-balanced set I would like to be informed (via email preferable) of the situation so that I can fix the issue.

Delete Azure VM Instance from load balanced Cloud Service

I have 2 Azure vm's (Linux) being load balanced by a public Azure Cloud Service. Both instances show in the Azure Management portal for the same cloud service. I want to take down one instance and perform some maintenance. However since the instance is still showing even though the VM has been shutdown it the Cloud Service is still directing traffic to it. How do I delete an instance from the Cloud Service or stop the Cloud Service from directing traffic to a particular VM instance? Then afterwards how does one re-associate an existing VM to that service? (i.e. change from one Cloud Service to another).
Note: SSH works into the VM but other ports used by the VM are not working acting like they are trying to go to the other VM even though the correct endpoints are created to the active VM.
The purpose of a port probe in a load-balanced set is for the load balancer to be able to detect whether or not a VM is able to accept traffic. When configuring the load-balanced endpoint you can specify a webpage or a TCP endpoint for the probe - and this should be present on each instance. Traffic will be directed to the VM as long as the webpage returns 200 OK or the TCP endpoint accepts the connection when the load balancer probes. You can specify the time interval between probes and the number of probes that must fail before the endpoint is deemed dead and should be taken out of rotation (defaults are every 15 seconds and 2 probes).
You can take a VM out of load-balancer rotation by ensuring that the configured probe page returns something other than 200 OK and then bring it back into rotation by having it once again send a 200 OK.
When I have needed to keep my webservice running and returning status of 200 I have had to resort to removing the endpoint from the load-balanced set. It is pretty simple to do but it does take usually a minute for the webPortal to remove the endpoint and then again once you recreate the endpoint to put it back in the set.

Resources