I am using Microsoft Azure Web App for managing my releases (testing and production environments) via deployment slots.
Recently, I've done some infrastructural changes to my application. And after thorough testing, I created a new deployment slot and cloned the production configuration.
I started increasing the traffic gradually to avoid potential problems caused by the changes on the new slot. After receiving no exceptions, and as a final test, I turned the traffic of the new deployment slot up to 100%.
After receiving a few emails from some clients, I checked my application insights and my database. The application insights showed no exceptions and new data was being inserted into the database.
After further investigation of the logs, I found out that a huge amount of the new reqeusts were still being routed to the main production slot which have 0% traffic. These requests were being rejected with status code 405 method not allowed.
I investigated the possibility of this being caused by the routing path being saved in the cookies in the variable x-ms-routing-name. However, my application has tens of new users everyday, so they cannot have any cookies saved on their devices. I also gave it some extra couple of hours to make sure that this is not caused by any caching policy. Unfortunately, nothing have changed over that time.
I also tried the functionality of the new testing slot by opening it dirctly via its url and everything was working as expected.
To understand what's happening, I read the Web Apps documentation, but I couldn't find anything useful for my case.
Did anybody face this problem before?
How can I prevent new requests to go to deployment slots with 0% traffic?
Am I doing something wrong in my deployment?
Can it be a configuration error? Though i cloned the configuration of the production.
I highly appreciate any answers.
Related
I'm working on an application that's hosted within Azure using an AppService, it sits behind an Azure Firewall and WAF (for reasons).
Over the Christmas break, most of my test environments went to sleep and never came back (they started dying after between 7 and 16 days of idle time). I could see the firewall attempting to health check them every 2 seconds, but at some point they all stopped responding. The AppService started returning 500.30 errors (which are visible in the AppServiceHttpLogs), but our applications weren't starting, and there were no ApplicationInsights logs (i.e. the app wasn't started/starting).
We also noticed, that if we made any configuration change to any of the environment (not the app) the app would start and behave just fine.
It is worth noting that "AlwaysOn" is configured off, because as far as I'm aware, the startup will just cause some initial request latency (after 20 minutes of idle).
Has anybody got a good suggestion as to what happened, could there be some weird interaction between "AlwaysOn" and AzureFirewall, and if so why did it take weeks before it kicked in?
Thanks.
To answer my own question (partially).
There was an update to azure, which rolled out across our environments over a couple of weeks. After the update there was ~50% change that the automatic restart killed out apps.
The apps were dying because... after a restart, there was a change that the app service route to their keyvault via a vnet, but instead via a public IP, which would be rejected by keyvault.
We determined that this was the issue using kudu --> tools --> diagnostic dump --> (some dump).zip --> LogFiles --> eventlog.xml
If you ever want find app service startup failure stack traces, this is a great place to look.
Now we've got to work out why sometimes keyvault requests don't get routed via vnet, and instead go via the public IP.
I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.
I'm deploying updates to my Function app through the VS publish window. I set up a deployment slot with auto swap turned on. My updates through VS are going to the slot. The problem is, right after the publish is successful and when I test my API endpoints, I briefly receive 503 errors. I was under the impression that auto swap was seamless and end-users would not experience such interruptions. Am I missing something? How can I make my deployments unnoticeable to the users?
Switching to something like API Management or Traffic Manager is obviously an option, but slots are designed to do exactly what you want, and they should work the way you expect.
I looked into this a bit. Unfortunately, I can reproduce your issue, which suprised me. A few things feel a bit off when using Azure Functions with slots, so maybe there is some weirdness under the covers.
The official documentation does not mention anything about this however, quite the opposite:
Traffic redirection is seamless; no requests are dropped because of a swap.
If a function is running during a swap, execution continues and the next triggers are routed to the swapped app instance.
You don't even need to use Auto Swap. Just publish to both slots and swap the slots manually. When observing the responses, the following pattern can be seen:
Responses of old code
Responses of new code
503 errors for ~10 seconds
Request slowdown
Responses of new code
I tried:
AppService Plan & Consumption Plan
AAR Affinity On/Off
Azure Function V2 and V3 runtime
This seems like a bug to me. I would suggest you create a support case and maybe an issue at Github. I might do so myself if I find the time in the next few days. See also this Issue:
https://github.com/Azure/Azure-Functions/issues/862
edit: the linked GitHub issue and also the medium article mentioned by Ron point out that you can set WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 and this should help with the 503 errors. It is a documented behavior very deep in the AppService docs. Why it is not mentioned for Azure Functions eludes me.
Did you see this
Does Azure Functions throw 503 errors while app settings are being updated?
Depending on how you are doing the swap it could be triggering a restart because the app settings are "changing"
There is also this that probably would help but its only a prem feature
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-warmup?tabs=csharp
I would also check out
https://medium.com/#yapaxinl/azure-deployment-slots-how-not-to-make-deployment-worse-23c5819d1a17
I believe the solution would be adding an API Management in front of your Azure Functions, then implement a retry policy in it. This error seems to be related to the DNS swap between the slots.
The general practice a lot of people follow is maintaining two hosts(Az function/App services) in two different regions behind azure traffic manager and deployment goes as follows:
disable first region in traffic manager
swap functions in the first region
enable first region in traffic manager
disable second region in traffic manager
swap functions in the second region
enable second region in traffic manager
Although it does not solve the issue of Az functions returning 503, but it does make it unnoticable to the user as you always route to the stable endpoint.
Having two regions also help handle other issues like azure outage in specific regions
We have few of our internal business services hosted on an isolated ASE in Azure.
These services run on a medium app service plan with 2 instances.
This environment has been in production and use for little more than a month now and has been performing fairly well apart from the occasional sudden CPU spike to 100% in one of the instance which bring down the services.
We don't have auto scaling setup but have 2 instances running all the time.
The services are `aspnetcore` webapi and the runtime is dotnet core 2.0.
Every time I have come across this issue in the last couple of weeks I have not been lucky enough to login to kudu and get a process dump to investigate further. The business are literally behind my back to get the service up and running as quick as possible and the easiest route is to restart one of the faulting service or swap slots with a pre-prod environment.
Access to the ASE are also restricted from our network and makes it all the more difficult for me to switch to a WiFi and then go through jump boxes to login to kudu, I had asked our Ops engineer to get me the dump when this issue is reported but he has not been listening to me either, mostly for the same reasons as me not able to do it myself.
All exceptions I can see in Application Insights are due to the service themselves going down and there are no exceptions there which can cause the issue in the first place(at least I've not found it yet)
This lead me to take few guess and look for metrics, the only thing raising my
suspicions is garbage collection. I don't see any sudden spike in GC graphs as well, each time the service is re started the graph is fairly a straight line(24 hours) but increases day by day and ends up like below.
But the working memory is a sinusoid graph letting me think there are no memory leaks. But is the above graph over 3 days normal?
The drop is when I restart the service. But all services have a similar trajectory even the one that has not gone down.
I am not sure if this is a problem with an individual service or an environment configuration I have overlooked.
The API endpoints are simple CRUD operations and publish events to a service bus topic after each operation. There is a static `HttpClient` instance used to fetch data from another service. Apart from that there are no unmanaged resources and the DB connections are always wrapped in `using` statements.
I understand I would need a process dump to investigate further but my biggest concern is why is the application gateway(load balancer) not sending traffic to the healthy instance. Because of the gateway going unhealthy cloudflare returns a `502` response to clients using the api.
MS support haven't been able to help and have not answered if we have our load balancers working correctly.
The average number of requests is about 50-60 per minute.
CPU runs at less than 10% apart this sudden surge.
Thanks
It could be that the backend is pegged at 100% CPU and is unable to respond to Application Gateway health probes. When such an issue occurs, were you able to verify, using Backend health logs, the health state of your backends? If both backend instances were unhealthy, it would explain the 502s. If one of them was healthy and responding to probes, then new requests sent to Application Gateway would indeed flow to the healthy instance. If you suspect that is not the case then please reply back with subscription id, gateway name and approximate time window of incident for us to take a look.
I deployed node js server to Azure WebApp, and it worked fine. But, I see that sometime the response time is very slow. Also, I see that somewhere above 500 request/second the server start to fail handling request, and I see it use only 15% CPU. Now, I checked and the server return 500 error because the pipe is busy (by the win32 error code). That's why I was wondering if there is something I can change in the IISNode config to improve the server request capacity.
I already enabled the AlwaysOn feature, and also I add a check in Pingdom to keep the site alive. Also, I already changed nodeProcessCountPerApplication to 0 so it use all the available process.
Thank you,
Omer
One thing you can do is enable Always On. Without it, when your site hasn't been visited for 20 minutes the site gets taken down. Then the next time someone makes a request to your site Azure Web Apps warms up (re-sets up) your site but this process takes a few seconds.
Note that Always On is only available for sites in Basic, Standard, or Premium SKUs
Also, check out this page for tips on debugging Node.js apps in Azure Web Apps: https://azure.microsoft.com/en-us/documentation/articles/web-sites-nodejs-debug/