Following a recent investigation into an Azure web api going down (it does not like cold restarts as the queued requests then swamp the server, which 503's), I received the following:
Your application was restarted as site binding status changed. This
can most likely occur due to recent deployment slot swap operations.
In some cases after the swap the web app in the production slot may
restart later without any action taken by the app owner. This restart
may take place several hours/days after the swap took place. This
usually happens when the underlying storage infrastructure of Azure
App Service undergoes some changes. When that happens the application
will restart on all VMs at the same time which may result in a cold
start and a high latency of the HTTP requests. This event occurred
multiple times during the day.
The recommendation was
to minimize the random cold starts, you can set this app setting
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG to 1 in every slot of
the app.
Can someone please elaborate on this?
Am I right in thinking that if we ever do a swap (eg: staging to production) at some random point in the future the app will restart?
What does the app setting actually do and how will it stop Azure restarting the production slot?
Answer from the link provided by Patrick Goode, whose google-foo is far better than mine
"Just to explain the specifics of what
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG app setting does. By
default we put the site’s hostnames into the site’s
applicationHost.config file “bindings” section. Then when the swap
happens the hostnames in the applicationHost.config get out of sync
with what the actual site’s hostnames are. That does not affect the
app in anyway while it is running, but as soon as some storage event
occurs, e.g. storage volume fail over, that discrepancy causes the
worker process app domain to recycle. If you use this app setting then
instead of the hostnames we will put the sitename into the “bindings”
section of the appHost.config file. The sitename does not change
during the swap so there will be no such discrepancy after the swap
and hence there should not be a restart."
Looks like this setting is supposed to prevent 'random cold restarts'
https://ruslany.net/2019/06/azure-app-service-deployment-slots-tips-and-tricks
Related
I'm working on an application that's hosted within Azure using an AppService, it sits behind an Azure Firewall and WAF (for reasons).
Over the Christmas break, most of my test environments went to sleep and never came back (they started dying after between 7 and 16 days of idle time). I could see the firewall attempting to health check them every 2 seconds, but at some point they all stopped responding. The AppService started returning 500.30 errors (which are visible in the AppServiceHttpLogs), but our applications weren't starting, and there were no ApplicationInsights logs (i.e. the app wasn't started/starting).
We also noticed, that if we made any configuration change to any of the environment (not the app) the app would start and behave just fine.
It is worth noting that "AlwaysOn" is configured off, because as far as I'm aware, the startup will just cause some initial request latency (after 20 minutes of idle).
Has anybody got a good suggestion as to what happened, could there be some weird interaction between "AlwaysOn" and AzureFirewall, and if so why did it take weeks before it kicked in?
Thanks.
To answer my own question (partially).
There was an update to azure, which rolled out across our environments over a couple of weeks. After the update there was ~50% change that the automatic restart killed out apps.
The apps were dying because... after a restart, there was a change that the app service route to their keyvault via a vnet, but instead via a public IP, which would be rejected by keyvault.
We determined that this was the issue using kudu --> tools --> diagnostic dump --> (some dump).zip --> LogFiles --> eventlog.xml
If you ever want find app service startup failure stack traces, this is a great place to look.
Now we've got to work out why sometimes keyvault requests don't get routed via vnet, and instead go via the public IP.
since a month one of our web application hosted as WebApp on Azure is having some kind of problem and I cannot find the root cause of that.
This WebApp is hosted on Azure on a 2 x B2 App Service Plan. On the same App Service Plan there is another WebApp that is currently working without any issue.
This WebApp is an ASP.NET WebApi application and exposes a REST set of API.
Effect: without any apparent sense (at least for what I know by now), the ThreadCount metric starts to spin up, sometimes very slowly, sometimes in few minutes. What happens is that no requests seems to be served and the service is dead.
Solution: a simple restart of the application (an this means a restart of the AppPool) causes an immediate obvious drop of the ThreadCount and everything starts as usual.
Other observations: there is no "periodicity" in this event. It happened in the evening, in the morning and in the afternoon. It seems that evening is a preferred timeframe, but I won't say there is any correlation.
What I measured through Azure Monitoring Metric:
- Request Count seems to oscillate normally. There is no peak that causes that increase in ThreadCount
- CPU and Memory seems to be normal, nothing strange.
- Response time, like the others metrics
- Connections (that should be related to sockets) oscillates normally. So I'd exclude something related to DB connections.
What may I do in order to understand what's going on?
After a lot of research, this happened to be related to a wrong usage of Dependency Injection (using Ninject) and an application that wasn't designed to use it.
In order to diagnose, I discovered a very helpful feature in Azure. You can reach it by entering into the app that is having the problem, click on "Diagnose and solve problems" then click on "Diagnostic tools" and then select "Collect .NET profiler report". In that panel, after configuring the storage for the diagnostic files, you can select "Add thread report".
In those report you can easily understand what's going wrong.
Hope this helps.
By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.
I have a WebAPI application running on Azure WebSites. It is running in Basic mode and I have the option to make it "Always On". There seems to be conflicting information online about what this means exactly. I know the effect, but the "how" matters a lot here. In particular, does something automatically hit an endpoint in my application periodically? If so, can I control the endpoint it hits?
As I mentioned, it is a Web API application and the default route does non-trivial work and results in a notable amount of outbound traffic and it will also result in items being placed onto a work queue that will eventually be processed. I want the application always on (no cold start times) but I don't want some service making requests of application.
As soon as your Azure Website is marked as AlwaysOn, your site root will be hit within a few seconds. We also make sure your site is up and running on all the workers (if you have configured auto scale option or such). After that, if the worker process crashes, alwaysOn makes sure that it comes back up.
You cannot control the endpoint that it hits.
I haven't found a definitive list out there, but hopefully someone's got one going or we can come up with one ourselves. What causes disruptions for .NET applications, or general service disruption, running on IIS? For instance, web.config changes will cause a recompilation in JIT (while just deploying a single page doesn't affect the whole app), and iisresets halt everything (natch, but you see where I'm going). How about things like creating a new virtual directory under a current web app?
It's helpful to know all the cases so you know if you can affect a change to a server without causing issues with the whole thing.
EDIT: I had IIS 6 in mind when I asked, but of course a list of anything different in other versions would be helpful as well to people.
It depends on what exactly you are talking about with disruptions. IISReset can cause a Service Unavailable message to display for a short time as IIS is shutdown and re-started.
Changes to the web.config, or adding a .dll file to the bin directory of an application causes a recycle of the application domain but that is not a disruption exactly, more of a "delay" in responding, the user will NOT see an error just a delayed response from the server. You can also get that from changing any files in App_Code or .vb files on non WAP developed sites.
You can also get IIS Worker Process Shutdowns due to inactivity, default setting is 20 minutes. Again this is a delay, not a lack of service.