Scenarios where Azure Web Role restarts itself - azure

I noticed that for some reason one of the web role stopped and restarted itself. Could someone help me to understand on what scenarios does the webrole restarts itself?
And, is there any way to find why the webrole restarted itself?

That happens once in a while when Azure performs guest OS upgrades - it stops instances honoring upgrade domains and then starts them shortly thereafter. This is the most frequent scenario, the same could happen if the server hosting the VM was diagnosed faulty, but that happens quite rarely.
You should be ready for such restarts - they are normal - and your code should be designed to be able to continue working after such restart.
Here's a post with more details on the upgrade process.

Related

Azure App Services die when not used for a couple of weeks

I'm working on an application that's hosted within Azure using an AppService, it sits behind an Azure Firewall and WAF (for reasons).
Over the Christmas break, most of my test environments went to sleep and never came back (they started dying after between 7 and 16 days of idle time). I could see the firewall attempting to health check them every 2 seconds, but at some point they all stopped responding. The AppService started returning 500.30 errors (which are visible in the AppServiceHttpLogs), but our applications weren't starting, and there were no ApplicationInsights logs (i.e. the app wasn't started/starting).
We also noticed, that if we made any configuration change to any of the environment (not the app) the app would start and behave just fine.
It is worth noting that "AlwaysOn" is configured off, because as far as I'm aware, the startup will just cause some initial request latency (after 20 minutes of idle).
Has anybody got a good suggestion as to what happened, could there be some weird interaction between "AlwaysOn" and AzureFirewall, and if so why did it take weeks before it kicked in?
Thanks.
To answer my own question (partially).
There was an update to azure, which rolled out across our environments over a couple of weeks. After the update there was ~50% change that the automatic restart killed out apps.
The apps were dying because... after a restart, there was a change that the app service route to their keyvault via a vnet, but instead via a public IP, which would be rejected by keyvault.
We determined that this was the issue using kudu --> tools --> diagnostic dump --> (some dump).zip --> LogFiles --> eventlog.xml
If you ever want find app service startup failure stack traces, this is a great place to look.
Now we've got to work out why sometimes keyvault requests don't get routed via vnet, and instead go via the public IP.

Azure Webjobs: Does re-publishing associated website cause existing jobs to stop running?

I have an Azure Website where I would like to be able to republish the website without stopping any webjobs that might be running in the background.
Ignoring the fact that it's bad practice to publish while the site is being used, this scenario means that a large queue might keep the webjobs firing 24/7 as load increases on the website.
I'm not sure if publishing the website (and not the webjobs) cause the webjobs (scheduled and on-demand) to cancel. Do they?
I think they do, and in that case, is there anything you can do to prevent that? I risk jobs being stopped halfway-through because of the need to publish, and I don't want to sit there waiting for the queue to be empty before publishing. A method of allowing currently running jobs to finish without starting new runs would be fine too.
If the webjob files are not updated (under wwwroot/app_data/jobs/...) they will not restart.

Does azure force kill processes by itself? My nodeJS/Java processes/Jmeter are force killed

I am using windows azure for a performance test in about 8 nodes, each running a different application. Since its a performance test, we do have quite a bit of traffic generated.
The test was running just fine for a few hours. Then suddenly we realise a few of the applications like nodeJS, JMeter and even Java processes have been force-killed. Each at a different time.
We find nothing in logs that indicate a out of memory or any other error or application issue. And this happens pretty often, once every few hours. For example we had seen this issue with jmeter shut down once every 3-4 hours and then once it had happened after 10 hours or continuous run.
So we suspect azure is using root permissions to force-kill the above processes.
Did any of you notice this with your applictaions on azure and do you know why?
Short answer, no, Azure does not kill your processes. There is no such thing as 'root permissions' to kill specific processes.
Are you running an IaaS VM or a PaaS Web/Worker Role? For PaaS, check out http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx for where to start getting diagnostic data. For IaaS, troubleshoot it like you would on-prem (DebugDiag, WinDBG, procmon, Application/System event logs, etc) since there is really nothing specific about Azure that would cause this behavior.

Worker Role goes Cycling... after some time, who can I get an alert?

I have a worker role deployed that works fine for a period of time (days...) but at some point it stops or crashes, then it can't restart at all and stays "Cycling...". The only solution is to Reimage the Role.
How can I set an automatic alert so I get an email when the Role becomes unresponsive (and Cycling...) ?
Thanks
Alerts or notifications like this are not available today, but they are being worked on. If this is causing service interruptions you could always sign up for an external monitoring service which will send you alerts whenever your site is down.
However, I would recommend solving the root cause of the problem rather than just Reimaging it to fix the symptom. Here is how I would start:
You are most likely hitting the issue described in http://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx. In particular, see #1 under Common Issues where it talks about common causes for a role to not restart properly after being rebooted due to OS updates. Notice that #1 also talks about how to simulate these types of Azure environment issues (ie. manually do a Reboot from the portal) so you can reproduce the failure and debug it.
To troubleshoot the issue I would recommend reading through the troubleshooting series at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx. Of particular interest to you is probably the "Troubleshooting Scenario 2 – Role Recycling After Running Fine For 2 Weeks"
Azure cannot notify you of such conditions. Consider placing a try/catch around your loop in the WorkerRole with a catch that can email you in case of an issue.
Alternatively, if you're open to using third party services, consider AzureWatch (I'm affiliated with the product). It can alert you in case your instance becomes Unresponsive, Busy, or goes thru other non-Ready status

Automated application provisioning and file locking

We have WFF 2.5 installed, and have used it to successfully configure a farm and provision a secondary server in our test environment.
Our environment (controller, primary, secondary servers) is Windows 2008 Server Web Edition R2, running IIS 7.5, with WFF 2.5 installed.
We have ongoing issues with a .tmp file in an app pools related directory being locked. Process Monitor indicates that it is the worker process (w3wp.exe) locking the file.
The exact error message is Failed to run operation "ProvisionApplications". Failed to run method "Microsoft.Web.Farm.SyncApplicationsRemoteMethod" on server "abc". Exception in response stream. An error was encountered when processing operation "Delete File" on "ABC85DA.tmp". The error code was 0x80070020. The process cannot access "C:\inetpub\temp\appPools\ABC85DA.tmp" because it is being used by another process
If I shut down the Windows Process Activation Service which AFAIK hosts the worker process, the error is disappears.
Obviously, however , to bring the server online, we need to start the service, and as soon as we do, the automated provisioning step fails, and WFF marks the server as unhealthy, and takes it out of the farm.
I have tried to turn Application Provisioning off by unchecking "Enable Application Provisioning" under the Application Provisioning Module, but the operation still seems to fire every 30 seconds.
So - two problems really:
How to solve the file locking issue on the App pool temp file.
How to turn off automated application provisioning operation on
secondary servers? (this is really a second prize workaround in case
there is no solution to problem 1)
TIA
Rebooting the ARR server caused the server to stop attempting to repeatedly provision the secondary servers (in other words applied the "Enable Application Provisioning" option, which I had turned off)
Otherwise, I think the locking issue would still occur. This may not be an issue, since you can turn off the Windows Process Activations Service while initially provisioning the service, and during any subsequent application provisioning intended to synch the servers.
Automated provisioning on a schedule will still be an issue, I suspect.

Resources