....and the site doesn't respond in that time at all. Some responses even fail.
This is giving a horrible user experience where the site suddenly stops reacting for the people that are online at that moment.
The site is completely warmed up and responsive through the temporarily url.
Here is the log:
Is there anything to speed this up?
Is this normal behaviour or is this an error that should be reported?
From the log you posted it seems that the operation that consumes a long time is "ChangeDeploymentConfigurationBySlot".
An azure configuration change by default triggers the restart of the role instance (which might explain why your clients are experiencing downtime).
You can get further details in this blog: https://alexandrebrisebois.wordpress.com/2013/09/29/handling-cloud-service-role-configuration-changes-in-windows-azure/
Related
By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.
At certain times during the week while I'm testing my Mobile Services app I get a 503 error (Service Unavailable). It happens whether I try to call the app from localhost or live on my Azure Website. It hangs around for 10-15 minutes and then goes away on its own. It doesn't seem to be caused by anything in particular that I am doing (i.e. I have not updated any code). The 503 error occurs when I'm trying to call one of my custom APIs in my Mobile Services account. A few of the requests make it through (strangely enough) but the majority return a 503 error.
I've seen that someone had a very similar problem here (Why does Azure give me an intermittent Error 503. The service is unavailable?) without an acceptable resolution.
I am using the free version of Mobile Services but I should be no where near pushing the limits of what the free version can handle; I am the sole user of the app right now.
It will soon be time to make the service live and I'm shuddering at the thought of support calls that will come in during one of these funky states the service gets into. Any help in debugging the problem would be greatly appreciated.
EDIT:
I've narrowed this down to a database problem. I have one main query (sproc) that I use to feed data to the UI. I noticed that when I get the 503 errors the query takes about 13 seconds (when run in SSMS). When things are running "normally", the query takes less than a second.
This doesn't solve my problem though, in fact it makes it more perplexing because I am using the Business Edition of Windows Azure SQL Database and there shouldn't be a 13 second fluctuation in execution time!
This problem seems to happen randomly. Is there some kind of caching in SQL Server that could explain this? Maybe my query really does take 13 seconds to execute and the caching superficially speeds it up.
Could you try transitioning your database/server to one of the "editions"? They have resource governance to promote predictable performance. Web/Business suffer from a noisy neighbor problem. It sounds like that may be your issue, considering it is intermittent.
Here's a link to a page describing the editions. https://msdn.microsoft.com/en-us/library/azure/dn741340.aspx
I have a number of small MVC apps deployed as Microsoft Windows Azure websites. This has been working for several months.
Yesterday I rolled out a new one, and the deployment was unremarkable, everything worked fine. But a couple of hours later, access to the site was unavailable. The symptoms were that when the browser tried to navigate to the URL for that site, it would try to load for several minutes and then just give up with a completely blank page.
I attempted to stop and restart the site, and it worked once, but the symptoms came back several minutes later. Then I tried to stop and restart, and it didn't work.
I deployed the identical app to three additional URLs. Again, immediately on deployment, they all work fine, however, they fail at some interval in the future. They seem to not all fail at once. Sometimes restarting the site will fix the problem, and sometimes not.
IMPORTANT: If I wait for some period of time, the site may start to work again on its own.
However, deploying four versions of the app so that our users can go to a backup one if the primary one is not working is not optimal.
Any words of wisdom as to how I might go about debugging this?
ADDITIONAL INFO NOV 25, 2013:
When sites are failing, the IIS logs show either 500 or 502 Internal Service Errors. Our own MVC code is never hit, not even app_start.
You can start by checking the logs and remote debugging
http://www.drdobbs.com/windows/azure-sdk-22-supports-visual-studio-2013/240163499
Are the apps working locally?
Might not be the same problem, but from time to time our Azure instances will get the blue question mark of death as a status.
The reason we found out was that Microsoft will do upgrades on instances from time to time. If you have just one instance in a cloud service/role, then from time to time they will do maintenance and during that time it will be dead.
I have confirmed this with their support.
The only way to get around this that I know of is to create two instances. Then Microsoft guarantees ~99% availability.
Of course I also confirmed with them that this means twice the cost. =/
If that's not the issue I would enable RDP and get onto the machine to see what the problem is. Microsoft has these tools to help debug problems: http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx
First, you should always run multiple instances of your web role with more than 1 upgrade domain. This is configurable in the service definition (CSDEF). Without this, you don't get an SLA from Microsoft, so you can't really complain that the VMs go down.
Second, to figure out what might be going on with these boxes, you should have both logs (my preference is to roll my own with page blobs or table storage), AND you should always have RDP access to a pre-production environment (production as well if you're not too fussed about security). Once on the box, look through the event viewer for errors.
Third, when an outage occurs check out the azure service dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) for outages.
Lastly, contact Microsoft support. It may take a few hours, but they are pretty good.
That it is happening repeatedly and for extended periods of time (more than 5 minutes), I would be there's something wrong with your hosted service. Again, RDP in and poke around. Good luck.
To debug your sites try to enable diagnostic logs:
http://www.windowsazure.com/en-us/develop/net/common-tasks/diagnostics-logging-and-instrumentation/
Another nice way to look around your site is using the debug console:
https://github.com/projectkudu/kudu/wiki/Kudu-console
I have a worker role deployed that works fine for a period of time (days...) but at some point it stops or crashes, then it can't restart at all and stays "Cycling...". The only solution is to Reimage the Role.
How can I set an automatic alert so I get an email when the Role becomes unresponsive (and Cycling...) ?
Thanks
Alerts or notifications like this are not available today, but they are being worked on. If this is causing service interruptions you could always sign up for an external monitoring service which will send you alerts whenever your site is down.
However, I would recommend solving the root cause of the problem rather than just Reimaging it to fix the symptom. Here is how I would start:
You are most likely hitting the issue described in http://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx. In particular, see #1 under Common Issues where it talks about common causes for a role to not restart properly after being rebooted due to OS updates. Notice that #1 also talks about how to simulate these types of Azure environment issues (ie. manually do a Reboot from the portal) so you can reproduce the failure and debug it.
To troubleshoot the issue I would recommend reading through the troubleshooting series at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx. Of particular interest to you is probably the "Troubleshooting Scenario 2 – Role Recycling After Running Fine For 2 Weeks"
Azure cannot notify you of such conditions. Consider placing a try/catch around your loop in the WorkerRole with a catch that can email you in case of an issue.
Alternatively, if you're open to using third party services, consider AzureWatch (I'm affiliated with the product). It can alert you in case your instance becomes Unresponsive, Busy, or goes thru other non-Ready status
I have two free subscriptions for windows azure and because I exceeded the limit on the first one, Microsoft closed it down. So I tried to deploy my application from the other subscription, and changed a few settings, and it seems to take a lot longer and the dns name of the depolyed application (in production area) does not seem to work. (I've been waiting for about 15 minutes.. in the other subscription it was almost immediate that the link started to work..). Also my webrole seems to be in a state of busy for a very long time..
The application always worked fine and now I'm getting all this trouble just by switching subscription?? I'm getting really frustrated with this especially because I all worked perfectly before. Now I have to 'waste' my time getting all the things to work again and I can't start with anything new. I don't think this is normal but I can't seem to find the solution to this either.
edit:
Over half an hour the dns finally started working but this still does not fix the problem with the extreme slow deploying and the busy state of the webrole..
Please study the discussion below to understand why the time to deploy an application could vary between 10-30 minutes:
Is there a way to reduce time between Azure deployment start and role OnStart() code being invoked?
Above details will helped you to get the answer about your statement ".. this still does not fix the problem with the extreme slow deploying and the busy state of the webrole.."..
To add more about that, when your application is deployment phase it goes through several state and in some cases the time taken in one state could be longer then expected and during this time you will see status as "Busy", "Initializing", "Starting.." etc and these state actually explain which level you are during your deployment. I hope this helps you to understand the time taken during deployment.