Windows Azure: Unexpected & unclean virtual-machine shutdown - azure

Using a large instance of a virtual machine on Windows Azure. The instance runs Microsoft SQL 2012 with light usage, on Windows Server 2012 + all up to date. No user is logged in at time of failures.
However, several (between none and three) times a day (appears random), the VM halts and shuts down. It does not come back online until someone logs back into the Management Portal and starts the VM again. There is no memory dump created. So I am guessing the host halts the running VM, rather than some configuration instance within the guest OS causes the halt. The subscription has billable funds. Other VMs in the subscription are also affected.
Only event logs generated:
Kernel-Power logged:
The system has rebooted without cleanly shutting down first. This
error could be caused if the system stopped responding, crashed, or
lost power unexpectedly.
Kernel-Boot logged:
The last shutdown's success status was false. The last boot's success
status was true.
How can this be resolved? There is no way to initiate a support request within Azure.

The first point I would do is install some monitoring software like newrelic or foglight and see if you can see if you are running out memory or a process is pushing the CPU into a spin.
This will give you some visibility of the activity on the box over time and give you some evidence should you need it to open a support request.
Azure now has paid support only
http://www.windowsazure.com/en-us/support/plans/
We use developer for exactly this type of situation where you are bit lost to figuring out a situation the cost of $30 dollars compared to running a SQL Server 2012 VM per month makes it worth having. The support under Microsoft are generally very good and they will have more diagnostic information and will be able to give you the heads up if this is because of Azure failure or something else.
Getting diagnostic going though would be first port of call then you can see what is going on and get some evidence together and help you track down the problem.

Related

Azure Web Site CPU High at random intervals of the day

I have a Azure Web Site running for 6 months and on Friday 1st April 2016 at 09:50pm the CPU was very high and this had a impact on the performance of the web site. Stopping and restarting the web service solved the problem but it came back at 13:00pm. Since then the CPU has stayed high and making the web site un-useable
I've tried all monitoring tools, Daas, Event Logs, checked for Open Connections and ensure my software is closing or disposing objects correctly.
But the CPU is still high. Only way to resolve is to restart the web service but I dont want to keep doing this.
Has anyone else experience a similar problem and what was the solutions.
The only thing from the event logs that look an issues is the odd "A network-related or instance-specific error occurred while establishing a connection to SQL Server", which could be because the SQL Aure is not available.
Please help
Hmmm, high cpu means that your web site is executing code, perhaps a wrong loop on some not frequent code path.
The brute force way to identify what code is being executed, would be to add tracing to your solution by System.Diagnostics.Trace.WriteLine("I am here") and then check the Azure Application Log.
Another way would be to attach the Visual Studio Debugger during high cpu and check what is being executed
The other way would be to take a dump or minidump from kudu site and analyze it with WinDbg:
1)What thread is conuming cpu:
!runaway
2) What is this thread doing:
!clrstack
hth,
Aldo

Syncing clocks on multiple Azure VMs

I have a requirement to write a load test measuring message transmission latencies. In order to simulate a large number of simultaneous uses without running into thread contention problem on one box, I'm spinning up multiple servers in Azure.
When I got my first results back, I was a little shocked to see that the results indicated the message was received before it was sent. I immediately realized that, while I had an implicit assumption that all the VMs would have their clocks synced to within milliseconds, that was clearly not the case.
I've spent several hours googling ways to resolve this, and I'm not getting anywhere. One thought was to have each VM query the time on a central server using NetRemoteTOD() using a technique similar to this NetRemoteTOD, and then establish a per-machine correction factor to be added to the time measured from the local machine's clock. However when I tried to run that method, I got a error 2184, "The service has not been started" I have verified that both the RPC service and the Windows Time service are running on the both the client and target machines, and I have not been successful in finding any information indicating what other service needs to be running (or even if the error really means what it seems to mean). (I also get the same error when running between my development desktop and a server on our corporate network. However, I can run it successfully to a PDC on the corporate network - but I can't find a PDC on Azure, since neither machine is part of a domain.)
So, does any one have either any information on what service needs to be started to get NetRemoteTOD (or the windows NET TIME command, which relies on NetRemoteTOD under the covers) working. Alternatively, does anyone have a suggestion for some other technique to get a consistent time reference across multiple VMs in Azure? (Note, I don't necessarily need their clocks synced, I just need a way to establish a consistent correction factor to reference the times to a common source. Note also, I need sub-second accuracy - probably about 100 msec will do.) Basically, I just need a windows function or shell command that will get me the time to sub-second accuracy on a given remote server.
Thanks in advance.
PS. Azure servers are running Server 2008 R2 SP1

How does one know why an Azure WebSite instance(WebApp) was shutdown?

By looking at my Pingdom reports I have noted that my WebSite instance is getting recycled. Basically Pingdom is used to keep my site warm. When I look deeper into the Azure Logs ie /LogFiles/kudu/trace I notice a number of small xml files with "shutdown" or "startup" suffixes ie:
2015-07-29T20-05-05_abc123_002_Shutdown_0s.xml
While I suspect this might be to do with MS patching VMs, I am not sure. My application is not showing any raised exceptions, hence my suspicions that it is happening at the OS level. Is there a way to find out why my Instance is being shutdown?
I also admit I am using a one S2 instance scalable to three dependent on CPU usage. We may have to review this to use a 2-3 setup. Obviously this doubles the costs.
EDIT
I have looked at my Operation Logs and all I see is "UpdateWebsite" with status of "succeeded", however nothing for the times I saw the above files for. So it seems that the "instance" is being shutdown, but the event is not appearing in the "Operation Log". Why would this be? Had about 5 yesterday, yet the last "Operation Log" entry was 29/7.
An example of one of yesterday's shutdown xml file:
2015-08-05T13-26-18_abc123_002_Shutdown_1s.xml
You should see entries regarding backend maintenance in operation logs like this:
As for keeping your site alive, standard plans allows you to use the "Always On" feature which pretty much do what pingdom is doing to keep your website warm. Just enable it by using the configure tab of portal.
Configure web apps in Azure App Service
https://azure.microsoft.com/en-us/documentation/articles/web-sites-configure/
Every site on Azure runs 2 applications. 1 is yours and the other is the scm endpoint (a.k.a Kudu) these "shutdown" traces are for the kudu app, not for your site.
If you want similar traces for your site, you'll have to implement them yourself just like kudu does. If you don't have Always On enabled, Kudu get's shutdown after an hour of inactivity (as far as I remember).
Aside from that, like you mentioned Azure will shutdown your app during machine upgrade, though I don't think these shutdowns result in operational log events.
Are you seeing any side-effects? is this causing downtime?
When upgrades to the service are going on, your site might get moved to a different machine. We bring the site up on a new machine before shutting it down on the old one and letting connections drain, however this should not result in any perceivable downtime.

Attaching a new disk to Azure IAAS VM caused a reboot?

I recently attached a new empty data disk to an Azure IAAS VM running Windows Server 2012 datacenter. As soon as the disk was added. Windows rebooted. This surprised me greatly, as I didn't expect added a data disk to cause a reboot. I looked in the event log and didn't see any errors, the event log indicated NT AUTHORTIY\SYSTEM initiated the reboot.
I attached another disk after it came back up, and it behaved as expected, the disk was added without a reboot.
Does anyone know why/what circumstances would cause an operation like that to make the system trigger a reboot?
Thank you!
This isn't expected behaviour, and isn't anything I've seen before.
I've added numerous data disks to numerous IaaS VMs on Azure, in both Windows 2008 R2 and Windows 2012 R2, and I've never seen a VM reboot automatically as a result of that action.
Is there any chance you were doing anything else at same time which may have caused the reboot, such as a silent software installation? Alternatively you might have been particularly unfortunate if it coincided with a scheduled Azure maintenance window (although you would normally see something in the Event Log if that was the case).

Website on Azure Virtual Machine stops responding every day

I have a website (orders.cpidealers.com) running on an Azure Virtual Machine currently configured to Basic, A2 (2 cores, 3.5 GB memory) monitoring 3 endpoints.
Every morning since Tuesday, June 24,
The website has been unavailable (the browser just spins, I don't even get a 401 or any error)
I can't RDP into the virtual machine,
The endpoint status shows a warning triangle (although when I click on the link next to it some say Not Available while others give a time, I'm not sure I know how to translate the endpoint status box).
To resolve the problem, I login to Azure and restart the Virtual Machine. So far, everything seems to work fine for the remainder of the day until I arrive to work in the morning at 7:30 (Mountain Time).
Any suggestions on how to troubleshoot this?
Well, it seems to me like your app somehow manages to hang IIS by wasting resources. Cant tell you more without any data. You should enable some performance counters monitoring and see what is going on.
http://azure.microsoft.com/en-us/documentation/articles/cloud-services-dotnet-use-performance-counters/
http://www.codeproject.com/Articles/303686/Windows-Azure-Diagnostics-Performance-Counters-In
It looks like the system was hanging as Rouen mentioned. From that, we found this article which seems to have resolved the problem: IIS: Web Application hangs periodically needs system reboot
Here is everything my developer did:
I changed a few other things on the server. Set the sql server to never auto close, which should help the performance in the morning, set the gupdate to manual ( we did that together ) and then I found this article, which seems an exact case for our problem so I set the Credentials Manager to automatic and restarted.
IIS: Web Application hangs periodically needs system reboot

Resources