One of my 10 Azure VMs running windows has suddenly became inaccessible! Azure Management Console shows the state of this VM as "running" the Dashboard shows no server activity since my last RDP logout 16 hours ago. I tried restarting the instance with no success, still inaccessible ( No RDP access, hosted application down, unsuccessful ping...).
I have changed the instance size from A5 to A6 using the management portal and everything went back to normal. Windows event viewer showed no errors except the unexpected shutdown today after my Instance size change. Nothing was logged between my RDP logout yesterday and the system startup today after changing the size.
I can't afford having the server down for 16 hours! Luckily this server was the development server.
How can I know what went wrong? Anyone faced a similar issue with Azure?
Thanks
there is no easy way to troubleshoot this without capturing it in a stuck state.
Sounds like you followed the recommended steps, i.e.:
- Check VM is running (portal/PowerShell/CLI).
- Check endpoints are valid.
- Restart VM
- Force a redeployment by changing the instance size.
To understand why it happened it would be necessary to leave it in a stuck state and open a support case to investigate.
There is work underway to make both self-service diagnosis and redeployment easier for situations like this.
Apparently nothing is wrong! After the reboot the machine was installing updates to complete the reboot. When I panicked, I have rebooted it again, stopped it, started it again and I have even changed its configuration thinking that it is dead. While in fact it was only installing updates.
Too bad that we cannot disable the automatic reboot or estimate the time it takes to complete.
Related
I have a f1-micro gcloud vm instance running Ubuntu 20.04.
It has 0,2 vcpus and 600mb memory.
I write freezing/crashing which stands for just not responding to anything anymore.
From my monitoring i can see that the cpu is at its peak at 40% usage (usually steady under 1%), while the memory is always arround 60% (both stats with my (nodejs) server running).
When i open a ssh connection to my instance and run my (nodejs) server in background everything works fine as long as i keep the ssh connection alive. As soon as i close the connection it takes a few more minutes until the instance freezes/crashes. Without closing the ssh connection i can keep it running for hours without any problem.
I dont get any crash or freeze information from gcloud itself. The instance has a green checkmark and is kind of still running. I just cant open a new ssh connection and also the only way to do something again with this instance is by restarting it.
I have cloud logging active and there are also no messages in there.
So with this knowledge my question is if gcloud somehow boosts ssh connected vms to keep them alive?
Cause i dont know what else could cause this behaviour.
My (nodejs) server uses arround 120mb, another service uses 80mb and the gcp monitoring agent uses 30mb. The linux free command on the instance shows memory available between 60mb and 100mb.
In addition to John Hanley and Mike, You can edit your Machine Type based on you needs.
In the Google Cloud Console, Go to VM Instance under Compute Engine.
Select Instance name to open its Overview page.
Make sure to Stop the Instance before editing Instance.
Select Machine Type that match your application needs.
Save.
For more info and guides you may refer on link below:
Edit Instance
Machine Family Categories
Since there were no answers that explained the strange behaviour i encountered.
I also haven't figured it out but at least my servers wont crash/freeze anymore.
I somehow fixxed it by running my node.js application in an actual background job using forever instead of running it like node main.js &.
I have a VM host in Azure, created using Resource Manager. I've come to use it today and can't RDP to the machine. When I view the Boot Diagnostics it has Please Wait. after a period of time it will go to the logon screen. When I view the CPU usage you can see it drop which assume is the VM restarting.
I've tried the following :
Reset Password
Reset Configuration
Redeploy
I've also looked at the network interfaces and tried adding it to a network security group with RDP rule but still nothing.
Is there anything else I can check?
EDIT
When I first start the VM up and look at the Boot diagnostics I can see the login screen. When I try and RDP to the machine it says it can't connect.
The CPU drops where I assume its restarting, I've tried RDP to the machine from another machine on the same VPN
I raised a ticket regarding the following. Support noticed the following in the logs "Rebooting VM to apply DSC configuration." The "DSC extension" was causing the machine to reboot.
They advised me to go to VM in the control panel and then extensions and uninstall the Powershell extension. Not sure what caused this ie I did not knowingly install this. But once I uninstalled it I was able to RDP. Support have asked me to try and install it again and see if the same happens again but at the moment not had a chance to do this.
We've got a classic VM on azure. All it's doing is running SQL server on it with a lot of DB's (we've got another VM which is a web server which is the web facing side which accesses the sql classic VM for data).
The problem we have that since yesterday morning we are now experiencing outages every 2-3 hours. There doesnt seem to be any reason for it. We've been working with Azure support but they seem to be still struggling to work out what the issue is. There doesnt seem to be anything in the event logs that give's us any information.
All that happens is that we receive a pingdom alert saying the box is out, we then can't remote into it as it times out and all database calls to it fail. 5 minutes later it will come back up. It doesnt seem to fully reboot or anything it just haults.
Any ideas on what this could be caused by? Or any places that we could look for better info? Or ways to patch this from happening?
The only thing that seems to be in the event logs that occurs around the same time is a DNS Client Event "Name resolution for the name [DNSName] timed out after none of the configured DNS servers responded."
Smartest or Quick Recovery:
Did you check SQL Server by connecting inside VM(internal) using localhost or 127.0.0.1/Instance name. If you can able connect SQL Server without any Issue internally and then Capture or Snapshot SQL Server VM and Create new VM using Capture VM(i.e without lose any data).
This issue may be occurred by following criteria:
Azure Network Firewall
Windows Server Update
This ended up being a fault with the node/sector that our VM was on. I fixed this by enlarging the size of our VM instance (4 core to 8 core), this forced azure to move it to another node/sector and this rectified the issue.
Whenever we get the error "Role Instances are taking longer than expected". The only possible options to do are .
Shutdown the emulators and try again.
Restart the machine and see if that helps.
Uninstall the Azure Tools for that version.
Some times uninstalling the same takes a long time,some times even days. It appears that some process or service is blocking the same. Has anyone faced this before ? If yes does anyone know which process would be blocking the same?
When an instance starts it will run the OnStart method on the worker/web role (depending on your service type). The more stuff you have in there, the more time it will take to start up the role. Common caveats are the Cache as mentioned and blob/table storage (if you do read/write/create when you start the role).
Try minimizing the OnStart's workload and moving any storage stuff in async tasks.
I have had similar problems as well in the past
IISConfigurator could not map the web roles in IIS. In my case it was due to corrupted file system ACLs on the code directory. See logs under C:\Users\YOUR_USER_NAME\AppData\Local\dftmp\IISConfiguratorLogs\
Another cause might be that something else has tied up the Port Numbers that Azure is trying to bind your web role on. Or that the ports that the local storage needs for tables/blobs and queues (10000-10002) have been taken by another app. Open a command prompt and run netstat -anb
Try running the Visual Studio using "Run as Administrator" option.
Windows Azure, RDP for web/worker roles configured successfully. All works fine, I can connect to servers via RDP. I can see logon screen, desktop and so on. But after 3..10 seconds everything freezing. It's seems like disconnect. After reconnection it's all the same: I can work for 3..10 seconds. What should I do to fix it?
Solution:
This trouble was because of restarting. So before connecting via RDP try to stabilize node first :)
Does the role stay in a running state? I have RDP'ed into many instances of both Web and Worker roles and I have not seen this behavior.
Do you have any other details that you can share? Have you installed/modified anything as a Startup task that might be causing an issue? Have you tried from another client computer?
It looks like a connection problem. I suggest you contact with the MS support, give them your subscription ID and deployment ID, then the MS will go to your machine to verify deeply.