Today we had an issue with Azure VMs where one VM in availability set of 2 just stopped responding. After few mins we noticed that machine was shutdown and the other VM in the set wasn't turned on (which should be ok as this isn't a failover). We have take a look at the VM monitoring and there wasn't a single log telling us that there was any downtime. The only thing that we found is 2 strange logs in the Management Services - Operation Logs.
11/12/2013 10:12:02 PM AutoscaleAction Succeeded VirtualMachinesAvailabilitySet:xyz Autoscale
11/12/2013 9:36:56 PM AutoscaleAction Succeeded VirtualMachinesAvailabilitySet:xyz Autoscale
The first one was with following details:
Description: The autoscale engine attempting to scale resource
'xyz' from 0 instances count to 1 instances count.
LastScaleActionTime: 20131106T173020Z
NewInstancesCount: 1
OldInstancesCount: 0
Second one:
The autoscale engine attempting to scale resource 'xyz' from 2
instances count to 1 instances count.
LastScaleActionTime: 20131112T203656Z
NewInstancesCount: 1
OldInstancesCount: 2
Does anyone know what may had happen ?
UPDATE
Azure Support has provided me with the feedback and they explained that machines were down due to host update.
Regards
Whenever you use autoscale you set an instance range that tells Azure the minimum and maximum number of VM's you want to be running at a given point in time. In this case, it looks like you've set the minimum to be 1. That would explain why, when both VM's were stopped, it turned on one of them.
In addition, the scale from 2 to 1 was likely because load was low on your VM's (assuming you're scaling by CPU). If the average CPU remains below the target you've established (by default 60%), it will scale down until it hits the minimum (in this case, 1).
Both of my machines were down because of the host update and AutoScaling set from 1 to 2 machines based on CPU usage. So I have found out that AutoScaling won't turn the second machine on when doing host update (which can be pretty helpful and make my apps online).
I think that will explain the 0 of 1 instances issue, so don't use AutoScaling with above setup to get HA.
Regards
Related
I'm running a v2 instance and from the documentation aws states you should only be paying for resources that you are actually using. I have an instance than is most of the time at 0 connections but it never scales down under 2 ACUs. See images below for reference. I have the instance setup to scale between 0.5-16ACU. But it doesn't seem to matter the load it always stays at a baseline of 2ACUs.
I had to turn off the AI monitoring on the DB. Then restart the instance. This then started the db at the minimum.
I can confirm this behaviour but as yet can't explain it. We have three databases running, all with the same schema and with different ACU limits set. Our production and staging databases insist at near flatlines close to the max capacity allowed whilst one other behaves as we would expect and only shows an upscale when we actually send it load.
We have tried rebooting the instances but they immediately scale up and do not appear willing to scale down.
We have full support with AWS so will raise a ticket with them and should report back here if we get an explanation/solution
We are using AKS 1.19.1 in our sandbox environment and have both system and user nodepools seperately. We have multiple applications running in our user nodepools along with istio as service mesh. Current node count of the usernodepool is 24 and auto scaling is enabled.
Now as part of cost optimisation, we are planning to scale down the usernodepool to zero during the non working hours ( say after office time or during night time).
Reference:- https://learn.microsoft.com/en-us/azure/aks/scale-cluster
Will this be recommended way to do scale down to zero nodes for such a cluster with nodepool size of 25 nodes
if yes,
When we renable autoscaling property (everyday after making the count to zero at night) , will this automatically increase node count and application pods will be auto sterted or, we need to rollout the restart of the pods seperately?
What are the factors it depends to back the normal running state and how long it may take?
Is there any way to schedule this feature of scaling down to zero during night and then again renable autoscaling property at morning automatically
Just use KEDA and let it manage the scale up / down for you:
https://learn.microsoft.com/en-us/learn/modules/aks-app-scale-keda/
I have an Azure AppServicePlan and have enabled Autoscale based on a schedule. The default number of servers is always 1.
During weekdays it should scale up at 12:50pm to 2. Then at 4pm it should scale back DOWN to 1.
It always scales up correctly, but does not scale down at 4pm. It remains on 2 servers. Today it scaled up correctly from 1 to 2 servers at 12:50. However it is now 16:50 and it has remained on 2 servers. It has NOT scaled back down to 1 at the END time of the rule.
The default scale rule is not executing as my CPU is never more than 10% and memory is never more than 58% or so.
This is how my configuration looks:
I have tried several times, on several subscriptions using a couple of different accounts and I keep running into the same exact issue when attempting to deploy a new service fabric cluster through the Azure portal. I tried this with both secure and unsecure clusters (to ensure that my certificate setup was not to blame) as well as with 5 node clusters as well as single node test clusters. In all cases the error was exactly the same.
At step 4, in all cases, the portal indicates that the portal generated ARM template is valid, and allows me to start the deployment process. After about 10 minutes I get the dreaded Deployment Failed icon on my dashboard for the 20th time!
Clicking on the icon takes me to the error logs which indicate that there was an issue with "Write Deployments"
I also see that all the necessary resource types have been generated (Storage Accounts, VM ScaleSets, Etc..)
However looking at the VM Scale Set I see another (more descriptive) issue stating that there was a provisioning error with the code "ProvisioningState/failed/InternalDiskManagementError" and that an internal disk management error occurred.
I am at a complete loss. I am not doing anything custom, this is all on the Azure Portal and as I mentioned I tried both simple test clusters without security or logging as well as 5 node clusters with security and logging enabled. In all case I get the same exact error. This is on 3 different Azure accounts.
The only other thing that I might try is different regions (I've only been targeting West US 2) and maybe some variants on the VM size (been targeting A0 for cost).
Has anyone else run into similar issues? I've been able to deploy clusters before (a few months back) but ever since then I keep getting stopped by this bug!
UPDATE 1
I attempted a deployment in West US 2 using the A1_V2 VM Size and I again got the Write Deployment failure, but this time on the VM Scale Set I have a different error:
ProvisioningState/failed/VMExtensionHandlerNonTransientError
Handler 'Microsoft.Azure.Diagnostics.IaaSDiagnostics' has reported failure for VM Extension 'VMDiagnosticsVmExt_vmNodeType0Name' with terminal error code '1007' and error message: 'Install failed for plugin (name: Microsoft.Azure.Diagnostics.IaaSDiagnostics, version 1.10.0.0) with exception Command C:\Packages\Plugins\Microsoft.Azure.Diagnostics.IaaSDiagnostics\1.10.0.0\DiagnosticsInstall.cmd of Microsoft.Azure.Diagnostics.IaaSDiagnostics has not exited on time! Killing it...'
UPDATE 2
I made a deployment in Central US using a D sized VM and was able to deploy just fine. At this point it seems that either the Region or the VM Size is what is causing issues. Going to make a few more deployments using various VM sizes and regions and will continue updating here with my findings...
UPDATE 3
Was able to create a single node Standard_D1_v2 cluster in West US 2.
UPDATE 4
Was able to create a 3 node Standard_A2_v2 cluster in West US 2.
Region is not the issue.....
UPDATE 5
A second attempt at deploying A1_V2 VM in West US 2 resulted in the same error as the last time this VM size was used:
ProvisioningState/failed/VMExtensionHandlerNonTransientError
FINAL UPDATE
The issue seems to be that the VM's I was using are under-powered.
I hope that Microsoft updates their portal so the next developer does not run into the same issues as me. Right now the portal makes you think that your setup is valid (even passes the validation in step 4) and then fails without any clarity. I opened a support ticket and even the Azure techs are giving me the run around and having me check my Resource Provider settings! They have no clue that I'm using insufficient VM sizes!
I also think it's way too expensive for developers to have to pay so much just to get some test nodes up on the cloud. And I'm still perplexed that I was able to get a 5 node A0 cluster up an running, but no longer can! Maybe there was a Service Fabric software update since then?
The recommended VM SKU is Standard D3 or Standard D3_V2 or equivalent with a minimum of 14 GB of local SSD.
The minimum supported use VM SKU is Standard D1 or Standard D1_V2 or equivalent with a minimum of 14 GB of local SSD.
Partial core VM SKUs like Standard A0 are not supported for production workloads.
Standard A1 SKU is not supported for production workloads for performance reasons.
Source
These errors are usually caused by using unsupported VM sizes. As a workaround for test clusters, you can first deploy using something like D3_V2 and after successful deployment, scale down.
I have 2 azure vm's created, 1 windows server 2012 r2, and 1 Ubuntu 14.
It takes both those vm's approximately 5 minutes to startup.
Is there a way I can speed up the process?
I don't need the vm's running continuously, I prefer to start/stop, as per need.
There are no steps you can take to speed up the VM start. Resources must be allocated and the VM provisioned.
What I can recommend is setting up a script to auto start / stop your VM's based on a schedule. For example, if you are using them in a class room environment you can set it up so that they start early (6am?) and shut down each day by 5pm.
You can find some more information about this here.