Azure VMSS Linux OS Upgrade

Azure VMSS Linux OS Upgrade - azure

I was looking at https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade and seeing if we could change our VMSS (currently using a Manual upgrade policy mode) to an automatic rolling update one but found that the application health probe couldn't use our existing App Gateway health probe as it needed to be specifically a LoadBalancer one. Bummer.
Anyhow, I thought I'd test our VMSS to ensure we can manually Upgrade each instance from the Portal/CLI but deliberately picking an old 16.04 LTS image id (instead of the 'latest' version tag). From "az vm image list --location canadacentral --publisher Canonical --offer UbuntuServer --SKU 16.04-LTS --all --output table" I picked the first 16.04 image published in 2018 ie 16.04.201801050. The latest one is "16.04.201811140"
Microsoft.Compute/virtualMachineScaleSets/cluster?api-version=2018-06-01:
"properties": {
"singlePlacementGroup": false,
"upgradePolicy": {
"mode": "Manual",
"automaticOSUpgrade": false
},
...
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "16.04-LTS",
"version": "16.04.201801050"
},
I can confirm that each new VMSS instance indeed has the desired "16.04.201801050" image by SSH onto the box (with plenty of updates to apply):
```
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-1016-azure x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
202 packages can be updated.
118 updates are security updates.
jiraadmin#jiranode-000001:~$ apt list linux-image-azure
Listing... Done
linux-image-azure/xenial-updates,xenial-security 4.15.0.1032.37 amd64 [upgradable from: 4.11.0.1016.16]
N: There is 1 additional version. Please use the '-a' switch to see it
```
but I was surprised to see that the Portal and REST API have each of the instances with the latest model applied set to true (which clearly it is not)
Microsoft.Compute/virtualMachineScaleSets/cluster/virtualMachines/0?api-version=2018-06-01:
"properties": {
"latestModelApplied": true,
"vmId": "...",
"hardwareProfile": {},
"storageProfile": {
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "16.04-LTS",
"version": "16.04.201801050"
}
Clicking on the Upgrade button for the VM instance in the Azure Portal kicks off a very short-lived task with no changes made to the underlying VM.
So I assumed the following:
Specifying an older image version before the 'latest' one should have VMSS instance's latestModelApplied set to false
Clicking on Upgrade button from Poisal should bring the "old" image version upto the 'latest' image version ie essentially performing a 'sudo apt-get upgrade' or 'sudo apt dist-upgrade'. With latestModelApplied to false, it doesn't do either.
Clicking an on Reimage from the Portal, you get a warning about instance back to it's original state but from https://learn.microsoft.com/en-us/rest/api/compute/virtualmachinescalesets/reimage it suggests it's going to upgrade OS image ie sudo apt dist-upgrade. It does the former, it reinstalls the original image blowing away everything.
So at the min it appears to me you can't use the Portal to maintain OS and security updates on the currently running VM due to erroneous latestModelApplied property. Is the behaviour and my assumptions above correct?
Thanks,
Stephen.

Guy from MS sorted out my (incorrect) assumptions at https://github.com/Azure/vm-scale-sets/issues/62.

Related

Azure VM: can't install Qualys extension

I run the same code snippet as for other extensions:
az vm extension set \
--resource-group "azure-vm-arm-rg" \
--vm-name "azure-vm" \
--name "WindowsAgent.AzureSecurityCenter" \
--publisher "Qualys"
..and I'm getting:
The handler for VM extension type 'Qualys.WindowsAgent.AzureSecurityCenter'
has reported terminal failure for VM extension 'WindowsAgent.AzureSecurityCenter'
with error message: 'Enable failed for plugin (name: Qualys.WindowsAgent.AzureSecurityCenter,
version 1.0.0.10) with exception Command
C:\Packages\Plugins\Qualys.WindowsAgent.AzureSecurityCenter\1.0.0.10\enableCommandHndlr.cmd
of Qualys.WindowsAgent.AzureSecurityCenter has exited with Exit code: 4306'.
I have no issues installing this extension via Azure UI in Security Center
I suspect license to be the root cause but I don't have any dedicated licenses, I believe Security center manages them automatically
Any ideas how to install Qualys extension automatically?

I encountered the same issue. It was because the extension was added too soon after the vm had started. The pre-req is that the Azure Virtual Machine agent should be running on the vm before the extension is added.
for my solution, I added dependencies on other extensions before running this extension. That gave enough time for the machine to start and have the Azure Virtual Machine agent running before qualys extension is added.
{
"type": "microsoft.compute/virtualmachines/providers/serverVulnerabilityAssessments",
"apiVersion": "2015-06-01-preview",
"name": "[concat(parameters('virtualMachineName'), '/Microsoft.Security/Default')]",
"dependsOn": [
"[concat('Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'))]",
"[concat('Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'), '/extensions/AzurePolicyforWindows')]",
"[concat('Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'), '/extensions/Microsoft.Insights.VMDiagnosticsSettings')]",
"[concat('Microsoft.Compute/virtualMachines/', parameters('virtualMachineName'), '/extensions/AzureNetworkWatcherExtension')]"
]
}

Make sure you have no Azure Policies configured which do things like require tags, as this can block the extension installation and only give the error message The resource operation completed with terminal provisioning state 'Failed'..

apt-get update not working as azure custom script extension

I'm deploying linux ubuntu 16.04 LTS VMSS using Azure ARM template with custom script extension. Content of customscript.sh:
apt-get update
apt-get install build-essential -y
...
but it fails at this update step itself with error:
Splitting up /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_xenial_InRelease into data and signature failedE: GPG error: http://archive.ubuntu.com/ubuntu xenial InRelease: Clearsigned file isn't valid, got 'NODATA' (does the network require authentication?)
When I login to the VM and perform apt-get update, it runs successfully.
ARM Template for deployment of Linux VMSS using Custom Script Extension:
"virtualMachineProfile": {
"extensionProfile": {
"extensions": [
{
"name": "Custom Deployment",
"properties": {
"publisher": "Microsoft.Azure.Extensions",
"typeHandlerVersion": "2.1",
"autoUpgradeMinorVersion": true,
"protectedSettings": {
"commandToExecute": "[concat('/bin/bash customscript.sh')]",
"fileUris": [
"[parameters('scriptToExecuteLinux')]"
]
},
"type": "CustomScript"
}
},
}
......
}
Please let me know if I'm performing any step incorrectly.
Thanks in advance.

I have basically no experience with azure but it could be that you run the script as the wrong user, if you are running this as root, maybe try running it as a normal user. If you're running this script as a user, maybe try running it as root.

I see couple of issues in your code
1. I think you are not correctly using the concat function. There is only one argument
2. Cross check that you have all the key value pairs as per the extension schema mentioned here: https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/custom-script-linux
For more detailed errors, please check the file /var/lib/waagent/custom-script/download/0/stderr, you might find some answers.
When you ran it manually did you try /bin/bash script.sh or sh script.sh?

Azure Function blobTrigger not registered

As title, when I try to run my nodejs based azure function, I come across the following error:
The following 1 functions are in error: [7/2/19 1:41:17 AM] ***: The binding type(s) 'blobTrigger' are not registered. Please ensure the type is correct and the binding extension is installed.
I tried func extensions install --force with no luck still, any idea? My development environment is macOS and I tried both nodejs based azure-functions-core-tools and brew based install both doesn't work.
The most scary part is this used to work fine on the same machine, all a sudden it just failed to work.

Basically, you can refer to the offical tutorial for Linux Create your first function hosted on Linux using Core Tools and the Azure CLI (preview) to start up your work.
Due to the same shell bash used in MacOS and Linux, I will start up my sample demo for you on Linux and avoid using those incompatible operations. First of all, assumed that there is an usable NodeJS runtime in your environment. The version of node and npm is v10.16.0 and 6.9.0.
To install azure-functions-core-tools via npm and inspect it, as the figure below.
Next to init a project MyFunctionProj via func
Then to new a function with blob trigger
There is an issue about the requirement for .NET Core SDK. So I move to https://www.microsoft.com/net/download to install it, here is incompatible with MacOS, but I think you can easy to fix it by yourself. So I followed the offical installation instruction to install it.
After installed .NET Core SDK, try to func new again.
And completed like this.
To change two configuration files MyFunctionProj/local.settings.json and MyFunctionProj/MyBlobTrigger/function.json, as below.
MyFunctionProj/local.settings.json
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "node",
"AzureWebJobsStorage": "<your real storage connection string like `DefaultEndpointsProtocol=https;AccountName=<your account name>;AccountKey=<your account key>;EndpointSuffix=core.windows.net`"
}
}
MyFunctionProj/MyBlobTrigger/function.json
{
"bindings": [
{
"name": "myBlob",
"type": "blobTrigger",
"direction": "in",
"path": "<the container name you want to monitor>/{name}",
"connection": "AzureWebJobsStorage"
}
]
}
Then, command func host start --build to start up it without any error.
Let's upload a test file named test.txt via Azure Storage Explorer to the container <the container name you want to monitor> which be configured in the function.json file. And you will see that MyBlobTrigger has been triggered and work fine.
Hope it helps.

Unable to create a pool with custom images on MS Azure

I'm trying to create a pool of virtual machines built on my custom image. I've successfully created a custom image and added it to my batch account.
But when I try to create a pool, based on this image from the azure portal, I get an error.
There was an error encountered while performing the last resize on the
pool. Please try resizing the pool again. Code: AllocationFailed
Message: Desired number of dedicated nodes could not be allocated
Details: Reason - The source managed disk or snapshot associated with
the virtual machine Image Id was not found.
While creating a pool in the portal I use my image name, as there's no option to set an image id. But the image Id in the json is correct. And I can see the image listed in the portal in the correct batch account.
Here's my pool properties json:
{
"id": "my-pool-0",
"displayName": "my-pool-0",
"lastModified": "2018-12-04T15:54:06.467Z",
"creationTime": "2018-12-04T15:44:18.197Z",
"state": "active",
"stateTransitionTime": "2018-12-04T15:44:18.197Z",
"allocationState": "steady",
"allocationStateTransitionTime": "2018-12-04T16:09:11.667Z",
"vmSize": "standard_a2",
"resizeTimeout": "PT15M",
"currentDedicatedNodes": 0,
"currentLowPriorityNodes": 0,
"targetDedicatedNodes": 1,
"targetLowPriorityNodes": 0,
"enableAutoScale": false,
"autoScaleFormula": null,
"autoScaleEvaluationInterval": null,
"enableInterNodeCommunication": false,
"maxTasksPerNode": 1,
"url": "https://mybatch.westeurope.batch.azure.com/pools/my-pool-0",
"resizeErrors": [
{
"message": "Desired number of dedicated nodes could not be allocated",
"code": "AllocationFailed",
"values": [
{
"name": "Reason",
"value": "The source managed disk or snapshot associated with the virtual machine Image Id was not found."
}
]
}
],
"virtualMachineConfiguration": {
"imageReference": {
"publisher": null,
"offer": null,
"sku": null,
"version": null,
"virtualMachineImageId": "/subscriptions/79b59716-301e-401a-bb8b-22edg5c1he4j/resourceGroups/resource-group-1/providers/Microsoft.Compute/images/my-image"
},
"nodeAgentSKUId": "batch.node.ubuntu 18.04"
},
"applicationLicenses": null
}
It seems like the error text has nothing to do with what actually is wrong. Has anyone encountered this error or now a way to troubleshoot this?
UPDATE
Packer json used to create the image (taken from here)
{
"builders": [{
"type": "azure-arm",
"client_id": "ffxcvbd0-c867-429a-bxcv-8ee0acvb6f99",
"client_secret": "cvb54cvb-202d-4wq-bb8b-22cdfbce4f",
"tenant_id": "ae33sdfd-a54c-40af-b20c-80810f0ff5da",
"subscription_id": "096da34-4604-4bcb-85ae-2afsdf22192b",
"managed_image_resource_group_name": "resource-group-1",
"managed_image_name": "my-image",
"os_type": "Linux",
"image_publisher": "Canonical",
"image_offer": "UbuntuServer",
"image_sku": "18.04-LTS",
"azure_tags": {
"dept": "Engineering",
"task": "Image deployment"
},
"location": "West Europe",
"vm_size": "Standard_DS2_v2"
}],
"provisioners": [{
"execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
"inline": [
"export DEBIAN_FRONTEND=noninteractive",
"apt-get update",
"apt-get upgrade -y",
"apt-get -y install nginx",
...
"/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync"
],
"inline_shebang": "/bin/sh -x",
"type": "shell"
}]
}

With your issue, I did the test as you. The steps here:
Create the managed image through Packer.
Create the Batch Pool with the managed image in the same subscription and region.
And then I get the same error as you. Then I make another test that creates the image from a snapshot and then create the Batch Pool with the image. Luck! The pool works well.
In Azure you can prepare a managed image from snapshots of an Azure
VM's OS and data disks, from a generalized Azure VM with managed
disks, or from a generalized on-premises VHD that you upload.
Reference to this description, it seems the custom image cannot create through Packer. I'm not sure about this. But it really works. Hope this will help you.
Update
Take a look at the document Custom Images with Batch Shipyard. The description:
Note: Currently creating an ARM Image directly with Packer can only be
used with User Subscription Batch accounts. For standard Batch Service
pool allocation mode Batch accounts, Packer will need to create a VHD
first, then you will need to import the VHD to an ARM Image. Please
follow the appropriate path that matches your Batch account pool
allocation mode.
In my test, I have followed the steps that Packer does to create the image. When the source VM exists, the custom image can be used normally for Batch Pool. But it will fail if you delete the source VM. So, as the description, the standard Batch Service just can use the image created from VHD file that Packer create and the VHD file should exist in the Pool lifetime.

If your using a managed image then your imageReference section should look like this:
"imageReference": {
"id": "/subscriptions/79b59716-301e-401a-bb8b-22edg5c1he4j/resourceGroups/resource-group-1/providers/Microsoft.Compute/images/my-image"
},

Chef node not consistently saving run list on server during first boot - Azure Scale set VM

I'm presently hosting an Azure Scale set running Windows Server 2012 R2 that is setup with the Chef extension (Chef.Bootstrap.WindowsAzure.ChefClient). When the VM is provisioned, the extension reports back that it succeeded via the Azure portal however the registered node on the Chef server is not updated to retain the provided run list and the first run isn't fully completed. This is causing subsequent chef-client runs to be performed with an empty run list. When I observe the reports on chef server, I see a run with a status of aborted with no error.
Upon review of the WindowsAzure Plugins chef-client.log file, I can see that it tries to execute the run list but seems to be interrupted with the following FATAL
FATAL: Errno::EINVAL: Invalid argument # io_writev - <STDOUT>
There is no chef-stacktrace.out file created as well. The ARM extension definition looks like:
{
"type": "extensions",
"name": "ChefClient",
"properties": {
"publisher": "Chef.Bootstrap.WindowsAzure",
"type": "ChefClient",
"typeHandlerVersion": "1210.12",
"autoUpgradeMinorVersion": true,
"settings": {
"client_rb": "ssl_verify_mode :verify_none\nnode_name ENV[\"COMPUTERNAME\"]",
"runlist": "recipe[example]",
"autoUpdateClient": "false",
"deleteChefConfig": "false",
"bootstrap_options": {
"chef_server_url": "https://mychefserver.com/organizations/myorg",
"validation_client_name": "myorg-validator",
"environment": "dev"
}
},
"protectedSettings": {
"validation_key": "-----BEGIN RSA PRIVATE KEY----- ... -----END RSA PRIVATE KEY----"
}
}
}
In order to troubleshoot, I've tried to reduce my example cookbook down to a single DSC script which installs IIS. Even this step, I've executed it multiple ways such as using windows_feature, powershell_script, and dsc_script. All result end up with the same error. Here is the current script
powershell_script 'Install IIS' do
code 'Add-WindowsFeature Web-Server'
guard_interpreter :powershell_script
not_if "(Get-WindowsFeature -Name Web-Server).Installed"
end
If I override the run list and call chef-client manually, everything succeeds. I'm having trouble honing in on whether this is the Azure Chef Extension, the Chef client, or the cookbook.
As far as I can tell, communication with the Chef server looks good as the necessary pem files are exchanged, chef-client is installed, and the cookbook is downloaded and cached from the server. The cache gets removed on the subsequent run however with the empty run list. Here are the contents of first-boot.json:
{"run_list":["recipe[example]"]}
Here are the versions in play:
chef-client version: 14.1.12
Azure Chef Extension version: 1210.12.110.1001
Server version: Windows Server 2012 R2
Any ideas what could be going on?

It turns out my analysis was incorrect about which resource was causing the problem. It appears that the first boot run was failing when using dsc_script as the resource to install the web server. When using the following powershell_script resource, it succeeded and the run list attached for future runs.
powershell_script 'Install IIS' do
code 'Add-WindowsFeature Web-Server'
guard_interpreter :powershell_script
not_if "(Get-WindowsFeature -Name Web-Server).Installed"
end

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string