Unable to create a pool with custom images on MS Azure - azure

I'm trying to create a pool of virtual machines built on my custom image. I've successfully created a custom image and added it to my batch account.
But when I try to create a pool, based on this image from the azure portal, I get an error.
There was an error encountered while performing the last resize on the
pool. Please try resizing the pool again. Code: AllocationFailed
Message: Desired number of dedicated nodes could not be allocated
Details: Reason - The source managed disk or snapshot associated with
the virtual machine Image Id was not found.
While creating a pool in the portal I use my image name, as there's no option to set an image id. But the image Id in the json is correct. And I can see the image listed in the portal in the correct batch account.
Here's my pool properties json:
{
"id": "my-pool-0",
"displayName": "my-pool-0",
"lastModified": "2018-12-04T15:54:06.467Z",
"creationTime": "2018-12-04T15:44:18.197Z",
"state": "active",
"stateTransitionTime": "2018-12-04T15:44:18.197Z",
"allocationState": "steady",
"allocationStateTransitionTime": "2018-12-04T16:09:11.667Z",
"vmSize": "standard_a2",
"resizeTimeout": "PT15M",
"currentDedicatedNodes": 0,
"currentLowPriorityNodes": 0,
"targetDedicatedNodes": 1,
"targetLowPriorityNodes": 0,
"enableAutoScale": false,
"autoScaleFormula": null,
"autoScaleEvaluationInterval": null,
"enableInterNodeCommunication": false,
"maxTasksPerNode": 1,
"url": "https://mybatch.westeurope.batch.azure.com/pools/my-pool-0",
"resizeErrors": [
{
"message": "Desired number of dedicated nodes could not be allocated",
"code": "AllocationFailed",
"values": [
{
"name": "Reason",
"value": "The source managed disk or snapshot associated with the virtual machine Image Id was not found."
}
]
}
],
"virtualMachineConfiguration": {
"imageReference": {
"publisher": null,
"offer": null,
"sku": null,
"version": null,
"virtualMachineImageId": "/subscriptions/79b59716-301e-401a-bb8b-22edg5c1he4j/resourceGroups/resource-group-1/providers/Microsoft.Compute/images/my-image"
},
"nodeAgentSKUId": "batch.node.ubuntu 18.04"
},
"applicationLicenses": null
}
It seems like the error text has nothing to do with what actually is wrong. Has anyone encountered this error or now a way to troubleshoot this?
UPDATE
Packer json used to create the image (taken from here)
{
"builders": [{
"type": "azure-arm",
"client_id": "ffxcvbd0-c867-429a-bxcv-8ee0acvb6f99",
"client_secret": "cvb54cvb-202d-4wq-bb8b-22cdfbce4f",
"tenant_id": "ae33sdfd-a54c-40af-b20c-80810f0ff5da",
"subscription_id": "096da34-4604-4bcb-85ae-2afsdf22192b",
"managed_image_resource_group_name": "resource-group-1",
"managed_image_name": "my-image",
"os_type": "Linux",
"image_publisher": "Canonical",
"image_offer": "UbuntuServer",
"image_sku": "18.04-LTS",
"azure_tags": {
"dept": "Engineering",
"task": "Image deployment"
},
"location": "West Europe",
"vm_size": "Standard_DS2_v2"
}],
"provisioners": [{
"execute_command": "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
"inline": [
"export DEBIAN_FRONTEND=noninteractive",
"apt-get update",
"apt-get upgrade -y",
"apt-get -y install nginx",
...
"/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync"
],
"inline_shebang": "/bin/sh -x",
"type": "shell"
}]
}

With your issue, I did the test as you. The steps here:
Create the managed image through Packer.
Create the Batch Pool with the managed image in the same subscription and region.
And then I get the same error as you. Then I make another test that creates the image from a snapshot and then create the Batch Pool with the image. Luck! The pool works well.
In Azure you can prepare a managed image from snapshots of an Azure
VM's OS and data disks, from a generalized Azure VM with managed
disks, or from a generalized on-premises VHD that you upload.
Reference to this description, it seems the custom image cannot create through Packer. I'm not sure about this. But it really works. Hope this will help you.
Update
Take a look at the document Custom Images with Batch Shipyard. The description:
Note: Currently creating an ARM Image directly with Packer can only be
used with User Subscription Batch accounts. For standard Batch Service
pool allocation mode Batch accounts, Packer will need to create a VHD
first, then you will need to import the VHD to an ARM Image. Please
follow the appropriate path that matches your Batch account pool
allocation mode.
In my test, I have followed the steps that Packer does to create the image. When the source VM exists, the custom image can be used normally for Batch Pool. But it will fail if you delete the source VM. So, as the description, the standard Batch Service just can use the image created from VHD file that Packer create and the VHD file should exist in the Pool lifetime.

If your using a managed image then your imageReference section should look like this:
"imageReference": {
"id": "/subscriptions/79b59716-301e-401a-bb8b-22edg5c1he4j/resourceGroups/resource-group-1/providers/Microsoft.Compute/images/my-image"
},

Related

How to run Presto discovery service standalone?

How to run Presto Discovery Service standalone so it's neither a coordinator nor a worker? What are the requirements of a HTTP endpoint to become a discovery service for a Presto cluster?
I found this thread on presto-users mailing list where David Phillips wrote:
If you want to run discovery as a standalone service, separate from
Presto, that is an option. We used to publish instructions for doing
this, but got rid of them years ago, as running discovery inside the
coordinator worked fine (even on large clusters with hundreds of
machines).
Does this still hold?
Yes, you can run a standalone discovery service. The cases for this are rare and in general I recommend just running it on the coordinator.
On your discovery node:
Download the discovery service tar.gz with the version that is compatible with your Presto nodes. (e.g. presto version 347 is compatible with discovery service 1.29) and untar it to a directory.
Similar to a Presto Server setup, create an /etc directory under the service root and configure the node.properties and jvm.config.
Add the config.properties, which for discovery service is as simple as this.
http-server.http.port=8081
Update these lines in your coordinator/worker config.properties.
discovery-server.enabled=false
discovery.uri=http://discovery.example.com:8081
Restart your services. (Discovery service is started the same way the presto services are started using bin/launcher)
Once all the servers and workers come up, you should be able to check curl -XGET http://discovery.example.com:8081/v1/service and should expect to see some output that contains:
{
"environment": "production",
"services": [
{
"id": "d2b7141e-d83f-4d23-be86-285ff2a9f53d",
"nodeId": "57ac8bd3-c55e-4170-b363-80d10023ece8",
"type": "presto",
"pool": "general",
"location": "/57ac8bd3-c55e-4170-b363-80d10023ece8",
"properties": {
"node_version": "347",
"coordinator": "true",
"http": "http://coord.example.com:8080",
"http-external": "http://coord.example.com:8080",
"connectorIds": "system"
}
},
{
"id": "f0abafae-052a-4758-95c6-d19355043bc6",
"nodeId": "57ac8bd3-c55e-4170-b363-80d10023ece8",
"type": "presto-coordinator",
"pool": "general",
"location": "/57ac8bd3-c55e-4170-b363-80d10023ece8",
"properties": {
"http": "http://coord.example.com:8080",
"http-external": "http://coord.example.com:8080"
}
},
{
"id": "1f5096de-189e-4e25-bac3-adc079981d86",
"nodeId": "8d7e820f-dd01-4227-ad6e-f74b97202647",
"type": "presto",
"pool": "general",
"location": "/8d7e820f-dd01-4227-ad6e-f74b97202647",
"properties": {
"node_version": "347",
"coordinator": "false",
"http": "http://worker1.example.com:8080",
"http-external": "http://worker1.example.com:8080",
"connectorIds": "system"
}
},
....
]
}

Azure Container Service (AKS) kubeconfig file outdated

I am learning about K8s and did setup a release pipeline with a kubectl apply. I've setup the AKS cluster via Terraform and on the first run all seemed fine. Once I destroyed the cluster I reran the pipeline, I get issues which I believe are related to the kubeconfig file mentioned in the exception. I tried the cloud shell etc. to get to the file or reset it but I wasn't succesful. How can I get back to a clean state?
2020-12-09T09:08:51.7047177Z ##[section]Starting: kubectl apply
2020-12-09T09:08:51.7482440Z ==============================================================================
2020-12-09T09:08:51.7483217Z Task : Kubectl
2020-12-09T09:08:51.7483729Z Description : Deploy, configure, update a Kubernetes cluster in Azure Container Service by running kubectl commands
2020-12-09T09:08:51.7484058Z Version : 0.177.0
2020-12-09T09:08:51.7484996Z Author : Microsoft Corporation
2020-12-09T09:08:51.7485587Z Help : https://learn.microsoft.com/azure/devops/pipelines/tasks/deploy/kubernetes
2020-12-09T09:08:51.7485955Z ==============================================================================
2020-12-09T09:08:52.7640528Z [command]C:\ProgramData\Chocolatey\bin\kubectl.exe --kubeconfig D:\a\_temp\kubectlTask\1607504932712\config apply -f D:\a\r1\a/medquality-cordapp/k8s
2020-12-09T09:08:54.1555570Z Unable to connect to the server: dial tcp: lookup mq-k8s-dfee38f6.hcp.switzerlandnorth.azmk8s.io: no such host
2020-12-09T09:08:54.1798118Z ##[error]The process 'C:\ProgramData\Chocolatey\bin\kubectl.exe' failed with exit code 1
2020-12-09T09:08:54.1853710Z ##[section]Finishing: kubectl apply
Update, workflow tasks of the release pipeline:
Initially I get the artifact, clone of the repo containing the k8s yamls, then the stage does a kubectl apply.
"workflowTasks": [
{
"environment": {},
"taskId": "cbc316a2-586f-4def-be79-488a1f503564",
"version": "0.*",
"name": "kubectl apply",
"refName": "",
"enabled": true,
"alwaysRun": false,
"continueOnError": false,
"timeoutInMinutes": 0,
"definitionType": null,
"overrideInputs": {},
"condition": "succeeded()",
"inputs": {
"kubernetesServiceEndpoint": "82e5971b-9ac6-42c6-ac43-211d2f6b60e4",
"namespace": "",
"command": "apply",
"useConfigurationFile": "false",
"configuration": "",
"arguments": "-f $(System.DefaultWorkingDirectory)/medquality-cordapp/k8s",
"secretType": "dockerRegistry",
"secretArguments": "",
"containerRegistryType": "Azure Container Registry",
"dockerRegistryEndpoint": "",
"azureSubscriptionEndpoint": "",
"azureContainerRegistry": "",
"secretName": "",
"forceUpdate": "true",
"configMapName": "",
"forceUpdateConfigMap": "false",
"useConfigMapFile": "false",
"configMapFile": "",
"configMapArguments": "",
"versionOrLocation": "version",
"versionSpec": "1.7.0",
"checkLatest": "false",
"specifyLocation": "",
"cwd": "$(System.DefaultWorkingDirectory)",
"outputFormat": "json",
"kubectlOutput": ""
}
}
]
```
I can see you are using kubernetesServiceEndpoint as the Service connection type in Kubectl task.
Once I destroyed the cluster I reran the pipeline, I get issues....
If the cluster was destroyed. The kubernetesServiceEndpoint in azure devops is still connected to the origin cluster. Kubectl task which using the origin kubernetesServiceEndpoint is still looking for the old cluster. And it will fail with above error, since the old cluster was destroyed.
You can fix this issue by updating the kubernetesServiceEndpoint in azure devops with the newly created cluster:
Go to Azure devops Project settings-->Service connections--> Find your Kubernetes Service connection-->Click Edit to update the configuration.
But if your kubernete cluster gets destroyed and recreated frequently. I would suggest using Azure Resource Manager as the Service connection type to connect to the cluster in Kubectl task. See below screenshot.
By using azureSubscriptionEndpoint and specifying azureResourceGroup, if only the cluster's name doesnot change, It doesnot matter how many times the cluster is recreated.
See document to create an Azure Resource Manager service connection
When you destroy and reprovision AKS cluster the kube API URL and some other things change, but as you found out, nothing updates this automatically on your configured clients.
What I do to get access new and reprovisioned AKS clusters is :
az aks get-credentials --subscription <sub> -g <rg> -n <aksname> -a --overwrite

How to deploy a Linux Azure Function using the Github Docker Registry

I cannot get a deployment of an Azure Function by private repository, using then new Github artifact repo for Docker to work (https://github.com/features/packages).
My linux_fx_version is:
'linux_fx_version': 'DOCKER|{}'.format(self.docker_image_id)
with docker_image_id having the value organisation/project-name/container-name:latest
For the other settings, I am using
{ "name": "DOCKER_REGISTRY_SERVER_PASSWORD", "value": self.docker_password },
{ "name": "DOCKER_REGISTRY_SERVER_USERNAME", "value": self.docker_username },
{ "name": "DOCKER_REGISTRY_SERVER_URL", "value": self.docker_url },
with the docker_url being https://docker.pkg.github.com/, and the password being the token with read:packages
Things look good, and yet I get the following (I am not able to fetch any deployment logs as the runtime is unreachable).
Error:
Azure Functions Runtime is unreachable. Click here for details on storage configuration.
Solution found.
Use https://docker.pkg.github.com/ as the docker URL,
and docker.pkg.github.com/<org>/<project-name>/<container-name>:<version> as the linux_fx_version

IotEdge - Error calling Create module image-classifier-service

I'm very new to Azure IoT Edge and I'm trying to deploy to my Raspberry PI : Image Recognition with Azure IoT Edge and Cognitive Services
but after Build & Push IoT Edge Solution and Deploy it to Single Device ID I see none of those 2 modules listed in Docker PS -a & Iotedge list
And when try to check it on EdgeAgent Logs there's error message and it seems EdgeAgent get error while creating those Modules (camera-capture and image-classifier-service)
I've tried :
1. Re-build it from fresh folder package
2. Pull the image manually from Azure Portal and run the image manually by script
I'm stuck on this for days.
in deployment.arm32v7.json for those modules I define the Image with registered registry url :
"modules": {
"camera-capture": {
"version": "1.0",
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "zzzz.azurecr.io/camera-capture-opencv:1.1.12-arm32v7",
"createOptions": "{\"Env\":[\"Video=0\",\"azureSpeechServicesKey=2f57f2d9f1074faaa0e9484e1f1c08c1\",\"AiEndpoint=http://image-classifier-service:80/image\"],\"HostConfig\":{\"PortBindings\":{\"5678/tcp\":[{\"HostPort\":\"5678\"}]},\"Devices\":[{\"PathOnHost\":\"/dev/video0\",\"PathInContainer\":\"/dev/video0\",\"CgroupPermissions\":\"mrw\"},{\"PathOnHost\":\"/dev/snd\",\"PathInContainer\":\"/dev/snd\",\"CgroupPermissions\":\"mrw\"}]}}"
}
},
"image-classifier-service": {
"version": "1.0",
"type": "docker",
"status": "running",
"restartPolicy": "always",
"settings": {
"image": "zzzz.azurecr.io/image-classifier-service:1.1.5-arm32v7",
"createOptions": "{\"HostConfig\":{\"Binds\":[\"/home/pi/images:/images\"],\"PortBindings\":{\"8000/tcp\":[{\"HostPort\":\"80\"}],\"5679/tcp\":[{\"HostPort\":\"5679\"}]}}}"
}
Error message from EdgeAgent Logs :
(Inner Exception #0) Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling Create module
image-classifier-service: Could not create module image-classifier-service
caused by: Could not pull image zzzzz.azurecr.io/image-classifier-service:1.1.5-arm32v7
caused by: Get https://zzzzz.azurecr.io/v2/image-classifier-service/manifests/1.1.5-arm32v7: unauthorized: authentication required
When trying to run the pulled image by script :
sudo docker run --rm --name testName -it zzzz.azurecr.io/camera-capture-opencv:1.1.12-arm32v7
None
I get this error :
Camera Capture Azure IoT Edge Module. Press Ctrl-C to exit.
Error: Time:Fri May 24 10:01:09 2019 File:/usr/sdk/src/c/iothub_client/src/iothub_client_core_ll.c Func:retrieve_edge_environment_variabes Line:191 Environment IOTEDGE_AUTHSCHEME not set
Error: Time:Fri May 24 10:01:09 2019 File:/usr/sdk/src/c/iothub_client/src/iothub_client_core_ll.c Func:IoTHubClientCore_LL_CreateFromEnvironment Line:1572 retrieve_edge_environment_variabes failed
Error: Time:Fri May 24 10:01:09 2019 File:/usr/sdk/src/c/iothub_client/src/iothub_client_core.c Func:create_iothub_instance Line:941 Failure creating iothub handle
Unexpected error IoTHubClient.create_from_environment, IoTHubClientResult.ERROR from IoTHub
When you pulled the image directly with docker run, it pulled but then failed to run outside of the edge runtime, which is expected. But when the edge agent tried to pull it, it failed because it was not authorized. No credentials were supplied to the runtime, so it attempted to access the registry anonymously.
Make sure that you add your container registry credentials to the deployment so that edge runtime can pull images. The deployment should contain something like the following in the runtime settings:
"MyRegistry" :{
"username": "<username>",
"password": "<password>",
"address": "<registry-name>.azurecr.io"
}
As #silent pointed out in the comments, the documentation is here, including an example deployment that includes container registry credentials.

Azure VMSS Linux OS Upgrade

I was looking at https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade and seeing if we could change our VMSS (currently using a Manual upgrade policy mode) to an automatic rolling update one but found that the application health probe couldn't use our existing App Gateway health probe as it needed to be specifically a LoadBalancer one. Bummer.
Anyhow, I thought I'd test our VMSS to ensure we can manually Upgrade each instance from the Portal/CLI but deliberately picking an old 16.04 LTS image id (instead of the 'latest' version tag). From "az vm image list --location canadacentral --publisher Canonical --offer UbuntuServer --SKU 16.04-LTS --all --output table" I picked the first 16.04 image published in 2018 ie 16.04.201801050. The latest one is "16.04.201811140"
Microsoft.Compute/virtualMachineScaleSets/cluster?api-version=2018-06-01:
"properties": {
"singlePlacementGroup": false,
"upgradePolicy": {
"mode": "Manual",
"automaticOSUpgrade": false
},
...
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "16.04-LTS",
"version": "16.04.201801050"
},
I can confirm that each new VMSS instance indeed has the desired "16.04.201801050" image by SSH onto the box (with plenty of updates to apply):
```
Welcome to Ubuntu 16.04.3 LTS (GNU/Linux 4.11.0-1016-azure x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
202 packages can be updated.
118 updates are security updates.
jiraadmin#jiranode-000001:~$ apt list linux-image-azure
Listing... Done
linux-image-azure/xenial-updates,xenial-security 4.15.0.1032.37 amd64 [upgradable from: 4.11.0.1016.16]
N: There is 1 additional version. Please use the '-a' switch to see it
```
but I was surprised to see that the Portal and REST API have each of the instances with the latest model applied set to true (which clearly it is not)
Microsoft.Compute/virtualMachineScaleSets/cluster/virtualMachines/0?api-version=2018-06-01:
"properties": {
"latestModelApplied": true,
"vmId": "...",
"hardwareProfile": {},
"storageProfile": {
"imageReference": {
"publisher": "Canonical",
"offer": "UbuntuServer",
"sku": "16.04-LTS",
"version": "16.04.201801050"
}
Clicking on the Upgrade button for the VM instance in the Azure Portal kicks off a very short-lived task with no changes made to the underlying VM.
So I assumed the following:
Specifying an older image version before the 'latest' one should have VMSS instance's latestModelApplied set to false
Clicking on Upgrade button from Poisal should bring the "old" image version upto the 'latest' image version ie essentially performing a 'sudo apt-get upgrade' or 'sudo apt dist-upgrade'. With latestModelApplied to false, it doesn't do either.
Clicking an on Reimage from the Portal, you get a warning about instance back to it's original state but from https://learn.microsoft.com/en-us/rest/api/compute/virtualmachinescalesets/reimage it suggests it's going to upgrade OS image ie sudo apt dist-upgrade. It does the former, it reinstalls the original image blowing away everything.
So at the min it appears to me you can't use the Portal to maintain OS and security updates on the currently running VM due to erroneous latestModelApplied property. Is the behaviour and my assumptions above correct?
Thanks,
Stephen.
Guy from MS sorted out my (incorrect) assumptions at https://github.com/Azure/vm-scale-sets/issues/62.

Resources