How can I get information on my template failing to start? - azure

I'm using Azure Labs Services (for classrooms), and I can't start my Template VM. The "start VM" trigger will work, but the VM will fail to start and return to a "stopped" state without any error message in the Labs environment or the Azure Portal. Is there a way I can pull more debugging information as to why my Template didn't start, or a possible troubleshooting option from someone who's experienced this problem before?

Yes of course, you can troubleshoot it further by checking the Activity logs of your Lab account from within the Azure portal as follows:
Expanding the failed event further, you should be able to see the Error code and the Message. Switching to the JSON representation, look for the statusMessage key within properties that has more details.
For example:
..
"properties": {
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ResourceGroupNotFound\",\"message\":\"Resource group 'MX-RG-xxxxx' could not be found.\"}]}}"
},
..
This should hopefully give you enough information to take the next steps.
There's an ongoing outage for Azure Lab Services. Please follow updates here.

Related

Azure Machine Learning pipeline: How to retry upon failure?

So I've got an Azure Machine Learning pipeline here that consists of a number of PythonScriptStep tasks - pretty basic really.
Some of these script steps fail intermittently due to network issues or somesuch - really nothing unexpected. The solution here is always to simply trigger a rerun of the failed experiment in the browser interface of Azure Machine Learning studio.
Despite my best efforts I haven't been able to figure out how to set a retry parameter either on the script step objects, the pipeline object, or any other AZ ML-related object.
This is a common pattern in pipelines of any sort: Task fails once - retry a couple of times before deciding it actually fails.
Does anyone have pointers for me please?
Edit: One helpful user suggested an external solution to this which requires an Azure Logic App that listens to ML pipeline events and re-triggers failed pipelines via an HTTP request. While this solution may work for some it just takes you down another rabbit hole of setting up, debugging, and maintaining another external component. I'm looking for a simple "retry upon task failure" option that (IMO) must be baked into the Azure ML pipeline framework and is hopefully just poorly documented.
I assume that if a script fails, you want to rerun the entire pipeline. In that case, it is pretty simple with Logic Apps. What you need is the following:
You need to make a PipelineEndpoint for your pipeline so it can be triggered by something outside Azure ML.
You need to set up a Logic App to listen for failed runs. See the following: https://medium.com/geekculture/notifications-on-azure-machine-learning-pipelines-with-logic-apps-5d5df11d3126. Instead of printing a message to Microsoft Teams as in that example, you instead invoke your pipeline through its endpoint.
(this would ideally be a comment but it exceeded the word limit)
#user787267's answer above help me set up the re-try pipeline. So I thought I'd add a few more details that might help someone else set this up.
How to set up the HTTP action
Method: POST
URI: The pipeline endpoint that you configured
Headers: `Key`: Content-Type -- `Value`: application/json
Body:
{
"ExperimentName": "my_experiment_name",
"ParameterAssignments": {
"param1": "value1",
"param2": "value2" },
"RunSource": "SDK"
}
Authentication Type: Managed Identity
Managed Identity: System-assigned managed identity
You can set up the managed identity by going to the logic app's page and then clicking on the Identity tab as shown below. After that just follow the steps. You'll need to give the managed identity permissions over the space in which your ML instance lives.

Backup Windows server Azure VM new Azure Recovery Service Vault error code BMSUserErrorContainerObjectNotFound

I have a new vm, Operating system Windows (Windows Server 2016 Datacenter).
When I try to enable backup and select new Recovery Service Vault, I get deployment error:
Deployment to resource group test failed.
Additional details from the underlying API that might be helpful: At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.
Resource
vault242/Azure/iaasvmcontainer;iaasvmcontainerv2;test;web01/vm;iaasvmcontainerv2;test;web01
Type
Microsoft.RecoveryServices/vaults/backupFabrics/protectionContainers/protectedItems
Status
Conflict
Status message
{
"status": "Failed",
"error": {
"code": "BMSUserErrorContainerObjectNotFound",
"message": "Item not found"
}
}
Can't find any information for code BMSUserErrorContainerObjectNotFound and why a protected item not created automatically
My apologies for the delay in the response.
Were you able to resolve the issue?
If not, let's review it.
As I understood, you are enabling the Azure VM Back Up by following the next steps:
There could be multiple reasons why you are getting this failure.
Did you perform these steps manually using the Azure Portal? Template deployment? Scripting? I suspect most likely you are doing the template deployment or any kind of scripting and this one is the syntax issue.
Second thought, it was the transmitted issue due to the load of request on the Azure end. In this case, you need to retry the operation.
Additional question to ask, do you get the failure on one specific machine or all machines? Specific region?
Do you get the same failure when you use the existing vault?
If you still can provide information above, it's going to be helpful to narrow down the root cause.
I ran into this error as well today and I think it is is a Azure portal bug when enabling the Backup from the VM blade.
Instead, you can initiate a Backup from the "Recovery Services vaults" blade and add the VM to it.

Where to find logs for databricks workspace?

I created a databricks component with an vnet based on this template and documentation. The problem is that we receive an error when trying to launch a workspace.
"We've encountered an error creating your workspace. Please wait a few minutes and try again."
In the documentation, there is a similar error in troubleshooting section but it's not the same.
The problem could be a network problem as the documentation suggests, but the ARM has been probed in other azure environments and it works properly.
The problem is creating a workspace but we don't know why.
Does anyone know where to find any kind of logs about workspace creation or know anything about this error message?
Thanks.
This error means that your workspace failed to be provisioned. We had this when a Policy on our subscription blocked the resource from being created. The policy was to ensure that Tags were set. Check to see if you have any Policies enabled.
Any logs you can see will be in the resource group under the deployments blade. But it probably won't show anything useful. You should raise a support ticket if you cannot track the problem yourself.

Azure Portal - Unable to Create a Signal R Service

I tried creating Signal R service with a Deployment failed message shown below.
Deployment to resource group '' failed. Additional details
from the underlying API that might be helpful: At least one resource
deployment operation failed. Please list deployment operations for
details. Please see https://aka.ms/arm-debug for usage details
I can see the service created in my Signal List even after getting the above error while creating it.
However, in the overview tab I can see the below error.
After clicking on the above error, I can see the code that says "Invalid RG"
Is there any problem with my RG?
Thanks,
Praveen
It seems to be a problem of azure SignalR itself, I try to create the service via portal and powershell, and get the same error.
I have opened a issue in the Github, you could trace it for progress.
Update:
It works fine in the portal today, seems something wrong with it yesterday.

Azure Service Fabric ARM template Provisioning Failed

I have a script that facilitates an ARM template to provision an Azure Service Fabric cluster (official windows servers) among other dependencies like storage and such. I do not provision through the portal.
Facts:
Two days ago, I used this script to provision the cluster with complete success.
I tried the same again yesterday, and the provisioning failed (with the error below).
just to reassure you that the provisioning script works, I can successfully provision with this script on other subscription and it constantly and reliably succeeds.
The error:
Resource Microsoft.Insights/autoscaleSettings '1NodeVMSetAutoScale' failed with message 'The metric with namespace '' and name '\Processor(_Total)\% Processor Time' is not supported for this resource id '/subscriptions/----/resourceGroups/-cluster/providers/Microsoft.Compute/virtualMachineScaleSets/1'.' 8:10:01 PM - Resource Microsoft.Insights/autoscaleSettings '2NodeVMSetAutoScale' failed with message 'The metric with namespace '' and name '\Processor(_Total)\% Processor Time' is not supported for this resource id '/subscriptions/----/resourceGroups/cluster/providers/Microsoft.Compute/virtualMachineScaleSets/2'.' 8:10:01 PM - "Template output evaluation skipped: at least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." 'string' does not contain a definition for 'error'
My question is why? What could be the reason for it not to consistently succeed? Can you please help with troubleshooting steps?
Related info: https://azure.microsoft.com/en-us/documentation/articles/insights-autoscale-common-metrics/
2 questions:
1) what region are you deploying in?
2) In the new subscription, can you check what resource providers you have registered, and in what regions? In the CLI, the commands look like:
azure config mode arm
azure provider list
azure provider show Microsoft.Insights
I faced the same issue since a week in my subscriptions. The way out was to make changes to the Diagnostic configurations, by adding the counter "\Processor(_Total)\% Processor Time" under the waddiagnostic performace counters section. You can also take sneak peak here were autoscale is discussed: Service Fabric Autoscale
Please share your template/ part of it to analyse further.

Resources