Azure Machine Learning pipeline: How to retry upon failure? - azure

So I've got an Azure Machine Learning pipeline here that consists of a number of PythonScriptStep tasks - pretty basic really.
Some of these script steps fail intermittently due to network issues or somesuch - really nothing unexpected. The solution here is always to simply trigger a rerun of the failed experiment in the browser interface of Azure Machine Learning studio.
Despite my best efforts I haven't been able to figure out how to set a retry parameter either on the script step objects, the pipeline object, or any other AZ ML-related object.
This is a common pattern in pipelines of any sort: Task fails once - retry a couple of times before deciding it actually fails.
Does anyone have pointers for me please?
Edit: One helpful user suggested an external solution to this which requires an Azure Logic App that listens to ML pipeline events and re-triggers failed pipelines via an HTTP request. While this solution may work for some it just takes you down another rabbit hole of setting up, debugging, and maintaining another external component. I'm looking for a simple "retry upon task failure" option that (IMO) must be baked into the Azure ML pipeline framework and is hopefully just poorly documented.

I assume that if a script fails, you want to rerun the entire pipeline. In that case, it is pretty simple with Logic Apps. What you need is the following:
You need to make a PipelineEndpoint for your pipeline so it can be triggered by something outside Azure ML.
You need to set up a Logic App to listen for failed runs. See the following: https://medium.com/geekculture/notifications-on-azure-machine-learning-pipelines-with-logic-apps-5d5df11d3126. Instead of printing a message to Microsoft Teams as in that example, you instead invoke your pipeline through its endpoint.

(this would ideally be a comment but it exceeded the word limit)
#user787267's answer above help me set up the re-try pipeline. So I thought I'd add a few more details that might help someone else set this up.
How to set up the HTTP action
Method: POST
URI: The pipeline endpoint that you configured
Headers: `Key`: Content-Type -- `Value`: application/json
Body:
{
"ExperimentName": "my_experiment_name",
"ParameterAssignments": {
"param1": "value1",
"param2": "value2" },
"RunSource": "SDK"
}
Authentication Type: Managed Identity
Managed Identity: System-assigned managed identity
You can set up the managed identity by going to the logic app's page and then clicking on the Identity tab as shown below. After that just follow the steps. You'll need to give the managed identity permissions over the space in which your ML instance lives.

Related

AzureBlobCredentialMissing Error only occurs when triggered, versus no error in Debug

I get the following error in a pipeline that's first activity is to do a lookup on a storage container to get the contents of a file. When I test the connectionns, linked server, datasets or debug the pipeline I do not receive any errors. However when the pipeline is triggered by the storage event, it throws this error:
ErrorCode=AzureBlobCredentialMissing,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Please provide either connectionString or sasUri or serviceEndpoint to connect to Blob.,Source=Microsoft.DataTransfer.ClientLibrary,'
As per your scenario, where the debug is successful but the trigger runs failing. This make me assume that your dev changes have not been published which is why the trigger run fails. In simple terms the most recent published version of your linked service is different than that of your development version which haven't been published.
In case if you are using Source control then I would recommed following this tutorial for best practices - Automated publishing for continuous integration and delivery
If you are using CI-CD, then the issue might indeed cause by the DevOps pipeline not overriding the linked service parameters. Try redeploying the resource bye following below step and it should work as expected. (Linked service parameters had to be overwritten on the Azure resource template)
For example, if you have a linked service such as below:
Then you will still have to add below values into the overrideParameters section of the AzureResourceManagerTemplateDeployment task.

Change Endpoint in Azure Devops Invoke Rest API task

I try to invoke another endpoint for my task but I cannot get it to work. Does anyone know of a way to do that?
The endpoint that Devops defaults to:
https://management.Azure.com
The desired endpoint:
https://germanywestcentral.api.azureml.ms
The documentation is not really clear whether this is at all possible.. or at least I canĀ“t find it..
Resources:
https://learn.microsoft.com/en-us/azure/devops/pipelines/tasks/utility/http-rest-api?view=azure-devops

How can I get information on my template failing to start?

I'm using Azure Labs Services (for classrooms), and I can't start my Template VM. The "start VM" trigger will work, but the VM will fail to start and return to a "stopped" state without any error message in the Labs environment or the Azure Portal. Is there a way I can pull more debugging information as to why my Template didn't start, or a possible troubleshooting option from someone who's experienced this problem before?
Yes of course, you can troubleshoot it further by checking the Activity logs of your Lab account from within the Azure portal as follows:
Expanding the failed event further, you should be able to see the Error code and the Message. Switching to the JSON representation, look for the statusMessage key within properties that has more details.
For example:
..
"properties": {
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ResourceGroupNotFound\",\"message\":\"Resource group 'MX-RG-xxxxx' could not be found.\"}]}}"
},
..
This should hopefully give you enough information to take the next steps.
There's an ongoing outage for Azure Lab Services. Please follow updates here.

'Failed to encrypt sub-resource payload' error when attempting CI/CD

We are trying to setup CI/deployment with DevOps using the documentation provided here: https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment. We are using a shared IR that has been set up in the target environment prior to deployment.
The release succeeds if the deployment mode setting is set to validation only, but fails when incremental or complete is selected. We get the following error when using override template parameters:
2018-09-21T17:07:43.2936188Z ##[error]BadRequest: {
"error": {
"code": "BadRequest",
"message": "Failed to encrypt sub-resource payload
Please make sure your shared IR is online when doing the deployment, otherwise you may meet this problem because self-hosted IR will be used to encrypt your payload.
If you confirm the above action is done and you still have this error, please share the request activity ID to us and we can do some further investigation.
Make sure that you've entered the right connection string into your parameters JSON for any linked services you are using. This fixed the error for me although I don't have a full CI/CD environment with IR established.
I can solve it using the Azure Key Vault.
I added the connection string as a Secret.
In the connection string I also included the authentication data (username and password).
The limitation of this approach is that the possibility of passing the parameters is lost.
For example: dynamic values such as the name of the database or the user.
I would request you to look into the connection string for the respective Linked Service for which you have attached IR. For my ASQL based Linked service I had to use something like this , simple server name would not suffice and you will get "message": "Failed to encrypt sub-resource payload
"typeProperties": {
"connectionString": "Integrated Security=False;Encrypt=True;Connection Timeout=30;Data Source=axxx-xxx-xx-xxxx.database.windows.net;Initial Catalog=\"#{split(linkedService().LS_ASQL_SERVERDB,';')[1]}\""
}
I override parameter because of the connection string was secure. Use dummy value of(username, password, connection string) if You don't have original ones and then deploy.
The IR already being running doesn't make sense when doing a full deployment of an ADF instance. The IR key is generated within the instance of ADF you deploy, meaning you've created circular logic: you cannot deploy IR until the deployment of ADF is complete, but you can't complete the deployment of ADF until the IR is deployed.
So far our answer has been to let the arm template fail at this point, which is after the IR registration in the template so the IR key is then generated. We use that to deploy the IR, then re-run the template and it succeeds... it's stupid and hacky and there has to be a more sane way to do this than intentional failure/retry.

How can I programatically (C#) read the autoscale settings for a WebApp?

I'm trying to build a small program to change the autoscale settings for our Azure WebApps, using the Microsoft.WindowsAzure.Management.Monitoring and Microsoft.WindowsAzure.Management.WebSites NuGet packages.
I have been roughly following the guide here.
However, we are interested in scaling WebApps / App Services rather than Cloud Services, so I am trying to use the same code to read the autoscale settings but providing a resource ID for our WebApp. I have already got the credentials required for making a connection (using a browser window popup for Active Directory authentication, but I understand we can use X.509 management certificates for non-interactive programs).
This is the request I'm trying to make. Credentials already established, and an exception is thrown earlier if they're not valid.
AutoscaleClient autoscaleClient = new AutoscaleClient(credentials);
var resourceId = AutoscaleResourceIdBuilder.BuildWebSiteResourceId(webspaceName: WebSpaceNames.NorthEuropeWebSpace, serverFarmName: "Default2");
AutoscaleSettingGetResponse get = autoscaleClient.Settings.Get(resourceId); // exception here
The WebApp (let's call it "MyWebApp") is part of an App Service Plan called "Default2" (Standard: 1 small), in a Resource Group called "WebDevResources", in the North Europe region. I expect that my problem is that I am using the wrong names to build the resourceId in the code - the naming conventions in the library don't map well onto what I can see in the Azure Portal.
I'm assuming that BuildWebSiteResourceId is the correct method to call, see MSDN documentation here.
However the two parameters it takes are webspaceName and serverFarmName, neither of which match anything in the Azure portal (or Google). I found another example which seemed to be using the WebApp's geo region for webSpaceName, so I've used the predefined value for North Europe where our app is hosted.
While trying to find the correct value for serverFarmName in the Azure Portal, I found the Resource ID for the App Service Plan, which looks like this:
/subscriptions/{subscription-guid}/resourceGroups/WebDevResources/providers/Microsoft.Web/serverfarms/Default2
That resource ID isn't valid for the call I'm trying to make, but it does support the idea that a 'serverfarm' is the same as an App Service Plan.
When I run the code, regardless of whether the resourceId parameters seem to be correct or garbage, I get this error response:
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">
{"Code":"SettingNotFound","Message":"Could not find the autoscale settings."}
</string>
So, how can I construct the correct resource ID for my WebApp or App Service Plan? Or alternatively, is there a different tree I should be barking up to programatially manage WebApp scaling?
Update:
The solution below got the info I wanted. I also found the Azure resource explorer at resources.azure.com extremely useful to browse existing resources and find the correct names. For example, the name for my autoscale settings is actually "Default2-WebDevResources", i.e. "{AppServicePlan}-{ResourceGroup}" which I wouldn't have expected.
There is a preview service https://resources.azure.com/ where you can inspect all your resources easily. If you search for autoscale in the UI you will easily find the settings for your resource. It will also show you how to call the relevant REST Api endpoint to read or update that resorce.
It's a great tool for revealing a lot of details for your deployed resources and it will actually give you an ARM template stub for the resource you are looking at.
And to answer your question, you could programmatically call the REST API from a client with updated settings for autoscale. The REST API is one way of doing this, the SDK another and PowerShell a third.
The guide which you're following is based on the Azure Service Management model, aka Classic mode, which is deprecated and only exists mainly for backward compatibility support.
You should use the latest
Microsoft.Azure.Insights nuget package for getting the autoscale settings.
Sample code using the nuget above is as below:
using Microsoft.Azure.Management.Insights;
using Microsoft.Rest;
//... Get necessary values for the required parameters
var client = new InsightsManagementClient(new TokenCredentials(token));
client.AutoscaleSettings.Get(resourceGroupName, autoScaleSettingName);
Besides, the autoscalesettings is a resource under the "Microsoft.Insights" provider and not under the "Microsoft.Web" provider, which explains why you are not able to find it with your serverfarm resourceId.
See the REST API Reference below for getting the autoscale settings.
GET
https://management.azure.com/subscriptions/{subscription-id}/resourceGroups/{resource-group-name}/providers/microsoft.insights/autoscaleSettings/{autoscale-setting-name}?api-version={api-version}

Resources