Azure Machine Learning Execute Pipeline Configuration to pass input data - azure-machine-learning-service

I would like to create a Synapse Pipeline for batch inferencing with data ingestion to store the data into data lake and then use this as input to call a batch endpoint already created (through ML Execute Pipeline) Then, capture the output into the data lake (appended to a table) to continue the next steps...
The documentation from Microsoft to setup such a scenario is very poor and everything I tried is failing.
Below is the Azure Machine Learning Execute Pipeline configuration. I need to pass the value for the dataset_param with data asset instance already available as below.
But, it complains that the dataset_param is not provided. Not sure, how to pass this value...
Here is the original experiment / pipeline / endpoint created by the DevOps pipeline. I just call this endpoint above from the Synapse Pipeline

Related

Iterating through the parameters in csv file and run pipeline with each parameter

Hi i have a scenario where i have an csv file in azure datalake storage. while running azure pipeline, the parameters from the excel has to be picked up one by one in an iterative manner. Based on each parameter the databricks notebook should be run.
Is there any solution for this - how to iterate through the values in csv file?
If you are in Azure, you should consider Azure Data Factory (ADF) or Azure Synapse Analytics which has pipelines. Both are good for moving data from place to place and data orchestration. For example you could have an ADF pipeline with a Lookup activity which reads your .csv, then calls a For Each activity with a parameterised Databricks notebook inside:
Interestingly, the For Each activity runs in parallel so could deal with multiple lines at once, depending on your Databricks cluster size etc.
You could try and do all this within a single Databricks notebook which I'm sure is possible, but I would say that is a more code heavy approach and you still have questions around the scheduling, doing tasks in parallel, orchestration etc

azure data factory pipeline and activities Monitor

I have about 120 pipeline with almost 400 activities all together and I would like to log them in our datalake storage system so we can report on the performance using powerBI. I came across How to get output parameter from Executed Pipeline in ADF? but it seems to me to work with a single pipeline, but I am wondering if I could get the whole pipeline in my ADF in one single call and the activities also.
Thnaks
Assuming the source in these pipelines varies which makes it difficult to apply the logic for monitoring.
One way is to store the logs individually for each pipeline by running some queries with pipeline parameters. Refer Option 2 in this tutorial.
Although, the best feasible and appropriate way to monitor ADF pipelines and activities is to use the Azure Data Factory Analytics.
This solution provides you a summary of overall health of your Data Factory, with options to drill into details and to troubleshoot unexpected behavior patterns. With rich, out of the box views you can get insights into key processing including:
At a glance summary of data factory pipeline, activity and trigger
runs
Ability to drill into data factory activity runs by type
Summary of data factory top pipeline, activity errors
Go to Azure Marketplace, choose Analytics filter, and search for Azure Data Factory Analytics (Preview)
Select Create and then create or select the Log Analytics Workspace.
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration
You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts, and take other actions.

Using Azure Data Factory to ingest incoming data from a REST API

Is there a way to create an Azure ADF Pipeline to ingest the incoming POST requests? I have this gateway app (outside Azure) that is able to publish data via REST as it arrives from the application and this data needs to be ingested into a Data Lake. I am utilizing the REST calls from another pipeline to pull the data but this basically needs to do the reverse - the data will be pushed and i need to be constantly 'listening' to those calls...
Is this something an ADF pipeline should do or maybe there are any other Azure components able to do it?
Previous comment is right and is one of the approach to get it working but would need bit of coding (for azure function).
There could also be an alternate solution to cater to your requirement is with Azure Logic Apps and Azure data factory.
Step 1: Create a HTTP triggered logic app which would be invoked by your gateway app and data will be posted to this REST callable endpoint.
Step 2: Create ADF pipeline with a parameter, this parameter holds the data that needs to be pushed to the data lake. It could be raw data and can be transformed as a step within the pipeline before pushing it to the data lake.
Step 3: Once logic app is triggered, you can simply use Azure data factory actions to invoke the data factory pipeline created in step 2 and pass the posted data as a pipeline parameter to your ADF pipeline.
This should be it, with this - you can spin up your code-less solution.
If your outside application is already pushing via REST, why not have it make calls directly to the Data Lake REST APIs? This would cut out the middle steps and bring everything under your control.
Azure Data Factory is a batch data movement service. If you want to push the data over HTTP, you can implement a simple Azure Function to accept the data and write it to the Azure Data Lake.
See Azure Functions HTTP triggers and bindings overview

Manage Azure BlobStorage file append with ADF

I've azure data factory pipeline which stores data with some operation by calling Azure data flow.
Here file name in blob storage should be the pipeline-run-id.
Pipeline copy activity has 'Copy Behavior', I can not find a related option in the sink stream in a data flow?
Now I have a situation where I'm going to call the same azure data flow in the same pipeline execution more than one time. And because of that my file get overwritten in the blob. But I want to append new data to the same file if it exists.
Ex. If pipeline run id '9500d37b-70cc-4dfb-a351-3a0fa2475e32' and data flow call from that pipeline execution 2 times. In that case, 9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv only has data with 2nd azure data flow process.
Data Flow doesn't support copyBehavior. It means that Data Flow doesn't support merge/append files.
Every time you call the Data Flow, it will create a new file '9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv' and replace the exist '9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv'.
Hope this helps.

What service is used for triggering an Azure Pipeline that is assigned a Machine Learning task?

I have a model trained on SVM with dataset as CSV uploaded as blob in blob storage. How can I update the CSV and how the changes can be used to trigger the pipeline that re-train the ML model.
If you mean trigger the build/release pipeline in Azure DevOps, then you need to set CI/CD for the build/release pipeline. Thus the pipeline will be triggered when a new commit/changeset is pushed to the repository.
In your scenario seems you stored the csv file in blob storage but not the normal repository. So, you cannot trigger the pipeline directly.
However as a workaround you can try to create a new build pipeline (e.g Pipeline A) and run commands/scripts in a command line task to update the CSV file, then use this build pipeline (e.g Pipeline A) to trigger another pipeline (e.g Pipeline B). Thus Pipeline B will be triggered when you updated the CSV file successfully in Pipeline A.
Not familiar with Machine Learning, however find the following articles, hope that helps:
Machine Learning DevOps (MLOps) with Azure ML [Enabling CI/CD
for Machine Learning project with Azure Pipelines
If you don't want the csv upload to happen in a pipeline you can write an Azure Function or Azure Logic App. Those can be triggered on changes or creations of blobs. Inside you could do a rest call to either start your pipeline like here api-for-automating-azure-devops-pipelines or retrain your model.

Resources