What service is used for triggering an Azure Pipeline that is assigned a Machine Learning task? - azure

I have a model trained on SVM with dataset as CSV uploaded as blob in blob storage. How can I update the CSV and how the changes can be used to trigger the pipeline that re-train the ML model.

If you mean trigger the build/release pipeline in Azure DevOps, then you need to set CI/CD for the build/release pipeline. Thus the pipeline will be triggered when a new commit/changeset is pushed to the repository.
In your scenario seems you stored the csv file in blob storage but not the normal repository. So, you cannot trigger the pipeline directly.
However as a workaround you can try to create a new build pipeline (e.g Pipeline A) and run commands/scripts in a command line task to update the CSV file, then use this build pipeline (e.g Pipeline A) to trigger another pipeline (e.g Pipeline B). Thus Pipeline B will be triggered when you updated the CSV file successfully in Pipeline A.
Not familiar with Machine Learning, however find the following articles, hope that helps:
Machine Learning DevOps (MLOps) with Azure ML [Enabling CI/CD
for Machine Learning project with Azure Pipelines

If you don't want the csv upload to happen in a pipeline you can write an Azure Function or Azure Logic App. Those can be triggered on changes or creations of blobs. Inside you could do a rest call to either start your pipeline like here api-for-automating-azure-devops-pipelines or retrain your model.

Related

Azure Machine Learning Execute Pipeline Configuration to pass input data

I would like to create a Synapse Pipeline for batch inferencing with data ingestion to store the data into data lake and then use this as input to call a batch endpoint already created (through ML Execute Pipeline) Then, capture the output into the data lake (appended to a table) to continue the next steps...
The documentation from Microsoft to setup such a scenario is very poor and everything I tried is failing.
Below is the Azure Machine Learning Execute Pipeline configuration. I need to pass the value for the dataset_param with data asset instance already available as below.
But, it complains that the dataset_param is not provided. Not sure, how to pass this value...
Here is the original experiment / pipeline / endpoint created by the DevOps pipeline. I just call this endpoint above from the Synapse Pipeline

Running Code in Azure Repo from Azure Data Factory

Main problem:
I need to orchestrate the run of Python scripts using an Azure Data Factory pipeline.
What I have tried:
Databricks: The problem with this solution is that it is costly, a little slow (the need to spin up clusters), and it requires that I write my code in a notebook.
Batch activity from ADF: It too is costly and little slow. I don't have to write my code in a notebook, but I have to manually put it in a storage account, which is not great when debugging or updating.
My question:
Is there a way to run code in an Azure repo (or Github repo) directly from the Data Factory? Like the batch activity but instead of reading the code from a storage account read it from the repo itself?
Thanks for your help
Based on the statement in the document "Pipelines and activities in Azure Data Factory", the Azure Git Repos and GitHub Repos are not the supported source data store and sink data store for ADF pipelines. So, it is not possible to directly run the code from the git repository in the ADF pipelines.
However, ADF has the Source control option to allow you to configure a Git repository with either Azure Repos or GitHub. Then you can configure CI/CD pipelines on Azure DevOps to integrate with ADF. The CI/CD pipelines can directly run code from the git repository.
For more details, you can see the document "CI/CD in ADF".

How to export files generated to Azure DevOps from Azure Databricks after a job terminates?

We are using Azure DevOps to submit a Training Job to Databricks. The training job uses a notebook to train a Machine Learning Model. We are using databricks CLI to submit the job from ADO.
In the notebook, in of the steps, we create a .pkl file, we want to download this to the build agent and publish it as an artifact in Azure DevOps. How do we do this?
It really depends on how that file is stored:
If it just saved on the DBFS, you can use databrics fs cp 'dbfs:/....' local-path
if file is stored on the local file system, then copy it to DBFS (for example, by using dbutils.fs.cp), and then use the previous item
if the model is tracked by MLflow, then you can either explicitly export model to DBFS via MLflow API (or REST API) (you can do it to DevOps directly as well, just need to have correct credentials, etc.) or use this tool to export models/experiments/runs to local disk

How to use the same pipeline in different environments with varying number of customers inside Azure Data Factory?

I have a copy data pipeline in the Azure Data Factory. I need to deploy the same Data Factory instance in multiple environments like DEV, QA, PROD using Release Pipeline.
The pipeline transfer data from Customer Storage Account (Blob Container) to Centralized Data Lake. So, we can say - its a Many to One flow. (Many customers > One Data Lake)
Now, suppose I am in DEV environment & I have 1 demo customer there. I have defined an ADF pipeline for Copy Data. But in prod environment, the number of customers will grow. So, I don't want to create multiple copies of the same pipeline in production Data Factory.
I am looking out for a solution so that I can keep one copy pipeline in Data Factory and deploy/promote the same Data Factory from one environment to the other environment. And this should work even if the number of customers is varying from one to another.
I am also doing CI/CD in Azure Data Factory using Git integration with Azure Repos.
You will have to create additional linked services and datasets which do not exist in a non-production environment to ensure any new "customer" storage account is mapped to the pipeline instance.
With CI/CD routines, you can deliver this in an incremental manner i.e. parameterize you release pipeline with variable groups and update the data factory instance with newer pipelines with new datasets/linked services.

Use of Azure Grid Events to trigger ADF Pipe to move On-premises CSV files to Azure database

We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.
You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.
Have a look at the Azure Logic Apps for File System connector Triggers. More details here.

Resources