Running Code in Azure Repo from Azure Data Factory - azure

Main problem:
I need to orchestrate the run of Python scripts using an Azure Data Factory pipeline.
What I have tried:
Databricks: The problem with this solution is that it is costly, a little slow (the need to spin up clusters), and it requires that I write my code in a notebook.
Batch activity from ADF: It too is costly and little slow. I don't have to write my code in a notebook, but I have to manually put it in a storage account, which is not great when debugging or updating.
My question:
Is there a way to run code in an Azure repo (or Github repo) directly from the Data Factory? Like the batch activity but instead of reading the code from a storage account read it from the repo itself?
Thanks for your help

Based on the statement in the document "Pipelines and activities in Azure Data Factory", the Azure Git Repos and GitHub Repos are not the supported source data store and sink data store for ADF pipelines. So, it is not possible to directly run the code from the git repository in the ADF pipelines.
However, ADF has the Source control option to allow you to configure a Git repository with either Azure Repos or GitHub. Then you can configure CI/CD pipelines on Azure DevOps to integrate with ADF. The CI/CD pipelines can directly run code from the git repository.
For more details, you can see the document "CI/CD in ADF".

Related

Clone pipelines from one ADF to another

I need to clone a copy of existing pipelines (pipeline count: 10-20) from one subscription to another subscription's (another ADF). Is there any way to do this activity using Azure Devops?
Option1:
Using Git Configuration, you can publish the data factory to the GIT branch. Connect your new data factory to the same repository and build from the branch. Resources, such as pipelines, datasets, and triggers, will carry through. You can delete the pipelines which are not used.
Option2:
You can manually copy the code (JSON code) of each pipeline, dataset, linked service and use the same code in the new data factory. (Use same names for when creating pipelines/datasets/linked services).

How to export files generated to Azure DevOps from Azure Databricks after a job terminates?

We are using Azure DevOps to submit a Training Job to Databricks. The training job uses a notebook to train a Machine Learning Model. We are using databricks CLI to submit the job from ADO.
In the notebook, in of the steps, we create a .pkl file, we want to download this to the build agent and publish it as an artifact in Azure DevOps. How do we do this?
It really depends on how that file is stored:
If it just saved on the DBFS, you can use databrics fs cp 'dbfs:/....' local-path
if file is stored on the local file system, then copy it to DBFS (for example, by using dbutils.fs.cp), and then use the previous item
if the model is tracked by MLflow, then you can either explicitly export model to DBFS via MLflow API (or REST API) (you can do it to DevOps directly as well, just need to have correct credentials, etc.) or use this tool to export models/experiments/runs to local disk

Azure Data Factory, How get output from scala (jar job)?

We have a Azure Data Factory pipeline and one step is a jar job that should return output used in the next steps.
It is possible to get output from notebook with dbutils.notebook.exit(....)
I need the similar feature to retrieve output from main class of jar.
Thanks!
Image of my pipeline
Actually,there is no built-in feature to execute jar job directly as i know.However, you could implement it easily with Azure Databricks Service.
Two ways in Azure Databricks workspace:
If your jar is executable jar,then just use Set JAR which could set main class and parameters:
Conversely,you could try to use Notebook to execute dbutils.notebook.exit(....) or something else.
Back to ADF, ADF has Databricks Activity and you can get output of it for next steps.Any concern,please let me know.
Updates:
There is no similar feature to dbutils.notebook.exit(....) in Jar activity as i know.So far i just provide a workaround here: storing the parameters into specific file which resides in the (for example)blob storage inside the jar execution.Then use LookUp activity after jar activity to get the params for next steps.
Updates at 1.21.2020
Got some updates from MSFT in the github link: https://github.com/MicrosoftDocs/azure-docs/issues/46347
Sending output is a feature that only notebooks support for notebook
workflows and not jar or python executions in databricks. This should
be a feature ask for databricks and only then ADF can support it.
I would recommend you to submit this as a product feedback on Azure
Databricks feedback forum.
It seems that output from jar execution is not supported by azure databricks,ADF only supports features of azure databricks naturally. Fine...,you could push the related progress by contacting with azure databricks team. I just shared all my knowledges here.

What service is used for triggering an Azure Pipeline that is assigned a Machine Learning task?

I have a model trained on SVM with dataset as CSV uploaded as blob in blob storage. How can I update the CSV and how the changes can be used to trigger the pipeline that re-train the ML model.
If you mean trigger the build/release pipeline in Azure DevOps, then you need to set CI/CD for the build/release pipeline. Thus the pipeline will be triggered when a new commit/changeset is pushed to the repository.
In your scenario seems you stored the csv file in blob storage but not the normal repository. So, you cannot trigger the pipeline directly.
However as a workaround you can try to create a new build pipeline (e.g Pipeline A) and run commands/scripts in a command line task to update the CSV file, then use this build pipeline (e.g Pipeline A) to trigger another pipeline (e.g Pipeline B). Thus Pipeline B will be triggered when you updated the CSV file successfully in Pipeline A.
Not familiar with Machine Learning, however find the following articles, hope that helps:
Machine Learning DevOps (MLOps) with Azure ML [Enabling CI/CD
for Machine Learning project with Azure Pipelines
If you don't want the csv upload to happen in a pipeline you can write an Azure Function or Azure Logic App. Those can be triggered on changes or creations of blobs. Inside you could do a rest call to either start your pipeline like here api-for-automating-azure-devops-pipelines or retrain your model.

How to Migrate Azure DataFactory(v2) Job to another DataFactory(2)

How to move all existing job to another Azure Datafactory.
I am trying to move existing job from one data Factory to another but not able to find the solution any suggestions, please.
As far as I know there is no easy import/export facility.
I recommend connecting your Data Factories to source control (GIT). You can then copy and paste the JSON definitions between the two repo's using a text editor.
For propagating pipelines between environments, you can look in to the documentation for CI/CD in Azure Data Factory.

Resources