Azure databricks CI CD pipeline to delete notebooks on production - azure

I have a CI/CD pipeline in place to deploy notebooks from dev to production in an Azure databricks workspace.
However, it is not deleting the notebooks from production, when those notebooks have been removed from development and are no longer in Azure git repository.
I want to delete all notebooks which have been removed from source, as a part of build/release process.
Is there a way to achieve this?

The easiest way is when there are new commits in Azure DevOps git repository, you could redeploy the notebooks by checked the Clean Workspace Folderoption:
Otherwise, you could add a powershell script task to compare files in two folders. The follow case may give you a start: Comparing folders and content with PowerShell

Related

How can I do CICD of Databricks Notebook in Azure Devops?

I want to do CICD of my Databricks Notebook. Steps I followed.
I have integrated my Databricks with Azure Repos.
Created a Build Artifact using YAML script which will hold my Notebook.
Deployed Build Artifact into Databricks workspace in YAML.
Now I want to
Execute and Schedule the Databricks notebook from the Azure DevOps pipeline itself.
How can setup multiple Environments like Stage, Dev, and Prod using YAML.
My Notebook itself call other notebooks. can I do this?
How can I solve this?
It's doable, and with Databricks Repos you really don't need to create build artifact & deploy it - it's better to use Repos API or databricks repos to update another checkout that will be used for tests.
For testing of notebooks I always recommend to use Nutter library from Microsoft that simplifies testing of notebooks by allowing to trigger their execution from the command-line.
You can include other notebooks using %run directive - it's important to use relative paths instead of absolute paths. You can organize dev/staging/prod either as folders inside the Repos, or as a fully separated environments - it's up to you.
I have a demo of notebooks testing & Repos integration with CI/CD - it contains all necessary instructions how to setup dev/staging/prod + Azure DevOps pipeline that will test notebook & trigger release pipeline.
The only one thing that I want to mention explicitly - for Azure DevOps you will need to use Azure DevOps personal access token because identity passthrough doesn't work with APIs yet.

Automate deploying of synapse artifacts to devops repo

im trying to deploy some synapse artifacts to a synapse workspace with devops repo integration via a python runbook. By using the azure-synapse-artifacts library of the python azure sdk the artifacts are published directly to the live mode of the synapse workspace. Is there any way to deploy artifacts to a devops repo branch for synapse? Didnt find any devops repo apis or libaries, just for the direct integration of git.
We can use CICD in this case, as this process will help to move entities from one environment to others, and for this we need to configure our synapse work space as source in GIT.
Below are few straight steps we can follow:
Set up Azure Synapse workspace and configure pipeline in Azure Devops.
Under staging while creating DevOps project, we can select Add Artifacts and select GIT.
Configure the workflow file and add workflow.
You can refer to MS Docs for detailed explanation of each step in achieving this task

Configure Azure DevOps repository in Databricks through ARM template or Powershell

I am looking for a sample ARM template which can setup my Azure DevOps repository into Azure Databricks. This will help me deploy my Master branch directly on ADB workspace.
I tried to do manually on portal and it works, but the repos path for the notebooks shows my email_id, which is not good in Production.
I want to configure through a Powershell OR an ARM template while creating Databricks. The same problem I am facing on Azure dataFactory as well.
Please help me resolve it.
It's not possible as of today - there is no API for creating a checkout. It will be possible only when Databricks Repos will start to provide corresponding API for creating the checkouts of repositories, not only "Update checkout" API that is available right now.
If you're concerned with the checkout created in your own folder, you can just create a Folder inside Repos, call it like "Production", and then do checkout inside that folder (pictures are taken from my demo of Repos with Azure DevOps):
To deploy Notebooks from your master branch to another workspace, I would recommend to trigger a deployment pipeline from the master branch onto the target databricks worskpace.
That way, no need to setup Repos in the target environment.
You use Repos in your development workspace (with your email in path)
You commit to the branch you work on and eventually merge / PR to master
Once on Master branch, a DevOps pipeline is triggered and deploys the notebook to your target workspace on the path you want

DevOps on Databricks Notebook with specific Path

I am trying to implement azure DevOps on databricks notebook.
My Dev Instance databricks notebooks are integrated with git repository and it is in below folder structure.
I have created a build pipeline which will detect the changes in the Databricks folder for each Code (CodeA and CodeB) using Trigger tab in the build pipeline as shown below.
But at the time of publishing artifacts, how could we select the path to get the databricks files only from each of the Code as shown in the above folder structure?
If it is not possible if i have to select the parent folder Code which includes databricks file for CodeA and CodeB then how can I deploy it into the Shared folder of Databricks UAT instance which is having the below folder structure?
Ideally it should be as shown in the below diagram.
Any way to achieve this? Any leads appreciated.
You can just select the parent folder Code/ which includes databricks file for CodeA and CodeB to publish in the build pipeline.
Then you need to create a release pipeline and use the third-party task Databricks Deploy Notebooks to deploy the notebooks.
When you create the release pipeline, Click add to select your build pipeline and add the artifacts
Add a stage in the release pipeline. Add the task Databricks Deploy Notebooks in the stage job.
Click the 3dots of the Source files path field to select the databricks. Enter the Target files path of your azure databricks.
Here you can select the path to get each databricks file deployed to its corresponding folder in azure databricks. See below.
Then configure the Authentication method. See document here to get a databricks bearer token for the task.
Add Multiple Databricks Deploy Notebooks tasks and change the Source files path and Target files path field accordingly to deploy to different databricks.
You can check this tutorial for more information.

Delete Multiple Azure Data Factory Pipeline

I want to delete 50+ ADF pipelines linked to Azure Devops GIT. We can do it manually via Azure Front-end, but it's a tedious task. I have tried deleting it via Powershell but powershell can only delete the pipelines which present under DataFactory(PFA the screenshot) mode whereas it is not impacting pipelines linked to Azure DevOps GIT.
Can anyone suggest any better approach to do this activity ?
If you just want them out of git, you can create a feature branch from the ADF editor. Then use Visual Studio or any git repo navigator to pull that branch to your local file system. Manually delete the files and then push back to git and do a pull request to merge back into your master.
If you really want to purge them completely, you can do the same manual deletion upon your ADF publish branch too.
I accomplished the task of deleting pipeline by :-
Creating a clone of the Dev branch in my local Visual studio.
By deleting all .json file (corresponding to the pipeline I wanted to delete) from newly cloned local branch.
Commit, Sync and Push the change to Dev GIT branch.
This is the easiest way I could find to delete 50+ pipelines in one shot.

Resources