Programatically importing library to workspace in Databricks - databricks

I would like to programatically add a (Python Wheel) library to the /Shared workspace on Databricks. It is easy to do in the GUI (Workspace > Import > Library), but I cannot figure out how to do it in the Databricks CLI.
So I though that I had two possible strategies:
Install it as a library
Copy it as a file to the workspace
It seems that 1) is not feasible because the library term is dedicated to actual installations on clusters, while 2) is not feasible because workspace import requires languages (Python, R, SQL, etc.), and interprets the files as scripts.
So I am a bit lost on how to approach this.

As per my observation:
Note: databricks workspace import "Imports a file from local to the Databricks workspace."
I have tried with databricks workspace import cmdlets and understood that it copies as a file.
How to install a library using Azure Databricks CLI?
Copy the library from local directory to DBFS using DBFS CLI:
databricks fs cp "C:\Users\Azurewala\Downloads\wheel-0.33.4-py2.py3-none-any.whl" dbfs:/FileStore/jars
Create a cluster using the API or UI.
Get cluster id using databricks clusters list and cop the cluster-id.
Attach the libraries in DBFS to a cluster using DBFS CLI:
databricks libraries install --cluster-id "0802-090441-honks846" --whl "dbfs:/FileStore/jars/wheel-0.33.4-py2.py3-none-any.whl"
Successfully installed a library using Azure Databricks CLI:
Hope this helps.

Related

How to export files generated to Azure DevOps from Azure Databricks after a job terminates?

We are using Azure DevOps to submit a Training Job to Databricks. The training job uses a notebook to train a Machine Learning Model. We are using databricks CLI to submit the job from ADO.
In the notebook, in of the steps, we create a .pkl file, we want to download this to the build agent and publish it as an artifact in Azure DevOps. How do we do this?
It really depends on how that file is stored:
If it just saved on the DBFS, you can use databrics fs cp 'dbfs:/....' local-path
if file is stored on the local file system, then copy it to DBFS (for example, by using dbutils.fs.cp), and then use the previous item
if the model is tracked by MLflow, then you can either explicitly export model to DBFS via MLflow API (or REST API) (you can do it to DevOps directly as well, just need to have correct credentials, etc.) or use this tool to export models/experiments/runs to local disk

Move Files from Azure Files to ADLS Gen 2 and Back using Databricks

I have a Databricks process which currently generate a bunch of text files which gets stored in Azure Files. These files need to be moved to ADLS Gen 2 on a scheduled basis and back to File Share.
How this can be achieved using Databricks?
Installing the azure-storage package and using the Azure Files SDK for Python on Azure Databricks is the only way to access files in Azure Files.
Install Library: file-share azure-storage https://pypi.org/project/azure-storage-file-share/
Note : Pip install only instals the package on the driver node, thus pandas must be loaded first. The library must be deployed as a Databricks Library before it can be used by Spark worker nodes.
Python - Load file from Azure Files to Azure Databricks - Stack Overflow
Alternative could be copying the data from Azure File Storage to ADLS2 via Azure DataFactory using Copy activity : Copy data from/to Azure File Storage - Azure Data Factory & Azure Synapse | Microsoft Docs

ADF databricks python activity to pick python script from blob storage not from dbfs

I'm running databricks python activity from the Azure data factory. I want to pick the python/shell script from Azure blob-storage/data-lake instead for dbfs path. My current ADF databricks python activity is not allowing without 'dbfs:/'.
Could you please help me here.
Only dbfs file path is supported in the Databricks Python activity: https://learn.microsoft.com/en-us/azure/data-factory/transform-data-databricks-python#databricks-python-activity-properties
you need to think of other methods for uploading the python file to dbfs using databricks cli, potentially through your cicd pipeline also.

Create Azure databricks notebook from storage account

We have python script stored in Azure storage account in blob. We want to deploy / create this python script (as notebook) in azure databricks cluster so later we can run Azure data factory pipeline and pipeline can execute notebook created/deployed in databricks.
We want to create / deploy this script only one time as and when its available in blob.
I have tried to search over the web but couldn't find proper solution for this.
Is it possible to deploy/create notebook from storage account? if yes, how?
Thank you.
You can import notebook into Databricks using the URL, but I expect that you won't make that notebook public.
Another solution would be to use a combination of azcopy tool with Databricks CLI (workspace sub-command). Something like this:
azcopy cp "https://[account].blob.core.windows.net/[container]/[path/to/script.py" .
databricks workspace import -l PYTHON script.py '<location_on_databricks>'
You can also do it completely in notebook, combining the dbutils.fs.cp command with Databricks's Workspace REST API, but that's could be more complicated as you need to get personal access token, base64 the notebook, etc.
We can use databricks API 2.0 to import python script in databricks cluster.
Here is the API definition: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/workspace#--import

Access databricks secrets in pyspark/python job

Databricks secrets can be accessed within notebooks using dbutils, however since dbutils is not available outside notebooks how can one access secrets in pyspark/python jobs, especially if they are run using mlflow.
I have already tried How to load databricks package dbutils in pyspark
which does not work for remote jobs or mlflow project runs.
In raw pyspark you cannot do this. However if you are developing a pyspark application specifically for Databricks then I strongly recommend you look at Databricks-connect.
This allows access to parts of dbutils including secrets from an ide. It also simplifies how you access storage so that it aligns with how the code will run in production.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect

Resources