Import a CSV file using Databricks CLI in Repos - databricks

We are using Databricks to generate ETL scripts. One step requires us to upload small csvs into a Repos folder. I can do this manually using the import window in the Repos GUI. However, i would like to do this programmatically using the databricks cli. Is this possible? I have tried using the Workspace API, but this only works for sourcecode files.

Unfortunately it's not possible as of right now, because there is no API for that that could be used by databricks-cli. But you can add and commit files to the Git repository, and then use databricks repos update to pull them inside the workspace.

Related

Azure Devops. - Importing from one project/repository to another

I am trying to import project from one repository to another on Azure DevOps and running into this error:
Your import of https://asbc#dev.azure.com/application/repo/_git/mycode repository failed due to VS403655: The push was rejected because storage:621b7803-367a-307b-8a15-0164783333e2 contains ':', which isn't a valid file or directory character. .
All solutions I found are suggesting force push to the new repo but we need to retain history. I will appreciate if there is any way to copy these files without using third-party tools or custom development using APIs.

Run .ipynb on Databricks without the import ui

Is there a way to run (or convert) .ipynb files on a Databricks cluster without using the import ui of Databricks. Basically I want to be able to develop in Jupyter but also be able to run this file on Databricks where its pulled trough git.
It's possible to import Jupyter notebooks into Databricks workspace as a Databricks notebook, and then execute it. You can use:
Workspace Import REST API
databricks workspace import command of databricks-cli.
P.S. Unfortunately you can't open it by committing into a Repo, it will be treated as JSON. So you need to import it to convert into a Databricks notebook

Import a GitHub repo into Databricks community edition

I am trying to import some data from a public repo in GitHub so that to use it from my Databricks notebooks.
So far I tried to connect my Databricks account with my GitHub as described here, without results though since it seems that GitHub support comes with some non-community licensing. I get the following message when I try to set the GitHub token which is required for the GitHub integration:
The same question has been asked before on the official Databricks forum.
What is the best way to import and store a GitHub repo on databricks community edition?
I managed to solve this using shell commands from the notebook itself. To retrieve the repository for the 1st time I did git clone via HTTPS:
%sh git clone https://github.com/SomeDataRepo/TheData.git --depth 1 --branch=master /dbfs/FileStore/TheData/
Why not SSH? Well SSH requires to setup the SSH keys which was not necessary in my case.
Finally, every time that I need a fresh version of the data I execute a git pull before executing my program:
%sh git -C /dbfs/FileStore/TheData/ pull
assuming you have python installed on your desktop, install the databricks cli, clone the git repo to your local, then use the workspace cli to import the entire repo as a directory.
https://docs.databricks.com/dev-tools/cli/workspace-cli.html
The simplest way is, just import the .dbc file direct into your user workspace on Community Edition, as explained by Databricks here:
Import GitHub repo into Community Edtion Workspace
In GitHub, in the pane to the right, under Releases, click on the
Latest link:
Latest release
Under Assets look for the link to the DBC file
Right click the DBC file's link and copy the link location (there is
no need to download this file)
.dbc file
Back in Databricks, click on the Workspace icon in the
navigational pane to the left
In the Workspace swimlane, click the Home button to open your
home folder. It should open the folder /Users/your-email-address
as in /Users/student#example.com
In the swimlane for your email address, click on the down chevron
and select Import
Import
In the Import Notebooks dialog
Select URL
Paste in the URL copied in step #3 above
Click Import
Once the import is done, select the new folder for this course to
view this course's notebooks.
Which notebook you should start with depends on your courseware and/or instructor.

How to use Airflow-API-Plugin?

I want to List and Trigger DAGs using this https://github.com/airflow-plugins/airflow_api_plugin github repo. How and where should I place this plugin in my airflow folder so that I can call the endpoints?
Is there anything that I need to change in the airflow.cfg file?
The repository you listed has not been updated in a while. Why not just use the experimental REST APIs included in Airflow? You can find them here: https://airflow.apache.org/docs/stable/api.html .
Use:
GET /api/experimental/dags//dag_runs
to get a list of DAG runs and
POST /api/experimental/dags//dag_runs
to trigger a new dag run

What is a good Databricks workflow

I'm using Azure Databricks for data processing, with notebooks and pipeline.
I'm not satisfied with my current workflow:
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Git integration is very simple, but this is not my main concern.
Great question. Definitely dont modify your production code in place.
One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.
You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.
Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.
Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.
I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code
The way we are doing it -
-Integrate the Dev notebooks with Azure DevOps.
-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI
https://docs.databricks.com/dev-tools/api/latest/index.html
Create Release pipeline for Test, Staging and Production deployments.
-Deploy on Test and test.
-Deploy on Staging and test.
-Deploy on production
Hope this can help.

Resources