Databricks cli - dbfs commands to copy files - azure

I'm working on the Deployment of the Purview ADB Lineage Solution Accelerator. In step 3 of Install OpenLineage on Your Databricks Cluster section, the author is asking to run the following in thepowershell to Upload the init script and jar to dbfs using the Databricks CLI.
dbfs mkdirs dbfs:/databricks/openlineage
dbfs cp --overwrite ./openlineage-spark-*.jar dbfs:/databricks/openlineage/
dbfs cp --overwrite ./open-lineage-init-script.sh dbfs:/databricks/openlineage/open-lineage-init-script.sh
Question: Do I correctly understand the above code as follows? If that is not the case, before running the code, I would like to know what exactly the code is doing.
The first line creates a folder openlineage in the root directory of dbfs
It's assumed that you are running the powershell command from the location where .jar and open-lineage-init-script.sh are located
The second and third lines of the code are copying the jar and .sh files from your local directory to the dbfs:/databricks/openlineage/ in dbfs of Databricks

dbfs mkdirs is an equivalent of UNIX mkdir -p, ie. under DBFS root it will create a folder named databricks, and inside it another folder named openlineage - and will not complain if these directories already exist.
and 3. Yes. Files/directories not prefixed with dbfs:/ mean your local filesystem. Note that you can copy from DBFS to local or vice versa, or between two DBFS locations. Just not between local filesystem only.

Related

Creating symbolic link in a Databricks FileStore's folder

I would like to create a symbolic link like in linux env with the command : ln -s.
Unfortunately I can't find anything similar to do in a Databricks FileStore.
And it seems that ln operation is not a member of dbutils.
Is there a way to do this maybe differently?
Thanks a lot for your help.
FileStore is located on DBFS (Databricks File System) that is baked either by S3, or ADLS that don't have a notion of symlink. You have a choice - either rename file, or copy it, or modify your code to generate correct file name from an alias.

Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7

In databricks runtime version 6.6 I am able to successfully run a shell command like the following:
%sh ls /dbfs/FileStore/tables
However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime version 7? I need to run commands to unzip a parquet zip file in /dbfs/FileStore/tables. This used to work in version 6.6 but databricks new "upgrade" breaks this simple core functionality.
Not sure if this matters but I am using the community edition of databricks.
WHen you run %sh ls /dbfs/FileStore/tables you can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 because by default, the folder named '/dbfs/FileStore' does not exists in the 'dbfs'.
Try to upload some files in '/dbfs/FileStore/Tables'.
Now, try to run the same command again %sh ls /dbfs/FileStore/tables, now you see the results because we have upload the data into /dbfs/FileStore/tables folder.
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
You can workaround this limitation by working with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your code will look as following:
#write a file to local filesystem using Python I/O APIs
...
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/dbfs_file.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/dbfs_file.txt', 'file:/tmp/local-path')
# read the file locally
...
I know this question is a year old, but I wanted to share other posts that I found helpful in case someone has the same question.
I found the comments in this similar question to be helpful: How to access DBFS from shell?. The comments in the aforementioned post, also references Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory: which I found helpful as well.
I learned in Community Edition ls /dbfs/FileStore/tables is not possible because the dbfs itself is not mounted on the nodes and the feature is disabled.

Convert databricks noteboooks.dbc into standard .py files

I often get sent databricks notebooks from various sources to move around / look at / refactor. Due to different tenancies I can't log into the actual environment. These are usually sent as .dbc files and I can convert them by opening up a new databricks enviroment are re-saving them as a .py file. I was wondering if there was a method where I could do this from command line, like nb-convert for Juypter ?
it's a little bit of a pain to import a whole host of files, then re-convert to python just for the sake of reading code.
Source control is not always an option due to permissions.
Import the .dbc in your Databricks workspace, for example in the Shared directory.
Then, as suggested by Carlos, install the Databricks CLI on your local computer and set it up.
pip install databricks-cli
databricks configure --token
and run the following to import the .py notebooks into your local folder
mkdir export_notebooks
cd export_notebooks
databricks workspace export_dir /Shared ./

Download files (databricks/driver)

I tried to download an excel file that I generated via pandas but I can't find it ... I know it is in the file:/databricks/driver but I can download it ...
Is it possible to transfer it into storage or transfer it to my machine local?
I tried it but it didn't work.
dbutils.fs.cp('file:/databricks/driver/test.xlsx','dbfs:/mnt/datalake/test.xlsx')
Note: Using Databricks GUI, you can download full results (max 1 millions rows).
OR
Using Databricks CLI:
To download full results (more than 1 million), first save the file to dbfs and then copy the file to local machine using Databricks CLI as follows.
dbfs cp "dbfs:/FileStore/tables/AA.csv" "A:\AzureAnalytics"
Reference: Databricks file system
The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana
Reference: Installing and configuring Azure Databricks CLI

Pyspark list files by filtetypes in a directory

I want to list files by filetypes in a directory. The directory has .csv,.pdf etc files types and I want to list all the .csv files.
I am using the following command
dbutils.fs.ls("/mnt/test-output/*.csv")
I am expecting to get the list of all csv files in that directory.
I am getting the following error in databricks
java.io.FileNotFoundException: No such file or directory: /test-output/*.csv
Try using a shell cell with %sh. You can access DBFS and the mnt directory from there, too.
%sh
ls /dbfs/mnt/*.csv
Should get you a result like
/dbfs/mnt/temp.csv
%fs is a shortcut to dbutils and its access to the file system. dbutils doesn't support all unix shell functions and syntax, so that's probably the issue you ran into. Notice also how when running the %sh cell we access DBFS with /dbfs/.
I think you're mixing DBFS with local file system. Where is /mnt/test-output/*.csv?
If you're trying to read from DBFS then it will work.
Can you try running dbutils.fs.ls("/") to ensure that /mnt exist in DBFS.

Resources