Download files (databricks/driver) - databricks

I tried to download an excel file that I generated via pandas but I can't find it ... I know it is in the file:/databricks/driver but I can download it ...
Is it possible to transfer it into storage or transfer it to my machine local?
I tried it but it didn't work.
dbutils.fs.cp('file:/databricks/driver/test.xlsx','dbfs:/mnt/datalake/test.xlsx')

Note: Using Databricks GUI, you can download full results (max 1 millions rows).
OR
Using Databricks CLI:
To download full results (more than 1 million), first save the file to dbfs and then copy the file to local machine using Databricks CLI as follows.
dbfs cp "dbfs:/FileStore/tables/AA.csv" "A:\AzureAnalytics"
Reference: Databricks file system
The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana
Reference: Installing and configuring Azure Databricks CLI

Related

Databricks cli - dbfs commands to copy files

I'm working on the Deployment of the Purview ADB Lineage Solution Accelerator. In step 3 of Install OpenLineage on Your Databricks Cluster section, the author is asking to run the following in thepowershell to Upload the init script and jar to dbfs using the Databricks CLI.
dbfs mkdirs dbfs:/databricks/openlineage
dbfs cp --overwrite ./openlineage-spark-*.jar dbfs:/databricks/openlineage/
dbfs cp --overwrite ./open-lineage-init-script.sh dbfs:/databricks/openlineage/open-lineage-init-script.sh
Question: Do I correctly understand the above code as follows? If that is not the case, before running the code, I would like to know what exactly the code is doing.
The first line creates a folder openlineage in the root directory of dbfs
It's assumed that you are running the powershell command from the location where .jar and open-lineage-init-script.sh are located
The second and third lines of the code are copying the jar and .sh files from your local directory to the dbfs:/databricks/openlineage/ in dbfs of Databricks
dbfs mkdirs is an equivalent of UNIX mkdir -p, ie. under DBFS root it will create a folder named databricks, and inside it another folder named openlineage - and will not complain if these directories already exist.
and 3. Yes. Files/directories not prefixed with dbfs:/ mean your local filesystem. Note that you can copy from DBFS to local or vice versa, or between two DBFS locations. Just not between local filesystem only.

AWS CLI S3 CP --recursive function works in console but not in .sh file

I have the following line inside of a .sh file:
aws s3 cp s3://bucket/folder/ /home/ec2-user/ --recursive
When I run this line in the console, it runs fine and completes as expected. When I run this line inside of a .sh file, the line returns the following error.
Unknown options: --recursive
Here is my full .sh script.
echo "Test"
aws s3 cp s3://bucket/folder/ /home/ec2-user/ --recursive
echo "Test"
python /home/ec2-user/Run.py
If I manually add the Run.py file (not having it copied over from S3 as desired) and run the script I get the following output.
Test
Unknown options: --recursive
Test
Hello World
If I remove the files which are expected to be transferred by S3 out of the Linux environment and rely on the AWS S3 command I get the following output:
Test
Unknown options: --recursive
Test
python: can't open file '/home/ec2-user/Run.py': [Errno 2] No such file or directory
Note that there are multiple files that can be transferred inside of the specified S3 location. All of these files are transferred as expected when running the AWS command from the console.
I initially thought this was a line ending error, as I am copying the .sh file over from Windows. I have made sure that my line endings are working and we see this by the rest of the script running as expected. As a result, this issue seems isolated to the actual AWS command. If I remove the --recursive from the call, it will transfer over a single file successfully.
Any ideas on why the --recursive option would be working in the console but not in the .sh file?
P.s.
$ aws --version
aws-cli/2.1.15 Python/3.7.3 Linux/4.14.209-160.339.amzn2.x86_64 exe/x86_64.amzn.2 prompt/off
I think it is about the position of the option or you have somewhere the AWS CLI path which is pointing to older version because recursive option was added later. For me, it works both ways inside the shell script as well as the console.
aws s3 cp --recursive s3://bucket/folder/ /home/ec2-user/
$ aws --version
aws-cli/2.1.6 Python/3.7.4 Darwin/20.2.0 exe/x86_64 prompt/off
Try printing the version inside the shell script.
cp --recursive method lists source path and copies (overwrites) all to the destination path.
Plus instead of doing recursive use sync.sync recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.
aws s3 sync 3://bucket/folder/ /home/ec2-user/
sync method first lists both source and destination paths and copies only differences (name, size etc.).

Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7

In databricks runtime version 6.6 I am able to successfully run a shell command like the following:
%sh ls /dbfs/FileStore/tables
However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime version 7? I need to run commands to unzip a parquet zip file in /dbfs/FileStore/tables. This used to work in version 6.6 but databricks new "upgrade" breaks this simple core functionality.
Not sure if this matters but I am using the community edition of databricks.
WHen you run %sh ls /dbfs/FileStore/tables you can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 because by default, the folder named '/dbfs/FileStore' does not exists in the 'dbfs'.
Try to upload some files in '/dbfs/FileStore/Tables'.
Now, try to run the same command again %sh ls /dbfs/FileStore/tables, now you see the results because we have upload the data into /dbfs/FileStore/tables folder.
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
You can workaround this limitation by working with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your code will look as following:
#write a file to local filesystem using Python I/O APIs
...
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/dbfs_file.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/dbfs_file.txt', 'file:/tmp/local-path')
# read the file locally
...
I know this question is a year old, but I wanted to share other posts that I found helpful in case someone has the same question.
I found the comments in this similar question to be helpful: How to access DBFS from shell?. The comments in the aforementioned post, also references Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory: which I found helpful as well.
I learned in Community Edition ls /dbfs/FileStore/tables is not possible because the dbfs itself is not mounted on the nodes and the feature is disabled.

Convert databricks noteboooks.dbc into standard .py files

I often get sent databricks notebooks from various sources to move around / look at / refactor. Due to different tenancies I can't log into the actual environment. These are usually sent as .dbc files and I can convert them by opening up a new databricks enviroment are re-saving them as a .py file. I was wondering if there was a method where I could do this from command line, like nb-convert for Juypter ?
it's a little bit of a pain to import a whole host of files, then re-convert to python just for the sake of reading code.
Source control is not always an option due to permissions.
Import the .dbc in your Databricks workspace, for example in the Shared directory.
Then, as suggested by Carlos, install the Databricks CLI on your local computer and set it up.
pip install databricks-cli
databricks configure --token
and run the following to import the .py notebooks into your local folder
mkdir export_notebooks
cd export_notebooks
databricks workspace export_dir /Shared ./

Pyspark list files by filtetypes in a directory

I want to list files by filetypes in a directory. The directory has .csv,.pdf etc files types and I want to list all the .csv files.
I am using the following command
dbutils.fs.ls("/mnt/test-output/*.csv")
I am expecting to get the list of all csv files in that directory.
I am getting the following error in databricks
java.io.FileNotFoundException: No such file or directory: /test-output/*.csv
Try using a shell cell with %sh. You can access DBFS and the mnt directory from there, too.
%sh
ls /dbfs/mnt/*.csv
Should get you a result like
/dbfs/mnt/temp.csv
%fs is a shortcut to dbutils and its access to the file system. dbutils doesn't support all unix shell functions and syntax, so that's probably the issue you ran into. Notice also how when running the %sh cell we access DBFS with /dbfs/.
I think you're mixing DBFS with local file system. Where is /mnt/test-output/*.csv?
If you're trying to read from DBFS then it will work.
Can you try running dbutils.fs.ls("/") to ensure that /mnt exist in DBFS.

Resources