Pyspark list files by filtetypes in a directory - linux

I want to list files by filetypes in a directory. The directory has .csv,.pdf etc files types and I want to list all the .csv files.
I am using the following command
dbutils.fs.ls("/mnt/test-output/*.csv")
I am expecting to get the list of all csv files in that directory.
I am getting the following error in databricks
java.io.FileNotFoundException: No such file or directory: /test-output/*.csv

Try using a shell cell with %sh. You can access DBFS and the mnt directory from there, too.
%sh
ls /dbfs/mnt/*.csv
Should get you a result like
/dbfs/mnt/temp.csv
%fs is a shortcut to dbutils and its access to the file system. dbutils doesn't support all unix shell functions and syntax, so that's probably the issue you ran into. Notice also how when running the %sh cell we access DBFS with /dbfs/.

I think you're mixing DBFS with local file system. Where is /mnt/test-output/*.csv?
If you're trying to read from DBFS then it will work.
Can you try running dbutils.fs.ls("/") to ensure that /mnt exist in DBFS.

Related

Databricks cli - dbfs commands to copy files

I'm working on the Deployment of the Purview ADB Lineage Solution Accelerator. In step 3 of Install OpenLineage on Your Databricks Cluster section, the author is asking to run the following in thepowershell to Upload the init script and jar to dbfs using the Databricks CLI.
dbfs mkdirs dbfs:/databricks/openlineage
dbfs cp --overwrite ./openlineage-spark-*.jar dbfs:/databricks/openlineage/
dbfs cp --overwrite ./open-lineage-init-script.sh dbfs:/databricks/openlineage/open-lineage-init-script.sh
Question: Do I correctly understand the above code as follows? If that is not the case, before running the code, I would like to know what exactly the code is doing.
The first line creates a folder openlineage in the root directory of dbfs
It's assumed that you are running the powershell command from the location where .jar and open-lineage-init-script.sh are located
The second and third lines of the code are copying the jar and .sh files from your local directory to the dbfs:/databricks/openlineage/ in dbfs of Databricks
dbfs mkdirs is an equivalent of UNIX mkdir -p, ie. under DBFS root it will create a folder named databricks, and inside it another folder named openlineage - and will not complain if these directories already exist.
and 3. Yes. Files/directories not prefixed with dbfs:/ mean your local filesystem. Note that you can copy from DBFS to local or vice versa, or between two DBFS locations. Just not between local filesystem only.

clickhouse-client import file not found

I'm trying to import the example data from the link but when I run the below command it can't find the csv file, Error: -bash: full_dataset .csv: No such file or directory and installed with clickhouse docker.
Where do I need to keep the "full_dataset.csv" file?
You need to either supply entire path or move the CSV file to clickhouse bin folder where the clickhouse-client resides.
I solved this problem by moving the "full_dataset.csv" file to the "root" directory.
you need to download the csv dataset from the link and then only execute the code on the clickhouse documentation page, this is indicated https://recipenlg.cs.put.poznan.pl/dataset

Creating symbolic link in a Databricks FileStore's folder

I would like to create a symbolic link like in linux env with the command : ln -s.
Unfortunately I can't find anything similar to do in a Databricks FileStore.
And it seems that ln operation is not a member of dbutils.
Is there a way to do this maybe differently?
Thanks a lot for your help.
FileStore is located on DBFS (Databricks File System) that is baked either by S3, or ADLS that don't have a notion of symlink. You have a choice - either rename file, or copy it, or modify your code to generate correct file name from an alias.

Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7

In databricks runtime version 6.6 I am able to successfully run a shell command like the following:
%sh ls /dbfs/FileStore/tables
However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime version 7? I need to run commands to unzip a parquet zip file in /dbfs/FileStore/tables. This used to work in version 6.6 but databricks new "upgrade" breaks this simple core functionality.
Not sure if this matters but I am using the community edition of databricks.
WHen you run %sh ls /dbfs/FileStore/tables you can't Access /dbfs/FileStore using shell commands in databricks runtime version 7 because by default, the folder named '/dbfs/FileStore' does not exists in the 'dbfs'.
Try to upload some files in '/dbfs/FileStore/Tables'.
Now, try to run the same command again %sh ls /dbfs/FileStore/tables, now you see the results because we have upload the data into /dbfs/FileStore/tables folder.
The /dbfs mount doesn't work on Community Edition with DBR >= 7.x - it's a known limitation.
You can workaround this limitation by working with files on the driver node and upload or download files using the dbutils.fs.cp command (docs). So your code will look as following:
#write a file to local filesystem using Python I/O APIs
...
# upload file to DBFS
dbutils.fs.cp('file:/tmp/local-path', 'dbfs:/FileStore/tables/dbfs_file.txt')
and reading from DBFS will look as following:
# copy file from DBFS to local file_system
dbutils.fs.cp('dbfs:/tmp/dbfs_file.txt', 'file:/tmp/local-path')
# read the file locally
...
I know this question is a year old, but I wanted to share other posts that I found helpful in case someone has the same question.
I found the comments in this similar question to be helpful: How to access DBFS from shell?. The comments in the aforementioned post, also references Not able to cat dbfs file in databricks community edition cluster. FileNotFoundError: [Errno 2] No such file or directory: which I found helpful as well.
I learned in Community Edition ls /dbfs/FileStore/tables is not possible because the dbfs itself is not mounted on the nodes and the feature is disabled.

Uploading files into hadoop

I've recently downloaded Oracle Virtual Box and I want to take some data and import it into HDFS. I want to state that I am a complete novice when it comes to these things. I've tried copying the instructions from a udacity course which do NOT work.
I apologize if the terminology I'm using is not accurate.
So in my VM space I have the following files
Computer
Training's Home (Provided by Udacity)
Eclipse
Trash
Inside Training's Home I have on the left-hand side under Places
training,
Desktop
File System
Network
Trash
Documents
Pictures
Downloads
On the right-hand side when I select training there are many folders one of them is udacity_training. When I select this there are two folders
code and data. When I select data there are further two folders something called access_log.gz and purchases.txt which is the data I want to load into HDFS
Copying the command entered by the udacity tutorial I typed
[training#localhost ~]$ ls access_log.gz purchases.txt
This gave the error messages
ls: cannot access access_log.gz: No such file or directory
ls: cannot access purchases: No such file or directory
I then tried the next line just to see what happens which was
[training#localhost ~]$ hadoop fs -ls
[training#localhost ~]$ hadoop fs -put purchases.txt
again an error saying
put: 'purchases.txt': No such file or directory
What am I doing wrong? I don't really understand command line prompts I think they're in Linux? So what I'm typing looks quite Alien to me. I want to be able to understand what I'm typing. Could someone help me access the data and perhaps also provide some info on where I can understand what I'm actually typing into the command line? Any help is greatly appreciated.
Please start learning the basics of linux & hadoop commands.
To answer your question try below options.
Use command cd /dir_name to goto the required directory and then use
hadoop fs -put /file_name /hdfs/path

Resources