I am trying to understand the way Databricks stores files and I am a bit unsure of what the difference is between dbfs:/ and file:/ (see image below)
From what I have been able to deduce from here, file:/ seems to be the area where external files downloaded via curl/wget get downloaded into in the following folder path:
%fs ls "file:/databricks/driver"
But what is file:/ really and why does it exist and how is it different from dbfs:/?
For the record, I am using the community free edition of Databricks.
Related
I'm a beginner to Spark and just picked up the highly recommended 'Spark - the Definitive Edition' textbook. Running the code examples and came across the first example that needed me to upload the flight-data csv files provided with the book. I've uploaded the files at the following location as shown in the screenshot:
/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv
I've in the past used Azure Databricks to upload files directly onto DBFS and access them using ls command without any issues. But now in community edition of Databricks (Runtime 9.1) I don't seem to be able to do so.
When I try to access the csv files I just uploaded into dbfs using the below command:
%sh ls /dbfs/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv
I keep getting the below error:
ls: cannot access '/dbfs/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv': No such file or directory
I tried finding out a solution and came across the suggested workaround of using dbutils.fs.cp() as below:
dbutils.fs.cp('C:/Users/myusername/Documents/Spark_the_definitive_guide/Spark-The-Definitive-Guide-master/data/flight-data/csv', 'dbfs:/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv')
dbutils.fs.cp('dbfs:/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv/', 'C:/Users/myusername/Documents/Spark_the_definitive_guide/Spark-The-Definitive-Guide-master/data/flight-data/csv/', recurse=True)
Neither of them worked. Both threw the error: java.io.IOException: No FileSystem for scheme: C
This is really blocking me from proceeding with my learning. It would be supercool if someone can help me solve this soon. Thanks in advance.
I believe the way you are trying to use is the wrong one, use it like this
to list the data:
display(dbutils.fs.ls("/FileStore/tables/spark_the_definitive_guide/data/flight-data/"))
to copy between databricks directories:
dbutils.fs.cp("/FileStore/jars/d004b203_4168_406a_89fc_50b7897b4aa6/databricksutils-1.3.0-py3-none-any.whl","/FileStore/tables/new.whl")
For local copy you need the premium version where you create a token and configure the databricks-cli to send from the computer to the dbfs of your databricks account:
databricks fs cp C:/folder/file.csv dbfs:/FileStore/folder
I have a fresh Azure Databricks instance that I'm doing some experimenting on. Per the Databricks documentation, I activated the DBFS File Browser in the Admin Console.
However, when browsing the DBFS root location, only FileStore, mnt and user folders are showing (see below). Reading this Databricks doc, I expected to also see databricks-datasets, databricks-results and databricks/init, but these are not showing in the GUI.
However, I am able to access e.g. databricks-datasets programatically through a notebook command:
Does anyone know what is going on here? At first I thought it may be different since it's an instance of Azure Databricks, but the Azure Databricks documentation is exactly the same and suggests I should be able to see the same root folders.
Why can I not see some DBFS root folders in the DBFS File Browser GUI, even though I can programatically access them?
I have the same issue. There is no folder/file appearing in the UI of Databricks at the following location: dbfs/FileStore/ even after I do an upload. But it does appear in the notebook when I run dbutils.fs.ls("/FileStore/").
However, the folders and files can be found in the UI at the following location: /FileStore/
I followed the guidance in this wiki page to create an application package for my Azure Batch pool, but now my nodes are stuck in an unusable state because it fails to unzip. I can't find anything in the documentation that talks about what kind of compressed file is acceptable here, other than "a zip file".
I have a collection of database files used for some genomic sequencing tools that I have stored in a folder structure, which I created a compressed archive with using tar -zvcf and gave a .zip extension to. That did not work, so I tried uploading the same file with a .tar.gz extension and it also failed.
The Batch Node is running the CentOS image Azure Batch recommends for container applications, and my startup task is not running in the context of the container.
Can anyone point me to documentation or personal experience that helps clarify what kind of files can be used for this? Thank you in advance!
Yes, you are correct, but let me emphasise on the confusion, tar is the different compress archive file format then zip i.e. more detail here: What is the difference between tar and zip? it is mentioned many times in the documentation you mentioned along with
Batch App Package feature only support *.zip format and hence changing file extension from *.tar to *.zip is not the right way as they are 2 different way they get compressed et. al.
Extra docs:
https://azure.microsoft.com/en-au/blog/application-packages-and-task-dependencies-now-available-on-azure-batch/
https://kb.winzip.com/help/winzip/AboutZIPsAndOtherArchives_4.htm
Thanks and hope it helps.
We have few .py files on my local needs to stored/saved on fileStore path on dbfs. How can I achieve this?
Tried with dbUtils.fs module copy actions.
I tried the below code but did not work, I know something is not right with my source path. Or is there any better way of doing this? please advise
'''
dbUtils.fs.cp ("c:\\file.py", "dbfs/filestore/file.py")
'''
It sounds like you want to copy a file on local to the dbfs path of servers of Azure Databricks. However, due to the interactive interface of Notebook of Azure Databricks based on browser, it could not directly operate the files on local by programming on cloud.
So the solutions as below that you can try.
As #Jon said in the comment, you can follow the offical document Databricks CLI to install the databricks CLI via Python tool command pip install databricks-cli on local and then copy a file to dbfs.
Follow the offical document Accessing Data to import data via Drop files into or browse to files in the Import & Explore Data box on the landing page, but also recommended to use CLI, as the figure below.
Upload your specified files to Azure Blob Storage, then follow the offical document Data sources / Azure Blob Storage to do the operations include dbutils.fs.cp.
Hope it helps.
We are moving form Pentaho 3.8 to Pentaho 7.1, quite a some upgrade. :)
However many things has changed, so I need some help every now and then. On 3.8 we have had folder on HDD where we have had all our reports stored. I am quite used to manage this folder through SVN, so I was trying to do it same way on Pentaho 7.1 but its not working.
At first I have switched pentaho-server/pentaho-solutions/system/jackrabbit/repository.xml back from postgres to FileSystem settings.
However it did not worked. I could not find folders created through web app on HDD.
Next step, I have tried to crerate folder on HDD, located in pentaho-server/pentaho-solutions/. added also index.xmlvfile to recognize it and refreshed/restarted all I could find in pentaho, inculuding pentaho itself. Still can't see this folder in web app.
Now I am searching for possible location where to maintain those files, but there are so many possibilities, I could spend days working on it.
Can someone give me a hint or was doing something similar?
My system is Linux, and I use Community Edition of pentaho-server.
Pentaho only uses Jackrabbit for storing the repository since version 5. There is no longer a physical copy in your hard drive.
Your best shot is using CBF2 and the import/export scripts to sync the jackrabbit repository and a folder on your drive you can then sync using SVN.
CBF 2 blog post