https://docs.databricks.com/dev-tools/databricks-utils.html
I am trying to use dbutils.fs.rm in a job for Azure on a dbfs folder. It's actually a big pain and the dbutils.fs.rm resolves all the issues but seems to only work in a notebook.
The issues I am having are dealing with sub folders with files. I want an easy way within python to delete all a folder, and all sub content.
Related
I'm a beginner to Spark and just picked up the highly recommended 'Spark - the Definitive Edition' textbook. Running the code examples and came across the first example that needed me to upload the flight-data csv files provided with the book. I've uploaded the files at the following location as shown in the screenshot:
/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv
I've in the past used Azure Databricks to upload files directly onto DBFS and access them using ls command without any issues. But now in community edition of Databricks (Runtime 9.1) I don't seem to be able to do so.
When I try to access the csv files I just uploaded into dbfs using the below command:
%sh ls /dbfs/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv
I keep getting the below error:
ls: cannot access '/dbfs/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv': No such file or directory
I tried finding out a solution and came across the suggested workaround of using dbutils.fs.cp() as below:
dbutils.fs.cp('C:/Users/myusername/Documents/Spark_the_definitive_guide/Spark-The-Definitive-Guide-master/data/flight-data/csv', 'dbfs:/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv')
dbutils.fs.cp('dbfs:/FileStore/tables/spark_the_definitive_guide/data/flight-data/csv/', 'C:/Users/myusername/Documents/Spark_the_definitive_guide/Spark-The-Definitive-Guide-master/data/flight-data/csv/', recurse=True)
Neither of them worked. Both threw the error: java.io.IOException: No FileSystem for scheme: C
This is really blocking me from proceeding with my learning. It would be supercool if someone can help me solve this soon. Thanks in advance.
I believe the way you are trying to use is the wrong one, use it like this
to list the data:
display(dbutils.fs.ls("/FileStore/tables/spark_the_definitive_guide/data/flight-data/"))
to copy between databricks directories:
dbutils.fs.cp("/FileStore/jars/d004b203_4168_406a_89fc_50b7897b4aa6/databricksutils-1.3.0-py3-none-any.whl","/FileStore/tables/new.whl")
For local copy you need the premium version where you create a token and configure the databricks-cli to send from the computer to the dbfs of your databricks account:
databricks fs cp C:/folder/file.csv dbfs:/FileStore/folder
I have a fresh Azure Databricks instance that I'm doing some experimenting on. Per the Databricks documentation, I activated the DBFS File Browser in the Admin Console.
However, when browsing the DBFS root location, only FileStore, mnt and user folders are showing (see below). Reading this Databricks doc, I expected to also see databricks-datasets, databricks-results and databricks/init, but these are not showing in the GUI.
However, I am able to access e.g. databricks-datasets programatically through a notebook command:
Does anyone know what is going on here? At first I thought it may be different since it's an instance of Azure Databricks, but the Azure Databricks documentation is exactly the same and suggests I should be able to see the same root folders.
Why can I not see some DBFS root folders in the DBFS File Browser GUI, even though I can programatically access them?
I have the same issue. There is no folder/file appearing in the UI of Databricks at the following location: dbfs/FileStore/ even after I do an upload. But it does appear in the notebook when I run dbutils.fs.ls("/FileStore/").
However, the folders and files can be found in the UI at the following location: /FileStore/
I am trying to pass a whole directory of python files that are referenced in the main python file in Azure Synapse Spark Job Definition but the files are not appearing in the location and I get Module Not Found Error. Trying to upload like this:
abfss://[directory path in data lake]/*
You have to trick the Spark job definition by exporting it, editing it as a JSON, and importing it back.
After the export, open in a text editor and add the following:
"conf": {
"spark.submit.pyFiles":
"path-to-abfss/module1.zip, path-to-abfss/module2.zip"
},
Now, import the JSON back.
The way to achieve this on Synapse is to package your python files into a wheel package and upload the wheel package to a specific location the Azure Data Lake Storage where your spark pool will load them from every time it starts. This will make the custom python packages available to all jobs and notebooks using that spark pool.
You can find more details on the official documentation: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-python-packages#install-wheel-files
We are trying to use cloud hot folder functionality and in order to do so we are modifying our existing hot-folder implementation that was not implemented originally for usage within cloud.
Following the steps on this help page:
https://help.sap.com/viewer/0fa6bcf4736c46f78c248512391eb467/SHIP/en-US/4abf9290a64f43b59fbf35a3d8e5ba4d.html
We are trying to test the cloud functionality locally. I have on my machine azurite docker container running and I have modified the mentioned properties in local.properties file but it seems that the files are not being picked up by hybris in any of the cases that we are trying.
First we have in our local azurite storage a blob storage called hybris. Within this blob storage we have folders master>hotfolder, and according to docs uploading a sample.csv file into this should trigger a hot folder upload.
Also we have a mapping for our hot-folder import that scans the files within this folder: #{baseDirectory}/${tenantId}/sample/classifications. {baseDirectory} is configured using a property like so: ${HYBRIS_DATA_DIR}/sample/import
Can we keep these mappings within our hot folder xml definitions, or do we need to change them?
How should the blob container be named in order for it to be accessible to hybris?
Thank you very much,
I would be very happy to provide any further information.
In the end I did manage to run cloud hot folder imports on local machine.
It was a matter of correctly configuring a number of properties that are used by cloudhotfolder and azurecloudhotfolder extensions.
Simply use the following properties to set the desired behaviour of the system:
cluster.node.groups=integration,yHotfolderCandidate
azure.hotfolder.storage.account.connection-string=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:32770/devstoreaccount1;
azure.hotfolder.storage.container.hotfolder=${tenantId}/your/path/here
cloud.hotfolder.default.mapping.file.name.pattern=^(customer|product|url_media|sampleFilePattern|anotherFileNamePattern)-\\d+.*
cloud.hotfolder.default.images.root.url=http://127.0.0.1:32785/devstoreaccount1/${azure.hotfolder.storage.container.name}/master/path/to/media/folder
cloud.hotfolder.default.mapping.header.catalog=YourProductCatalog
And that is it, if there are existing routings for traditional hot folder import, these can also be used but their mappings should be in the value of
cloud.hotfolder.default.mapping.file.name.pattern
property.
I am trying the same - to set up a local dev env to test out the cloud hotfolder. It seems that you have had some success. Can you provide where you located the azurecloudhotfolder - which is called out here https://help.sap.com/viewer/0fa6bcf4736c46f78c248512391eb467/SHIP/en-US/4abf9290a64f43b59fbf35a3d8e5ba4d.html
Thanks
We have few .py files on my local needs to stored/saved on fileStore path on dbfs. How can I achieve this?
Tried with dbUtils.fs module copy actions.
I tried the below code but did not work, I know something is not right with my source path. Or is there any better way of doing this? please advise
'''
dbUtils.fs.cp ("c:\\file.py", "dbfs/filestore/file.py")
'''
It sounds like you want to copy a file on local to the dbfs path of servers of Azure Databricks. However, due to the interactive interface of Notebook of Azure Databricks based on browser, it could not directly operate the files on local by programming on cloud.
So the solutions as below that you can try.
As #Jon said in the comment, you can follow the offical document Databricks CLI to install the databricks CLI via Python tool command pip install databricks-cli on local and then copy a file to dbfs.
Follow the offical document Accessing Data to import data via Drop files into or browse to files in the Import & Explore Data box on the landing page, but also recommended to use CLI, as the figure below.
Upload your specified files to Azure Blob Storage, then follow the offical document Data sources / Azure Blob Storage to do the operations include dbutils.fs.cp.
Hope it helps.