Aws S3 to Databricks mount is not working - databricks

I have mounted 'mybucket' using mount commands and i could able to list all the objects using the below command-
%fs
ls /mnt/mybucket/
however, i have folders inside the folders in 'mybucket' and i want to run the below command but it is not working.
%fs
ls /mnt/mybucket/*/*/
Any help is much appreciated. Thanks

The dbutils.fs.ls and it's magic variant %fs ls don't support wildcards, so you need to iterate over the files yourself, with something like this:
def list_files(path, max_level = 1, cur_level=0):
d = dbutils.fs.ls(path)
for i in d:
if i.name.endswith("/") and i.size == 0 and cur_level < (max_level - 1):
yield from list_files(i.path, max_level, cur_level+1)
else:
yield i.path
files = list_files("/mnt/mybucket", 1)

If you attempt to create a mount point within an existing mount point, for example:
Mount one storage account to /mnt/storage1
Mount a second storage account to /mnt/storage1/storage2
This will fail because nested mounts are not supported in Databricks. recommended one is creating separate mount entries for each storage object.
For example:
Mount one storage account to /mnt/storage1
Mount a second storage account to /mnt/storage2

Unmount and mount again.
dbutils.fs.unmount("/mnt/mount_name")
dbutils.fs.mount("s3a://%s" % aws_bucket_name, "/mnt/%s" % mount_name)

Related

How to open index.html file in databricks or browser?

I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.

List all file names located in an Azure Blob Storage

I want to list in Databricks all file names located in an Azure Blob Storage.
My Azure Blob Storage is structured like this:
aaa
<br/>------bbb
<br/>------------bbb1.xml
<br/>------------bbb2.xml
<br/>------ccc
<br/>------------ccc1.xml
<br/>------------ccc2.xml
<br/>------------ccc3.xml
If I do:
dbutils.fs.ls('wasbs://xxx#xxx.blob.core.windows.net/aaa')
only subfolders bbb and ccc are listed like this:
[FileInfo(path='wasbs://xxx#xxx.blob.core.windows.net/aaa/bbb/', name='bbb/', size=0),
FileInfo(path='wasbs://xxx#xxx.blob.core.windows.net/aaa/ccc/', name='ccc/', size=0)]
I want to deepen to the last subfolder to see all file names located in aaa: bbb1.xml, bbb2.xml, ccc1.xml, ccc2.xml and ccc3.xml.
If I do:
dbutils.fs.ls('wasbs://xxx#xxx.blob.core.windows.net/aaa/*')
an error occurs because the path can not be parameterized.
Any idea to do this in Databricks?
dbutils.fs.ls doesn't support wildcards, that's why you're getting an error. You have few choices:
Use Python SDK for Azure blob storage to list files - it could be faster than using recursive dbutils.fs.ls, but you will need to setup authentication, etc.
You can do recursive calls to dbutils.fs.ls, using function like this, but it's not very performant:
def list_files(path, max_level = 1, cur_level=0):
"""
Lists files under the given path, recursing up to the max_level
"""
d = dbutils.fs.ls(path)
for i in d:
if i.name.endswith("/") and i.size == 0 and cur_level < (max_level - 1):
yield from list_files(i.path, max_level, cur_level+1)
else:
yield i.path
You can use Hadoop API to access files in your container, similar to this answer.

file transfer from DBFS to Azure Blob Storage

I need to transfer the files in the below dbfs file system path:
%fs ls /FileStore/tables/26AS_report/customer_monthly_running_report/parts/
To the below Azure Blob
dbutils.fs.ls("wasbs://"+blob.storage_account_container+"#"
+ blob.storage_account_name+".blob.core.windows.net/")
WHAT SERIES OF STEPS SHOULD I FOLLOW? Pls suggest
The simplest way would be to load the data into a dataframe and then to write that dataframe into the target.
df = spark.read.format(format).load("dbfs://FileStore/tables/26AS_report/customer_monthly_running_report/parts/*")
df.write.format(format).save("wasbs://"+blob.storage_account_container+"#" + blob.storage_account_name+".blob.core.windows.net/")
You will have to replace "format" with the source file format and the format you want in the target folder.
Keep in mind that if you do not want to do any transformations to the data but to just move it, it will most likely be more efficient not to use pyspark but to just use the az-copy command line tool. You can also run that in Databricks with the %sh magic command if needed.

How to copy multiple blob containers from one storage account to another storage account(different subscriptions)

I've to copy multiple blob containers(each container has multiple files) from one storage account to another storage account.The hierarchy is below:
Container 1
Folder 1
file x
file y
Folder 2
file x
file y
Container 2
Folder 1
file x
file y
Folder 2
file x
file y
(Have around 50 containers)
Here's what I've tried:
a)Used ADF template.The copy operation is unable to copy data(Folder 1,Folder2 and files under the folders) inside the containers.
b)AzCopy - cannot use it since it does'nt copies archive file.
Is there any other way to perform this operation?
In Data Factory, we can copy all the folder/files in one container to another container by binary format. I have answered the same question in Stack overflow, you can search that easily.
Settings like this:
But we can not auto create the container in sink with all the folders and files! Data Factory doesn't support create the container.

How to find the RELATIVE PATH of a BLOB?

I want to integrate my Blob Storage with Azure Data Bricks, I found this in azure documentation as a part of connection, Can someone help me where can I find blob_relative_path ?
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Boston"
blob_sas_token = r"?st=2019-02-26T02%3A34%3A32Z&se=2119-02-27T02%3A34%3A00Z&sp=rl&sv=2018-03-
28&sr=c&sig=XlJVWA7fMXCSxCKqJm8psMOh0W4h7cSYO28coRqF2fs%3D"
The relative path of a blob is the path to the file of interest in your blob container. For instance, consider the following hierarchy:
Home > [blob_account_name] > [blob_container_name] > fakeDirectory/fakeSubDirectory/file.csv
The path to your file of interest in the identified container (i.e. fakeDirectory/fakeSubDirectory/file.csv) is the blob_relative_path.
Hope this helps!

Resources