I"m trying to mount azure storage blob into azure Databricks using python notebook using below code.
mount_name = '/mnt/testMount'
if not any(mount.mountPoint == mount_name for mount in dbutils.fs.mounts()):
dbutils.fs.mount(
source = "wasbs://%s#%s.blob.core.windows.net" % (container, accountName),
mount_point = mount_name,
extra_configs = {"fs.azure.account.key.%s.blob.core.windows.net" % (accountName ) : accountKey })
Mount was successful and I was able to see using print(dbutils.fs.mounts())
also using DBFS CLI in my linux VM. dbfs ls dbfs:/mnt/testMount
But not visible in the UI nor accessible from python notebook FileNotFoundError: [Errno 2] No such file or directory: '/mnt/testMount/'.
Can someone please let me know if you faced this issue and what is the fix?
Thanks
I really recommended you can read the section Managed and unmanaged tables of the offical document User Guide > Databases and Tables.
So you mount Azure Blob Storage to DBFS as a part of filesystem of Azure Databricks which is belong to unmanaged table that be created by coding in notebook.
In Azure databricks, you need to provide the path in Jyupter Notebook as "dbfs:/mnt/azureblobshare/<your_file_name_with_extension>"
Example: If I upload a file "MyFile.txt" then the file path in Jyupter notebook will be
filePath="dbfs:/mnt/azureblobshare/MyFile.txt"
This should work for you.
Related
my Azure webapp needs to download 1000+ very small files from a blob storage directory and process them.
If I list them, then download them one by one, it takes ages... Is there a fast way to do it? Like to download them all together?
PS: I use the following code:
from azure.storage.blob import ContainerClient, BlobClient
blob_list = #... list all files in a blob storage directory
for blob in blob_list:
blob_client = BlobClient.from_connection_string(connection_string, container_name, blob)
downloader = blob_client.download_blob(0)
blob = pickle.loads(downloader.readall())
I would also point out that since you are using azure-batch you could use the blob mount configuration in your linux VMs. So the idea will be to mount the drive to your VM, hence take out all the download time, and your drive is attached to the vm.
Docs:https://learn.microsoft.com/en-us/azure/batch/virtual-file-mount
Py SDK reference: https://learn.microsoft.com/en-us/python/api/azure-batch/azure.batch.models.mountconfiguration?view=azure-python
Blobfilesystem configuration: https://learn.microsoft.com/en-us/python/api/azure-batch/azure.batch.models.azureblobfilesystemconfiguration?view=azure-python
Key thing (Just for knowledge): Under the hood blobfilesystem uses blobfuse driver to mount. https://learn.microsoft.com/en-us/azure/batch/virtual-file-mount#azure-blob-file-system
Thanks and hope this help.
I used Azure databricks for a similar problem. You could easily mount the Azure storage accounts in databricks (i.e. ADLS Gen2) then deal with storage files like local files. You could either copy the files or do your process/transform directly even without downloading them.
You could find the databricks mount steps in this LINK
In databricks you could also use dbutils functions to have OS like access to your files after mountiung your ADLS.
I hope this approach could help.
How to write to azure file share from azure databricks spark jobs.
I configured the Hadoop storage key and values.
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare#STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
This code is to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/
I am trying to connect databricks to my blob containers in azure data lake gen2.
I can't find what my file-system-name is or my storage-account-name is anywhere for a connection.
dbutils.fs.ls("abfss://file-system-name#storage-account-name.dfs.core.windows.net/")
Thanks. If someone could reference an example that would be great.
Oh man ! they could have provided a better documentation with an example. took some time to click it
file-system-name means the container name. For instance if you have a container 'data' created in a storage account "myadls2021", it would be like below.
val data = spark.read
.option("header", "true")
.csv("abfss://data#myadls2021.dfs.core.windows.net/Baby_Names__Beginning_2007.csv")
I would recommend following this documentation :
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#azure-data-lake-storage-gen2
It explains how to access and/or mount an Azure Datalake Gen2 from Databricks.
file-system-name is the container name. storage-account-name is the azure storage account. Refer image below.
dbutils.fs.mount(
source = "abfss://raw#storageaccadls01.dfs.core.windows.net/",
mount_point = "/mnt/raw",
extra_configs = configs)
Can access the storage account files from the dbfs mount point location. Refer image below.
I have an Azure data lake storage gen 2 account, with hierarchical namespace enabled. I generated a SAS-token to the account, and I recieve data to a folder in the File Share (File Service). Now I want to access these files through Azure Databricks and python. However, it seems like Azure Databricks can only access the File System (called Blob Container in gen1), and not the File Share. I also failed to generate a SAS-token to the File System.
I wish to have a storage instance to which can generate a SAS-token and give to my client, and access the same from azure databricks using python. It is not important if it is File System, File Share, ADLS gen2 or gen1 as long as it somehow works.
I use the following to access the File System from databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "my_client_id",
"fs.azure.account.oauth2.client.secret": "my_client_secret",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+"My_tenant_id" +"/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(source = "abfss://"+"my_file_system"+"#"+"my_storage_account"+".dfs.core.windows.net/MyFolder",
mount_point = "/mnt/my_mount",
extra_configs = configs)
Works fine but I cannot make it access the File Share. And I have a SAS-token with a connection string like this:
connection_string = (
'BlobEndpoint=https://<my_storage>.blob.core.windows.net/;'+
'QueueEndpoint=https://<my_storage>.queue.core.windows.net/;'+
'FileEndpoint=https://<my_storage>.file.core.windows.net/;'+
'TableEndpoint=https://<my_storage>.table.core.windows.net/;'+
'SharedAccessSignature=sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-26T17:12:38Z&st=2019-08-26T09:12:38Z&spr=https&sig=<my_sig>'
)
Which I manage to use to upload stuff to the file share, but not to the file system. Is there any kind of Azure storage that can be accessed by both a SAS-token and azure databricks?
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
use this for further reference https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string
code to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/
It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.