write/save Dataframe to azure file share from azure databricks - azure

How to write to azure file share from azure databricks spark jobs.
I configured the Hadoop storage key and values.
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare#STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
This code is to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

Related

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

Azure databricks DBFS mount not visible

I"m trying to mount azure storage blob into azure Databricks using python notebook using below code.
mount_name = '/mnt/testMount'
if not any(mount.mountPoint == mount_name for mount in dbutils.fs.mounts()):
dbutils.fs.mount(
source = "wasbs://%s#%s.blob.core.windows.net" % (container, accountName),
mount_point = mount_name,
extra_configs = {"fs.azure.account.key.%s.blob.core.windows.net" % (accountName ) : accountKey })
Mount was successful and I was able to see using print(dbutils.fs.mounts())
also using DBFS CLI in my linux VM. dbfs ls dbfs:/mnt/testMount
But not visible in the UI nor accessible from python notebook FileNotFoundError: [Errno 2] No such file or directory: '/mnt/testMount/'.
Can someone please let me know if you faced this issue and what is the fix?
Thanks
I really recommended you can read the section Managed and unmanaged tables of the offical document User Guide > Databases and Tables.
So you mount Azure Blob Storage to DBFS as a part of filesystem of Azure Databricks which is belong to unmanaged table that be created by coding in notebook.
In Azure databricks, you need to provide the path in Jyupter Notebook as "dbfs:/mnt/azureblobshare/<your_file_name_with_extension>"
Example: If I upload a file "MyFile.txt" then the file path in Jyupter notebook will be
filePath="dbfs:/mnt/azureblobshare/MyFile.txt"
This should work for you.

How to connect Azure Data Lake Store gen 2 File Share with Azure Databricks?

I have an Azure data lake storage gen 2 account, with hierarchical namespace enabled. I generated a SAS-token to the account, and I recieve data to a folder in the File Share (File Service). Now I want to access these files through Azure Databricks and python. However, it seems like Azure Databricks can only access the File System (called Blob Container in gen1), and not the File Share. I also failed to generate a SAS-token to the File System.
I wish to have a storage instance to which can generate a SAS-token and give to my client, and access the same from azure databricks using python. It is not important if it is File System, File Share, ADLS gen2 or gen1 as long as it somehow works.
I use the following to access the File System from databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "my_client_id",
"fs.azure.account.oauth2.client.secret": "my_client_secret",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+"My_tenant_id" +"/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(source = "abfss://"+"my_file_system"+"#"+"my_storage_account"+".dfs.core.windows.net/MyFolder",
mount_point = "/mnt/my_mount",
extra_configs = configs)
Works fine but I cannot make it access the File Share. And I have a SAS-token with a connection string like this:
connection_string = (
'BlobEndpoint=https://<my_storage>.blob.core.windows.net/;'+
'QueueEndpoint=https://<my_storage>.queue.core.windows.net/;'+
'FileEndpoint=https://<my_storage>.file.core.windows.net/;'+
'TableEndpoint=https://<my_storage>.table.core.windows.net/;'+
'SharedAccessSignature=sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-26T17:12:38Z&st=2019-08-26T09:12:38Z&spr=https&sig=<my_sig>'
)
Which I manage to use to upload stuff to the file share, but not to the file system. Is there any kind of Azure storage that can be accessed by both a SAS-token and azure databricks?
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
use this for further reference https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string
code to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

save spark ML model in azure blobs

I tried saving my machine learning model in pyspark to azure blob. But this is giving error.
lr.save('wasbs:///user/remoteuser/models/')
Illegal Argument Exception: Cannot initialize WASB file system, URI authority not recognized.'
Also tried,
m = lr.save('wasbs://'+container_name+'#'+storage_account_name+'.blob.core.windows.net/models/')
But getting unable to identify user identity in stack trace.
P.S. : I am not using Azure HDInsight. I am just using Databricks and Azure blob storage
To access Azure Blob Storage from Azure Databricks directly (not mounted), you have to set an an account access key:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
or a SAS for a container. Then you should be able to access the Blob Storage:
val df = spark.read.parquet("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")

Error reading csv file in Azure HDInsight Blob Store.

Jupyter note book gives the following error.
u'Path does not exist: wasb://spk123-2018#dreamcatcher.blob.core.windows.net/data/raw-flight-data.csv;
The csv file does exist in the blob store of Azure HDInsight
What is the command you are doing in Jupyter?
Check for a couple of two problems:
1. The file actually doesn't exist. You can check that by going into Azure portal -> storage account and checking if this container has the file.
2. You didn't connect the storage account to the cluster when you created the cluster.

Resources