save spark ML model in azure blobs

save spark ML model in azure blobs - azure

I tried saving my machine learning model in pyspark to azure blob. But this is giving error.
lr.save('wasbs:///user/remoteuser/models/')
Illegal Argument Exception: Cannot initialize WASB file system, URI authority not recognized.'
Also tried,
m = lr.save('wasbs://'+container_name+'#'+storage_account_name+'.blob.core.windows.net/models/')
But getting unable to identify user identity in stack trace.
P.S. : I am not using Azure HDInsight. I am just using Databricks and Azure blob storage

To access Azure Blob Storage from Azure Databricks directly (not mounted), you have to set an an account access key:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
or a SAS for a container. Then you should be able to access the Blob Storage:
val df = spark.read.parquet("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")

Related

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")

Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.

The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Access data from ADLS using Azure Databricks

I am trying to access data files stored in ADLS location via Azure Databricks using storage account access keys.
To access data files, I am using python notebook in azure databricks and below command works fine,
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
"<access-key>"
)
However, when I try to list the directory using below command, it throws an error
dbutils.fs.ls("abfss://<container-name>#<storage-account-name>.dfs.core.windows.net")
ERROR:
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://<storage-account-name>.dfs.core.windows.net/<container-name>?upn=false&resource=filesystem&maxResults=500&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:<request-id> Time:2021-08-03T08:53:28.0215432Z"
I am not sure on what permission would it require and how can I proceed with it.
Also, I am using ADLS Gen2 and Azure Databricks(Trial - premium).
Thanks in advance!

The complete config key is called "spark.hadoop.fs.azure.account.key.adlsiqdigital.dfs.core.windows.net"
However it would be beneficial for a production environment to use a service account and a mount point. This way the actions on the storage can be traced back to this application more easily than just with the generic access key and the mount point avoid specifying the connection string everywhere in your code.

Try this out.
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
dbutils.fs.mount(source = "abfss://<container-name>#<your-storage-account-name>.dfs.core.windows.net/", mount_point = "/mnt/test")

You can mount ADLS storage account using access key via Databricks and then read/write data. Please try below code:
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
dbutils.fs.ls("/mnt/<mount-name>")

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

write/save Dataframe to azure file share from azure databricks

How to write to azure file share from azure databricks spark jobs.
I configured the Hadoop storage key and values.
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.STORAGEKEY.file.core.windows.net",
"SECRETVALUE"
)
val wasbFileShare =
s"wasbs://testfileshare#STORAGEKEY.file.core.windows.net/testPath"
df.coalesce(1).write.mode("overwrite").csv(wasbBlob)
When tried to save the dataframe to azure file share I'm seeing the following the resource not found error although the URI is present.
Exception in thread "main" org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: The requested URI does not represent any resource on the server.

Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
This code is to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

How to read a blob in Azure databricks with SAS

I'm new to Databricks. I write sample code to read Storage Blob in Azure Databricks.
blob_account_name = "sars"
blob_container_name = "mpi"
blob_sas_token =r"**"
ini_path = "58154388-b043-4080-a0ef-aa5fdefe22c8"
inputini = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, ini_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net"% (blob_container_name, blob_account_name), blob_sas_token)
print(inputini)
ini=sc.textFile(inputini).collect()
It throw error:
Container mpi in account sars.blob.core.windows.net not found
I guess it doesn't attach the SAS token in WASBS link, so that it doesn't permission to read the data.
How to attach the SAS in wasbs link.

This is excepted behaviour, you cannot access the read private storage from Databricks. In order to access private data from storage where firewall is enabled or when created in a vnet, you will have to Deploy Azure Databricks in your Azure Virtual Network then whitelist the Vnet address range in the firewall of the storage account. You could refer to configure Azure Storage firewalls and virtual networks.
WITH PRIVATE ACCESS:
When you have provided access level to "Private (no anonymous access)".
Output: Error message
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container carona in account cheprasas.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
WITH CONTAINER ACCESS:
When you have provided access level to "Container (Anonymous read access for containers and blobs)".
Output: You will able to see the output without any issue.
Reference: Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

save spark ML model in azure blobs - azure

Related

Accessing Azure ADLS gen2 with Pyspark on Databricks

Access data from ADLS using Azure Databricks

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

write/save Dataframe to azure file share from azure databricks

How to read a blob in Azure databricks with SAS

Categories

Resources