Access data from ADLS using Azure Databricks

Access data from ADLS using Azure Databricks - azure

I am trying to access data files stored in ADLS location via Azure Databricks using storage account access keys.
To access data files, I am using python notebook in azure databricks and below command works fine,
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
"<access-key>"
)
However, when I try to list the directory using below command, it throws an error
dbutils.fs.ls("abfss://<container-name>#<storage-account-name>.dfs.core.windows.net")
ERROR:
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://<storage-account-name>.dfs.core.windows.net/<container-name>?upn=false&resource=filesystem&maxResults=500&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:<request-id> Time:2021-08-03T08:53:28.0215432Z"
I am not sure on what permission would it require and how can I proceed with it.
Also, I am using ADLS Gen2 and Azure Databricks(Trial - premium).
Thanks in advance!

The complete config key is called "spark.hadoop.fs.azure.account.key.adlsiqdigital.dfs.core.windows.net"
However it would be beneficial for a production environment to use a service account and a mount point. This way the actions on the storage can be traced back to this application more easily than just with the generic access key and the mount point avoid specifying the connection string everywhere in your code.

Try this out.
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
dbutils.fs.mount(source = "abfss://<container-name>#<your-storage-account-name>.dfs.core.windows.net/", mount_point = "/mnt/test")

You can mount ADLS storage account using access key via Databricks and then read/write data. Please try below code:
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
dbutils.fs.ls("/mnt/<mount-name>")

Related

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")

Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.

The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

AuthenticationException when creating Azure ML Dataset from Azure Data Lake Gen2 Datastore

I have an Azure Data Lake Gen2 with public endpoint and a standard Azure ML instance.
I have created both components with my user and I am listed as Contributor.
I want to use data from this data lake in Azure ML.
I have added the data lake as a Datastore using Service Principal authentication.
I then try to create a Tabular Dataset using the Azure ML GUI I get the following error:
Access denied
You do not have permission to the specified path or file.
{
"message": "ScriptExecutionException was caused by StreamAccessException.\n StreamAccessException was caused by AuthenticationException.\n 'AdlsGen2-ListFiles (req=1, existingItems=0)' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID '1f9e329b-2c2c-49d6-a627-91828def284e', request ID '5ad0e715-a01f-0040-24cb-b887da000000'. Error message: [REDACTED]\n"
}
I have tried having our Azure Portal Admin, with Admin access to both Azure ML and Data Lake try the same and she gets the same error.
I tried creating the Dataset using Python sdk and get a similar error:
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Authentication
Failed Step: 667ddfcb-c7b1-47cf-b24a-6e090dab8947
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ListFiles (req=1, existingItems=0)' for 'https://mydatalake.dfs.core.windows.net/mycontainer?directory=mydirectory/csv&recursive=true&resource=filesystem' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID 'a231f3e9-b32b-4173-b631-b9ed043fdfff', request ID 'c6a6f5fe-e01f-0008-3c86-b9b547000000'. Error message: {"error":{"code":"AuthorizationPermissionMismatch","message":"This request is not authorized to perform this operation using this permission.\nRequestId:c6a6f5fe-e01f-0008-3c86-b9b547000000\nTime:2020-11-13T06:34:01.4743177Z"}}
| session_id=75ed3c11-36de-48bf-8f7b-a0cd7dac4d58
I have created Datastore and Datasets of both a normal blob storage and a managed sql database with no issues and I have only contributor access to those so I cannot understand why I should not be Authorized to add data lake. The fact that our admin gets the same error leads me to believe there are some other issue.
I hope you can help me identify what it is or give me some clue of what more to test.
Edit:
I see I might have duplicated this post: How to connect AMLS to ADLS Gen 2?
I will test that solution and close this post if it works

This was actually a duplicate of How to connect AMLS to ADLS Gen 2?.
The solution is to give the service principal that Azure ML uses to access the data lake the Storage Blob Data Reader access. And note you have to wait at least some minutes for this to have effect.

Creating database in Azure databricks on External Blob Storage giving error

I have mapped my blob storage to dbfs:/mnt/ under name /mnt/deltalake
and blob storage container name is deltalake.
Mounting to Dbfs is done using Azure KeyVault backed secret scope.
When I try to create a database using CREATE DATABASE abc with location '/mnt/deltalake/databases/abc' this errors out saying path does not exist.
However when I was using the dbfs path as storage by using .. CREATE DATABASE abc with location '/user/hive/warehouse/databases/abc' .. it was always successful.
Wonder what is going wrong .
Suggestions please.

Using a mount point, you should be able to access existing files or write new files through databricks.
However, I believe the SQL commands, such as CREATE DATABASE, only work on the underlying hive metastore.
You may need to create a database for your blob storage outside of databricks, and then connect to the database to read and write from it using databricks.

Connect to Blob storage "no credentials found for them in the configuration"

I'm working with Databricks notebook backed by spark cluster. Having trouble trying to connect to the Azure blob storage. I used this link and tried the section Access Azure Blob Storage Directly - Set up an account access key. I get no errors here:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
But receive errors when I try and do an 'ls' on the directory:
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container <container name> in account <storage account name>core.windows.net using anonymous credentials, and no credentials found for them in the configuration.
If there is a better way, please provide suggestion as well. thanks

You need to pass the storage account name and key while setting up the configuration . You can find this from azure portal.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
Also while doing the ls you need to add
Container name and directory name.
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
Hope this will resolve your issue!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Access data from ADLS using Azure Databricks - azure

Try this out. spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>") dbutils.fs.mount(source = "abfss://<container-name>#<your-storage-account-name>.dfs.core.windows.net/", mount_point = "/mnt/test")

Related

Accessing Azure ADLS gen2 with Pyspark on Databricks

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

AuthenticationException when creating Azure ML Dataset from Azure Data Lake Gen2 Datastore

Creating database in Azure databricks on External Blob Storage giving error

Connect to Blob storage "no credentials found for them in the configuration"

Categories

Resources