Read data from GCS in azure databricks with account impersonation - apache-spark

I am trying to read the data from a gcs bucket to Azure databricks spark/pyspark using service account impersonation. I tried giving
"google.cloud.auth.service.account.enable", "false"
"fs.gs.auth.impersonation.service.account.for.user.<USER_NAME>"
"fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME>"
"fs.gs.auth.impersonation.service.account"
But authentication is failing with below error
IllegalArgumentException: No valid credential configuration discovered:
Any idea or suggestion?

Related

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")
Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.
The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

AuthenticationException when creating Azure ML Dataset from Azure Data Lake Gen2 Datastore

I have an Azure Data Lake Gen2 with public endpoint and a standard Azure ML instance.
I have created both components with my user and I am listed as Contributor.
I want to use data from this data lake in Azure ML.
I have added the data lake as a Datastore using Service Principal authentication.
I then try to create a Tabular Dataset using the Azure ML GUI I get the following error:
Access denied
You do not have permission to the specified path or file.
{
"message": "ScriptExecutionException was caused by StreamAccessException.\n StreamAccessException was caused by AuthenticationException.\n 'AdlsGen2-ListFiles (req=1, existingItems=0)' for '[REDACTED]' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID '1f9e329b-2c2c-49d6-a627-91828def284e', request ID '5ad0e715-a01f-0040-24cb-b887da000000'. Error message: [REDACTED]\n"
}
I have tried having our Azure Portal Admin, with Admin access to both Azure ML and Data Lake try the same and she gets the same error.
I tried creating the Dataset using Python sdk and get a similar error:
ExecutionError:
Error Code: ScriptExecution.StreamAccess.Authentication
Failed Step: 667ddfcb-c7b1-47cf-b24a-6e090dab8947
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by AuthenticationException.
'AdlsGen2-ListFiles (req=1, existingItems=0)' for 'https://mydatalake.dfs.core.windows.net/mycontainer?directory=mydirectory/csv&recursive=true&resource=filesystem' on storage failed with status code 'Forbidden' (This request is not authorized to perform this operation using this permission.), client request ID 'a231f3e9-b32b-4173-b631-b9ed043fdfff', request ID 'c6a6f5fe-e01f-0008-3c86-b9b547000000'. Error message: {"error":{"code":"AuthorizationPermissionMismatch","message":"This request is not authorized to perform this operation using this permission.\nRequestId:c6a6f5fe-e01f-0008-3c86-b9b547000000\nTime:2020-11-13T06:34:01.4743177Z"}}
| session_id=75ed3c11-36de-48bf-8f7b-a0cd7dac4d58
I have created Datastore and Datasets of both a normal blob storage and a managed sql database with no issues and I have only contributor access to those so I cannot understand why I should not be Authorized to add data lake. The fact that our admin gets the same error leads me to believe there are some other issue.
I hope you can help me identify what it is or give me some clue of what more to test.
Edit:
I see I might have duplicated this post: How to connect AMLS to ADLS Gen 2?
I will test that solution and close this post if it works
This was actually a duplicate of How to connect AMLS to ADLS Gen 2?.
The solution is to give the service principal that Azure ML uses to access the data lake the Storage Blob Data Reader access. And note you have to wait at least some minutes for this to have effect.

save spark ML model in azure blobs

I tried saving my machine learning model in pyspark to azure blob. But this is giving error.
lr.save('wasbs:///user/remoteuser/models/')
Illegal Argument Exception: Cannot initialize WASB file system, URI authority not recognized.'
Also tried,
m = lr.save('wasbs://'+container_name+'#'+storage_account_name+'.blob.core.windows.net/models/')
But getting unable to identify user identity in stack trace.
P.S. : I am not using Azure HDInsight. I am just using Databricks and Azure blob storage
To access Azure Blob Storage from Azure Databricks directly (not mounted), you have to set an an account access key:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
or a SAS for a container. Then you should be able to access the Blob Storage:
val df = spark.read.parquet("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")

Connect to Blob storage "no credentials found for them in the configuration"

I'm working with Databricks notebook backed by spark cluster. Having trouble trying to connect to the Azure blob storage. I used this link and tried the section Access Azure Blob Storage Directly - Set up an account access key. I get no errors here:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
But receive errors when I try and do an 'ls' on the directory:
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container <container name> in account <storage account name>core.windows.net using anonymous credentials, and no credentials found for them in the configuration.
If there is a better way, please provide suggestion as well. thanks
You need to pass the storage account name and key while setting up the configuration . You can find this from azure portal.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
Also while doing the ls you need to add
Container name and directory name.
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
Hope this will resolve your issue!

Resources