Accessing Azure ADLS gen2 with Pyspark on Databricks - azure

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")

Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.

The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

Related

Access S3 files from Azure Synapse Notebook

Goal:
Move a lot of files from AWS S3 to ADLS Gen2 using Azure Synapse as fast as possible using parameterized regex expression for filename pattern using Synapse Notebook.
What I tried so far:
I know to access ADLS gen2, we can use
mssparkutils.fs.ls('abfss://container_name#storage_account_name.blob.core.windows.net/foldername') works but what is the equivalent to access S3 ?
I used mssparkutils.credentials.getsecret('AKV name','secretname') and mssparkutils.credentials.getsecret('AKV name','secret key id') to fetch secret details in the Synapse notebook but unable configure S3 to Synapse.
Question: Do I have to use the existing linked service using the credentials.getFullConnectionString(LinkedService) API ?
In short, my question is, How do I configure connectivity to S3 from within Synapse Notebook?
Answering my question here. AzCopy worked.Below is the link which helped me finish the task. The steps are as follows.
Install AzCopy on your machine.
Goto your terminal and goto the directory where the executeable is installed; run "AzCopy Login"; use Azure Active Directory credentials in your browser using the link from terminal message..Use the CODE provided in the terminal.
Authorize with S3 using below
set AWS_ACCESS_KEY_ID=
set AWS_SECRET_ACCESS_KEY=
For ADLS Gen2, you are already done in step-2
Use the commands (which ever suits your need) from the link below.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3

Azure Data Lake Gen2 Storage Account blob vs adf choice

I am new to Azure Data Lake Storage Gen2 service. I have a Storage Account with "Hierarchical namespace" option Enabled.
I am using AzCopy to move some files and folders. From the command line I can - within the address string - use either the option "blob" or the "adf" string tokens:
'https://myaccount.blob.core.windows.net/mycontainer/myfolder'
or
'https://myaccount.adf.core.windows.net/mycontainer/myfolder'
again within the .\azcopy.exe copy command.
"Apparently" both ways succeed giving the same result. My question is: is there any difference if I use blob or adf in the address string? If yes, what is it?
Also, whatever string token I choose, in the Azure portal a file address is always given with the blob string token..
thanks
In the storage account Endpoint page, you can see all the available endpoints for you to use for their services.
Both blob and dfs work for you because both of them are supported in Azure Data Lake Storage Gen2 . However, in Gen1, you may only have the blob service but not the dfs service available (like below). In that case, you won't be able to use the dfs endpoint.
blob and dfs represent the resource type in the endpoint URL

Access data from ADLS using Azure Databricks

I am trying to access data files stored in ADLS location via Azure Databricks using storage account access keys.
To access data files, I am using python notebook in azure databricks and below command works fine,
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
"<access-key>"
)
However, when I try to list the directory using below command, it throws an error
dbutils.fs.ls("abfss://<container-name>#<storage-account-name>.dfs.core.windows.net")
ERROR:
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://<storage-account-name>.dfs.core.windows.net/<container-name>?upn=false&resource=filesystem&maxResults=500&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:<request-id> Time:2021-08-03T08:53:28.0215432Z"
I am not sure on what permission would it require and how can I proceed with it.
Also, I am using ADLS Gen2 and Azure Databricks(Trial - premium).
Thanks in advance!
The complete config key is called "spark.hadoop.fs.azure.account.key.adlsiqdigital.dfs.core.windows.net"
However it would be beneficial for a production environment to use a service account and a mount point. This way the actions on the storage can be traced back to this application more easily than just with the generic access key and the mount point avoid specifying the connection string everywhere in your code.
Try this out.
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
dbutils.fs.mount(source = "abfss://<container-name>#<your-storage-account-name>.dfs.core.windows.net/", mount_point = "/mnt/test")
You can mount ADLS storage account using access key via Databricks and then read/write data. Please try below code:
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
dbutils.fs.ls("/mnt/<mount-name>")

Using Dask to load data from Azure Data Lake Gen2 with SAS Token

I'm looking for a way to load data from an Azure DataLake Gen2 using Dask, the content of the container are only parquet files but I only have the account name, account endpoint and an SAS token.
When I use Azure SDK for a File System, I can navigate easily with only those values.
azure_file_system_client = FileSystemClient(
account_url=endpoint,
file_system_name="container-name",
credential=sas_token,
)
When I try to do the same using abfs in DASK using the adlfs as backend, as below:
ENDPOINT = f"https://{ACCOUNT_NAME}.dfs.core.windows.net"
storage_options={'connection_string': f"{ENDPOINT}/{CONTAINER_NAME}/?{sas_token}"}
ddf = dd.read_parquet(
f"abfs://{CONTAINER_NAME}/**/*.parquet",
storage_options=storage_options
)
I get the following error:
ValueError: unable to connect to account for Connection string missing required connection details.
Any thoughts?
Thanks in advance :)

Error running Spark on Databricks: constructor public XXX is not whitelisted

I was using Azure Databricks and trying to run some example python code from this page.
But I get this exception:
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.classification.LogisticRegression(java.lang.String) is not whitelisted.
This error shows up with some library methods when using High Concurrency cluster with credential pass through enabled. If that is your scenario a work around that may be an option is to use a different cluster mode.
py4j.security.Py4JSecurityException: ... is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
Reference: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

Resources