I am unable to mount ADLS Gen2. Please assist - apache-spark

Code to mount ADLS Gen2:
Error while mounting ADLS Gen2:

From the error message it clear says - the provided tenant is not found in your subscription. Please make sure the tenant ID is available in your subscription.
Prerequisites:
Create and grant permissions to service principal If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Step1: Create an Azure AD application and service principal that can access resources. Note the following properties:
application-id: An ID that uniquely identifies the application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
storage-account-name: The name of the storage account.
filesystem-name: The name of the filesystem.
Step2: Register the service principal, granting the correct role assignment, such as Storage Blob Data Contributor role, on the Azure Data Lake Storage Gen2 account.
Step3: Azure Data Lake Gen2 mount by passing the values directly.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "06ecxx-xxxxx-xxxx-xxx-xxxef", #Enter <appId> = Application ID
"fs.azure.account.oauth2.client.secret": "1i_7Q-XXXXXXXXXXXXXXXXXXgyC.Szg", #Enter <password> = Client Secret created in AAD
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f98-xxxx-xxxx-xxx-xx47/oauth2/token", #Enter <tenant> = Tenant ID
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<filesystem>#<StorageName>.dfs.core.windows.net/<Directory>", #Enter <container-name> = filesystem name <storage-account-name> = storage name
mount_point = "/mnt/<mountname>",
extra_configs = configs)
For more details, refer to Azure Databricks - Azure Data Lake Storage Gen2.

Related

Access cross tenant Storage Account (firewall protected) from Az Synapse (dedicated SQL pool) in a different tenant

I have a Az Synapse (dedicated SQL pool) configured with managed VNet in tenant A and storage account in tenant B. The storage account is firewall protected and only certain VNets and IPs can access it. I want to created external tables from the Az synapse and hence, access the Storage account residing in the other tenant.
I have created a private endpoint on the storage account using Az synapse and the necessary IAM roles are in place.
The external table is created and I can retrieve the data when the firewall on storage account is lifted.
However, when the storage account firewall is enabled, I get the following error:
Msg 105019, Level 16, State 1, Line 1
External file access failed due to internal error: 'Error occurred while accessing HDFS: Java exception raised on call to HdfsBridge_IsDirExist. Java exception message:
HdfsBridge::isDirExist - Unexpected error encountered checking whether directory exists or not: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://someadlsl001.dfs.core.windows.net/somecontainer/?upn=false&action=getAccessControl&timeout=90'
The SQL queries used in synapse workspace SQL script is
CREATE DATABASE SCOPED CREDENTIAL cred WITH IDENTITY = '{clientID of service principal}#https://login.microsoftonline.com/{tenantID}/oauth2/token', SECRET = 'xxxxxxxxxxxxxxxx'
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore
WITH ( LOCATION = 'abfss://somecontainer#someadlsl001.dfs.core.windows.net/weather.csv' , CREDENTIAL = cred, TYPE = HADOOP ) ;
CREATE EXTERNAL TABLE [dbo].[WeatherData2] (
[usaf] [nvarchar](100) NULL
)
WITH
(
LOCATION='/',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = csvFile,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
select * from [dbo].[WeatherData2]
Please help
You can retrieve the data when the firewall on storage account is disabled. It shows There is an issue with Role assignment.
You need to make sure User is assigned with Storage Blob Data Contributor role to the service principal.
Also make sure you whitelist IP address.
Reference - https://learn.microsoft.com/en-us/answers/questions/648148/spark-pool-notebook-error.html
Configuration of rules that grant access to subnets in virtual networks that are a part of a different Azure Active Directory tenant are currently only supported through PowerShell, CLI and REST APIs. Such rules cannot be configured through the Azure portal, though they may be viewed in the portal.*
*Configure Azure Storage firewalls and virtual networks
You mention your firewall is configured to allow only certain VNets & IPs. You might need to elaborate for us on what your rules are specifically, but the documentation is very clear on how this is configured when accessing the storage account from another tenant.
AZ CLI:
az storage account network-rule add -g myRg --account-name mystorageaccount --subnet $subnetId
Az Powershell:
Add-AzStorageAccountNetworkRule -ResourceGroupName "myRg" -Name "mystorageaccount" -VirtualNetworkResourceId $subnetId
And this might be stating the obvious, but any IP range in the IP address whitelist only applies to the public endpoints of the storage account. Keep that in mind if you're trying to whitelist resources accross tenants or on-premise.

unable to create mount point in databricks for adls gen 2

I am trying to create mount point to the ADLS Gen2 using key vault in databricks, however i am not being able to do so due to some error that i am getting.
I have contributor access and i tried with Storage Blob Data Contributor and contributor access to the SPN still i am not being able to create it the mount points.
I request some help please
configs= {"fs.azure.account.auth.type":"OAuth",
"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id":"abcdefgh",
"fs.azure.account.oauth2.client.secret":dbutils.secrets.get(scope="myscope",key="mykey"),
"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/tenantid/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source= "abfss://cont1#storageaccount.dfs.core.windows.net/",
mount_point="/mnt/cont1",
extra_configs=configs)
the error i am getting is
An error occurred while calling o280.mount.
: HEAD https://storageaccount.dfs.core.windows.net/cont1?resource=filesystem&timeout=90
StatusCode=403
StatusDescription=This request is not authorized to perform this operation.
When performing the steps in the Assign the application to a role, make sure that your user account has the Storage Blob Data Contributor role assigned to it.
Repro: I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Solution: Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning the Storage Blob Data Contributor role to the service principal.
For more details, refer “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.

Azure Databricks: can't connect to Azure Data Lake Storage Gen2

I have Storage account kagsa1 with container cont1 inside and need it to accessible (mounted) via Databricks
If I use storage account key in KeyVault it works correctly:
configs = {
"fs.azure.account.key.kagsa1.blob.core.windows.net":dbutils.secrets.get(scope = "kv-db1", key = "storage-account-access-key")
}
dbutils.fs.mount(
source = "wasbs://cont1#kagsa1.blob.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
..but if I'm trying to connect using Azure Active Directory credentials:
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.ls("abfss://cont1#kagsa1.dfs.core.windows.net/")
..it fails:
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: GET https://kagsa1.dfs.core.windows.net/cont1?resource=filesystem&maxResults=5000&timeout=90&recursive=false
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
ErrorCode=AuthorizationPermissionMismatch
ErrorMessage=This request is not authorized to perform this operation using this permission.
Databrics Workspace tier is Premium,
Cluster has Azure Data Lake Storage Credential Passthrough option enabled,
Storage account has hierarchical namespace option enabled,
Filesystem was initialized with
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://cont1#kagsa1.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
and I have full access to container in storage account:
What am I doing wrong?
Note: When performing the steps in the Assign the application to a role, make sure to assign the Storage Blob Data Contributor role to the service principal.
As part of repro, I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning Storage Blob Data Contributor role to the service principal.
For more details, refer “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.
Reference: Azure Databricks - ADLS Gen2 throws 403 error message.

Reusing a token obtained from DeviceCodeCredential

I am using the following code to obtain a token for azure blob service:
from azure.storage.blob import BlobServiceClient from azure.identity import InteractiveBrowserCredential, DeviceCodeCredential, ClientSecretCredential
credential = DeviceCodeCredential(authority="login.microsoftonline.com", tenant_id="***", client_id="***")
blobber = BlobServiceClient(account_url="https://***.blob.core.windows.net", credential=credential)
blobs = blobber.list_containers()
for b in blobs:
print(b)
It works perfectly.
However, during a certain timeframe, a user may need to invoke the blob service more than once. The key point is that the process may close and reopen several times.
Making the user go through the interactive token acquisition process each time the process restarts would be very annoying. I would like to persist the token and reuse it in later flows until it expires (assume persistence is secure).
What type of credential should I use? ClientSecretCredential doesn't work. Alternatively, perhaps there is a token cache mechanism I am not aware of.
EDIT:
I reposted a variation of this question. It also has a working answer.
Thank you Jim Xu.
According to my research, the DeviceCodeCredential doesn't cache tokens--each get_token(*scopes, **kwargs) call begins a new authentication flow.
According to your need, you can use ClientSecretCredential. Regarding how to implement it, please refer to the following steps
Create a service principal and assign Azure RABC role(such as Storage Blob Data Owner Storage Blob Data Contributor and Storage Blob Data Reader) to it to do Azure AD auth and access Azure blob storage. For more details, please refer to the document and the document
I use Azure CLI
#create a sevice pricipal and assign Storage Blob Data Contributor role at storage account level
az login
az ad sp create-for-rbac -n "MyApp" --role "Storage Blob Data Contributor" \
--scope "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" --sdk-auth
# just assign Storage Blob Data Contributor role at storage account level
az role assignment create --assignee <sp_name> --role "Storage Blob Data Contributor role"
--scope "/subscriptions/<subscription>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>"
Code
from azure.identity import ClientSecretCredential
token_credential = ClientSecretCredential(
sp_tenant_id,
sp_application_id,
sp_application_secret
)
# Instantiate a BlobServiceClient using a token credential
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient(account_url=self.oauth_url, credential=token_credential)
blobs = blob_service_client.list_containers()
for b in blobs:
print(b)

connecting data lake storage gen 2 with databricks

I am trying to connect MS Azure databricks with data lake storage v2, and not able to match the client, secret scope and key.
I have data in a Azure data lake v2. I am trying to follow these instructions:
https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake-gen2.html#requirements-azure-data-lake
I have created a 'service principle' with the role "Storage Blob Data Contributor", obtained
I have created secret scopes in both Azure Keyvault and Databricks with keys and values
when I try the code below, the authentication fails to recognize the secret scope & key. It is not clear to me from the documentation if it is necessary to use the Azure Keyvault or Databricks secret scope.
val configs = Map(
"fs.azure.account.auth.type" -> "OAuth",
"fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id" -> "<CLIENT-ID>",
"fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "<SCOPE-NAME>", key = "<KEY-VALUE>"),
"fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/XXXXXXXXXX/oauth2/token")
If anybody could help on this, please advise / confirm:
what should be CLIENT-ID : I understand this to be from the storage account;
where should the SCOPE-NAME and KEY-VALUE be created, in Azure Keyvault or Databricks?
The XXXX in https://login.microsoftonline.com/XXXXXXXXXX/oauth2/token should be your TenantID (get this from the Azure Active Directory tab in the Portal > Properties > DirectoryID).
The Client ID is the ApplicationID/Service Principal ID (sadly these names are used interchangeably in the Azure world - but they are all the same thing).
If you have not created a service principal yet follow these instructions: https://learn.microsoft.com/en-us/azure/storage/common/storage-auth-aad-app#register-your-application-with-an-azure-ad-tenant - make sure you grant the service principal access to your lake once it is created.
You should create a scope and secret for the Principal ID Key - as this is something you want to hide from free text. You cannot create this in the Databricks UI (yet). Use one of these:
CLI - https://docs.databricks.com/user-guide/secrets/secrets.html#create-a-secret
PowerShell - https://github.com/DataThirstLtd/azure.databricks.cicd.tools/wiki/Set-DatabricksSecret
REST API - https://docs.databricks.com/api/latest/secrets.html#put-secret
Right now I do not think can create secrets in Azure KeyVault - though I expect to see that in the future. Technically you could manually integrate with Key Vault using their API's but it would give you another headache in needing a secret credential to connect to key vault.
I was facing the same issue , the only thing i did extra was to assign the default permission of the application to datalake gen2's blob container in azure storage explorer . It required the object id of the application , which is not the one available on the UI , it can be taken by using the command "az ad sp show --id " on azure-cli .
After assign the permission on blob container, create a new file, and then try to access it,

Resources