Connect to ADLS gen 1 with Azure Databricks using SPN + certificate - azure

I want to connect to a datalake store in databricks using a service principal with certificate(pfx or pem). On the databricks page there is only reference to using access tokens.
https://docs.databricks.com/data/data-sources/azure/azure-datalake.html
Is it possible to use a certificate?

Related

Accessing ADLS Gen2 using Azure Databricks

I am new to Cloud(Azure), trying to connect to ADLS Gen2 using Azure Databricks but not able to access it.
Step tried -
got the access key from Azure portal, which was for specific Storage account
Tried to create Score as well, which was looking for Key Vault.
Is there any other way by which we can directly read the data from ADLS using python or any other python module which can help us to achieve this.

How to access ADLS blob containers from Databricks using User Assigned Identity

I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.
You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough

Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2

Hello Azure Data factory experts,
I have this error I am trying to connect from Azure data factory to Data lake Gen2 by creating Linked services in Azure data factory, but I got this error.
How can anyone help?
BR,
Mohammed
ADLS Gen2 operation failed for: Storage operation '' on container
'XXXXXXXXXX' get failed with 'Operation returned an invalid status
code 'Forbidden''. Possible root causes: (1). It's possible because
the service principal or managed identity don't have enough permission
to access the data. (2). It's possible because some IP address ranges
of Azure Data Factory are not allowed by your Azure Storage firewall
settings. Azure Data Factory IP ranges please refer
https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses..
Account: 'azurestoragforalltype'. ErrorCode:
'AuthorizationPermissionMismatch'. Message: 'This request is not
authorized to perform this operation using this permission.'.
RequestId: '83628e11-d01f-0020-33c3-cc8430000000'. TimeStamp: 'Fri, 29
Oct 2021 12:49:39 GMT'.. Operation returned an invalid status code
'Forbidden' Activity ID: ef66fc97-6bbb-4d4e-99fb-95c0ba3a4c31.
Hi you should assign yourself at least contributor access to the lake and then try connect to the lake using managed identity from adf, this should solve your problem
Below are the different authentication types which Azure Data Lake Storage Gen2 connector supports:
Account key authentication
Service principal authentication
System-assigned managed identity authentication
User-assigned managed identity authentication
As per your error message you might be using service principal or managed identity authentication method in Azure data lake Gen2 connector.
You must grant proper permissions for service principal/managed identity. Grant at least Execute permission for ALL upstream folders and the file system, along with Read permission for the files to copy. Alternatively, in Access control (IAM), grant at least the Storage Blob Data Reader role.
You can check this document to see examples on how the permissions works in Azure data lake Gen2.

How to mount Azure Data Lake Store on DBFS

I need to mount Azure Data Lake Store Gen1 data folders on Azure Databricks File System using Azure Service Principal Client credentials. Please help on the same
There are three ways of accessing Azure Data Lake Storage Gen1:
Pass your Azure Active Directory credentials, also known as credential passthrough.
Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
1. Pass your Azure Active Directory credentials, also known as credential passthrough:
You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a standard cluster
For complete setup and usage instructions, see Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough.
2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a
service principal and OAuth 2.0.
Step1: Create and grant permissions to service principal
If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Create an Azure AD application and service principal that can access resources. Note the following properties:
application-id: An ID that uniquely identifies the client application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Step2: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Python code:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
3. Access directly with Spark APIs using a service principal and OAuth 2.0
You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.
Access using the DataFrame API:
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("<prefix>.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("<prefix>.oauth2.client.id", "<application-id>")
spark.conf.set("<prefix>.oauth2.credential","<key-name-for-service-credential>"))
spark.conf.set("<prefix>.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Reference: Azure Databricks - Azure Data Lake Storage Gen1

Using service principal to access blob storage from Databricks

I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working, or authenticating using access key or SAS is the only way?
No that is not possible as of now. OAuth Bearer Token is supported for Azure Data Lake Storage Gen2 (with the hierarchical namespace enabled when creating the storage account). To access Azure Data Lake Store Gen2 the ABFS-driver is used:
abfss://<your-file-system-name>#<your-storage-account-name>.dfs.core.windows.net/
To access the Blob Storage you use WASB:
wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net
only supporting token based access.

Resources