How to mount Azure Data Lake Store on DBFS - azure

I need to mount Azure Data Lake Store Gen1 data folders on Azure Databricks File System using Azure Service Principal Client credentials. Please help on the same

There are three ways of accessing Azure Data Lake Storage Gen1:
Pass your Azure Active Directory credentials, also known as credential passthrough.
Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
1. Pass your Azure Active Directory credentials, also known as credential passthrough:
You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a standard cluster
For complete setup and usage instructions, see Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough.
2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a
service principal and OAuth 2.0.
Step1: Create and grant permissions to service principal
If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Create an Azure AD application and service principal that can access resources. Note the following properties:
application-id: An ID that uniquely identifies the client application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Step2: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Python code:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
3. Access directly with Spark APIs using a service principal and OAuth 2.0
You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.
Access using the DataFrame API:
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("<prefix>.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("<prefix>.oauth2.client.id", "<application-id>")
spark.conf.set("<prefix>.oauth2.credential","<key-name-for-service-credential>"))
spark.conf.set("<prefix>.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Reference: Azure Databricks - Azure Data Lake Storage Gen1

Related

How to access ADLS blob containers from Databricks using User Assigned Identity

I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.
You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough

Reading data from azure blob storage without Sas token and master key in spark scala

How we can read data from Azure blob storage without using SAS token and master key? If the user already has some role like Contributor or Reader of storage account or container.
Since you do not want to use either SAS token or account key, you can leverage AAD authentication by assigning any of the following RBAC roles to your spark.
Storage Blob Data Owner: Use to set ownership and manage POSIX access control for Azure Data Lake Storage Gen2. For more information, see Access control in Azure Data Lake Storage Gen2.
Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
Storage Blob Data Reader: Use to grant read-only permissions to Blob storage resources.
Storage Blob Delegator: Get a user delegation key to use to create a shared access signature that is signed with Azure AD credentials for a container or blob.
https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-access-azure-active-directory#azure-built-in-roles-for-blobs

Connect to ADLS gen 1 with Azure Databricks using SPN + certificate

I want to connect to a datalake store in databricks using a service principal with certificate(pfx or pem). On the databricks page there is only reference to using access tokens.
https://docs.databricks.com/data/data-sources/azure/azure-datalake.html
Is it possible to use a certificate?

Calling Azure AD secure functions from Azure Data Factory

I recently secured my azure functions by Azure Active Directory. Hence you need access token set in auth header to enable you to call them. I am successfully able to do that from my Front end Angular apps. But in my backend I have Azure Data factory as well, How can i enable Azure Data factory to use Azure AD while calling functions and not host key?
You can use the web activity to get the bearer token and then pass this to the subsequent calls .
Azure Data Factory supports Managed Identity:
Managed identity for Data Factory
When creating a data factory, a managed identity can be created along with factory creation. The managed identity is a managed application registered to Azure Activity Directory, and represents this specific data factory.
Managed identity for Data Factory benefits the following features:
Store credential in Azure Key Vault, in which case data factory managed identity is used for Azure Key Vault authentication.
Connectors including Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, and Azure SQL Data Warehouse.
Web activity.

Using service principal to access blob storage from Databricks

I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working, or authenticating using access key or SAS is the only way?
No that is not possible as of now. OAuth Bearer Token is supported for Azure Data Lake Storage Gen2 (with the hierarchical namespace enabled when creating the storage account). To access Azure Data Lake Store Gen2 the ABFS-driver is used:
abfss://<your-file-system-name>#<your-storage-account-name>.dfs.core.windows.net/
To access the Blob Storage you use WASB:
wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net
only supporting token based access.

Resources