Using service principal to access blob storage from Databricks - databricks

I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working, or authenticating using access key or SAS is the only way?

No that is not possible as of now. OAuth Bearer Token is supported for Azure Data Lake Storage Gen2 (with the hierarchical namespace enabled when creating the storage account). To access Azure Data Lake Store Gen2 the ABFS-driver is used:
abfss://<your-file-system-name>#<your-storage-account-name>.dfs.core.windows.net/
To access the Blob Storage you use WASB:
wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net
only supporting token based access.

Related

How to access ADLS blob containers from Databricks using User Assigned Identity

I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.
You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough

Reading data from azure blob storage without Sas token and master key in spark scala

How we can read data from Azure blob storage without using SAS token and master key? If the user already has some role like Contributor or Reader of storage account or container.
Since you do not want to use either SAS token or account key, you can leverage AAD authentication by assigning any of the following RBAC roles to your spark.
Storage Blob Data Owner: Use to set ownership and manage POSIX access control for Azure Data Lake Storage Gen2. For more information, see Access control in Azure Data Lake Storage Gen2.
Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
Storage Blob Data Reader: Use to grant read-only permissions to Blob storage resources.
Storage Blob Delegator: Get a user delegation key to use to create a shared access signature that is signed with Azure AD credentials for a container or blob.
https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-access-azure-active-directory#azure-built-in-roles-for-blobs

IS it possible to create a SAS token for ADLS GEN1

I think I know the answer but just to be sure, is it possible to generate a SAS URI for ADLS GEN1?
What is the alternative? use Service principals?
Thnx,
Hennie
No, SAS token doesn't support Azure Data Lake Gen1 for now. We can't generate a SAS URI for ADLS GEN1. It only support Blob, Queue, File and Table storage.
We can get this from SAS document Delegate access with a shared access signature:
The service SAS delegates access to a resource in just one of the storage services: the Blob, Queue, Table, or File
service.
ADLS GEN1 only support access control lists(Access ACLs) to control the access permission to the files an folder.
Ref: Access control in Azure Data Lake Storage Gen1
HTH.

How to mount Azure Data Lake Store on DBFS

I need to mount Azure Data Lake Store Gen1 data folders on Azure Databricks File System using Azure Service Principal Client credentials. Please help on the same
There are three ways of accessing Azure Data Lake Storage Gen1:
Pass your Azure Active Directory credentials, also known as credential passthrough.
Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
1. Pass your Azure Active Directory credentials, also known as credential passthrough:
You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a standard cluster
For complete setup and usage instructions, see Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough.
2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a
service principal and OAuth 2.0.
Step1: Create and grant permissions to service principal
If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Create an Azure AD application and service principal that can access resources. Note the following properties:
application-id: An ID that uniquely identifies the client application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Step2: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Python code:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
3. Access directly with Spark APIs using a service principal and OAuth 2.0
You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.
Access using the DataFrame API:
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("<prefix>.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("<prefix>.oauth2.client.id", "<application-id>")
spark.conf.set("<prefix>.oauth2.credential","<key-name-for-service-credential>"))
spark.conf.set("<prefix>.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Reference: Azure Databricks - Azure Data Lake Storage Gen1

Secure access to Microsoft Azure Blob Storage

I have a frontend container, backend container and the azure blob storage. User using the front/backend are authenticated. Thus the backend validates the user credentials and users are allowed to access their media files stored in the azure blob storage.
I would like that users access their media files directly at the azure blob storage in order not to stress the backend to much by using it as a proxy. The media references for each user are stored in the backend.
How would you achieve this by using the azure blob storage and its access control (or is it a misuse of the azure blob storage)?
You can implement security by generating a SAS token for your blob container/individual blob
With a SAS, you can grant clients access to resources in your storage account, without sharing your account keys

Resources