How to access ADLS blob containers from Databricks using User Assigned Identity - azure

I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.

You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough

Related

Get the full path of a file in azure synapse studio using pyspark

I need to process a pdf file from my storage account. In the local environment, we use to get the path of the file 'C:\path\file1.pdf'. But how can I access the data in Azure storage account in the azure synapse studio pyspark(python)?
Manual Method: If you want to manually get the full path of the storage account manually.
For ADLS GEN2 accounts: 'abfss://<FileSystemName>#<StorageName>.dfs.core.windows.net/FilePath/FileName/'
For Azure Blob accounts: 'wasbs://<ContainerName>#<StorageName>.blob.core.windows.net/FilePath/FileName/'
Automatic Method: Here are the steps to get the full path of a file in Azure Synapse Studio using Pyspark.
You can create a linked service to connection to the external data (Azure Blob Storage/Gen1/Gen2).
Step1: You can analyze the data in your workspace default ADLS Gen2 account or you can link an ADLS Gen2 or Blob storage account to your workspace through "Manage" > "Linked Services" > "New"
Step2: Once a connection is created, the underlying data of that connection will be available for analysis in the Data hub or for pipeline activities in the Integrate hub.
Step3: Now you have successfully connected Azure Data Lake Gen2 without pass any path.
Reference: Azure Synapse Analytics - Analyze data in a storage account

Reading data from azure blob storage without Sas token and master key in spark scala

How we can read data from Azure blob storage without using SAS token and master key? If the user already has some role like Contributor or Reader of storage account or container.
Since you do not want to use either SAS token or account key, you can leverage AAD authentication by assigning any of the following RBAC roles to your spark.
Storage Blob Data Owner: Use to set ownership and manage POSIX access control for Azure Data Lake Storage Gen2. For more information, see Access control in Azure Data Lake Storage Gen2.
Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
Storage Blob Data Reader: Use to grant read-only permissions to Blob storage resources.
Storage Blob Delegator: Get a user delegation key to use to create a shared access signature that is signed with Azure AD credentials for a container or blob.
https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-access-azure-active-directory#azure-built-in-roles-for-blobs

How to mount Azure Data Lake Store on DBFS

I need to mount Azure Data Lake Store Gen1 data folders on Azure Databricks File System using Azure Service Principal Client credentials. Please help on the same
There are three ways of accessing Azure Data Lake Storage Gen1:
Pass your Azure Active Directory credentials, also known as credential passthrough.
Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a service principal and OAuth 2.0.
Use a service principal directly.
1. Pass your Azure Active Directory credentials, also known as credential passthrough:
You can authenticate automatically to Azure Data Lake Storage Gen1 from Azure Databricks clusters using the same Azure Active Directory (Azure AD) identity that you use to log into Azure Databricks. When you enable your cluster for Azure AD credential passthrough, commands that you run on that cluster will be able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a standard cluster
For complete setup and usage instructions, see Secure access to Azure Data Lake Storage using Azure Active Directory credential passthrough.
2. Mount an Azure Data Lake Storage Gen1 filesystem to DBFS using a
service principal and OAuth 2.0.
Step1: Create and grant permissions to service principal
If your selected access method requires a service principal with adequate permissions, and you do not have one, follow these steps:
Create an Azure AD application and service principal that can access resources. Note the following properties:
application-id: An ID that uniquely identifies the client application.
directory-id: An ID that uniquely identifies the Azure AD instance.
service-credential: A string that the application uses to prove its identity.
Register the service principal, granting the correct role assignment, such as Contributor, on the Azure Data Lake Storage Gen1 account.
Step2: Mount Azure Data Lake Storage Gen1 resource using a service principal and OAuth 2.0
Python code:
configs = {"<prefix>.oauth2.access.token.provider.type": "ClientCredential",
"<prefix>.oauth2.client.id": "<application-id>",
"<prefix>.oauth2.credential": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
"<prefix>.oauth2.refresh.url": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "adl://<storage-resource>.azuredatalakestore.net/<directory-name>",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
3. Access directly with Spark APIs using a service principal and OAuth 2.0
You can access an Azure Data Lake Storage Gen1 storage account directly (as opposed to mounting with DBFS) with OAuth 2.0 using the service principal.
Access using the DataFrame API:
To read from your Azure Data Lake Storage Gen1 account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("<prefix>.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("<prefix>.oauth2.client.id", "<application-id>")
spark.conf.set("<prefix>.oauth2.credential","<key-name-for-service-credential>"))
spark.conf.set("<prefix>.oauth2.refresh.url", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
Reference: Azure Databricks - Azure Data Lake Storage Gen1

Using service principal to access blob storage from Databricks

I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working, or authenticating using access key or SAS is the only way?
No that is not possible as of now. OAuth Bearer Token is supported for Azure Data Lake Storage Gen2 (with the hierarchical namespace enabled when creating the storage account). To access Azure Data Lake Store Gen2 the ABFS-driver is used:
abfss://<your-file-system-name>#<your-storage-account-name>.dfs.core.windows.net/
To access the Blob Storage you use WASB:
wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net
only supporting token based access.

HDInsight Spark cluster - can't connect to Azure Data Lake Store

So I have created an HDInsight Spark Cluster. I want it to access Azure Data Lake Store.
To create the HDInsight Spark cluster I followed the instructions at: https://azure.microsoft.com/en-gb/documentation/articles/data-lake-store-hdinsight-hadoop-use-portal however there was no option in the Azure Portal to configure the AAD or add a Service Principle.
So my cluster was created using Azure Blob Storage only. Now I want to extend it to access Azure Data Lake Store. However the "Cluster AAD Identity" dialog states "Service Principal: DISABLED" and all fields in the dialog are greyed our and disabled. I can't see any way to extend the storage to point to ADL.
Any help would be appreciated!
Thanks :-)
You can move your data from Blob to ADLS with Data Factory, but you can't access direct to ADLS from a Spark cluster.
Please create an Azure Hdinsight cluster with ServicePrincipal. ServicePrincipal should have access to your data lake storage account.
You can configure your cluster to use Data lake storage but that is very complicated. And in fact there is no documentation for that.
So recommended way to create is with ServicePrincipal.
Which type of cluster did you create?
In our Linux cluster all the option listed in the guide you linked are available.

Resources