I think I know the answer but just to be sure, is it possible to generate a SAS URI for ADLS GEN1?
What is the alternative? use Service principals?
Thnx,
Hennie
No, SAS token doesn't support Azure Data Lake Gen1 for now. We can't generate a SAS URI for ADLS GEN1. It only support Blob, Queue, File and Table storage.
We can get this from SAS document Delegate access with a shared access signature:
The service SAS delegates access to a resource in just one of the storage services: the Blob, Queue, Table, or File
service.
ADLS GEN1 only support access control lists(Access ACLs) to control the access permission to the files an folder.
Ref: Access control in Azure Data Lake Storage Gen1
HTH.
Related
I have ADLS storage account with blob containers. I have successfully mounted ADLS with Service Principal in Databricks and able to do my necessary transformations on the Data.
Now I'm in a process of using User Assigned Managed Identities to avoid keeping the secrets in my code. For the process, I have created required Managed Identity and enabled it to my service principal by assigning necessary role in the Storage account.
My question is how can I use the managed Identity or how can I do my transformation on the ADLS storage from Databricks without mounting or using secrets?
Please suggest a working solution or any helpful forum for the same.
Thanks.
You can authenticate automatically to Azure Data Lake Storage Gen1
(ADLS Gen1) and Azure Data Lake Storage Gen2 (ADLS Gen2) from Azure
Databricks clusters using the same Azure Active Directory (Azure AD)
identity that you use to log into Azure Databricks. When you enable
Azure Data Lake Storage credential passthrough for your cluster,
commands that you run on that cluster can read and write data in Azure
Data Lake Storage without requiring you to configure service principal
credentials for access to storage.
Enable Azure Data Lake Storage credential passthrough for a High Concurrency cluster
High concurrency clusters can be shared by multiple users. They support only Python and SQL with Azure Data Lake Storage credential passthrough.
When you create a cluster, set Cluster Mode to High Concurrency.
Under Advanced Options, select Enable credential passthrough for user-level data access and only allow Python and SQL commands.
Enable Azure Data Lake Storage credential passthrough for a Standard cluster
When you create a cluster, set the Cluster Mode to Standard.
Under Advanced Options, select Enable credential passthrough for user-level data access and select the user name from the Single User Access drop-down.
Access Azure Data Lake Storage directly using credential passthrough
After configuring Azure Data Lake Storage credential passthrough and creating storage containers, you can access data directly in Azure Data Lake Storage Gen1 using an adl:// path and Azure Data Lake Storage Gen2 using an abfss:// path.
Example:
Python - spark.read.csv("adl://<storage-account-name>.azuredatalakestore.net/MyData.csv").collect()
Refer this offcicial documentation: Access Azure Data Lake Storage using Azure Active Directory credential passthrough
How we can read data from Azure blob storage without using SAS token and master key? If the user already has some role like Contributor or Reader of storage account or container.
Since you do not want to use either SAS token or account key, you can leverage AAD authentication by assigning any of the following RBAC roles to your spark.
Storage Blob Data Owner: Use to set ownership and manage POSIX access control for Azure Data Lake Storage Gen2. For more information, see Access control in Azure Data Lake Storage Gen2.
Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
Storage Blob Data Reader: Use to grant read-only permissions to Blob storage resources.
Storage Blob Delegator: Get a user delegation key to use to create a shared access signature that is signed with Azure AD credentials for a container or blob.
https://learn.microsoft.com/en-us/azure/storage/blobs/authorize-access-azure-active-directory#azure-built-in-roles-for-blobs
I am currently trying to send a csv file using Azure Function with NodeJs to Azure Data Lake gen2 but unable to do the same, Any suggestions regarding the same would be really helpful.
Thanks.
I have tried to use Credentials of blob storage present in ADLS gen2 using the Blob storage API's but i am getting an error.
For now this could not be implemented with SDK. Please check this known issue:
Blob storage APIs are disabled to prevent feature operability issues that could arise because Blob Storage APIs aren't yet interoperable with Azure Data Lake Gen2 APIs.
And in the table of features, you could find the information about APIs for Data Lake Storage Gen2 storage accounts:
multi-protocol access on Data Lake Storage is currently in public preview. This preview enables you to use Blob APIs in the .NET, Java, Python SDKs with accounts that have a hierarchical namespace. The SDKs don't yet contain APIs that enable you to interact with directories or set access control lists (ACLs). To perform those functions, you can use Data Lake Storage Gen2 REST APIs.
So if you want to implement it, you have to use the REST API:Azure Data Lake Store REST API.
I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working, or authenticating using access key or SAS is the only way?
No that is not possible as of now. OAuth Bearer Token is supported for Azure Data Lake Storage Gen2 (with the hierarchical namespace enabled when creating the storage account). To access Azure Data Lake Store Gen2 the ABFS-driver is used:
abfss://<your-file-system-name>#<your-storage-account-name>.dfs.core.windows.net/
To access the Blob Storage you use WASB:
wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net
only supporting token based access.
What is the difference between mounting an Azure Data Lake Store Gen2 on Databricks using Service pricipal and Direct Access using SAS key ?
I want to know the difference in term of data transfer, security of access
Thanks
If you mount storage all users on all clusters get access.
If you do not mount and connect directly in the session using either a service principal or a SAS (I don't think a SAS key is officially supported BTW) the user in that session must have access to the credentials to create the connection.
Service Principals can also have low lever permissions applied within the lake, such as restricting to certain folders.
Note that with ADLS Gen2 you now also have the option of passing through the user credentials: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
I do not know of any performance differences.