Faster way to grant access privileges to ADLS on HDInsight cluster provisioning? - azure

I have an Azure Data Lake Store (ADLS) containing ~100k files that I need to access from an HDInsight cluster for analysis. When I provision the cluster via Azure Portal, I use this ADLS for the cluster's storage and assign rwx privileges for all files on the ADLS using a service principal + the "Data Lake Store Access" feature. This feature appears to grant access to each file one at a time, at a rate of about 2k per minute: it takes over an hour just to grant the permissions!
Is there a faster way to grant a new cluster rwx privileges on its associated ADLS?

Yes there is a better way to get this all set up. You need to, on a one-time basis, add permissions for an Azure Active Directory group to all your files and folders. Once that is set up, then whenever you create a new HDInsight cluster, the service principal simply needs to be made a member of the group.
So to summarize:
Create a new Azure Active Directory Group
Propagate permissions in your ADLS account to this group on the appropriate files and folders
Create your HDInsight cluster. Choose the right service principal
when creating it.
Add the service principal to the group created in
step 1
Hope this helps and do let me know if you have questions.

Related

Access Control from Databricks to Azure Storage Accounts and Containers

Our Databricks workspace needs to access different data sets but we need to ensure that access control can be granted on a role or individual level. The data sets are planned to be available as files on Data Lake Gen2 that will be read into dataframes etc. These files in storage accounts can be organized as seen fit for access rights (either 1 storage account per dataset - which might hit the 256 limit soon - or 1 dataset per container and thus several datasets in a storage account).
Our architectural guidelines require the access to be via service principal. However, I think this would give each user in the Databricks workspace the same access rights to different storage accounts (datasets).
Is there another feasible solution with accessing storage accounts from Databricks via service principal but at the same time have fine-grained control about access rights of individual users or at least on a role-level? Can this be achieved on a container level or only on a storage account level?
I tried to use service principal to access storage accounts from within a Databricks workspace which then grants every user the access to the storage accounts.
Usually when user is working with the data it happens in two steps:
Checking permissions for accessing a specific piece of data
Actually accessing the data in the storage account if it's allowed
This schema is fully supported on Databricks with following:
If your organization is already adopted the Unity Catalog (UC), then it's easy - you just add storage accounts/containers as external locations, create tables for data in these locations, and then grant permissions on working with specific tables to users or (better) roles. Actual data access will be done
If you didn't adopt UC yet, then you can enforce access via Table Access Control (TACL). In this case you will need to attach a service principal to a TACL enabled cluster, but actual enforcement will happen by the TACL service, and data will be read/written only if user/role has permissions to do that.

How to stop users from accessing a certain mount point in Azure Databricks

I use two different ADLS one is open for all and the other one is a secured location with only the privilege given to a few individuals.
But these privileges given through RBAC are only applicable through the Azure portal and the users are still able to access the secured ADLS through mount point setup on Azure Databricks,
Is there a way to restrict the access on this mount point?
Thanks.
As per official documentation and MSFT Q&A, All users have read and write access to the objects in object storage mounted to DBFS. We cannot restrict users from using the mount point.
You can raise feature request here
However you can use Role-based access control for notebooks, clusters, jobs and tables feature by selecting Premium tier

Azure databricks cluster don't have acces to mounted adls2

I followed the documentation azure-datalake-gen2-sp-access and I mounted a ADLS2 storage in databricks, but when I try to see data from the GUI I get the next error:
Cluster easy-matches-cluster-001 does not have the proper credentials to view the content. Please select another cluster.
I don't find any documentation, only something about premium databricks, so I can only access with a premium databricks resource?
Edit1: I can see the mounted storage with dbutils.
After mounting the storage account, please do run this command do check if you have data access permissions to the mount point created.
dbutils.fs.ls("/mnt/<mount-point>")
If you have data access - you will see the files inside the storage
account.
Incase if you don't have data access- you will get this error - "This request is not authorized to perform this operation using this permissions", 403.
If you are able to mount the storage but unable to access, check if the ADLS2 account has the necessary roles assigned.
I was able to repro the same. Since you are using Azure Active Directory application, you would have to assign "Storage Blob Data Contributor" role to Azure Active Directory application too.
Below are steps for granting blob data contributor role on the registered application
1. Select your ADLS account. Navigate to Access Control (IAM). Select Add role assignment.
2. Select the role Storage Blob Data Contributor, Search and select your registered Azure Active Directory application and assign.
Back in Access Control (IAM) tab, search for your AAD app and check access.
3. Run dbutils.fs.ls("/mnt/<mount-point>") to confirm access.
Solved unmounting, mounting and restarting the cluster. I followed this doc: https://learn.microsoft.com/en-us/azure/databricks/kb/dbfs/remount-storage-after-rotate-access-key
If you still encounter the same issue when Access Control is checked. Do the following.
Use dbutils.fs.unmount() to unmount all storage accounts.
Restart the cluster.
Remount

Is there any method by which I can restrict other user not to view my container in Azure data lake gen 2

Problem Statement- There are two different teams working on two different project for same client. Both team have access to azure resource group on which azure data lake storage has been created. Now Client want us to use same data lake storage for both project but they also want that team working on a specific containers should not have access to other containers which other team will use and vice-versa.
Example--
Azure data lake storage -both team have access to this
->container1--only team 1 should have access to this
->container2--only team 2 should have access to this
Can anyone please suggest that how can we achieve this.
Thanks In advance!!
You can manage the access to containers, directories and blobs by using Access control lists (ACLs) feature in Azure Data Lake Storage Gen2.
You can associate a security principal with an access level for files and directories. Each association is captured as an entry in an access control list (ACL). Each file and directory in your storage account has an access control list. When a security principal attempts an operation on a file or directory, An ACL check determines whether that security principal (user, group, service principal, or managed identity) has the correct permission level to perform the operation.
To manage the ACL on the container, follow the below steps:
Go to the container in the storage account.
Navigate to any container, directory, or blob. Right-click the object, and then select Manage ACL.
The Access permissions tab of the Manage ACL page appears. Use the controls in this tab to manage access to the object.
To add a security principal to the ACL, select the Add principal button.
Find the security principal by using the search box, and then click the Select button.
You should create a security group in Azure AD for each of your team, and then maintain permissions on the group rather than for individual users.
Refer: Access control lists (ACLs) in Azure Data Lake Storage Gen2

Not able to provide access to Azure Data Lake Store files from HDInsight cluster

I was trying to create a new HDInsight cluster and wanted to connect to already created Azure Data Lake Store account(ADLS). I have selected HDI V3.5 as the cluster type for HDInsight. I was able to select my Data Lake as my storage, but when I created SPI account and when I tried to provide ADLS access to that account I don't see my ADLS root folder. Do I need to provide any additional permissions for my ADLS in order to appear in mange ADLS access blade? Any help would be appreciated.
You need to create a root folder in the ADLS account before creating the HDI cluster so that the root path is defined while creating the cluster.
For eg., create a root folder /clusters and the root path in your cluster would be /clusters/yourclustername. And also make sure you give access to the root folder in "Manage Access" blade while creating a HDI cluster.
Refer this for more info:
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-hdinsight-hadoop-use-portal
ADLA file system has separate UNIX like access control called ACL (Access control list). You(login email id) should have ACL access at root folder to access the same in HDInsight cluster.

Resources