First some background:
I want to facilitate access to the different groups of data scientists in Azure Data Lake gen 2. However, we don’t want provide access to them to the entire data lake because they are not supposed to see all the data for security reasons. They must be able to see only some limited files/folders. We are doing that by adding the data scientists’ AAD groups to the ACL of the data lake folders. You can refer to the following links to get more insights and to know what I am talking about:
https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control
Now the problem:
Since the data scientists are granted access to a very specific/limited area, they are able to access/browse those folders/files using Azure databricks (python commands/code etc.). However, they are not able to browse using Azure Storage Explorer.
So is there some way so that they can browse the datalake using Azure storage explorer or some other GUI tool.
Or is it possible to create some custom role for such a scenario and grant that role to the data scientists AAD groups so that they may just have access to the specific area (i.e. a custom role that may be created that would only have “execute” access on the ADLS gen 2 file-systems.)
As far as I knew, we have no way to use RABC role to control access on some folders in the file system(container). Because when we assign role to ADD group, we need to define a scope. The smallest scope in Azure data lake gen2 is file system(container). If you just want to control access on it, you do not need to create custom role and you can directly use the build-in role Storage Blob Data Reader. If one user has the role, he can read all files in the file system. For more details, please refer to the document
It is not possible to access data via Storage Explorer only with ACL permissions assigned. Unfortunately, you need to use ACLs in combination with RBAC role assigned on the Storage Account level (e.g. Reader), to be able to see Storage Account itself from the Storage Explorer. Then you can introduce granular permissions using ACL on specific containers/folders/files, however with Reader still they will be able to see the names of all the containers in the Storage Account (but cannot see the containers content until specified via ACL or Data RBAC assignment on container level).
As you noticed, the only option to access specific folder/file using only ACL permissions is via code e.g. Powershell or Python.
Related
Our Databricks workspace needs to access different data sets but we need to ensure that access control can be granted on a role or individual level. The data sets are planned to be available as files on Data Lake Gen2 that will be read into dataframes etc. These files in storage accounts can be organized as seen fit for access rights (either 1 storage account per dataset - which might hit the 256 limit soon - or 1 dataset per container and thus several datasets in a storage account).
Our architectural guidelines require the access to be via service principal. However, I think this would give each user in the Databricks workspace the same access rights to different storage accounts (datasets).
Is there another feasible solution with accessing storage accounts from Databricks via service principal but at the same time have fine-grained control about access rights of individual users or at least on a role-level? Can this be achieved on a container level or only on a storage account level?
I tried to use service principal to access storage accounts from within a Databricks workspace which then grants every user the access to the storage accounts.
Usually when user is working with the data it happens in two steps:
Checking permissions for accessing a specific piece of data
Actually accessing the data in the storage account if it's allowed
This schema is fully supported on Databricks with following:
If your organization is already adopted the Unity Catalog (UC), then it's easy - you just add storage accounts/containers as external locations, create tables for data in these locations, and then grant permissions on working with specific tables to users or (better) roles. Actual data access will be done
If you didn't adopt UC yet, then you can enforce access via Table Access Control (TACL). In this case you will need to attach a service principal to a TACL enabled cluster, but actual enforcement will happen by the TACL service, and data will be read/written only if user/role has permissions to do that.
I have a ADLS Gen2 account deployed in Azure. We are populating data to different teams to transform. For the security reasons, We only providing ACLs permissions. Now as the data becoming huge in size, in case new team introduced, we are getting issue while providing access to container level.
Currently we are using Powershell. Its taking around 5+ hrs if data in container is 20GB+.
Is there any way to reduce the time? Any other language can we used or alternate solution ?
It sounds like you have a single storage container and are granting access on a per item basis.
This is not sustainable as the number of items and the number of teams grows.
You need to group the data in a way that you can grant access to a team for a set of data.
Possible options:
Create several storage accounts, grant access to teams on a storage account level
Create containers within the storage account, place data in containers, grant access on container level.
Problem Statement- There are two different teams working on two different project for same client. Both team have access to azure resource group on which azure data lake storage has been created. Now Client want us to use same data lake storage for both project but they also want that team working on a specific containers should not have access to other containers which other team will use and vice-versa.
Example--
Azure data lake storage -both team have access to this
->container1--only team 1 should have access to this
->container2--only team 2 should have access to this
Can anyone please suggest that how can we achieve this.
Thanks In advance!!
You can manage the access to containers, directories and blobs by using Access control lists (ACLs) feature in Azure Data Lake Storage Gen2.
You can associate a security principal with an access level for files and directories. Each association is captured as an entry in an access control list (ACL). Each file and directory in your storage account has an access control list. When a security principal attempts an operation on a file or directory, An ACL check determines whether that security principal (user, group, service principal, or managed identity) has the correct permission level to perform the operation.
To manage the ACL on the container, follow the below steps:
Go to the container in the storage account.
Navigate to any container, directory, or blob. Right-click the object, and then select Manage ACL.
The Access permissions tab of the Manage ACL page appears. Use the controls in this tab to manage access to the object.
To add a security principal to the ACL, select the Add principal button.
Find the security principal by using the search box, and then click the Select button.
You should create a security group in Azure AD for each of your team, and then maintain permissions on the group rather than for individual users.
Refer: Access control lists (ACLs) in Azure Data Lake Storage Gen2
I need to enable one external user, to be able to access a single directory in a single container in my datalake, in order to upload some data. From what I see in the documentation, it should be possible to simply use RBAC & ACL, so that the user can authenticate himself later on using Powershell and Connect-AzureAD(or to obtain a OAuth2 token).
However, I am having trouble with all those inherited permissions. Once I add a user to my active directory, he is not able to see anything, unless I give him at least reader access on the subscription level. This gives him at least reader permission on all the resources in this subscription, which cannot be removed.
Is it possible to configure this access in such a way, that my user is only able to see a single datalake, single container, and a single folder within this container?
If you want just the one user to access only a single directory/container in your storage account, you should rather look at Shared Access Signatures or Stored Access policies.
For SAS : https://husseinsalman.com/securing-access-to-azure-storage-part-4-shared-access-signature/
For SAS built on top of Stored Acess Policies : https://husseinsalman.com/securing-access-to-azure-storage-part-5-stored-access-policy/
Once you have configured the permissions just for that directory/container, you can send that Shared Access Signature to the user and he/she can use Azure Storage Explorer to perform and file upload/delete etc actions on your container.
Download Azure storage explorer here : https://azure.microsoft.com/en-us/features/storage-explorer/#overview
For how to use Azure Storage Explorer : https://www.red-gate.com/simple-talk/cloud/azure/using-azure-storage-explorer/
More on using Azure storage explorer with azure data lake Gen 2 : https://medium.com/microsoftazure/guidance-for-using-azure-storage-explorer-with-azure-ad-authorization-for-azure-storage-data-access-663c2c88efb
Have Azure Storage account with ADLS Gen2 containers. The permissions for users get added by code but what it does is go to the storage container > Access Control (IAM) > Roles > Storage Blob Data Contributor > Then adds a user, group, or service principle.
Is there an easy way via python to be able to check if a user or service principle is in a specific role (such as Storage Blob Data Contributor) for a specific container?
I've attached a screenshot of the screen in azure that I'm wanting to replicate the functionality it does in python.
I've tried Role Assignments - List For Scope with a filter but it does not seem to return the same.
Screenshot
One of the options you could try is using the Rest API Get Container ACLs. This will provide you a list of the the entities who have access to the container. You can run a quick search in this list to verify the access.
I couldn't find anything similar in the Python SDK.