I am trying to understand, why my ACL permissions are not working properly in Databricks.
Scenario: I have 2 Users. one with full permissions on FileSystem and. other without any permissions.
I tried mounting Gen2 filesystem in databricks using 2 different methods.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientid,
"fs.azure.account.oauth2.client.secret": credential,
"fs.azure.account.oauth2.client.endpoint": refresh_url}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
and using passthrough
2.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
both mount the filesystem. But when I use:
dbfs.fs.ls("/mnt/xyz")
It displays all the contents files / folders for the user which has no permissions on datalake.
Would be glad if someone would explain me what's wrong.
Thanks
This is expected behavior when you enable Azure Data Lake Storage credential passthrough.
Note: When a cluster is enabled for Azure Data Lake Storage credential passthrough, commands run on that cluster can read and write data in Azure Data Lake Storage without requiring users to configure service principal credentials to access the storage. The credentials are set automatically, based on the user initiating the action.
Reference: Enable Azure Data Lake Storage credential passthrough for your workspace and Simplify Data Lake Access with Azure AD Credential Passthrough.
Probably you do forget to add permissions in the Access Control (IAM) of the container.
To check this, you can go to the container in azure portal and click on Switch to Azure AD User Account. If you don't have rights, you will see a error message.
For example, you can add the role Storage Blob Data Contributor to have read and write access.
Note: Datalake take some minutes to refresh the credentials, so you need to wait a little bit after adding the role.
Related
I mounted my Azure Storage Account using dbutils and Python like in this page, with the method using Azure Service Principal:
https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
but I also saw there is an option to do a connection with spark to the Azure Blob File System (ABFS) driver like in this page:
https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I couldn't find information about the difference? In which use cases is it better to use one or the other? Is one method faster than the other to get information from the stored data in the Azure Storage Account?
Thanks a lot in advance!
When you mount your storage account, you make it accessible to everyone that has access to your Databricks workspace.
But when you use spark.conf.set to connect and use your storage account, it is limited to only those who have access to that cluster.
As highlighted in the same Microsoft document for Access Azure Data Lake Storage Gen2 and Blob Storage, Mounting is among the deprecated ways of accessing Storage accounts and no longer recommended. Therefore, as per the requirement, you can either choose mounting or setting configurations taking security into consideration.
If you want to choose mounting, you can try setting up mount point using credential passthrough.
Is one method faster than the other to get information from the stored data in the Azure Storage Account?
As far as I know, the rate at which information can be accessed would not change. The main difference is that using mounting is not as secure as using spark.conf.set because it is accessible to all users.
Databricks job used to connect to ADLS G2 storage and process the files successfully.
Recently after renewing the Service Principal secrets, and updating the secret in Key-vault, now the jobs are failing.
using the databricks-cli databricks secrets list-scopes --profile mycluster, i was able to identify which key valut is being used, Also verified the corresponding secrets are updated correctly.
Within the notebook, i followed link and was able to access the ALDS
Below i used to test the key vault values, to access the ADLS.
scopename="name-of-the-scope-used-in-databricks-workspace"
appId=dbutils.secrets.get(scope=scopename,key="name-of-the-key-from-keyvault-referring-appid")
directoryId=dbutils.secrets.get(scope=scopename,key="name-of-key-from-keyvault-referring-TenantId")
secretValue=dbutils.secrets.get(scope=scopename,key="name-of-key-from-keyvaut-referring-Secretkey")
storageAccount="ADLS-Gen2-StorageAccountName"
spark.conf.set(f"fs.azure.account.auth.type.{storageAccount}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{storageAccount}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.client.id.{storageAccount}.dfs.core.windows.net", appid)
spark.conf.set(f"fs.azure.account.oauth2.client.secret.{storageAccount}.dfs.core.windows.net", secretValue)
spark.conf.set(f"fs.azure.account.oauth2.client.endpoint.{storageAccount}.dfs.core.windows.net", f"https://login.microsoftonline.com/{directoryid}/oauth2/token")
dbutils.fs.ls("abfss://<container-name>#<storage-accnt-name>.dfs.core.windows.net/<folder>")
With an attached cluster, above successfully display the list of folders/files within the ADLS G2 storage.
The code used to create the mount point, which used old secrets info.
scope_name="name-of-the-scope-from-workspace"
directoryId=dbutils.secrets.get(scope=scope_name, key="name-of-key-from-keyvault-which-stores-tenantid-value")
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": dbutils.secrets.get(scope=scope_name, key="name-of-key-from-key-vault-referring-to-clientid"),
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope=scope_name, key="name-of-key-from-key-vault-referring-to-secretvalue-generated-in-sp-secrets"),
"fs.azure.account.oauth2.client.endpoint": f"https://login.microsoftonline.com/{directoryId}/oauth2/token"}
storage_acct_name="storageaccountname"
container_name="name-of-container"
mount_point = "/mnt/appadls/content"
if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()):
print(f"Mounting {mount_point} to DBFS filesystem")
dbutils.fs.mount(
source = f"abfss://{container_name}#{storage_acct_name}.dfs.core.windows.net/",
mount_point = mount_point,
extra_configs = configs)
else:
print("Mount point {mount_point} has already been mounted.")
In my case the key vault is updated with clientid, tenant/directory id, SP secret key.
After renewing the service prinicpal, when accessing the /mnt/path, I see below exception.
...
response '{"error":"invalid_client","error_description":"AADSTS7000215: Invalid client secret is provided.
The only thing i could think of is the mount point was created with old secrets as in the above code. After renewing the service principal do i need to unmount and re-create the mount point?
So i finally tried to unmount and mount the ADLS G2 storage, now i am able to access that.
I didn't expect that the configuration would somehow be persisted. just updating the service principal secret is sufficient.
After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.
--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)
I have an Azure data lake storage gen 2 account, with hierarchical namespace enabled. I generated a SAS-token to the account, and I recieve data to a folder in the File Share (File Service). Now I want to access these files through Azure Databricks and python. However, it seems like Azure Databricks can only access the File System (called Blob Container in gen1), and not the File Share. I also failed to generate a SAS-token to the File System.
I wish to have a storage instance to which can generate a SAS-token and give to my client, and access the same from azure databricks using python. It is not important if it is File System, File Share, ADLS gen2 or gen1 as long as it somehow works.
I use the following to access the File System from databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "my_client_id",
"fs.azure.account.oauth2.client.secret": "my_client_secret",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+"My_tenant_id" +"/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(source = "abfss://"+"my_file_system"+"#"+"my_storage_account"+".dfs.core.windows.net/MyFolder",
mount_point = "/mnt/my_mount",
extra_configs = configs)
Works fine but I cannot make it access the File Share. And I have a SAS-token with a connection string like this:
connection_string = (
'BlobEndpoint=https://<my_storage>.blob.core.windows.net/;'+
'QueueEndpoint=https://<my_storage>.queue.core.windows.net/;'+
'FileEndpoint=https://<my_storage>.file.core.windows.net/;'+
'TableEndpoint=https://<my_storage>.table.core.windows.net/;'+
'SharedAccessSignature=sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-26T17:12:38Z&st=2019-08-26T09:12:38Z&spr=https&sig=<my_sig>'
)
Which I manage to use to upload stuff to the file share, but not to the file system. Is there any kind of Azure storage that can be accessed by both a SAS-token and azure databricks?
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
use this for further reference https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string
code to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/
When I am trying to mount ADLS Gen2 to Databricks, I have this issue : "StatusDescription=This request is not authorized to perform this operation" if the ADLS Gen2 firewall is enabled. But the request works fine if the firewall is disabled.
Someone can help please ?
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientID,
"fs.azure.account.oauth2.client.secret": keyID,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" + tenantID + "/oauth2/token"}
dbutils.fs.mount(
source = "abfss://" + fileSystem + "#" + accountName + ".dfs.core.windows.net/",
mount_point = "/mnt/adlsGen2",
extra_configs = configs)
StatusCode=403
StatusDescription=This request is not authorized to perform this operation.
ErrorCode=
ErrorMessage=
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:498)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:164)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:445)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:362)
at com.databricks.backend.daemon.dbutils.DBUtilsCore.verifyAzureFileSystem(DBUtilsCore.scala:486)
at com.databricks.backend.daemon.dbutils.DBUtilsCore.mount(DBUtilsCore.scala:435)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
If you enable the firewall on an Azure Data Lake Store Gen2 account, this configuration only works with Azure Databricks if you deploy Azure Databricks in your own virtual network. It does not work with workspaces deployed without vnet-injection feature.
On the storage account you have to enable access from the public-Databricks subnet.
This error is caused by the service principal not having read/execute permission on the file path - not the firewall.
FYI. On the Storage Azure you can allow Microsoft Trusted Services to access the resource. This includes Databricks. But like I say I do not believe you have a firewall issue.
To resolve the permissions issue I would first look at the IAM Roles for the FileSystem. From Azure portal go to the storage account > FileSystems and open the Access Controls (IAM) blade. Using the Check access screen paste the Client/ApplicationID of your service principal and check what permissions it has.
To have read access to the filesystem the SP must be in one of the following roles:
* Owner
* Storage Blob Data Contributor
* Storage Blob Data Owner
* Storage Blob Data Reader
Any of these roles will give full access to read all files in the FileSystem.
If not you can still grant permissions at a folder/file level using Azure Storage Explorer. Remember that all folders in the chain must have Execute permission at each level. For example:
/Root/SubFolder1/SubFolder2/file.csv
You must grant Execute on Root, SubFolder1 & SubFolder2 as well as Read on SubFolder2.
Further details: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control
You need to use Vnet-Injection during creation. This blog post walks you through it.
https://www.keithmsmith.com/azure-data-lake-firewall-databricks/
I also faced same issue but later figured out that you need to have only (Storage Blob Data Contributor) Role specified on your data lake for your service principal.
If you have given only just (Contributor) role it will not work.
Or both Contributor and Storage Blob Data Contributor it will not work.
You have to just provide Storage Blob Data Contributor on your data lake gen 2
enter image description here