Mount ADLS Gen2 to Databricks when firewall is enabled - azure

When I am trying to mount ADLS Gen2 to Databricks, I have this issue : "StatusDescription=This request is not authorized to perform this operation" if the ADLS Gen2 firewall is enabled. But the request works fine if the firewall is disabled.
Someone can help please ?
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientID,
"fs.azure.account.oauth2.client.secret": keyID,
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/" + tenantID + "/oauth2/token"}
dbutils.fs.mount(
source = "abfss://" + fileSystem + "#" + accountName + ".dfs.core.windows.net/",
mount_point = "/mnt/adlsGen2",
extra_configs = configs)
StatusCode=403
StatusDescription=This request is not authorized to perform this operation.
ErrorCode=
ErrorMessage=
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:134)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:498)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:164)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:445)
at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:362)
at com.databricks.backend.daemon.dbutils.DBUtilsCore.verifyAzureFileSystem(DBUtilsCore.scala:486)
at com.databricks.backend.daemon.dbutils.DBUtilsCore.mount(DBUtilsCore.scala:435)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

If you enable the firewall on an Azure Data Lake Store Gen2 account, this configuration only works with Azure Databricks if you deploy Azure Databricks in your own virtual network. It does not work with workspaces deployed without vnet-injection feature.
On the storage account you have to enable access from the public-Databricks subnet.

This error is caused by the service principal not having read/execute permission on the file path - not the firewall.
FYI. On the Storage Azure you can allow Microsoft Trusted Services to access the resource. This includes Databricks. But like I say I do not believe you have a firewall issue.
To resolve the permissions issue I would first look at the IAM Roles for the FileSystem. From Azure portal go to the storage account > FileSystems and open the Access Controls (IAM) blade. Using the Check access screen paste the Client/ApplicationID of your service principal and check what permissions it has.
To have read access to the filesystem the SP must be in one of the following roles:
* Owner
* Storage Blob Data Contributor
* Storage Blob Data Owner
* Storage Blob Data Reader
Any of these roles will give full access to read all files in the FileSystem.
If not you can still grant permissions at a folder/file level using Azure Storage Explorer. Remember that all folders in the chain must have Execute permission at each level. For example:
/Root/SubFolder1/SubFolder2/file.csv
You must grant Execute on Root, SubFolder1 & SubFolder2 as well as Read on SubFolder2.
Further details: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-access-control

You need to use Vnet-Injection during creation. This blog post walks you through it.
https://www.keithmsmith.com/azure-data-lake-firewall-databricks/

I also faced same issue but later figured out that you need to have only (Storage Blob Data Contributor) Role specified on your data lake for your service principal.
If you have given only just (Contributor) role it will not work.
Or both Contributor and Storage Blob Data Contributor it will not work.
You have to just provide Storage Blob Data Contributor on your data lake gen 2
enter image description here

Related

Difference between connect and mount in Azure Databricks

I mounted my Azure Storage Account using dbutils and Python like in this page, with the method using Azure Service Principal:
https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
but I also saw there is an option to do a connection with spark to the Azure Blob File System (ABFS) driver like in this page:
https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I couldn't find information about the difference? In which use cases is it better to use one or the other? Is one method faster than the other to get information from the stored data in the Azure Storage Account?
Thanks a lot in advance!
When you mount your storage account, you make it accessible to everyone that has access to your Databricks workspace.
But when you use spark.conf.set to connect and use your storage account, it is limited to only those who have access to that cluster.
As highlighted in the same Microsoft document for Access Azure Data Lake Storage Gen2 and Blob Storage, Mounting is among the deprecated ways of accessing Storage accounts and no longer recommended. Therefore, as per the requirement, you can either choose mounting or setting configurations taking security into consideration.
If you want to choose mounting, you can try setting up mount point using credential passthrough.
Is one method faster than the other to get information from the stored data in the Azure Storage Account?
As far as I know, the rate at which information can be accessed would not change. The main difference is that using mounting is not as secure as using spark.conf.set because it is accessible to all users.

DataBricks UnityCatalog create table fails with "Failed to acquire a SAS token UnauthorizedAccessException: PERMISSION_DENIED: request not authorized"

I'm new to DataBricks Unity Catalog and I'm trying to follow the quickstart notebook on https://docs.databricks.com/_static/notebooks/unity-catalog-example-notebook.html.
It seems to me I did whatever I had to do:
I created a Databricks access connector in Azure (which becomes a managed identity)
I created a storage Account ADLS Gen2 (DAtalake with hierarchical namespace) plus container
On my datalake container I assigned Storage Blob Data Contributor role to the managed identity above
I created a new Databricks Premium Workspace
I created a new metastore in Unity Catalog that "binds" the access connector to the DataLake
Bound the metastore to the premium databricks workspace
I gave my Databricks user Admin permission on the above Databricks workspace
I created a new cluster in the same premium workspaces, choosing framework 11.1 and "single user" access mode
I ran the workspace, which correctly created a new catalog, assinged proper rights to it, created a schema, confirmed that I am the owner for that schema
The only (but most important) SQL command of the same notebook that fails is the one that tries to create a managed Delta table and insert two records:
CREATE TABLE IF NOT EXISTS quickstart_catalog_mauromi.quickstart_schema_mauromi.quickstart_table
(columnA Int, columnB String) PARTITIONED BY (columnA);
When I run it, it starts working and in fact it starts creating the folder structure for this delta table in my storage account
, however then it fails with the following error:
java.util.concurrent.ExecutionException: Failed to acquire a SAS token for list on /data/a3b9da69-d82a-4e0d-9015-51646a2a93fb/tables/eab1e2cc-1c0d-4ee4-9a57-18f17edcfabb/_delta_log due to java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: request not authorized
Please consider that I didn't have any folder created under "unity-catalog" container before running the table creation command. So it seems that is can successfully create the folder structure, but after it creates the "table" folder, it can't acquare "the SAS token".
So I can't understand since I am an admin in this workspace and since Databricks managed identity is assigned the contributor role on the storage container, and since Databricks actually starts creating the other folders. What else should I configure?
I found it: you need to only to assign, at container level, the Storage Blob Data Contributor role to the Azure Databricks Connector. In fact, you need to assign the same role and the same connector at STORAGE ACCOUNT level.
I couldn't find this information in the documentation and I frankly can't understand why this is needed since the delta table path was created.
However, this way, it works.
I solved this issue by doing the following:
Grant the "Access Connector for Azure Databricks" the permission "Storage Blob Data Reader" at the Storage Account level.
Grant the "Access Connector for Azure Databricks" the permission "Storage Blob Data Contributor" at the container level used by the workspace.
That keeps the permissions a bit more restrictive without having to go down the 'Owner' level.

Trying to set a link services through registerd app to azure data lake storage and keep getting 24200 error

I am new to azure. We have azure data lake storage set. I am trying to set the link services from the data factory to the azure data lake storage gen2. It keeps failing when I test the link service to the data lake storage. As far as I can see, I have granted the "Storage blob contributor" role to the user in the azure data lake storage. I still keep getting permission denied error when I test the link services
ADLS Gen2 operation failed for: Storage operation '' on container 'testconnection' get failed with 'Operation returned an invalid status code 'Forbidden''. Possible root causes: (1). It's possible because the service principal or managed identity don't have enough permission to access the data. (2). It's possible because some IP address ranges of Azure Data Factory are not allowed by your Azure Storage firewall settings. Azure Data Factory IP ranges please refer https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses.. Account: 'dlsisrdatapoc001'. ErrorCode: 'AuthorizationFailure'. Message: 'This request is not authorized to perform this operation.'.
What I could observe is that when I open the network to all (public) in the data lake storage, it works, when I set the firewall with CIDR it fails. Couldn't narrow the cause of the problem. I do have the "Allow azure services on the trusted services list to access this account" checked.
Completely lost
As mentioned in the error description, the error usually occurs if you don't have sufficient permissions to perform the action or if you don't add the required IPs in the firewall settings of your storage account.
To resolve the error, please check if you added the Storage Blob Data Contributor role to your managed identity along with the user like below:
Go to Azure Portal -> Storage Accounts -> Your Storage Account -> Access Control (IAM) ->Add role assignment
Make sure to select the managed identity, based on the authentication method you selected while creating linked service.
As mentioned in this MsDoc, make sure to add all the required IPs based on your resource location and service tag.
Download the JSON file to know the IP range for service tag in your resource location and add them in the firewall settings like below:
Make sure to select the Resource type as
Microsoft.DataFactory/factories while choosing CIDR.
For more in detail, please refer below links:
Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2 by Anushree Garg
Storage Accoung V2 access with firewall, VNET to data factory V2 by Cindy Pau

azure datalake gen2 databricks ACLs permissions

I am trying to understand, why my ACL permissions are not working properly in Databricks.
Scenario: I have 2 Users. one with full permissions on FileSystem and. other without any permissions.
I tried mounting Gen2 filesystem in databricks using 2 different methods.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientid,
"fs.azure.account.oauth2.client.secret": credential,
"fs.azure.account.oauth2.client.endpoint": refresh_url}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
and using passthrough
2.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
both mount the filesystem. But when I use:
dbfs.fs.ls("/mnt/xyz")
It displays all the contents files / folders for the user which has no permissions on datalake.
Would be glad if someone would explain me what's wrong.
Thanks
This is expected behavior when you enable Azure Data Lake Storage credential passthrough.
Note: When a cluster is enabled for Azure Data Lake Storage credential passthrough, commands run on that cluster can read and write data in Azure Data Lake Storage without requiring users to configure service principal credentials to access the storage. The credentials are set automatically, based on the user initiating the action.
Reference: Enable Azure Data Lake Storage credential passthrough for your workspace and Simplify Data Lake Access with Azure AD Credential Passthrough.
Probably you do forget to add permissions in the Access Control (IAM) of the container.
To check this, you can go to the container in azure portal and click on Switch to Azure AD User Account. If you don't have rights, you will see a error message.
For example, you can add the role Storage Blob Data Contributor to have read and write access.
Note: Datalake take some minutes to refresh the credentials, so you need to wait a little bit after adding the role.

How to read a blob in Azure databricks with SAS

I'm new to Databricks. I write sample code to read Storage Blob in Azure Databricks.
blob_account_name = "sars"
blob_container_name = "mpi"
blob_sas_token =r"**"
ini_path = "58154388-b043-4080-a0ef-aa5fdefe22c8"
inputini = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, ini_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net"% (blob_container_name, blob_account_name), blob_sas_token)
print(inputini)
ini=sc.textFile(inputini).collect()
It throw error:
Container mpi in account sars.blob.core.windows.net not found
I guess it doesn't attach the SAS token in WASBS link, so that it doesn't permission to read the data.
How to attach the SAS in wasbs link.
This is excepted behaviour, you cannot access the read private storage from Databricks. In order to access private data from storage where firewall is enabled or when created in a vnet, you will have to Deploy Azure Databricks in your Azure Virtual Network then whitelist the Vnet address range in the firewall of the storage account. You could refer to configure Azure Storage firewalls and virtual networks.
WITH PRIVATE ACCESS:
When you have provided access level to "Private (no anonymous access)".
Output: Error message
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container carona in account cheprasas.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
WITH CONTAINER ACCESS:
When you have provided access level to "Container (Anonymous read access for containers and blobs)".
Output: You will able to see the output without any issue.
Reference: Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Resources