Cannot list Azure Storage Gen 2 files with Databricks - databricks

I wonder if my databricks code is addressing the correct location and if "contributor" right is enough for accessing storage.
I have Azure Storage Gen 2 with container named staging. (Url in Azure portal is https://datalaketest123.blob.core.windows.net/staging)
I have mounted Azure Storage Gen 2 with Azure Databricks.
I have configured passthrough and assuming that I get access to storage with my AD users. (contributor rights)
i have variable: source = 'abfss://' + in_fileSystemName + '#' + storageAccountName + '.dfs.core.windows.net/'
I tried now to list file system with command: dbutils.fs.ls(source)
I get error:
ET https://datalaketest123.dfs.core.windows.net/staging?
resource=filesystem&maxResults=500&timeout=90&recursive=false
---------------------------------------------------------------------------
ExecutionError Traceback (most recent call last)
<command-1012822525241408> in <module>
27 # COMMAND ----------
28 source = 'abfss://' + in_fileSystemName + '#' + storageAccountName + '.dfs.core.windows.net/'
---> 29 dbutils.fs.ls(source)
30
31 # COMMAND ----------
/local_disk0/tmp/1235891082005-0/dbutils.py in f_with_exception_handling(*args, **kwargs)
312 exc.__context__ = None
313 exc.__cause__ = None
--> 314 raise exc
315 return f_with_exception_handling
316
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: GET https://datalaketest123.dfs.core.windows.net/staging?
resource=filesystem&maxResults=500&timeout=90&recursive=false
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
ErrorCode=AuthorizationPermissionMismatch

Per official Databricks docs only Contributor is not enough - it should be Storage Blob Data XXX (where XXX is Owner, Contributor, Reader, .... - see docs)

When performing the steps in the Assign the application to a role, make sure that your user account has the Storage Blob Data Contributor role assigned to it.
Repro: I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Solution: Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning the Storage Blob Data Contributor role to the service principal.
Reference: “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.

Related

Trying to set a link services through registerd app to azure data lake storage and keep getting 24200 error

I am new to azure. We have azure data lake storage set. I am trying to set the link services from the data factory to the azure data lake storage gen2. It keeps failing when I test the link service to the data lake storage. As far as I can see, I have granted the "Storage blob contributor" role to the user in the azure data lake storage. I still keep getting permission denied error when I test the link services
ADLS Gen2 operation failed for: Storage operation '' on container 'testconnection' get failed with 'Operation returned an invalid status code 'Forbidden''. Possible root causes: (1). It's possible because the service principal or managed identity don't have enough permission to access the data. (2). It's possible because some IP address ranges of Azure Data Factory are not allowed by your Azure Storage firewall settings. Azure Data Factory IP ranges please refer https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses.. Account: 'dlsisrdatapoc001'. ErrorCode: 'AuthorizationFailure'. Message: 'This request is not authorized to perform this operation.'.
What I could observe is that when I open the network to all (public) in the data lake storage, it works, when I set the firewall with CIDR it fails. Couldn't narrow the cause of the problem. I do have the "Allow azure services on the trusted services list to access this account" checked.
Completely lost
As mentioned in the error description, the error usually occurs if you don't have sufficient permissions to perform the action or if you don't add the required IPs in the firewall settings of your storage account.
To resolve the error, please check if you added the Storage Blob Data Contributor role to your managed identity along with the user like below:
Go to Azure Portal -> Storage Accounts -> Your Storage Account -> Access Control (IAM) ->Add role assignment
Make sure to select the managed identity, based on the authentication method you selected while creating linked service.
As mentioned in this MsDoc, make sure to add all the required IPs based on your resource location and service tag.
Download the JSON file to know the IP range for service tag in your resource location and add them in the firewall settings like below:
Make sure to select the Resource type as
Microsoft.DataFactory/factories while choosing CIDR.
For more in detail, please refer below links:
Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2 by Anushree Garg
Storage Accoung V2 access with firewall, VNET to data factory V2 by Cindy Pau

Azure Data Storage Access from Databrikcs

I can not access Azure Data Lake Storage from Databrikcs.
I have no premium Azure Databricks service. I am trying to access ADLS Gen 2 Directly as per latest documentation: https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access#access-adls-gen2-directly
I have granted the service principle "Contributor permissions" on this account
This is the Error message from notebook:
Operation failed: "This request is not authorized to perform this operation using this permission.", 403, GET, https://geolocationinc.dfs.core.windows.net/instruments?upn=false&resource=filesystem&maxResults=500&timeout=90&recursive=false, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. ...;
this is my spark config setup:
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")```
The correct role is "Storage Blob Data Contributor" not "Contributor".

Azure Synapse severless SQL pool - query execution fails

After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.
--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)

Write data to Azure Data Lake Storage Gen 2 using Azure Synapse Analytics notebook

I am connecting to a RESTful api using Azure Synapse Analytics notebook and write the json file to Azure Data Lake Storage Gen 2.
pyspark code:
import requests
response = requests.get('https://api.web.com/v1/data.json')
data = response.json()
from pyspark.sql import *
df = spark.read.json(sc.parallelize([data]))
from pyspark.sql.types import *
account_name = "name of account"
container_name = "name of container"
relative_path = "name of file path" #abfss://<container_name>#<storage_account_name>.dfs.core.windows.net/<path>
adls_path = 'abfss://%s#%s.dfs.core.windows.net/%s' % (container_name, account_name, relative_path)
spark.conf.set('fs.%s#%s.dfs.core.windows.net/%s' % (container_name, account_name), "account_key") #not sure I'm doing the configuration right
df.write.mode("overwrite").json(adls_path)
Error:
Py4JJavaError : An error occurred while calling o536.json.
: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://storageaccount.dfs.core.windows.net/container/?upn=false&action=getAccessControl&timeout=90
Note: Storage Blob Data Contributor: Use to grant read/write/delete permissions to Blob storage resources.
If you are not assigning Storage Blob Data Contributor to users who are accessing the storage account, they will be not able to access the data from ADLS gen2 due to the lack of permission on the storage account.
If they try to access data from ADLS gen2 without the "Storage Blob Data Contributor" role on the storage account, they will receive the error message: Operation failed: "This request is not authorized to perform this operation.",403.
Once the storage account is created, select Access control (IAM) from the left navigation. Then assign the following roles or ensure they are already assigned.
Assign yourself to the Storage Blob Data Owner role on the Storage Account.
After granting Storage Blob Data Contributor role on the storage account wait for 5-10 minutes and re-try the operation.

How to read a blob in Azure databricks with SAS

I'm new to Databricks. I write sample code to read Storage Blob in Azure Databricks.
blob_account_name = "sars"
blob_container_name = "mpi"
blob_sas_token =r"**"
ini_path = "58154388-b043-4080-a0ef-aa5fdefe22c8"
inputini = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, ini_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net"% (blob_container_name, blob_account_name), blob_sas_token)
print(inputini)
ini=sc.textFile(inputini).collect()
It throw error:
Container mpi in account sars.blob.core.windows.net not found
I guess it doesn't attach the SAS token in WASBS link, so that it doesn't permission to read the data.
How to attach the SAS in wasbs link.
This is excepted behaviour, you cannot access the read private storage from Databricks. In order to access private data from storage where firewall is enabled or when created in a vnet, you will have to Deploy Azure Databricks in your Azure Virtual Network then whitelist the Vnet address range in the firewall of the storage account. You could refer to configure Azure Storage firewalls and virtual networks.
WITH PRIVATE ACCESS:
When you have provided access level to "Private (no anonymous access)".
Output: Error message
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container carona in account cheprasas.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
WITH CONTAINER ACCESS:
When you have provided access level to "Container (Anonymous read access for containers and blobs)".
Output: You will able to see the output without any issue.
Reference: Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Resources