How to read a blob in Azure databricks with SAS - azure

I'm new to Databricks. I write sample code to read Storage Blob in Azure Databricks.
blob_account_name = "sars"
blob_container_name = "mpi"
blob_sas_token =r"**"
ini_path = "58154388-b043-4080-a0ef-aa5fdefe22c8"
inputini = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, ini_path)
spark.conf.set("fs.azure.sas.%s.%s.blob.core.windows.net"% (blob_container_name, blob_account_name), blob_sas_token)
print(inputini)
ini=sc.textFile(inputini).collect()
It throw error:
Container mpi in account sars.blob.core.windows.net not found
I guess it doesn't attach the SAS token in WASBS link, so that it doesn't permission to read the data.
How to attach the SAS in wasbs link.

This is excepted behaviour, you cannot access the read private storage from Databricks. In order to access private data from storage where firewall is enabled or when created in a vnet, you will have to Deploy Azure Databricks in your Azure Virtual Network then whitelist the Vnet address range in the firewall of the storage account. You could refer to configure Azure Storage firewalls and virtual networks.
WITH PRIVATE ACCESS:
When you have provided access level to "Private (no anonymous access)".
Output: Error message
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container carona in account cheprasas.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
WITH CONTAINER ACCESS:
When you have provided access level to "Container (Anonymous read access for containers and blobs)".
Output: You will able to see the output without any issue.
Reference: Quickstart: Run a Spark job on Azure Databricks using the Azure portal.

Related

Access denied to storage account from Azure Data Factory

My goal is to run an exe file stored in a private Azure Blob container.
The exe is simple : it creates a text file, write the current datetime in it, and then push it to the private Azure Blob container.
This has to be sent from Azure Data Factory. To do this, here is my environment :
Azure Data Factory running with the simple pipeline :
https://i.stack.imgur.com/txQ9r.png
Private storage account with the following configuration :
https://i.stack.imgur.com/SJrGX.png
A linked service connected to the storage account :
https://i.stack.imgur.com/8xW5l.png
A private managed virtual network approved :
https://i.stack.imgur.com/G2DH3.png
A linked service connected to an Azure Batch :
https://i.stack.imgur.com/Yaq6C.png
A batch account linked to the right storage account
A pool running on this batch account
Two things that I need to add in context :
When I set the storage account to public, it works and I find the text file in my blob storage. So the process works well, but there is a security issue somewhere I can't find.
All the resources (ADF, Blob storage, Batch account) used have a role has contributor/owner of the blob with a managed identity.
Here is the error I get when I set the storage account to private :
{
"errorCategory":0,
"code":"BlobAccessDenied",
"message":"Access for one of the specified Azure Blob(s) is denied",
"details":[
{
"Name":"BlobSource",
"Value":"https://XXXXXXXXXXXXXXXXX/testv2.exe?sv=2018-03-28&sr=b&sig=XXXXXXXXXXXXXXXXXX&sp=r"
},
{
"Name":"FilePath",
"Value":"D:\\batch\\tasks\\workitems\\XXXXXXXXXXX\\job-1\\XXXXXXXXXXXXXXXXXXXXXXXX\\testv2.exe"
}
]
}
Thank you for your help!
Solution found Azure community support :
Check Subnet information under Network Configuration from the Azure portal > Batch Account > Pool > Properties. Take note and write the information down.
Navigate to the storage account, and select Networking. In the Firewalls and virtual networks setting, select Enable from selected virtual networks and IP addresses for Public network access. Add the Batch pool's subnet in the firewall allowlist.
If the subnet doesn't enable the service endpoint, when you select it, a notification will be displayed as follows:
The following networks don't have service endpoints enabled for 'Microsoft.Storage'. Enabling access will take up to 15 minutes to complete. After starting this operation, it is safe to leave and return later if you don't wish to wait.
Therefore, before you add the subnet, check it in the Batch virtual network to see if the service endpoint for the storage account is enabled.
After you complete the configurations above, the Batch nodes in the pool can access the storage account successfully.

Connecting Blob Storage to a Synapse Workspace with Public Network Workspace Access Disabled

I'm trying to connect blob storage from my RG storage account to the Data tab in my synapse workspace, but I get the following error: "The public network interface on this Workspace is not accessible. To connect to this Workspace, use the Private Endpoint from inside your virtual network or enable public network access for this workspace."
Public network access to my workspace must be disabled for company reasons. I made private endpoint connections on my synapse resource to Dev, Sql, and Sql-On-Demand, but I'm not sure where to go from there.
Thanks!
Go to Azure Synapse -> Manage -> Managed private endpoints -> +New and add private endpoints.
Accessing blob storage : If you already created linked service the follow below image. If not, please created and follow this Ms Doc for creating linked service.
Fastest way to access Azure blob storage
For for information follow this reference by Dennes Torres.

Trying to set a link services through registerd app to azure data lake storage and keep getting 24200 error

I am new to azure. We have azure data lake storage set. I am trying to set the link services from the data factory to the azure data lake storage gen2. It keeps failing when I test the link service to the data lake storage. As far as I can see, I have granted the "Storage blob contributor" role to the user in the azure data lake storage. I still keep getting permission denied error when I test the link services
ADLS Gen2 operation failed for: Storage operation '' on container 'testconnection' get failed with 'Operation returned an invalid status code 'Forbidden''. Possible root causes: (1). It's possible because the service principal or managed identity don't have enough permission to access the data. (2). It's possible because some IP address ranges of Azure Data Factory are not allowed by your Azure Storage firewall settings. Azure Data Factory IP ranges please refer https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses.. Account: 'dlsisrdatapoc001'. ErrorCode: 'AuthorizationFailure'. Message: 'This request is not authorized to perform this operation.'.
What I could observe is that when I open the network to all (public) in the data lake storage, it works, when I set the firewall with CIDR it fails. Couldn't narrow the cause of the problem. I do have the "Allow azure services on the trusted services list to access this account" checked.
Completely lost
As mentioned in the error description, the error usually occurs if you don't have sufficient permissions to perform the action or if you don't add the required IPs in the firewall settings of your storage account.
To resolve the error, please check if you added the Storage Blob Data Contributor role to your managed identity along with the user like below:
Go to Azure Portal -> Storage Accounts -> Your Storage Account -> Access Control (IAM) ->Add role assignment
Make sure to select the managed identity, based on the authentication method you selected while creating linked service.
As mentioned in this MsDoc, make sure to add all the required IPs based on your resource location and service tag.
Download the JSON file to know the IP range for service tag in your resource location and add them in the firewall settings like below:
Make sure to select the Resource type as
Microsoft.DataFactory/factories while choosing CIDR.
For more in detail, please refer below links:
Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2 by Anushree Garg
Storage Accoung V2 access with firewall, VNET to data factory V2 by Cindy Pau

Azure Synapse severless SQL pool - query execution fails

After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.
--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)

Connect to Blob storage "no credentials found for them in the configuration"

I'm working with Databricks notebook backed by spark cluster. Having trouble trying to connect to the Azure blob storage. I used this link and tried the section Access Azure Blob Storage Directly - Set up an account access key. I get no errors here:
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
But receive errors when I try and do an 'ls' on the directory:
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Unable to access container <container name> in account <storage account name>core.windows.net using anonymous credentials, and no credentials found for them in the configuration.
If there is a better way, please provide suggestion as well. thanks
You need to pass the storage account name and key while setting up the configuration . You can find this from azure portal.
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
Also while doing the ls you need to add
Container name and directory name.
dbutils.fs.ls("wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>")
Hope this will resolve your issue!

Resources