Accessing ADLS Gen 2 storage from Databricks - python-3.x

I am trying to read files in ADLS Gen 2 Storage from Databricks Notebook using Python.
The storage container however has it's public access level set to "Private".
I have Storage Account Contributor and Storage Blob Data Contributor access.
How can the Databricks be allowed to read and write into ADLS Storage ?

According to the information you provided, Storage Account Contributor has been assigned to your account. You have permission to get the storage account access key. So we can use access key to do auth then we can read and write into ADLS Gen 2 Storage. For more details, please refer to here
For example
spark.conf.set(
"fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
dbutils.secrets.get(scope="<scope-name>",key="<storage-account-access-key-name>"))
dbutils.fs.ls("abfss://<file-system-name>#<storage-account-name>.dfs.core.windows.net/<directory-name>")

Related

List down all container within a storage account of azure through databricks

I, want to dynamically get all storage account and containers within from an azure subscription by databricks.
So, that I can go through each file within a container and get the files and their sizes which I have done earlier. Now I want to dynamically set my storage account and container to process within from my databricks environment.
Per my experience and based on my understanding for all operations in storage account with databricks, authentication is happening in azure storage account level . In that case , If you are trying to access storage account through service principal or storage account access key both are in storage account level , you can list out the containers within storage account .But we don't have option to list out the storage account within subscription . As workaround , you can use powershell, to get the storage accounts within subscription and pass those value for your logic.
You can use below code , to get the list of containers within storage account .
from azure.storage.blob.blockblobservice import BlockBlobService
blob_service = BlockBlobService(account_name='storageaccount', account_key='accesskey')
containers = blob_service.list_containers()
for c in containers:
print(c.name)

unable to create mount point in databricks for adls gen 2

I am trying to create mount point to the ADLS Gen2 using key vault in databricks, however i am not being able to do so due to some error that i am getting.
I have contributor access and i tried with Storage Blob Data Contributor and contributor access to the SPN still i am not being able to create it the mount points.
I request some help please
configs= {"fs.azure.account.auth.type":"OAuth",
"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id":"abcdefgh",
"fs.azure.account.oauth2.client.secret":dbutils.secrets.get(scope="myscope",key="mykey"),
"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/tenantid/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source= "abfss://cont1#storageaccount.dfs.core.windows.net/",
mount_point="/mnt/cont1",
extra_configs=configs)
the error i am getting is
An error occurred while calling o280.mount.
: HEAD https://storageaccount.dfs.core.windows.net/cont1?resource=filesystem&timeout=90
StatusCode=403
StatusDescription=This request is not authorized to perform this operation.
When performing the steps in the Assign the application to a role, make sure that your user account has the Storage Blob Data Contributor role assigned to it.
Repro: I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Solution: Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning the Storage Blob Data Contributor role to the service principal.
For more details, refer “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.

Least privilege permissions for az storage blob upload-batch

Our CI pipeline needs to back up some files to Azure Blob Storage. I'm using the Azure CLI like this: az storage blob upload-batch -s . -d container/directory --account-name myaccount
When giving the service principal contributor access, it works as expected. However, I would like to lock down permissions so that the service principal is allowed to add files, but not delete, for example. What are the permissions required for this?
I've created a custom role giving it the same permissions as Storage Blob Data Contributor minus delete. This (and also just using the Storage Blob Data Contributor role directly) fails with a Storage account ... not found. Ok, I then proceeded to add more read permissions to the blob service. Not enough, now I'm at a point where it wants to do Microsoft.Storage/storageAccounts/listKeys/action. But if I give it access to the storage keys, then what's the point? With the storage keys the SP will have full access to the account, which I want to avoid in the first place. Why is az storage blob upload-batch requesting keys and can I prevent this from happening?
I've created a custom role giving it the same permissions as Storage Blob Data Contributor minus delete. This (and also just using the Storage Blob Data Contributor role directly) fails with a Storage account ... not found.
I can also reproduce your issue, actually what you did will work. The trick is the --auth-mode parameter of the command, if you did not specify it, it will use key by default, then the command will list all the storage accounts in your subscription, when it found your storage account, it will list the keys of the account and use the key to upload blobs.
However, the Storage Blob Data Contributor minus delete has no permission to list storage accounts, then you will get the error.
To solve the issue, just specify the --auth-mode login in your command, then it will use the credential of your service principal to get the access token, then use the token to call the REST API - Put Blob to upload blobs, principle see Authorize access to blobs and queues using Azure Active Directory.
az storage blob upload-batch -s . -d container/directory --account-name myaccount --auth-mode login

How to get list of all azure blob containers of all regions and how to get Region of a Azure blob container

I need help with two questions using java sdk:
How to get a list of all Azure blob containers in all regions with their region name/ code?
Do I need to loop over all the Storage accounts to get a list of Blob containers?
How to get region Code of an azure account from a blobServiceClient or BlobContainerClient or BlobContainerItem?
As far as I know azure does not have regions of a container but regions are associated with Storage account (correct me if I am wrong).
How to get Region of a storage account from blobServiceClient.
I can get account Info using
blobServiceClient.getAccountInfo()
but it does not have any region information in it.
Note
I have storage account key with which I generate connection string with which I get the blobServiceClient, but there is no way of getting region of that storage account.
For infomation/operation on storage account resource (instead of container, blob etc. under storage account), you could use a management client, as sample.
Related code here. Basically list all storage accounts (by subscription or by resource group), then check its region() method.

Azure Databricks: can't connect to Azure Data Lake Storage Gen2

I have Storage account kagsa1 with container cont1 inside and need it to accessible (mounted) via Databricks
If I use storage account key in KeyVault it works correctly:
configs = {
"fs.azure.account.key.kagsa1.blob.core.windows.net":dbutils.secrets.get(scope = "kv-db1", key = "storage-account-access-key")
}
dbutils.fs.mount(
source = "wasbs://cont1#kagsa1.blob.core.windows.net",
mount_point = "/mnt/cont1",
extra_configs = configs)
dbutils.fs.ls("/mnt/cont1")
..but if I'm trying to connect using Azure Active Directory credentials:
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.ls("abfss://cont1#kagsa1.dfs.core.windows.net/")
..it fails:
ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.ls.
: GET https://kagsa1.dfs.core.windows.net/cont1?resource=filesystem&maxResults=5000&timeout=90&recursive=false
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
ErrorCode=AuthorizationPermissionMismatch
ErrorMessage=This request is not authorized to perform this operation using this permission.
Databrics Workspace tier is Premium,
Cluster has Azure Data Lake Storage Credential Passthrough option enabled,
Storage account has hierarchical namespace option enabled,
Filesystem was initialized with
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://cont1#kagsa1.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
and I have full access to container in storage account:
What am I doing wrong?
Note: When performing the steps in the Assign the application to a role, make sure to assign the Storage Blob Data Contributor role to the service principal.
As part of repro, I have provided owner permission to the service principal and tried to run the “dbutils.fs.ls("mnt/azure/")”, returned same error message as above.
Now assigned the Storage Blob Data Contributor role to the service principal.
Finally, able to get the output without any error message after assigning Storage Blob Data Contributor role to the service principal.
For more details, refer “Tutorial: Azure Data Lake Storage Gen2, Azure Databricks & Spark”.
Reference: Azure Databricks - ADLS Gen2 throws 403 error message.

Resources