I would like to know if there is anyway i can create Azure Databricks mount points using the Azure Databricks Resource Provider. Various Azure Service Principals are used to give access to various mount points in ADLS Gen2.
So can these mount points be put inside Databricks with the right Service Principal access, can this be done using Terraform or what is the best way to do this.
Thanks
You can't do it with the azurerm provider as it works only with the Azure-related objects, and DBFS mount is specific for Databricks. But Databricks Terraform provider has databricks_mount resource that is designed for that task. Just take into account that because there is no such thing as "mount API", mounting is performed by spinning off a small cluster, and performing dbutils.fs.mount command inside it.
P.S. Mounts are really not recommended anymore due the fact that all users of the workspace will have access to the mount's content using the permissions of the service principal that was used for mounting.
Related
Currently, I am trying to dynamically provision the Azure blob storage for Kubernetes using Container Storage Interface plugin. The Azure documentation is quite confusing. The github says, Just create storage class and continue creating stateful set. The integration is complete.
While the official Azure Doc says, Create PVC and a pod followed by stateful set. This still seems like an incomplete doc, which is pretty unclear for me. Any leads on the same will be much appreciated.
How this works exactly ? My understanding is, create a PVC and statefulset after creating a storage class, and it should be working. If Anyone has implemented this in your project, please shed some light.
I got an understanding about these phenomena. We can create Persistence Volume Claim by two ways. The first way is to create a separate resource file where kind is a Persistent Volume Claim, and the other way is to rely on Statefulset. It's up to the user to decide whether they want to use a Statefulset or rely on a PVC file.
To dynamically provision the Azure Blob Storage, Install CSI Driver in the target environment and create a storage class. Create a Persistence Volume Claim (PVC) resource and use the target storage class name in it. Now, whenever the PVC resource is deployed, it dynamically fetches the required volume from the storage class automatically. Quite straightforward and simple.
We need to create a shared storage account for Synapse and Databricks, however we can only use existing storage accounts in Synapse while Databricks creates separate resource groups on its own and there is no option to use existing one also why there the managed resource group created by databricks have locks in it?
There are two things regarding storage accounts & Databricks:
Databricks automatically creates a storage account for each workspace to hold so-called DBFS Root. This storage account is meant to be used to keep only temporary data, libraries, cluster logs, models, etc. It's not designed to keep the production data as this storage account isn't accessible outside of the Databricks workspace.
Databricks can work with storage accounts created outside of the workspace (documentation) - just create a dedicated storage account to keep your data, and access it using the abfss protocol as described in the documentation, or mount it into workspace (although it's not recommended anymore). And then you can access that storage account from Synapse & other tools as well.
I use two different ADLS one is open for all and the other one is a secured location with only the privilege given to a few individuals.
But these privileges given through RBAC are only applicable through the Azure portal and the users are still able to access the secured ADLS through mount point setup on Azure Databricks,
Is there a way to restrict the access on this mount point?
Thanks.
As per official documentation and MSFT Q&A, All users have read and write access to the objects in object storage mounted to DBFS. We cannot restrict users from using the mount point.
You can raise feature request here
However you can use Role-based access control for notebooks, clusters, jobs and tables feature by selecting Premium tier
I followed the documentation azure-datalake-gen2-sp-access and I mounted a ADLS2 storage in databricks, but when I try to see data from the GUI I get the next error:
Cluster easy-matches-cluster-001 does not have the proper credentials to view the content. Please select another cluster.
I don't find any documentation, only something about premium databricks, so I can only access with a premium databricks resource?
Edit1: I can see the mounted storage with dbutils.
After mounting the storage account, please do run this command do check if you have data access permissions to the mount point created.
dbutils.fs.ls("/mnt/<mount-point>")
If you have data access - you will see the files inside the storage
account.
Incase if you don't have data access- you will get this error - "This request is not authorized to perform this operation using this permissions", 403.
If you are able to mount the storage but unable to access, check if the ADLS2 account has the necessary roles assigned.
I was able to repro the same. Since you are using Azure Active Directory application, you would have to assign "Storage Blob Data Contributor" role to Azure Active Directory application too.
Below are steps for granting blob data contributor role on the registered application
1. Select your ADLS account. Navigate to Access Control (IAM). Select Add role assignment.
2. Select the role Storage Blob Data Contributor, Search and select your registered Azure Active Directory application and assign.
Back in Access Control (IAM) tab, search for your AAD app and check access.
3. Run dbutils.fs.ls("/mnt/<mount-point>") to confirm access.
Solved unmounting, mounting and restarting the cluster. I followed this doc: https://learn.microsoft.com/en-us/azure/databricks/kb/dbfs/remount-storage-after-rotate-access-key
If you still encounter the same issue when Access Control is checked. Do the following.
Use dbutils.fs.unmount() to unmount all storage accounts.
Restart the cluster.
Remount
I have Azure Pay-As-You-Go subscription account having Azure Storage general purpose V1 Service where I store files. I wondered yesterday, when I found another storage account with different location which I haven't created. For details screen shot is given:
If you have any knowledge about it or faced same behavior on Azure Storage, please guide and share your experience as I want to know what it is for and why it has been created on different location as my other services are in different resource group on Notrh Europe Location.
please guide and share your experience as I want to know what it is
for and why it has been created on different location as my other
services are in different resource group on Notrh Europe Location.
When you use Azure cloud shell, on the initial start, Cloud Shell prompts you to associate a new or existing file share to persist files across sessions.
When you use basic settings and select only a subscription, Cloud
Shell creates three resources on your behalf in the supported region
that's nearest to you. The auto-generated storage account always names cs<uniqueGuid>, read here.
Also, Azure creates a disk image of your $Home directory to persist all contents within the directory. The disk image is saved in your specified file share as acc_<User>.img
at fileshare.storage.windows.net/fileshare/.cloudconsole/acc_<User>.img, and it automatically syncs changes.
About the region, it depended on the region when you select the associated Azure storage account when initially start with Cloud shell. Associated Azure storage accounts must reside in the same region as the Cloud Shell machine that you're mounting them to. Its region is totally not related to your other Azure resource group. You also could run clouddrive unmount to re-select an associated storage account for the Azure file share.
To find your current region you may run env in Bash and locate the variable ACC_LOCATION.