Difference between connect and mount in Azure Databricks - azure

I mounted my Azure Storage Account using dbutils and Python like in this page, with the method using Azure Service Principal:
https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<container-name>#<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
but I also saw there is an option to do a connection with spark to the Azure Blob File System (ABFS) driver like in this page:
https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage
service_credential = dbutils.secrets.get(scope="<scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
I couldn't find information about the difference? In which use cases is it better to use one or the other? Is one method faster than the other to get information from the stored data in the Azure Storage Account?
Thanks a lot in advance!

When you mount your storage account, you make it accessible to everyone that has access to your Databricks workspace.
But when you use spark.conf.set to connect and use your storage account, it is limited to only those who have access to that cluster.
As highlighted in the same Microsoft document for Access Azure Data Lake Storage Gen2 and Blob Storage, Mounting is among the deprecated ways of accessing Storage accounts and no longer recommended. Therefore, as per the requirement, you can either choose mounting or setting configurations taking security into consideration.
If you want to choose mounting, you can try setting up mount point using credential passthrough.
Is one method faster than the other to get information from the stored data in the Azure Storage Account?
As far as I know, the rate at which information can be accessed would not change. The main difference is that using mounting is not as secure as using spark.conf.set because it is accessible to all users.

Related

Azure Data Lake Gen2 Storage Account blob vs adf choice

I am new to Azure Data Lake Storage Gen2 service. I have a Storage Account with "Hierarchical namespace" option Enabled.
I am using AzCopy to move some files and folders. From the command line I can - within the address string - use either the option "blob" or the "adf" string tokens:
'https://myaccount.blob.core.windows.net/mycontainer/myfolder'
or
'https://myaccount.adf.core.windows.net/mycontainer/myfolder'
again within the .\azcopy.exe copy command.
"Apparently" both ways succeed giving the same result. My question is: is there any difference if I use blob or adf in the address string? If yes, what is it?
Also, whatever string token I choose, in the Azure portal a file address is always given with the blob string token..
thanks
In the storage account Endpoint page, you can see all the available endpoints for you to use for their services.
Both blob and dfs work for you because both of them are supported in Azure Data Lake Storage Gen2 . However, in Gen1, you may only have the blob service but not the dfs service available (like below). In that case, you won't be able to use the dfs endpoint.
blob and dfs represent the resource type in the endpoint URL

Accessing Azure ADLS gen2 with Pyspark on Databricks

I'm trying to learn Spark, Databricks & Azure.
I'm trying to access GEN2 from Databricks using Pyspark.
I can't find a proper way, I believe it's super simple but I failed.
Currently each time I receive the following:
Unable to access container {name} in account {name} using anonymous
credentials, and no credentials found for them in the configuration.
I have already running GEN2 + I have a SAS_URI to access.
What I was trying so far:
(based on this link: https://learn.microsoft.com/pl-pl/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sas-access):
spark.conf.set(f"fs.azure.account.auth.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
spark.conf.set(f"fs.azure.sas.token.provider.type.{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net", {SAS_URI})
Then to reach out to data:
sd_xxx = spark.read.parquet(f"wasbs://{CONTAINER_NAME}#{STORAGE_ACCOUNT_NAME}.dfs.core.windows.net/{proper_path_to_files/}")
Your configuration is incorrect. The first parameter should be set to just SAS value, while second - to name of Scala/Java class that will return the SAS token - you can't use just URI with SAS information in it, you need to implement some custom code.
If you want to use wasbs that the protocol for accessing Azure Blog Storage, and although it could be used for accessing ADLS Gen2 (not recommended although), but you need to use blob.core.windows.net instead of dfs.core.windows.net, and also set correct spark property for Azure Blob access.
The more common procedure to follow is here: Access Azure Data Lake Storage Gen2 using OAuth 2.0 with an Azure service principal

azure datalake gen2 databricks ACLs permissions

I am trying to understand, why my ACL permissions are not working properly in Databricks.
Scenario: I have 2 Users. one with full permissions on FileSystem and. other without any permissions.
I tried mounting Gen2 filesystem in databricks using 2 different methods.
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": clientid,
"fs.azure.account.oauth2.client.secret": credential,
"fs.azure.account.oauth2.client.endpoint": refresh_url}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
and using passthrough
2.
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.mount(
source = "abfss://xyz#abc.dfs.core.windows.net/",
mount_point = "/mnt/xyz",
extra_configs = configs)
both mount the filesystem. But when I use:
dbfs.fs.ls("/mnt/xyz")
It displays all the contents files / folders for the user which has no permissions on datalake.
Would be glad if someone would explain me what's wrong.
Thanks
This is expected behavior when you enable Azure Data Lake Storage credential passthrough.
Note: When a cluster is enabled for Azure Data Lake Storage credential passthrough, commands run on that cluster can read and write data in Azure Data Lake Storage without requiring users to configure service principal credentials to access the storage. The credentials are set automatically, based on the user initiating the action.
Reference: Enable Azure Data Lake Storage credential passthrough for your workspace and Simplify Data Lake Access with Azure AD Credential Passthrough.
Probably you do forget to add permissions in the Access Control (IAM) of the container.
To check this, you can go to the container in azure portal and click on Switch to Azure AD User Account. If you don't have rights, you will see a error message.
For example, you can add the role Storage Blob Data Contributor to have read and write access.
Note: Datalake take some minutes to refresh the credentials, so you need to wait a little bit after adding the role.

What is my file-system-name and storage-account-name? And how do I find it?

I am trying to connect databricks to my blob containers in azure data lake gen2.
I can't find what my file-system-name is or my storage-account-name is anywhere for a connection.
dbutils.fs.ls("abfss://file-system-name#storage-account-name.dfs.core.windows.net/")
Thanks. If someone could reference an example that would be great.
Oh man ! they could have provided a better documentation with an example. took some time to click it
file-system-name means the container name. For instance if you have a container 'data' created in a storage account "myadls2021", it would be like below.
val data = spark.read
.option("header", "true")
.csv("abfss://data#myadls2021.dfs.core.windows.net/Baby_Names__Beginning_2007.csv")
I would recommend following this documentation :
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#azure-data-lake-storage-gen2
It explains how to access and/or mount an Azure Datalake Gen2 from Databricks.
file-system-name is the container name. storage-account-name is the azure storage account. Refer image below.
dbutils.fs.mount(
source = "abfss://raw#storageaccadls01.dfs.core.windows.net/",
mount_point = "/mnt/raw",
extra_configs = configs)
Can access the storage account files from the dbfs mount point location. Refer image below.

How to connect Azure Data Lake Store gen 2 File Share with Azure Databricks?

I have an Azure data lake storage gen 2 account, with hierarchical namespace enabled. I generated a SAS-token to the account, and I recieve data to a folder in the File Share (File Service). Now I want to access these files through Azure Databricks and python. However, it seems like Azure Databricks can only access the File System (called Blob Container in gen1), and not the File Share. I also failed to generate a SAS-token to the File System.
I wish to have a storage instance to which can generate a SAS-token and give to my client, and access the same from azure databricks using python. It is not important if it is File System, File Share, ADLS gen2 or gen1 as long as it somehow works.
I use the following to access the File System from databricks:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "my_client_id",
"fs.azure.account.oauth2.client.secret": "my_client_secret",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/"+"My_tenant_id" +"/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(source = "abfss://"+"my_file_system"+"#"+"my_storage_account"+".dfs.core.windows.net/MyFolder",
mount_point = "/mnt/my_mount",
extra_configs = configs)
Works fine but I cannot make it access the File Share. And I have a SAS-token with a connection string like this:
connection_string = (
'BlobEndpoint=https://<my_storage>.blob.core.windows.net/;'+
'QueueEndpoint=https://<my_storage>.queue.core.windows.net/;'+
'FileEndpoint=https://<my_storage>.file.core.windows.net/;'+
'TableEndpoint=https://<my_storage>.table.core.windows.net/;'+
'SharedAccessSignature=sv=2018-03-28&ss=bfqt&srt=sco&sp=rwdlacup&se=2019-09-26T17:12:38Z&st=2019-08-26T09:12:38Z&spr=https&sig=<my_sig>'
)
Which I manage to use to upload stuff to the file share, but not to the file system. Is there any kind of Azure storage that can be accessed by both a SAS-token and azure databricks?
Steps to connect to azure file share from databricks
first install Microsoft Azure Storage File Share client library for Python using pip install in Databricks. https://pypi.org/project/azure-storage-file-share/
after installing, create a storage account. Then you can create a fileshare from databricks
from azure.storage.fileshare import ShareClient
share = ShareClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<file share name that you want to create>")
share.create_share()
use this for further reference https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string
code to upload a file into fileshare through databricks
from azure.storage.fileshare import ShareFileClient
file_client = ShareFileClient.from_connection_string(conn_str="<connection_string consists of FileEndpoint=myFileEndpoint(https://storageaccountname.file.core.windows.net/);SharedAccessSignature=sasToken>", share_name="<your_fileshare_name>", file_path="my_file")
with open("./SampleSource.txt", "rb") as source_file:
file_client.upload_file(source_file)
Refer this link for further information https://pypi.org/project/azure-storage-file-share/

Resources