Azure Data Lake Store as EXTERNAL TABLE in Databricks - azure

How can I create an EXTERNAL TABLE in Azure Databricks which reads from Azure Data Lake Store? I am having trouble seeing in the documentation if it is even possible. I have a set of CSV files in a specific folder in Azure Data lake Store, and I want to do a CREATE EXTERNAL TABLE in Azure Databricks which points to the CSV files.

1. Reference mounted directories
You can mount the Azure Data Lake Store (ADLS) to Azure Databricks DBFS (requires 4.0 runtime or higher):
# Get Azure Data Lake Store credentials from the secret store
clientid = dbutils.preview.secret.get(scope = "adls", key = "clientid")
credential = dbutils.preview.secret.get(scope = "adls", key = "credential")
refreshurl = dbutils.preview.secret.get(scope = "adls", key = "refreshurl")
accounturl = dbutils.preview.secret.get(scope = "adls", key = "accounturl")
# Mount the ADLS
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": clientid,
"dfs.adls.oauth2.credential": credential,
"dfs.adls.oauth2.refresh.url": refreshurl}
dbutils.fs.mount(
source = accounturl,
mount_point = "/mnt/adls",
extra_configs = configs)
Table creation works the same way as with DBFS. Just reference the mountpoint with the directory in ADLS, e. g.:
%sql
CREATE TABLE product
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION "/mnt/adls/productscsv/"
The location clause automatically implies EXTERNAL. See also Azure Databricks Documentation.
2. Reference the Data Lake Store in the table definition directly
You can also reference the storage directly without mounting the storage. This scenario makes sense if the metadata or parts of the code are also used in other platforms. In this scenario access to the storage has to be defined on the cluster or notebook level (see this Databricks documentation for ADLS Gen1 or this documentation for Gen2 configuration details) or Azure AD Credential Passthrough is used.
The table definition would look like this for ADLS Gen1:
CREATE TABLE sampletable
(L_ORDERKEY BIGINT,
L_PARTKEY BIGINT,
L_SUPPKEY BIGINT,
L_SHIPMODE STRING,
L_COMMENT STRING)
USING csv
OPTIONS ('DELIMITER' '|')
LOCATION "adl://<your adls>.azuredatalakestore.net/directory1/sampletable"
;
For Azure Data Lake Gen2 the location reference looks like:
LOCATION "abfss://<file_system>#<account_name.dfs.core.windows.net/directory/tablename"

you should consider looking at this link: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html
Access Azure Data Lake Store using the Spark API
To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")
It doesn't mention the use of External Table.

Related

How use externalDataSource option to write from Databricks to synapse?

while I was reading the documentation I came across this option "externalDataSource"
A pre-provisioned external data source to read data from Azure Synapse. An external data source can only be used with PolyBase and removes the CONTROL permission requirement since the connector does not need to create a scoped credential and an external data source to load data
And in the note below it says
externalDataSource is relevant only when reading data from Azure Synapse and writing data from Azure Databricks to a new table in Azure Synapse with PolyBase semantics You should not specify other storage authentication types while using externalDataSource
Is there any difference in performance when writing to synapse?
Also, I donĀ“t know what is the input, in the documentation says
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("tempDir", "abfss://<your-container-name>#<your-storage-account-name>.dfs.core.windows.net/<your-directory-name>") \
.option("externalDataSource", "<your-pre-provisioned-data-source>") \
.option("dbTable", "<your-table-name>") \
.load()
"your-pre-provisioned-data-source" should be the same name of "dbtalbe"?
Is there any difference in performance when writing to synapse?
Various data loading techniques are supported by the Azure Synapse Analytics. Load the data using PolyBase is the quickest and most efficiently. The T-SQL language can be used with PolyBase, a data virtualization tool, to access external data kept in Azure Data Lake Storage.
To use this PolyBase you need to create external database to Azure Data Lake Store.
To create external database, you will need scoped credentials for Azure Data Lake Store:
-- Create a db master key.
CREATE MASTER KEY ENCRYPTION BY PASSWORD='<EnterStrongPasswordHere>';
-- Create a database scoped credential.
CREATE DATABASE SCOPED CREDENTIAL ADL_User
WITH
IDENTITY = '<client_id>#<OAuth_2.0_Token_EndPoint>',
SECRET = '<key>'
;
After creating scoped credential, you need to create External database to mention a piece of external Azure storage and provide the login information needed to access it.
CREATE EXTERNAL DATA SOURCE <data_source_name>
WITH
( LOCATION = '<prefix>://<path>'
[, CREDENTIAL = <database scoped credential> ]
, TYPE = HADOOP
)
[;]
"your-pre-provisioned-data-source" should be the same name of "dbtalbe"?
You can use above created external data source name in place of "your-pre-provisioned-data-source"

Create External table in Azure databricks

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.
From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.
Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.
# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/tenant_id/oauth2/token")
DDL
create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container#account_name.dfs.core.windows.net/dev/data/employee
Error Received
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);
I need help in knowing if this is possible to refer to ADLS location directly in DDL?
Thanks.
Sort of if you can use Python (or Scala).
Start by making the connection:
TenantID = "blah"
def connectLake():
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")
connectLake()
lakePath = "abfss://liquix#mystorageaccount.dfs.core.windows.net/"
Using Python you can register a table using:
spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")
You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.
The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.
You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
Note that this requires using a High Concurrency cluster.

Write DataFrame from Databricks to Data Lake

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.

Is it possible for accessing Azure table service from data bricks

I have loaded data into the Azure table service. I would like to access the data from data bricks the same way we access data from Azure blob.
Unfortunately, Azure Databricks does not support the data source of azure table storage.
For more details about the Data Sources of Azure Databricks, refer to this link.
Besides, if you want to improve Azure Databricks for it, you could post your idea in the feedback.
I think the above answer is old - so here is my update.
I am currently accessing data from Azure Tables through DataBricks like this:
from azure.cosmosdb.table.tableservice import TableService
table_service = TableService(account_name='accountX',
account_key=None,sas_token="tokenX") #set Azure connection
data = table_service.query_entities('tableX') #read
df_raw = pd.DataFrame([asset for asset in data]) #move it to pandas if you prefer
You need your own credentials for account_name and sas_token; TableX is the name of the table you want to access

Hive external tables map to azure blob storage

Is there a way to create a Hive external table using with location pointing to Azure Storage? We actually want to connect SAP HANA (SDA) to blob storage, so it seems the only way is to create an external hive table first which points to Azure blob storage and then use Hive ODBC connector/spark connectorto connect it toHANA SAP`. Does anyone have any idea how to achieve that?
You can create external tables in Hive or Spark on Azure. There are several options available:
Azure HDInsight
Azure Databricks (via Spark)
Hadoop distros supporting Azure Blob Storage (e. g. HDP)
External table creation would reference the data in the Blob storage account. See the following example for a Hive table created in HDInsight (wasb is used in the location):
CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
lines terminated by '<line separator>' STORED AS TEXTFILE
LOCATION 'wasb:///<directory in Azure blob>'
TBLPROPERTIES("skip.header.line.count"="1");
or in Azure Databricks:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'wasbs://<storage-account#<containername>.blob.core.windows.net/<directory>'
See also:
HDInsight Documentation
Azure Databricks Documentation
I don' t know what SAP supports. ODBC-Access is possible to all of the solutions.

Resources