How to access one databricks delta tables from other databricks - databricks

I want to access one Databricks environment delta tables from other Databricks environment by creating global Hive meta store in one of the Databricks. Let me know if it is possible or not.
Thanks in advance.

There are two aspects here:
The data itself - they should be available to other workspaces - this is done by having a shared storage account/container, and writing data into it. You can either mount that storage account, or use direct access (via service principal or AAD passtrough) - you shouldn't write data to built-in DBFS Root that isn't available to other workspaces. After you write the data using dataframe.write.format("delta").save("some_path_on_adls"), you can read these data from another workspace that has access to that shared workspace - this could be done either
via Spark API: spark.read.format("delta").load("some_path_on_adls")
via SQL using following syntax instead of table name (see docs):
delta.`some_path_on_adls`
The metadata - if you want to represent saved data as SQL tables with database & table names instead of path, then you can use following choices:
Use the built-in metastore to save data into location on ADLS, and then create so-called external table in another workspace inside its own metastore. In the source workspace do:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
and in another workspace execute following SQL (either via %sql in notebook or via spark.sql function:
CREATE TABLE db_name.table_name USING DELTA LOCATION 'some_path_on_adls'
Use external metastore that is shared by multiple workspaces - in this case you just need to save data correctly:
dataframe.write.format("delta").option("path", "some_path_on_adls")\
.saveAsTable("db_name.table_name")
you still need to save it into shared location, so the data is accessible from another workspace, but you don't need to register the table explicitly, as another workspace will read the metadata from the same database.

Related

Databricks - transfer data from one databricks workspace to another

How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'

Creating database in Azure databricks on External Blob Storage giving error

I have mapped my blob storage to dbfs:/mnt/ under name /mnt/deltalake
and blob storage container name is deltalake.
Mounting to Dbfs is done using Azure KeyVault backed secret scope.
When I try to create a database using CREATE DATABASE abc with location '/mnt/deltalake/databases/abc' this errors out saying path does not exist.
However when I was using the dbfs path as storage by using .. CREATE DATABASE abc with location '/user/hive/warehouse/databases/abc' .. it was always successful.
Wonder what is going wrong .
Suggestions please.
Using a mount point, you should be able to access existing files or write new files through databricks.
However, I believe the SQL commands, such as CREATE DATABASE, only work on the underlying hive metastore.
You may need to create a database for your blob storage outside of databricks, and then connect to the database to read and write from it using databricks.

Create External table in Azure databricks

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.
From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.
Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.
# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/tenant_id/oauth2/token")
DDL
create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container#account_name.dfs.core.windows.net/dev/data/employee
Error Received
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);
I need help in knowing if this is possible to refer to ADLS location directly in DDL?
Thanks.
Sort of if you can use Python (or Scala).
Start by making the connection:
TenantID = "blah"
def connectLake():
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")
connectLake()
lakePath = "abfss://liquix#mystorageaccount.dfs.core.windows.net/"
Using Python you can register a table using:
spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")
You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.
The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.
You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
Note that this requires using a High Concurrency cluster.

Hive table creation with adl credentials configurable

Team,
In hive create table I need to load adl file,When I googled I found Provider Type, Client Id,Client Credential need to be configured in core-site.xml. My requirement is that we need to configure these credential dynamically while creating a hive table. The same is done for while loading to s3 file into hive table.
create table employee(
id int,
name string
) location 's3a://<access_key>:<secret_key>#<my-bucket>/<s3_path>'
Similarly, same can be achieved for creating hive table on adl file path?
Thanks
no, for reason (2)
that feature of secrets-in-URI has gone from S3A because it ended causing your critical AWS secrets to be logged everywhere: your application logs, error messages in exceptions, etc.
you'll need to come up with another solution I'm afraid.

Hive external tables map to azure blob storage

Is there a way to create a Hive external table using with location pointing to Azure Storage? We actually want to connect SAP HANA (SDA) to blob storage, so it seems the only way is to create an external hive table first which points to Azure blob storage and then use Hive ODBC connector/spark connectorto connect it toHANA SAP`. Does anyone have any idea how to achieve that?
You can create external tables in Hive or Spark on Azure. There are several options available:
Azure HDInsight
Azure Databricks (via Spark)
Hadoop distros supporting Azure Blob Storage (e. g. HDP)
External table creation would reference the data in the Blob storage account. See the following example for a Hive table created in HDInsight (wasb is used in the location):
CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
lines terminated by '<line separator>' STORED AS TEXTFILE
LOCATION 'wasb:///<directory in Azure blob>'
TBLPROPERTIES("skip.header.line.count"="1");
or in Azure Databricks:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'wasbs://<storage-account#<containername>.blob.core.windows.net/<directory>'
See also:
HDInsight Documentation
Azure Databricks Documentation
I don' t know what SAP supports. ODBC-Access is possible to all of the solutions.

Resources