Hive external tables map to azure blob storage - apache-spark

Is there a way to create a Hive external table using with location pointing to Azure Storage? We actually want to connect SAP HANA (SDA) to blob storage, so it seems the only way is to create an external hive table first which points to Azure blob storage and then use Hive ODBC connector/spark connectorto connect it toHANA SAP`. Does anyone have any idea how to achieve that?

You can create external tables in Hive or Spark on Azure. There are several options available:
Azure HDInsight
Azure Databricks (via Spark)
Hadoop distros supporting Azure Blob Storage (e. g. HDP)
External table creation would reference the data in the Blob storage account. See the following example for a Hive table created in HDInsight (wasb is used in the location):
CREATE EXTERNAL TABLE IF NOT EXISTS <database name>.<external textfile table name>
(
field1 string,
field2 int,
...
fieldN date
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>'
lines terminated by '<line separator>' STORED AS TEXTFILE
LOCATION 'wasb:///<directory in Azure blob>'
TBLPROPERTIES("skip.header.line.count"="1");
or in Azure Databricks:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
COMMENT 'This table is created with existing data'
LOCATION 'wasbs://<storage-account#<containername>.blob.core.windows.net/<directory>'
See also:
HDInsight Documentation
Azure Databricks Documentation
I don' t know what SAP supports. ODBC-Access is possible to all of the solutions.

Related

How to create a table in databricks from an existing table on SQL

Can someone let me know how to create a table in Azure Databricks from a table that exists on Azure sql server? (assuming Databricks already has a jdbc connection to the sql server).
For example, the following will create a table if it doesn't exist from a location in my datalake.
CREATE TABLE IF NOT EXISTS newDB.MyTable USING delta LOCATION
'/mnt/dblake/BASE/Public/Adventureworks/delta/SalesLT.Product/'
I would like do the same but with the table existing on SQL Server?
Here is the basic solution for creation of an external table over an Azure SQL table
You can take the url (connection string) from the Azure Portal
create table if not exists mydb.mytable
using jdbc
options (url = 'jdbc:sqlserver://mysqlserver.database.windows.net:1433;database=mydb;user=myuser;password=mypassword;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;', dbtable = 'dbo.mytable')
Check the following links for additional options
https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Databricks - transfer data from one databricks workspace to another

How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'

Create External table in Azure databricks

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.
From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.
Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.
# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/tenant_id/oauth2/token")
DDL
create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container#account_name.dfs.core.windows.net/dev/data/employee
Error Received
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);
I need help in knowing if this is possible to refer to ADLS location directly in DDL?
Thanks.
Sort of if you can use Python (or Scala).
Start by making the connection:
TenantID = "blah"
def connectLake():
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")
connectLake()
lakePath = "abfss://liquix#mystorageaccount.dfs.core.windows.net/"
Using Python you can register a table using:
spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")
You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.
The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.
You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
Note that this requires using a High Concurrency cluster.

Azure Data Lake Store as EXTERNAL TABLE in Databricks

How can I create an EXTERNAL TABLE in Azure Databricks which reads from Azure Data Lake Store? I am having trouble seeing in the documentation if it is even possible. I have a set of CSV files in a specific folder in Azure Data lake Store, and I want to do a CREATE EXTERNAL TABLE in Azure Databricks which points to the CSV files.
1. Reference mounted directories
You can mount the Azure Data Lake Store (ADLS) to Azure Databricks DBFS (requires 4.0 runtime or higher):
# Get Azure Data Lake Store credentials from the secret store
clientid = dbutils.preview.secret.get(scope = "adls", key = "clientid")
credential = dbutils.preview.secret.get(scope = "adls", key = "credential")
refreshurl = dbutils.preview.secret.get(scope = "adls", key = "refreshurl")
accounturl = dbutils.preview.secret.get(scope = "adls", key = "accounturl")
# Mount the ADLS
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": clientid,
"dfs.adls.oauth2.credential": credential,
"dfs.adls.oauth2.refresh.url": refreshurl}
dbutils.fs.mount(
source = accounturl,
mount_point = "/mnt/adls",
extra_configs = configs)
Table creation works the same way as with DBFS. Just reference the mountpoint with the directory in ADLS, e. g.:
%sql
CREATE TABLE product
USING CSV
OPTIONS (header "true", inferSchema "true")
LOCATION "/mnt/adls/productscsv/"
The location clause automatically implies EXTERNAL. See also Azure Databricks Documentation.
2. Reference the Data Lake Store in the table definition directly
You can also reference the storage directly without mounting the storage. This scenario makes sense if the metadata or parts of the code are also used in other platforms. In this scenario access to the storage has to be defined on the cluster or notebook level (see this Databricks documentation for ADLS Gen1 or this documentation for Gen2 configuration details) or Azure AD Credential Passthrough is used.
The table definition would look like this for ADLS Gen1:
CREATE TABLE sampletable
(L_ORDERKEY BIGINT,
L_PARTKEY BIGINT,
L_SUPPKEY BIGINT,
L_SHIPMODE STRING,
L_COMMENT STRING)
USING csv
OPTIONS ('DELIMITER' '|')
LOCATION "adl://<your adls>.azuredatalakestore.net/directory1/sampletable"
;
For Azure Data Lake Gen2 the location reference looks like:
LOCATION "abfss://<file_system>#<account_name.dfs.core.windows.net/directory/tablename"
you should consider looking at this link: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html
Access Azure Data Lake Store using the Spark API
To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}")
spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")
It doesn't mention the use of External Table.

Create External Data Source with HDInsight

I'm trying to create an external data source with my HDInsight Cluster. While doing so, I need to provide location as Hadoop, Name, Node, IP Address, and port number.
So, where could I find the Name, Node, IP Address, Resource Manager location, IP Address and port numbers for both on HDInsight cluster?
I already browsed through Core-site.xml & yarn-site.xml and found nothing for HDInsight.
--- 3: syntax for Creating an external data source.
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
TYPE = HADOOP,
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
CREDENTIAL = HadoopUser1
);
-- LOCATION (Required) : Hadoop Name Node IP address and port.
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
-- CREDENTIAL (Optional): the database scoped credential, created above.
Thanks.
If I understand your question correctly you already have a HDInsight cluster and are trying to get Azure SQL DW to talk to it via an external table. If you search the Syntax section of the documentation for CREATE EXTERNAL DATA SOURCE for "Azure SQL Data Warehouse" you will see the only way Polybase in Azure SQL DW works at the moment is by talking to Azure Blob Storage and Azure Data Lake Store. (Stay tuned to that documentation page as Polybase in Azure SQL DW will get more flexible over time as they continue to enhance it.)
So for now you should have HDInsight write to an external table defined in Hive and then have Azure SQL DW point at the same folder in blob storage and declare its own external table that reads those blobs.

Resources