Databricks Delta - Error: Overlapping auth mechanisms using deltaTable.detail() - databricks

In Azure Databricks. I have a unity catalog metastore created on ADLS on its own container (metastore#stgacct.dfs.core.windows.net/) connected w/ the Azure identity. Works fine.
I have a container on the same storage account called data. I'm using Notebook-scoped creds to gain access to that container. Using abfss://data#stgacct... Works fine.
Using the python Delta API, I'm creating an object for my DeltaTable using: deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable"). I'm able to perform normal Delta functions using that object like MERGE. Works fine.
However, if I attempt to run the deltaTable.detail() command, I get the error: "Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported."
It's as if Spark doesn't know which credential to use to fulfill the .detail() command; the metastore identity or the SPN I used when I scoped my creds for the data container - which also has rights to the metastore container.
To test: If I restart my cluster, which drops the spark conf for ADLS, and I attempt to run the command deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable") and then deltaTable.detail(), I get the error "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" - as if it's not using the metastore credentials which I would have expected since it's a unity/managed table (??).
Suggestions?

Related

Databricks - transfer data from one databricks workspace to another

How can I transform my data in databricks workspace 1 (DBW1) and then push it (send/save the table) to another databricks workspace (DBW2)?
On the DBW1 I installed this JDBC driver.
Then I tried:
(df.write
.format("jdbc")
.options(
url="jdbc:spark://<DBW2-url>:443/default;transportMode=http;ssl=1;httpPath=<http-path-of-cluster>;AuthMech=3;UID=<uid>;PWD=<pat>",
driver="com.simba.spark.jdbc.Driver",
dbtable="default.fromDBW1"
)
.save()
)
However, when I run it I get:
java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException:
How to do this correctly?
Note: each DBW is in different subscription.
From my point of view, the more scalable way would be to write directly into ADLS instead of using JDBC. But this needs to be done as following:
You need to have a separate storage account for your data. Anyway, use of DBFS Root for storage of the actual data isn't recommended as it's not accessible from outside - that makes things, like, migration, more complicated.
You need to have a way to access that storage account (ADLS or Blob storage). You can use access data directly (via abfss:// or wasbs:// URLs)
In the target workspace you just create a table for your data written - so called unmanaged table. Just do (see doc):
create table <name>
using delta
location 'path_or_url_to data'

Permission denied while inserting data from Azure Databricks to Synapse in production environment

We all have a scenario in our project where we are inserting data from Databricks dataframes into Azure Synapse. While we could do this without issues on Dev environment with admin access, we could not run this in higher environment. On Higher environments, Providing INSERT permission on the schema.
The error message I get…
Py4JJavaError: An error occurred while calling o2445.save. :
com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to
execute the JDBC query produced by the connector. Underlying
SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException:
User does not have permission to perform this action. [ErrorCode =
15247] [SQLState = S0001]
Assuming you took this approach then you will need CONTROL Database (db_owner) permissions in Synapse because it is currently required for Databricks to run CREATE DATABASE SCOPED CREDENTIAL
Though this feedback item is related to Azure Data Factory, if it were completed then more granular permissions could be used. So please vote and see my comment.

Create External table in Azure databricks

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.
From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.
Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.
# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint",
"https://login.microsoftonline.com/tenant_id/oauth2/token")
DDL
create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container#account_name.dfs.core.windows.net/dev/data/employee
Error Received
Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);
I need help in knowing if this is possible to refer to ADLS location directly in DDL?
Thanks.
Sort of if you can use Python (or Scala).
Start by making the connection:
TenantID = "blah"
def connectLake():
spark.conf.set("fs.azure.account.auth.type", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")
connectLake()
lakePath = "abfss://liquix#mystorageaccount.dfs.core.windows.net/"
Using Python you can register a table using:
spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")
You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.
The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.
You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
Note that this requires using a High Concurrency cluster.

Error running Spark on Databricks: constructor public XXX is not whitelisted

I was using Azure Databricks and trying to run some example python code from this page.
But I get this exception:
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.classification.LogisticRegression(java.lang.String) is not whitelisted.
This error shows up with some library methods when using High Concurrency cluster with credential pass through enabled. If that is your scenario a work around that may be an option is to use a different cluster mode.
py4j.security.Py4JSecurityException: ... is not whitelisted
This exception is thrown when you have accessed a method that Azure Databricks has not explicitly marked as safe for Azure Data Lake Storage credential passthrough clusters. In most cases, this means that the method could allow a user on a Azure Data Lake Storage credential passthrough cluster to access another user’s credentials.
Reference: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

How to create/access Hive tables with external Metastore on additional Azure Blob Storage?

I want to perform some data transformation in Hive with Azure Data Factory (v1) running a Azure HDInsight On Demand cluster (3.6).
Since the HDInsight On Demand cluster gets destroyed after some idle time and I want/need to keep the metadata about the Hive tables (e.g. partitions), I also configured an external Hive metastore, using a Azure SQL Server database.
Now I want to store all production data on a separate storage account than the one "default" account, where Data Factory and HDInsight also create containers for logging and other runtime data.
So I have the following resources:
Data Factory with HDInsight On Demand (as a linked service)
SQL Server and database for Hive metastore (configured in HDInsight On Demand)
Default storage account to be used by Data Factory and HDInsight On Demand cluster (blob storage, general purpose v1)
Additional storage account for data ingress and Hive tables (blob storage, general purpose v1)
Except the Data Factory, which is in location North Europe, all resources are in the same location West Europe, which should be fine (the HDInsight cluster must be in the same location as any storage accounts you want to use). All Data Factory related deployment is done using the DataFactoryManagementClient API.
An example Hive script (deployed as a HiveActivity in Data Factory) looks like this:
CREATE TABLE IF NOT EXISTS example_table (
deviceId string,
createdAt timestamp,
batteryVoltage double,
hardwareVersion string,
softwareVersion string,
)
PARTITIONED BY (year string, month string) -- year and month from createdAt
CLUSTERED BY (deviceId) INTO 256 BUCKETS
STORED AS ORC
LOCATION 'wasb://container#additionalstorage.blob.core.windows.net/example_table'
TBLPROPERTIES ('transactional'='true');
INSERT INTO TABLE example_table PARTITIONS (year, month) VALUES ("device1", timestamp "2018-01-22 08:57:00", 2.7, "hw1.32.2", "sw0.12.3");
Following the documentation here and here, this should be rather straightforward: Simply add the new storage account as an additional linked service (using the additionalLinkedServiceNames property).
However, this resulted in the following exceptions when a Hive script tried to access a table stored on this account:
IllegalStateException Error getting FileSystem for wasb : org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException exitCode=2: Error reading S/MIME message
139827842123416:error:0D06B08E:asn1 encoding routines:ASN1_D2I_READ_BIO:not enough data:a_d2i_fp.c:247:
139827842123416:error:0D0D106E:asn1 encoding routines:B64_READ_ASN1:decode error:asn_mime.c:192:
139827842123416:error:0D0D40CB:asn1 encoding routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
Some googling told me that this happens, when the key provider is not configured correctly (i.e. the exceptions is thrown because it tries to decrypt the key even though it is not encrypted). After manually setting fs.azure.account.keyprovider.<storage_name>.blob.core.windows.net to org.apache.hadoop.fs.azure.SimpleKeyProvider it seemed to work for reading and "simple" writing of data to tables, but failed again when the metastore got involved (creating a table, adding new partitions, ...):
ERROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:783)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4434)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:316)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
[...]
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: MetaException(message:Got exception: org.apache.hadoop.fs.azure.AzureException com.microsoft.azure.storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38593)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:38561)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:38487)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1103)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_table_with_environment_context(ThriftHiveMetastore.java:1089)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.create_table_with_environment_context(HiveMetaStoreClient.java:2203)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.create_table_with_environment_context(SessionHiveMetaStoreClient.java:99)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:736)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:724)
[...]
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:178)
at com.sun.proxy.$Proxy5.createTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:777)
... 24 more
I tried googling that again, but had no luck finding something usable. I think it may have to do something with the fact, that the metastore service is running separately from Hive and for some reason does not have access to the configured storage account keys... but to be honest, I think this should all just work without manually tinkering with the Hadoop/Hive configuration.
So, my question is: What am I doing wrong and how is this supposed to work?
You need to make sure you also add the hadoop-azure.jar and the azure-storage-5.4.0.jar to your Hadoop Classpath export in your hadoop-env.sh.
export HADOOP_CLASSPATH=/usr/lib/hadoop-client/hadoop-azure.jar:/usr/lib/hadoop-client/lib/azure-storage-5.4.0.jar:$HADOOP_CLASSPATH
And you will need to add the storage key via the following parameter in your core-site.
fs.azure.account.key.{storageaccount}.blob.core.windows.net
When you create your DB and table you need to specify the location using your storage account and the user id
Create table {Tablename}
...
LOCATION 'wasbs://{container}#{storageaccount}.blob.core.windows.net/{filepath}'
If you still have problems after trying the above check to see whether the storage account is a V1 or V2. We had an issue where the V2 storage account did not work with our version of HDP.

Resources