Is PD component has any authentication method to limit databases that Spark can access? - tidb

Is PD have any authentication method to limit databases Spark can access? because if someone know my PD address, they can use TiSark to connect to TiDB and query on my databases.
I have setup a TiDB cluster, then I add a new user, I connect to TiDB by new user and create a new database.
When I use Spark to connect to TiDB through PD, I call "show databases" and it return all my database include new database I have created by new user.
My spark session is:
val _spark = SparkSession.builder()
.master("spark://127.0.0.1:7077") //local[*]
.config("spark.tispark.pd.addresses", "127.0.0.1:2379")
.config("spark.sql.extensions","org.apache.spark.sql.TiExtensions")
.appName("SparkApp")
.getOrCreate()
I wonder if someone know where are my PDs is, they can hack to my databases. I have read TiDB document carefully but no where mentioned about it.

PD does have one, it's TLS authentication. The following link describes how to enable TLS authentication in the TiDB cluster. https://github.com/pingcap/docs/blob/df2a250b463079a35143ef913198732d4c6be5dd/v2.1/how-to/secure/enable-tls-between-components.md

Related

Databricks Delta - Error: Overlapping auth mechanisms using deltaTable.detail()

In Azure Databricks. I have a unity catalog metastore created on ADLS on its own container (metastore#stgacct.dfs.core.windows.net/) connected w/ the Azure identity. Works fine.
I have a container on the same storage account called data. I'm using Notebook-scoped creds to gain access to that container. Using abfss://data#stgacct... Works fine.
Using the python Delta API, I'm creating an object for my DeltaTable using: deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable"). I'm able to perform normal Delta functions using that object like MERGE. Works fine.
However, if I attempt to run the deltaTable.detail() command, I get the error: "Your query is attempting to access overlapping paths through multiple authorization mechanisms, which is not currently supported."
It's as if Spark doesn't know which credential to use to fulfill the .detail() command; the metastore identity or the SPN I used when I scoped my creds for the data container - which also has rights to the metastore container.
To test: If I restart my cluster, which drops the spark conf for ADLS, and I attempt to run the command deltaTable = DeltaTable.forName(spark, "mycat.myschema.mytable") and then deltaTable.detail(), I get the error "Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key" - as if it's not using the metastore credentials which I would have expected since it's a unity/managed table (??).
Suggestions?

AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] when using Hive warehouse

We recently enabled Kerberos authentication on our Spark cluster, but we found that when we submit Spark jobs in cluster mode, the code cannot connect to Hive.
Should we be using Kerberos to authenticate to Hive, and if yes, how? As detailed below, I think we have to specify keytab and principal, but I don't know what exactly.
This is the exception we get:
Traceback (most recent call last):
File "/mnt/resource/hadoop/yarn/local/usercache/sa-etl/appcache/application_1649255698304_0003/container_e01_1649255698304_0003_01_000001/__pyfiles__/utils.py", line 222, in use_db
spark.sql("CREATE DATABASE IF NOT EXISTS `{db}`".format(db=db))
File "/usr/hdp/current/spark3-client/python/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/hdp/current/spark3-client/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/hdp/current/spark3-client/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: java.lang.RuntimeException: java.io.IOException: DestHost:destPort hn1-pt-dev.MYREALM:8020 , LocalHost:localPort wn1-pt-dev/10.208.3.12:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Additionally, I saw this exception:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over hn0-pt-dev.myrealm/10.208.3.15:8020
This is the script that produces the exception, that as you can see, happens on the CREATE DATABASE:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
spark.sql("CREATE DATABASE IF NOT EXISTS TestDb")
Environment and relevant information
We have an ESP enabled HDInsight Cluster in Azure, it is inside a virtual network. AADDS works fine for logging into the cluster. The cluster is connected to a Storage Account, communicating to it with ABFS and storing the Hive warehouse on there. We are using Yarn.
We want to execute Spark jobs using PySpark from the Azure Data Factory, which uses Livy, but if we can get it to work with spark-submit cli it will hopefully also work with Livy.
We are using Spark 3.1.1 and Kerberos 1.10.3-30.
The exception only occurs when we use spark-submit --deploy-mode cluster, when using client mode there is no exception and the database is created.
When we remove the .enableHiveSupport the exception also disappears, so it apparently has something to do with the authentication to Hive.
We do need the Hive warehouse though, because we need to access tables from within multiple Spark sessions so they need to be persisted.
We can access HDFS, also in cluster mode, as sc.textFile('/example/data/fruits.txt').collect() works fine.
Similar questions and possible solutions
In the exception, I see that it is the worker node which tries to access the head node. The port is 8020, which is I think the namenode port, so this sounds indeed HDFS related - except that to my understanding we can access HDFS, but not Hive.
https://spark.apache.org/docs/latest/running-on-yarn.html#kerberos It suggests specifying principal and keytab file explicitly, so I found the keytab file with klist -k and added to the spark-submit command line --principal myusername#MYREALM --keytab /etc/krb5.keytab, which is the same keytab file as in one of the linked questions below, however I got
Exception in thread "main" org.apache.hadoop.security.KerberosAuthException: failure to login: for principal: myusername#MYREALM from keytab /etc/krb5.keytab javax.security.auth.login.LoginException: Unable to obtain password from user
Maybe I have the wrong keytab file though, because when I klist -k /etc/krb5.keytab the file I only get slots with entries like HN0-PT-DEV#MYREALM and host/hn0-pt-dev.myrealm#MYREALM.
If I look in the keytabs for hdfs/hive in /etc/security/keytabs I also see only entries for hdfs/hive users.
When I try adding all the extraJavaOptions specified in How to use Apache Spark to query Hive table with Kerberos? but don't specify principal/keytab, I get KrbException: Cannot locate default realm even though the default realm in /etc/krb5.conf is correct.
In Ambari, I can see the settings spark.yarn.keytab={{hive_kerberos_keytab}} and spark.yarn.principal={{hive_kerberos_principal}}.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-faq#how-do-i-create-a-keytab-for-an-hdinsight-esp-cluster- I created a keytab for my user and specified that file instead, but that didn't help.
It appears that many other answers/websites also suggest to specify principal/keytab explicitly:
Spark on YARN + Secured hbase For HBase instead of Hive, but same conclusion.
https://www.ibm.com/docs/en/spectrum-conductor/2.4.1?topic=ssbaig-submitting-spark-batch-applications-kerberos-enabled-hdfs-keytab
Issue with Spark Java API, Kerberos, and Hive
spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode
https://docs.cloudera.com/documentation/enterprise/5-7-x/topics/sg_spark_auth.html#concept_bvc_pcy_dt (I couldn't find similar documentation from Microsoft)
spark-submit,Client cannot authenticate via:[TOKEN, KERBEROS];
Other questions:
https://spark.apache.org/docs/2.1.1/running-on-yarn.html#running-in-a-secure-cluster To start with the official documentation: it explains that
For a Spark application to interact with HDFS, HBase and Hive, it must acquire the relevant tokens using the Kerberos credentials of the user launching the application —that is, the principal whose identity will become that of the launched Spark application. This is normally done at launch time: in a secure cluster Spark will automatically obtain a token for the cluster’s HDFS filesystem, and potentially for HBase and Hive.
Well, the user launching the application has valid ticket, as can be seen in the output of klist. The user has contributor access to the blob storage (not sure if that is actually needed). I don't understand what is meant with "Spark will automatically obtain a token for Hive [at launch time]" though. I did restart all services on the cluster, but that didn't help.
Kerberos authentication with Hadoop cluster from Spark stand alone cluster running on Kubernetes cluster This is a situation with two clusters. As explained here:
in yarn-cluster mode, the Spark client uses the local Kerberos ticket to connect to Hadoop services and retrieve special auth tokens that are then shipped to the YARN container running the driver; then the driver broadcasts the token to the executors
When running Spark on Kubernetes to access kerberized Hadoop cluster, how do you resolve a "SIMPLE authentication is not enabled" error on executors? For older Spark version.
Cannot connect to HIVE with Secured kerberos. I am using UserGroupInformation.loginUserFromKeytab() Something about JAAS
Spark-submit job fails on yarn nodemanager with error Client cannot authenticate via:[TOKEN, KERBEROS] No answer
Client cannot authenticate via: [TOKEN, KERBEROS) Not making sense to me.
Hive is not accessible via Spark In Kerberos Environment : Client cannot authenticate via:[TOKEN, KERBEROS] Added spark.security.credentials.hadoopfs.enabled=true
https://funclojure.tumblr.com/post/155129283948/hdfs-kerberos-java-client-api-pains about jars
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] Issue No answer
https://issues.apache.org/jira/browse/SPARK-27554 No answer
java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] old
Possible things to try:
https://spark.apache.org/docs/2.1.1/running-on-yarn.html#troubleshooting-kerberos Enable more in-detail logging.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-linux-ambari-ssh-tunnel Viewing the Namenode UI might give some information
Updates
When logged in as Hive user:
kinit then supply hive password:
Password for hive/hn0-pt-dev.myrealm#MYREALM:
kinit: Password incorrect while getting initial credentials
hive#hn0-pt-dev:/tmp$ klist -k /etc/security/keytabs/hive.service.keytab
Keytab name: FILE:/etc/security/keytabs/hive.service.keytab
KVNO Principal
---- --------------------------------------------------------------------------
0 hive/hn0-pt-dev.myrealm#MYREALM
0 hive/hn0-pt-dev.myrealm#MYREALM
0 hive/hn0-pt-dev.myrealm#MYREALM
0 hive/hn0-pt-dev.myrealm#MYREALM
0 hive/hn0-pt-dev.myrealm#MYREALM
hive#hn0-pt-dev:/tmp$ kinit -k /etc/security/keytabs/hive.service.keytab
kinit: Client '/etc/security/keytabs/hive.service.keytab#MYREALM' not found in Kerberos database while getting initial credentials
In general, you have to complete a [kinit successfully]/[pass a principle/keytab] to be able to use Kerberos with spark/hive. Their are some settings that complicate the use of hive. (Impersonation)
Generally speaking if you can kinit and use hdfs to write to your own folder your keytab is working:
kinit #enter user info
hdfs dfs -touch /home/myuser/somefile #gurantees you have a home directory... spark needs this
Once you know that is working you should check if you can write to hive:
Either use a JDBC connection or use beeline with a connection string like below
jdbc:hive2://HiveHost:10001/default;principal=myuser#HOST1.COM;
This helps to find were the issue is.
If you are looking at an issue with hive you need to check impersonation:
HiveServer2 Impersonation Important: This is not the recommended
method to implement HiveServer2 authorization. Cloudera recommends you
use Sentry to implement this instead. HiveServer2 impersonation lets
users execute queries and access HDFS files as the connected user
rather than as the super user. Access policies are applied at the file
level using the HDFS permissions specified in ACLs (access control
lists). Enabling HiveServer2 impersonation bypasses Sentry from the
end-to-end authorization process. Specifically, although Sentry
enforces access control policies on tables and views within the Hive
warehouse, it does not control access to the HDFS files that underlie
the tables. This means that users without Sentry permissions to tables
in the warehouse may nonetheless be able to bypass Sentry
authorization checks and execute jobs and queries against tables in
the warehouse as long as they have permissions on the HDFS files
supporting the table.
If you are on windows, you should look watch out for the ticket cache. You should consider setting up your own personal ticket cache location, because typically windows uses one generic location for all users. (Which allows users to login over top of each other creating weird errors.)
If you are having hive issues, the hive logs themselves often help you to understand why the process isn't working. (But you will only have a log if some of the kerberos was successful, if it was completely unsuccessful you won't see anything. )
Check Ranger and see if there are any Errors.
If you want to use cluster mode and access the Hive warehouse, you need to specify keytab and principal to spark-submit (this is clear in the official docs)
Using a Keytab By providing Spark with a principal and keytab (e.g. using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
You need to create your own keytab
After creating the keytab, make sure that the right user has permissions for it, otherwise you'll just get Unable to obtain password from user again.
If you are using Livy --proxy-user will conflict with --principal, but that's easy to fix. ( use: livy.impersonation.enabled=false )

Could not find Linked Service linked_service; the linked service does not exist or is not published

I am trying to connect to a storage account in a Scala notebook via Synapse. I am following instructions as outlined from this documentation: https://www.drware.com/using-msi-to-authenticate-on-a-synapse-spark-notebook-while-querying-the-storage-2/
My code looks like this:
val sc = spark.sparkContext
spark.conf.set("spark.storage.synapse.linkedServiceName", linked_service)
spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
val config_load_path = s"abfss://$storage_container_name#$storage_account_name.dfs.core.windows.net/"
val df_config = mssparkutils.fs.ls(config_load_path)
But for some reason, the error that I keep getting is:
"Could not find Linked Service linked_service; the linked service does
not exist or is not published."
What am I doing wrong? I am able to connect to this linked service if I switch to pySpark and use spark.read, so it's not as if I set up the linked service incorrectly.
It looks like my Azure auth tokens may have expired, so it could not access the storage account. In any case, you will need to refresh the page or reopen your browser entirely.

Azure Databricks externalize metastore - MSFT Script not running

I am trying to set up azure databricks with external hive metastore on AzureSQL.
While doing the setup, I created Azure SQL. And now I have to run a MSFT given sql script which has table and indices creation sql.
When I ran it was able to create new tables but failed in Index creation. I have full access on Database. May be some grant is missing. Also why MSFT or Databrick has such lengthy process?
OR if there a better way to externalize metedata.Please help.
To set up an external metastore using the Azure Databricks UI: Checkout Set up an external metastore using the UI
Click the Clusters button on the sidebar.
Click Create Cluster.
Enter the following Spark configuration options:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL <mssql-connection-string>
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName <mssql-username>
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword <mssql-password>
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version <hive-version>
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars <hive-jar-source>
Continue your cluster configuration, following the instructions in Configure clusters.
Click Create Cluster to create the cluster.

Accessing HDFS files of multiple Kerberos users in the same Spark Job

We are running Spark on Kubernetes accessing a Kerberized HDFS cluster. We can access data from individual users by using HDFS Delegation Tokens and from service accounts by using service keytabs.
However, we would like to read/write data from multiple HDFS accounts in the same Spark job. In particular:
Read from a user account, process the data and then save the result to a directory belonging to the service account as an intermediate step of the job (for caching/sharing among users).
Read from a user account and from a service account in the same job.
All the documentation I was able to find so far covers only the scenario of a single kerberos user per Spark job.
Is it at all possible to use multiple kerberos credentials in a single Spark Job? That is, when reading from hdfs://mycluster/user/a use credentials of user A, and when reading from hdfs://mycluster/user/b use credentials of user B? We are launching Spark programmatically, as part of a larger Scala program.
We are able to access multiple user accounts from a Java program by using the Hadoop HDFS API directly, doing something like this:
val ugi1 = UserGroupInformation.loginUserFromKeytabAndReturnUGI(user1, keytab1)
val ugi2 = UserGroupInformation.loginUserFromKeytabAndReturnUGI(user2, keytab2)
val fs1 = ugi1.doAs(new PrivilegedAction[Unit] {
override def run(): Unit = {
FileSystem.get(...)
}
})
val fs2 = ugi2.doAs(new PrivilegedAction[Unit] {
override def run(): Unit = {
FileSystem.get(...)
}
})
// Code using fs1 and fs2
We would like to do something similar from a Spark job (running on a Kubernetes cluster). Is this possible? If so, how could we do it?

Resources