spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode - apache-spark

it seems, in docker pyspark shell in local-client mode is working and able to connect to hive. However, issuing spark-submit with all dependencies it fails with below error.
20/08/24 14:03:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager test.server.com:41697 with 6.2 GB RAM, BlockManagerId(3, test.server.com, 41697, None)
20/08/24 14:03:02 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/24 14:03:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://metastore.server.com:9083
20/08/24 14:03:02 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
Running a simple pi example on pyspark works fine with no kerberos issues, but when trying to access hive getting kerberos error.
Spark-submit command:
spark-submit --master yarn --deploy-mode cluster --files=/etc/hive/conf/hive-site.xml,/etc/hive/conf/yarn-site.xml,/etc/hive/conf/hdfs-site.xml,/etc/hive/conf/core-site.xml,/etc/hive/conf/mapred-site.xml,/etc/hive/conf/ssl-client.xml --name fetch_hive_test --executor-memory 12g --num-executors 20 test_hive_minimal.py
test_hive_minimal.py is a simple pyspark script to show tables in test db:
from pyspark.sql import SparkSession
#declaration
appName = "test_hive_minimal"
master = "yarn"
# Create the Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
# Define the function to load data from Teradata
#custom freeform query
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
Can anyone throw some light how to fix this? Isnt kerberos tickets managed automatically by yarn? all other hadoop resources are accessible.
UPDATE:
Issue was fixed after sharing vol mount on the docker container and passing keytab/principal along with hive-site.xml for accessing metastore.
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias#realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py

I think that your driver have tickets but that not the case of your executors. Add the following parameters to your spark submit :
--principal : you can get principal this way : klist -k
--keytab : path to keytab
more informations : https://spark.apache.org/docs/latest/running-on-yarn.html#yarn-specific-kerberos-configuration

Can you try below command line property while running a job on the cluster.
-Djavax.security.auth.useSubjectCredsOnly=false
You can add above property to Spark submit command

Related

Why does my spark application fail in cluster mode but successful in client mode?

I am trying to run below pyspark program which will copy the files from a HDFS cluster.
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/devuser/devuser.keytab').config('spark.yarn.principal','devuser#NAME.COM').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
If I run the above code using spark-submit with deploy mode as client, the job runs fine and I can see the output in the dir /tmp/data
spark-submit --master yarn --num-executors 1 --deploy-mode client --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
But if I run the same code with --deploy-mode cluster,
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
The job fails with kerberos exception as below:
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797)
... 55 more
The code contains the keytab information of the cluster where I am reading the file from. But I don't understand why it fails in cluster mode but runs in client mode.
Should I make any config changes in the code to run it on cluster mode ? Could anyone let me how can I fix this problem ?
Edit 1: I tried to pass the keytab and principle details from spark-submit instead of hard coding them inside the program as below:
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
Spark-submit:
spark-submit --master yarn --deploy-mode cluster --name checkCon --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/devuser/conn_props/core-site_fs.xml,/home/devuser/conn_props/hdfs-site_fs.xml --principal devuser#NAME.COM --keytab /home/devuser/devuser.keytab check_con.py
Exception:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, executor name, executor 2): java.io.IOException: DestHost:destPort <port given in the csv_data statement> , LocalHost:localPort <localport name>. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Because of this .config('spark.yarn.keytab','/home/devuser/devuser.keytab') conf.
In client mode your job will be running in local & given path is available so Job completed successfully.
Where as in cluster mode /home/devuser/devuser.keytab not available or accessible in data nodes & driver so it is failing.
SparkSession\
.builder\
.master('yarn')\
.config('spark.app.name','dummy_App')\
.config('spark.executor.memory','2g')\
.config('spark.executor.cores','2')\
.config('spark.yarn.keytab','/home/devuser/devuser.keytab')\ # This line is causing problem
.config('spark.yarn.principal','devuser#NAME.COM')\
.config('spark.executor.instances','2')\
.config('hadoop.security.authentication','kerberos')\
.config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port')\
.getOrCreate()
Don't hardcode spark.yarn.keytab & spark.yarn.principal configs.
Pass these configs as part of spark-submit command.
spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal#REALM.COM \
--jars ${JARS} \
[...]
AFAIK,
the below approach will solve your problem,
KERBEROS_KEYTAB_PATH=/home/devuser/devuser.keytab and KERBEROS_PRINCIPAL=devuser#NAME.COM .
Approach 1: with kinit command
step 1: kinit and proceed with spark-submit
kinit -kt ${KERBEROS_KEYTAB_PATH} ${KERBEROS_PRINCIPAL}
step 2: Run klist and verify Kerberization is correctly working for the logged in devuser.
Ticket cache: FILE:/tmp/krb5cc_XXXXXXXXX_XXXXXX
Default principal: devuser#NAME.COM
Valid starting Expires Service principal
07/30/2020 15:52:28 07/31/2020 01:52:28 krbtgt/NAME.COM#NAME.COM
renew until 08/06/2020 15:52:28
step 3: replace the spark code with spark session
sparkSession = SparkSession.builder().config(sparkConf).appName("TEST1").enableHiveSupport().getOrCreate()
step 4: Run spark-submit as below.
$SPARK_HOME/bin/spark-submit --class com.test.load.Data \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 2 --num-executors 2 \
--conf "spark.driver.cores=2" \
--name "TEST1" \
--principal ${KERBEROS_PRINCIPAL} \
--keytab ${KERBEROS_KEYTAB_PATH} \
--conf spark.files=$SPARK_HOME/conf/hive-site.xml \
/home/devuser/sparkproject/Test-jar-1.0.jar 2> /home/devuser/logs/test1.log

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data programmatically without Spark. However, the problem comes when I try to connect and query in Spark.
My code when run locally without spark:
Class.forName("org.apache.hive.jdbc.HiveDriver")
System.setProperty("kerberos.keytab", keytab)
System.setProperty("kerberos.principal", keytab)
System.setProperty("java.security.krb5.conf", krb5.conf)
System.setProperty("java.security.auth.login.config", jaas.conf)
val conf = new Configuration
conf.set("hadoop.security.authentication", "Kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.createProxyUser("user", UserGroupInformation.getLoginUser)
UserGroupInformation.loginUserFromKeytab(user, keytab)
UserGroupInformation.getLoginUser.checkTGTAndReloginFromKeytab()
if (UserGroupInformation.isLoginKeytabBased) {
UserGroupInformation.getLoginUser.reloginFromKeytab()
}
else if (UserGroupInformation.isLoginTicketBased) UserGroupInformation.getLoginUser.reloginFromTicketCache()
val con = DriverManager.getConnection("jdbc:hive://hdpe-hive.company.com:10000", user, password)
val ps = con.prepareStatement("select * from table limit 5").executeQuery();
Does anyone know how I could include the keytab, krb5.conf and jaas.conf into my Spark initialization function so that I am able to authenticate with Kerberos to get the TGT?
My Spark initialization function:
conf = new SparkConf().setAppName("mediumData")
.setMaster(numCores)
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled","true") //enable spark UI
.set("spark.sql.shuffle.partitions",defaultPartitions)
sparkSession = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
I do not have files such as hive-site.xml, core-site.xml.
Thank you!
Looking at your code, you need to set the following properties in the spark-submit command on the terminal.
spark-submit --master yarn \
--principal YOUR_PRINCIPAL_HERE \
--keytab YOUR_KEYTAB_HERE \
--conf spark.driver.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.driver.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.auth.login.config=JAAS_CONF_PATH" \
--conf spark.executor.extraJavaOptions="-Djava.security.krb5.conf=KRB5_PATH" \
--class YOUR_MAIN_CLASS_NAME_HERE code.jar

Warning: Skip remote jar hdfs

I would like to submit a spark job with configuring additional jar on hdfs, however the hadoop gives me a warning on skipping remote jar. Although I can still get my final results on hdfs, I cannot obtain the effect of additional remote jar. I would appreciate if you can give me some suggestions.
Many thanks,
root#cluster-1-m:~# hadoop fs -ls hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
-rwxr-xr-x 2 hdfs hadoop 7097056 2019-01-23 14:44 hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
root#cluster-1-m:~#/usr/lib/spark/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.jars=hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar \
--conf spark.driver.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--class com.github.ehiggs.spark.terasort.TeraSort \
/root/spark-terasort-master/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /tmp/data/terasort_in /tmp/data/terasort_out
Warning: Skip remote jar hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar.
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.146.0.4:8032
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at cluster-1-m/10.146.0.4:10200
19/01/24 02:20:34 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1548293702222_0002

WARN Session: Error creating pool to /xxx.xxx.xxx.xxx:28730

I'm trying to connect to a ScyllaDB database running on IBM Cloud from Spark 2.3 running on IBM Analytics Engine.
I'm starting the spark shell like so ...
$ spark-shell --master local[1] \
--files jaas.conf \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,datastax:spark-cassandra-connector:2.3.0-s_2.11,commons-configuration:commons-configuration:1.10 \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf" \
--conf spark.cassandra.connection.host=xxx1.composedb.com,xxx2.composedb.com,xxx3.composedb.com \
--conf spark.cassandra.connection.port=28730 \
--conf spark.cassandra.auth.username=scylla \
--conf spark.cassandra.auth.password=SECRET \
--conf spark.cassandra.connection.ssl.enabled=true \
--num-executors 1 \
--executor-cores 1
Then executing the following spark scala code:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val stocksRdd = sc.cassandraTable("stocks", "stocks")
stocksRdd.count()
However, I see a bunch of warnings:
18/08/23 10:11:01 WARN Cluster: You listed xxx1.composedb.com/xxx.xxx.xxx.xxx:28730 in your contact points, but it wasn't found in the control host's system.peers at startup
18/08/23 10:11:01 WARN Cluster: You listed xxx1.composedb.com/xxx.xxx.xxx.xxx:28730 in your contact points, but it wasn't found in the control host's system.peers at startup
18/08/23 10:11:06 WARN Session: Error creating pool to /xxx.xxx.xxx.xxx:28730
com.datastax.driver.core.exceptions.ConnectionException: [/xxx.xxx.xxx.xxx:28730] Pool was closed during initialization
...
However, after the stacktrace in the warning, I see the output I am expecting:
res2: Long = 4
If I navigate to the compose UI, I see a map json:
[
{"xxx.xxx.xxx.xxx:9042":"xxx1.composedb.com:28730"},
{"xxx.xxx.xxx.xxx:9042":"xxx2.composedb.com:28730"},
{"xxx.xxx.xxx.xxx:9042":"xxx3.composedb.com:28730"}
]
It seems the warning is related to the map file.
What are the implications of the warning? Can I ignore it?
NOTE: I've seen a similar question, however I believe this question is different because of the map file and I have no control over how the scylladb cluster has been setup by Compose.
This is just warning. The warning is happening because the IPs that spark is trying to reach are not know to Scylla itself. Apparently Spark is connecting to the cluster and retrieving the expected information so you should be fine.

Fail to submit spark job

I am trying to run the Spark-solr Twitter example with spark-solr-3.4.4-shaded.jar,
bin/spark-submit --master local[2] \ --conf "spark.driver.extraJavaOptions=-Dtwitter4j.oauth.consumerKey=?
-Dtwitter4j.oauth.consumerSecret=? -Dtwitter4j.oauth.accessToken=? -Dtwitter4j.oauth.accessTokenSecret=?" \ --class com.lucidworks.spark.SparkApp \ ./target/spark-solr-3.1.1-shaded.jar \ twitter-to-solr -zkHost localhost:9983 -collection socialdata
but it is failed and the following message is shown
INFO ContextHandler: Started o.e.j.s.ServletContextHandler#29182679{/metrics/json,null,AVAILABLE,#Spark}
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.SparkContext.jobProgressListener()Lorg/apache/spark/ui/jobs/JobProgressListener;
I can confirm the path for ./target/spark-solr-3.1.1-shaded.jar is correct.
I suspect there is something wrong in --class com.lucidworks.spark.SparkApp (ClassPath), but I am not sure.
I am running on local mode and I change the parameters as instructed in the example.
Version:
Spark 2.1.1
Spark-solr 3.1.1
Solr 6.6.0

Resources