Why does my spark application fail in cluster mode but successful in client mode? - apache-spark

I am trying to run below pyspark program which will copy the files from a HDFS cluster.
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/devuser/devuser.keytab').config('spark.yarn.principal','devuser#NAME.COM').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
If I run the above code using spark-submit with deploy mode as client, the job runs fine and I can see the output in the dir /tmp/data
spark-submit --master yarn --num-executors 1 --deploy-mode client --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
But if I run the same code with --deploy-mode cluster,
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
The job fails with kerberos exception as below:
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797)
... 55 more
The code contains the keytab information of the cluster where I am reading the file from. But I don't understand why it fails in cluster mode but runs in client mode.
Should I make any config changes in the code to run it on cluster mode ? Could anyone let me how can I fix this problem ?
Edit 1: I tried to pass the keytab and principle details from spark-submit instead of hard coding them inside the program as below:
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
Spark-submit:
spark-submit --master yarn --deploy-mode cluster --name checkCon --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/devuser/conn_props/core-site_fs.xml,/home/devuser/conn_props/hdfs-site_fs.xml --principal devuser#NAME.COM --keytab /home/devuser/devuser.keytab check_con.py
Exception:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, executor name, executor 2): java.io.IOException: DestHost:destPort <port given in the csv_data statement> , LocalHost:localPort <localport name>. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

Because of this .config('spark.yarn.keytab','/home/devuser/devuser.keytab') conf.
In client mode your job will be running in local & given path is available so Job completed successfully.
Where as in cluster mode /home/devuser/devuser.keytab not available or accessible in data nodes & driver so it is failing.
SparkSession\
.builder\
.master('yarn')\
.config('spark.app.name','dummy_App')\
.config('spark.executor.memory','2g')\
.config('spark.executor.cores','2')\
.config('spark.yarn.keytab','/home/devuser/devuser.keytab')\ # This line is causing problem
.config('spark.yarn.principal','devuser#NAME.COM')\
.config('spark.executor.instances','2')\
.config('hadoop.security.authentication','kerberos')\
.config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port')\
.getOrCreate()
Don't hardcode spark.yarn.keytab & spark.yarn.principal configs.
Pass these configs as part of spark-submit command.
spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal#REALM.COM \
--jars ${JARS} \
[...]

AFAIK,
the below approach will solve your problem,
KERBEROS_KEYTAB_PATH=/home/devuser/devuser.keytab and KERBEROS_PRINCIPAL=devuser#NAME.COM .
Approach 1: with kinit command
step 1: kinit and proceed with spark-submit
kinit -kt ${KERBEROS_KEYTAB_PATH} ${KERBEROS_PRINCIPAL}
step 2: Run klist and verify Kerberization is correctly working for the logged in devuser.
Ticket cache: FILE:/tmp/krb5cc_XXXXXXXXX_XXXXXX
Default principal: devuser#NAME.COM
Valid starting Expires Service principal
07/30/2020 15:52:28 07/31/2020 01:52:28 krbtgt/NAME.COM#NAME.COM
renew until 08/06/2020 15:52:28
step 3: replace the spark code with spark session
sparkSession = SparkSession.builder().config(sparkConf).appName("TEST1").enableHiveSupport().getOrCreate()
step 4: Run spark-submit as below.
$SPARK_HOME/bin/spark-submit --class com.test.load.Data \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 2 --num-executors 2 \
--conf "spark.driver.cores=2" \
--name "TEST1" \
--principal ${KERBEROS_PRINCIPAL} \
--keytab ${KERBEROS_KEYTAB_PATH} \
--conf spark.files=$SPARK_HOME/conf/hive-site.xml \
/home/devuser/sparkproject/Test-jar-1.0.jar 2> /home/devuser/logs/test1.log

Related

spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode

it seems, in docker pyspark shell in local-client mode is working and able to connect to hive. However, issuing spark-submit with all dependencies it fails with below error.
20/08/24 14:03:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager test.server.com:41697 with 6.2 GB RAM, BlockManagerId(3, test.server.com, 41697, None)
20/08/24 14:03:02 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/24 14:03:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://metastore.server.com:9083
20/08/24 14:03:02 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
Running a simple pi example on pyspark works fine with no kerberos issues, but when trying to access hive getting kerberos error.
Spark-submit command:
spark-submit --master yarn --deploy-mode cluster --files=/etc/hive/conf/hive-site.xml,/etc/hive/conf/yarn-site.xml,/etc/hive/conf/hdfs-site.xml,/etc/hive/conf/core-site.xml,/etc/hive/conf/mapred-site.xml,/etc/hive/conf/ssl-client.xml --name fetch_hive_test --executor-memory 12g --num-executors 20 test_hive_minimal.py
test_hive_minimal.py is a simple pyspark script to show tables in test db:
from pyspark.sql import SparkSession
#declaration
appName = "test_hive_minimal"
master = "yarn"
# Create the Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
# Define the function to load data from Teradata
#custom freeform query
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
Can anyone throw some light how to fix this? Isnt kerberos tickets managed automatically by yarn? all other hadoop resources are accessible.
UPDATE:
Issue was fixed after sharing vol mount on the docker container and passing keytab/principal along with hive-site.xml for accessing metastore.
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias#realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py
I think that your driver have tickets but that not the case of your executors. Add the following parameters to your spark submit :
--principal : you can get principal this way : klist -k
--keytab : path to keytab
more informations : https://spark.apache.org/docs/latest/running-on-yarn.html#yarn-specific-kerberos-configuration
Can you try below command line property while running a job on the cluster.
-Djavax.security.auth.useSubjectCredsOnly=false
You can add above property to Spark submit command

How to fix "Connection refused error" when running a cluster mode spark job

I am running terasort benchmark with spark on the uni cluster which uses SLURM job management system. It works fine when I use --master local[8], however when I set the master as my current node I get connection refused error.
I run this command to launch the app on local without problem:
> spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master local[8] \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g \
data/terasort_in
When I use cluster mode I get the following error:
> spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master spark://iris-055:7077 \ #name of the cluster-node in use
--deploy-mode cluster \
--executor-memory 20G \
--total-executor-cores 24 \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 5g \
data/terasort_in
Output:
WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult:
at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at
.
.
./*many lines of timeout logs etc.*/
.
.
.
Caused by: java.net.ConnectException: Connection refused
... 11 more
I expect the command to run smooth and terminate, but I cannot get over this connection error.
The problem could be not defining --conf variables. This could work out:
spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master spark://iris-055:7077 \
--conf spark.driver.memory=4g \
--conf spark.executor.memory=20g \
--executor-memory 20g \
--total-executor-cores 24 \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 5g \
data/terasort_in

Spark-submit throwing error from yarn-cluster mode

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

submit spark Diagnostics: java.io.FileNotFoundException:

I'm using following command
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar
1000
and getting java.io.FileNotFoundException: and I can see on my cluster Yarn the app status as FAILED.
The jar is available at the location. Is there any specific place I need to place this jar when use cluster mode spark submit ?
Exception:
Diagnostics: java.io.FileNotFoundException: File file:/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar does not exist
Failing this attempt. Failing the application.
You must pass the jar file to the execution nodes by adding it to the "--jar" argument of spark-submit. E.g.
bin/spark-submit --class com.my.application.XApp
--master yarn-cluster
--jars "/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar"
--executor-memory 100m
--num-executors 50
/Users/nish1013/proj1/target/x-service-1.0.0-201512141101-assembly.jar 1000

Resources