Pyspark yarn cluster submit error (Cannot run Program Python) - apache-spark

I am trying to submit pyspark code with pandas udf (to use fbprophet...)
it works well in local submit but gets error in cluster submit such as
Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
"/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python": error=2, No such file or directory
my spark-submit code:
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar \
--py-files dependencies.zip \
--archives ./environment.tar.gz#environment \
--files config.ini \
$1
I made environment.tar.gz by conda pack, dependencies.zip as my local packages and
config.ini to load settings
Is there anyway to handle this problem?

You can't use local path:
--archives ./environment.tar.gz#environment
Publish environment.tar.gz on hdfs
venv-pack -o environment.tar.gz
# or conda pack
hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz
And change argument of spark-submit
--archives hdfs:///spark/app_name/environment.tar.gz#environment
More info: PySpark on YARN in self-contained environments

Related

spark-submit failing to connect to metastore due to Kerberos : Caused by GSSException: No valid credentials provided . but works in local-client mode

it seems, in docker pyspark shell in local-client mode is working and able to connect to hive. However, issuing spark-submit with all dependencies it fails with below error.
20/08/24 14:03:01 INFO storage.BlockManagerMasterEndpoint: Registering block manager test.server.com:41697 with 6.2 GB RAM, BlockManagerId(3, test.server.com, 41697, None)
20/08/24 14:03:02 INFO hive.HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
20/08/24 14:03:02 INFO hive.metastore: Trying to connect to metastore with URI thrift://metastore.server.com:9083
20/08/24 14:03:02 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
Running a simple pi example on pyspark works fine with no kerberos issues, but when trying to access hive getting kerberos error.
Spark-submit command:
spark-submit --master yarn --deploy-mode cluster --files=/etc/hive/conf/hive-site.xml,/etc/hive/conf/yarn-site.xml,/etc/hive/conf/hdfs-site.xml,/etc/hive/conf/core-site.xml,/etc/hive/conf/mapred-site.xml,/etc/hive/conf/ssl-client.xml --name fetch_hive_test --executor-memory 12g --num-executors 20 test_hive_minimal.py
test_hive_minimal.py is a simple pyspark script to show tables in test db:
from pyspark.sql import SparkSession
#declaration
appName = "test_hive_minimal"
master = "yarn"
# Create the Spark session
sc = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.config("spark.hadoop.hive.enforce.bucketing", "True") \
.config("spark.hadoop.hive.support.quoted.identifiers", "none") \
.config("hive.exec.dynamic.partition", "True") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.getOrCreate()
# Define the function to load data from Teradata
#custom freeform query
sql = "show tables in user_tables"
df_new = sc.sql(sql)
df_new.show()
sc.stop()
Can anyone throw some light how to fix this? Isnt kerberos tickets managed automatically by yarn? all other hadoop resources are accessible.
UPDATE:
Issue was fixed after sharing vol mount on the docker container and passing keytab/principal along with hive-site.xml for accessing metastore.
spark-submit --master yarn \
--deploy-mode cluster \
--jars /srv/python/ext_jars/terajdbc4.jar \
--files=/etc/hive/conf/hive-site.xml \
--keytab /home/alias/.kt/alias.keytab \ #this is mounted and kept in docker local path
--principal alias#realm.com.org \
--name td_to_hive_test \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 44 \
--executor-cores 5 \
--executor-memory 12g \
td_to_hive_test.py
I think that your driver have tickets but that not the case of your executors. Add the following parameters to your spark submit :
--principal : you can get principal this way : klist -k
--keytab : path to keytab
more informations : https://spark.apache.org/docs/latest/running-on-yarn.html#yarn-specific-kerberos-configuration
Can you try below command line property while running a job on the cluster.
-Djavax.security.auth.useSubjectCredsOnly=false
You can add above property to Spark submit command

Why does my spark application fail in cluster mode but successful in client mode?

I am trying to run below pyspark program which will copy the files from a HDFS cluster.
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/devuser/devuser.keytab').config('spark.yarn.principal','devuser#NAME.COM').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
If I run the above code using spark-submit with deploy mode as client, the job runs fine and I can see the output in the dir /tmp/data
spark-submit --master yarn --num-executors 1 --deploy-mode client --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
But if I run the same code with --deploy-mode cluster,
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
The job fails with kerberos exception as below:
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797)
... 55 more
The code contains the keytab information of the cluster where I am reading the file from. But I don't understand why it fails in cluster mode but runs in client mode.
Should I make any config changes in the code to run it on cluster mode ? Could anyone let me how can I fix this problem ?
Edit 1: I tried to pass the keytab and principle details from spark-submit instead of hard coding them inside the program as below:
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
Spark-submit:
spark-submit --master yarn --deploy-mode cluster --name checkCon --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/devuser/conn_props/core-site_fs.xml,/home/devuser/conn_props/hdfs-site_fs.xml --principal devuser#NAME.COM --keytab /home/devuser/devuser.keytab check_con.py
Exception:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, executor name, executor 2): java.io.IOException: DestHost:destPort <port given in the csv_data statement> , LocalHost:localPort <localport name>. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Because of this .config('spark.yarn.keytab','/home/devuser/devuser.keytab') conf.
In client mode your job will be running in local & given path is available so Job completed successfully.
Where as in cluster mode /home/devuser/devuser.keytab not available or accessible in data nodes & driver so it is failing.
SparkSession\
.builder\
.master('yarn')\
.config('spark.app.name','dummy_App')\
.config('spark.executor.memory','2g')\
.config('spark.executor.cores','2')\
.config('spark.yarn.keytab','/home/devuser/devuser.keytab')\ # This line is causing problem
.config('spark.yarn.principal','devuser#NAME.COM')\
.config('spark.executor.instances','2')\
.config('hadoop.security.authentication','kerberos')\
.config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port')\
.getOrCreate()
Don't hardcode spark.yarn.keytab & spark.yarn.principal configs.
Pass these configs as part of spark-submit command.
spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal#REALM.COM \
--jars ${JARS} \
[...]
AFAIK,
the below approach will solve your problem,
KERBEROS_KEYTAB_PATH=/home/devuser/devuser.keytab and KERBEROS_PRINCIPAL=devuser#NAME.COM .
Approach 1: with kinit command
step 1: kinit and proceed with spark-submit
kinit -kt ${KERBEROS_KEYTAB_PATH} ${KERBEROS_PRINCIPAL}
step 2: Run klist and verify Kerberization is correctly working for the logged in devuser.
Ticket cache: FILE:/tmp/krb5cc_XXXXXXXXX_XXXXXX
Default principal: devuser#NAME.COM
Valid starting Expires Service principal
07/30/2020 15:52:28 07/31/2020 01:52:28 krbtgt/NAME.COM#NAME.COM
renew until 08/06/2020 15:52:28
step 3: replace the spark code with spark session
sparkSession = SparkSession.builder().config(sparkConf).appName("TEST1").enableHiveSupport().getOrCreate()
step 4: Run spark-submit as below.
$SPARK_HOME/bin/spark-submit --class com.test.load.Data \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 2 --num-executors 2 \
--conf "spark.driver.cores=2" \
--name "TEST1" \
--principal ${KERBEROS_PRINCIPAL} \
--keytab ${KERBEROS_KEYTAB_PATH} \
--conf spark.files=$SPARK_HOME/conf/hive-site.xml \
/home/devuser/sparkproject/Test-jar-1.0.jar 2> /home/devuser/logs/test1.log

Spark-submit in cluster mode

I'm facing a problem launching a Spark Application in cluster mode
This is the .sh :
export SPARK_MAJOR_VERSION=2
spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 8G \
--executor-memory 8G \
--total-executor-cores 4 \
--num-executors 4 \
/home/hdfs/spark_scripts/ETL.py &> /home/hdfs/spark_scripts/log_spark.txt
In YARN logs, I found out that there's an Import Error related to a .py file that I need in "ETL.py". In other words, in "ETL.py", I',ve got a line in which I do this :
import AppUtility
AppUtilit.py is in the same path of ETL.py
In local mode,it works
This is the YARN log:
20/04/28 10:59:59 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Container: container_e64_1584554814241_22431_02_000001 on ftpandbit02.carte.local_45454
LogAggregationType: AGGREGATED
LogType:stdout
LogLastModifiedTime:Tue Apr 28 10:57:10 +0200 2020
LogLength:138
LogContents:
Traceback (most recent call last):
File "ETL.py", line 8, in
import AppUtility
ImportError: No module named AppUtility
End of LogType:stdout
End of LogType:prelaunch.err
It depends on either client mode or cluster mode.
If you use Spark in Yarn client mode,
you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode, is a different story. You can distribute python dependencies with
spark-submit ./bin/spark-submit --py-files AppUtility.py
/home/hdfs/spark_scripts/ETL.py
The --py-files directive sends the file to the Spark workers but does not add it to the PYTHONPATH.
To add the dependencies to the PYTHONPATH to fix the ImportError, add the following line to the Spark job, ETL.py
sc.addPyFile(PATH)
PATH: AppUtility.py (It can be either a local file, a file in HDFS,zip or an HTTP, HTTPS or FTP URI)

Warning: Skip remote jar hdfs

I would like to submit a spark job with configuring additional jar on hdfs, however the hadoop gives me a warning on skipping remote jar. Although I can still get my final results on hdfs, I cannot obtain the effect of additional remote jar. I would appreciate if you can give me some suggestions.
Many thanks,
root#cluster-1-m:~# hadoop fs -ls hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
-rwxr-xr-x 2 hdfs hadoop 7097056 2019-01-23 14:44 hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar
root#cluster-1-m:~#/usr/lib/spark/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.jars=hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar \
--conf spark.driver.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--conf spark.executor.extraJavaOptions=-javaagent:jvm-profiler-1.0.0.jar \
--class com.github.ehiggs.spark.terasort.TeraSort \
/root/spark-terasort-master/target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar /tmp/data/terasort_in /tmp/data/terasort_out
Warning: Skip remote jar hdfs://10.146.0.4:8020/tmp/jvm-profiler-1.0.0.jar.
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-1-m/10.146.0.4:8032
19/01/24 02:20:31 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at cluster-1-m/10.146.0.4:10200
19/01/24 02:20:34 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1548293702222_0002

Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image

I am trying to do spark-submit on minikube(Kubernetes) from local machine CLI with command
spark-submit --master k8s://https://127.0.0.1:8001 --name cfe2
--deploy-mode cluster --class com.yyy.Test --conf spark.executor.instances=2 --conf spark.kubernetes.container.image docker.io/anantpukale/spark_app:1.1 local://spark-0.0.1-SNAPSHOT.jar
I have a simple spark job jar built on verison 2.3.0. I also have containerized it in docker and minikube up and running on virtual box.
Below is exception stack:
Exception in thread "main" org.apache.spark.SparkException: Must specify the driver container image at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep$$anonfun$3.apply(BasicDriverConfigurationStep.scala:51) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.deploy.k8s.submit.steps.BasicDriverConfigurationStep.<init>(BasicDriverConfigurationStep.scala:51)
at org.apache.spark.deploy.k8s.submit.DriverConfigOrchestrator.getAllConfigurationSteps(DriverConfigOrchestrator.scala:82)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:229)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply(KubernetesClientApplication.scala:227)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:227)
at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:192)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Shutdown hook called 2018-04-06 13:33:52 INFO ShutdownHookManager:54 - Deleting directory C:\Users\anant\AppData\Local\Temp\spark-6da93408-88cb-4fc7-a2de-18ed166c3c66
Look like bug with default value for parameters spark.kubernetes.driver.container.image, that must be spark.kubernetes.container.image. So try specify driver/executor container image directly:
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
From the source code, the only available conf options are:
spark.kubernetes.container.image
spark.kubernetes.driver.container.image
spark.kubernetes.executor.container.image
And I noticed that Spark 2.3.0 has changed a lot in terms of k8s implementation compared to 2.2.0. For example, instead of specifying driver and executor separately, the official starter's guide is to use a single image given to spark.kubernetes.container.image.
See if this works:
spark-submit \
--master k8s://http://127.0.0.1:8001 \
--name cfe2 \
--deploy-mode cluster \
--class com.oracle.Test \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=docker/anantpukale/spark_app:1.1 \
--conf spark.kubernetes.authenticate.submission.oauthToken=YOUR_TOKEN \
--conf spark.kubernetes.authenticate.submission.caCertFile=PATH_TO_YOUR_CERT \
local://spark-0.0.1-SNAPSHOT.jar
The token and cert can be found on k8s dashboard. Follow the instructions to make Spark 2.3.0 compatible docker images.

Resources