Spark-submit throwing error from yarn-cluster mode - apache-spark

How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?

The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().

Related

spark-submit command with --py-files fails if the driver class path or executor class path is not set

I have a main script as below
from pyspark.sql.session import SparkSession
..............
..............
..............
import callmodule as cm <<<--- This is imported from another pyspark script which is in callmod.zip file
..............
..............
..............
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --driver-class-path C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?
You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.

Why does my spark application fail in cluster mode but successful in client mode?

I am trying to run below pyspark program which will copy the files from a HDFS cluster.
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/devuser/devuser.keytab').config('spark.yarn.principal','devuser#NAME.COM').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
If I run the above code using spark-submit with deploy mode as client, the job runs fine and I can see the output in the dir /tmp/data
spark-submit --master yarn --num-executors 1 --deploy-mode client --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
But if I run the same code with --deploy-mode cluster,
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
The job fails with kerberos exception as below:
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797)
... 55 more
The code contains the keytab information of the cluster where I am reading the file from. But I don't understand why it fails in cluster mode but runs in client mode.
Should I make any config changes in the code to run it on cluster mode ? Could anyone let me how can I fix this problem ?
Edit 1: I tried to pass the keytab and principle details from spark-submit instead of hard coding them inside the program as below:
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
Spark-submit:
spark-submit --master yarn --deploy-mode cluster --name checkCon --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/devuser/conn_props/core-site_fs.xml,/home/devuser/conn_props/hdfs-site_fs.xml --principal devuser#NAME.COM --keytab /home/devuser/devuser.keytab check_con.py
Exception:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, executor name, executor 2): java.io.IOException: DestHost:destPort <port given in the csv_data statement> , LocalHost:localPort <localport name>. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Because of this .config('spark.yarn.keytab','/home/devuser/devuser.keytab') conf.
In client mode your job will be running in local & given path is available so Job completed successfully.
Where as in cluster mode /home/devuser/devuser.keytab not available or accessible in data nodes & driver so it is failing.
SparkSession\
.builder\
.master('yarn')\
.config('spark.app.name','dummy_App')\
.config('spark.executor.memory','2g')\
.config('spark.executor.cores','2')\
.config('spark.yarn.keytab','/home/devuser/devuser.keytab')\ # This line is causing problem
.config('spark.yarn.principal','devuser#NAME.COM')\
.config('spark.executor.instances','2')\
.config('hadoop.security.authentication','kerberos')\
.config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port')\
.getOrCreate()
Don't hardcode spark.yarn.keytab & spark.yarn.principal configs.
Pass these configs as part of spark-submit command.
spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal#REALM.COM \
--jars ${JARS} \
[...]
AFAIK,
the below approach will solve your problem,
KERBEROS_KEYTAB_PATH=/home/devuser/devuser.keytab and KERBEROS_PRINCIPAL=devuser#NAME.COM .
Approach 1: with kinit command
step 1: kinit and proceed with spark-submit
kinit -kt ${KERBEROS_KEYTAB_PATH} ${KERBEROS_PRINCIPAL}
step 2: Run klist and verify Kerberization is correctly working for the logged in devuser.
Ticket cache: FILE:/tmp/krb5cc_XXXXXXXXX_XXXXXX
Default principal: devuser#NAME.COM
Valid starting Expires Service principal
07/30/2020 15:52:28 07/31/2020 01:52:28 krbtgt/NAME.COM#NAME.COM
renew until 08/06/2020 15:52:28
step 3: replace the spark code with spark session
sparkSession = SparkSession.builder().config(sparkConf).appName("TEST1").enableHiveSupport().getOrCreate()
step 4: Run spark-submit as below.
$SPARK_HOME/bin/spark-submit --class com.test.load.Data \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 2 --num-executors 2 \
--conf "spark.driver.cores=2" \
--name "TEST1" \
--principal ${KERBEROS_PRINCIPAL} \
--keytab ${KERBEROS_KEYTAB_PATH} \
--conf spark.files=$SPARK_HOME/conf/hive-site.xml \
/home/devuser/sparkproject/Test-jar-1.0.jar 2> /home/devuser/logs/test1.log

Pass arguments from a file to multiple spark jobs

Is it possible to have one master file that stores a list of arguments than can be referenced from a spark-submit command?
Example of the properties file, configurations.txt (does not have to be .txt):
school_library = "central"
school_canteen = "Nothernwall"
Expected requirement:
Calling it one spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library
Calling it in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_canteen
Calling both in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library configurations.school_canteen
Yes.
You can do that by the conf --files
For example, You are submitting a spark job with a config file: /data/config.conf:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster
--executor-memory 20G \
--num-executors 50 \
--files /data/config.conf \
/path/to/examples.jar
And this file will be uploaded & place in the working directory on the driver. So you have to access by its name.
Ex:
new FileInputStream("config.conf")
Spark-submit parameter "--properties-file" can be used.
Property names have to be started with "spark." prefix, for ex:
spark.mykey=myvalue
Values in this case extracted from configuration (SparkConf)

SparkConf not reading spark-submit arguments

SparkConf on pyspark does not read the configuration arguments passed to spark-submit.
My python code is something like
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("foo")
sc = SparkContext(conf=conf)
# processing code...
sc.stop()
and I submit it with
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g
but none of the configuration arguments are applied. That is, the application is executed with the default values of local[*] for master, 1g for driver memory and 1g for executor memory. This was confirmed by the Spark GUI.
However, the configuration arguments are followed if I use pyspark to submit the application:
PYSPARK_PYTHON="/opt/anaconda/bin/python" pyspark --master local[4] \
--conf="spark.driver.memory=8g"
Notice that --executor-memory 16g was also changed to --conf="spark.executor.memory=16g" because the former doesn't work either.
What am I doing wrong?
I believe you need to remove the = sign from --conf=. Your spark-submit script should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf spark.driver.memory=16g --executor-memory 16g
Note that spark-submit also supports setting driver memory with the flag --driver-memory 16G
Apparently, the order of the arguments matter. The last argument should be the name of the python script. So, the call should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g foo.py
or, following #glennie-helles-sindholt's advise,
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --driver-memory 16g --executor-memory 16g foo.py

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources