How do we pass a config file to executor when we submit a spark job on yarn-cluster?
If I change my below spark-submit command as --master yarn-client then it works fine , I get respective output
spark-submit\
--files /home/cloudera/conf/omega.config \
--class com.mdm.InitProcess \
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
/home/cloudera/Omega.jar \
/home/cloudera/conf/omega.config
My Spark Code:
object InitProcess
{
def main(args: Array[String]): Unit = {
val config_loc = args(0)
val config = ConfigFactory.parseFile(new File(config_loc ))
val jobName =config.getString("job_name")
.....
}
}
I am getting the below error
17/04/05 12:01:39 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'job_name'
Could someone help me on running this command in --master yarn-cluster ?
The different between yarn-client and yarn-cluster is that in yarn-client the driver location is on the machine running the spark-submit command.
In your case, the location of the config file is /home/cloudera/conf/omega.config which can be found when you running as yarn-client as the driver is running from the current machine which holds this full/path/to/file.
But can't be access in yarn-cluster mode, as the driver is running on other host, which doesn't holds this full/path/to/file.
I'd suggest execution the command in the following format:
spark-submit\
--master yarn-cluster \
--num-executors 7 \
--executor-memory 1024M \
--class com.mdm.InitProcess \
--files /home/cloudera/conf/omega.config \
--jar /home/cloudera/Omega.jar omega.config
Sending the config file using --files with its full-path-name, and providing it as parameter to the jar as it filename (not with a full path) as the file will be downloaded to unknown location on the workers.
In your code, you can use SparkFiles.get(filename) in order to get the actual full-path-name of the downloaded file on the worker
The change in your code should be something similar to:
val config_loc = SparkFiles.get(args(0))
SparkFiles docs
public class SparkFiles
Resolves paths to files added through SparkContext.addFile().
Related
I have a main script as below
from pyspark.sql.session import SparkSession
..............
..............
..............
import callmodule as cm <<<--- This is imported from another pyspark script which is in callmod.zip file
..............
..............
..............
when I submit the spark command as below it fails with Error: No module named Callmodule
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with driver class path(without executor class path) as below it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --driver-class-path C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.driver.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
when I submit the spark command with executor class path (without driver classpath) also it runs successfully.
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraClassPath=C:\pyspark\scripts\callmod.zip mainscript.py
(or)
spark-submit --master local --py-files C:\pyspark\scripts\callmod.zip --conf spark.executor.extraLibraryPath=C:\pyspark\scripts\callmod.zip mainscript.py
Can you explain me where does the below import statement work? on driver or executor?
import callmodule as cm
Why is the code not failing with Error: No Module Named callmodule when only the driver classpath is set or only the executor classpath is set?
You are using --master local, so the driver is the same as the executor. Therefore, setting classpath on either driver or executor produces the same behaviour, and neither would cause an error.
I am trying to run below pyspark program which will copy the files from a HDFS cluster.
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/devuser/devuser.keytab').config('spark.yarn.principal','devuser#NAME.COM').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
If I run the above code using spark-submit with deploy mode as client, the job runs fine and I can see the output in the dir /tmp/data
spark-submit --master yarn --num-executors 1 --deploy-mode client --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
But if I run the same code with --deploy-mode cluster,
spark-submit --master yarn --deploy-mode cluster --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/hdfstest/conn_props/core-site_fs.xml,/home/hdfstest/conn_props/hdfs-site_fs.xml check_con.py
The job fails with kerberos exception as below:
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:411)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:801)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:797)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:797)
... 55 more
The code contains the keytab information of the cluster where I am reading the file from. But I don't understand why it fails in cluster mode but runs in client mode.
Should I make any config changes in the code to run it on cluster mode ? Could anyone let me how can I fix this problem ?
Edit 1: I tried to pass the keytab and principle details from spark-submit instead of hard coding them inside the program as below:
def read_file(spark):
try:
csv_data = spark.read.csv('hdfs://hostname:port/user/devuser/example.csv')
csv_data.write.format('csv').save('/tmp/data')
print('count of csv_data: {}'.format(csv_data.count()))
except Exception as e:
print(e)
return False
return True
if __name__ == "__main__":
spark = SparkSession.builder.master('yarn').config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.executor.instances','2').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port').getOrCreate()
if read_file(spark):
print('Read the file successfully..')
else:
print('Reading failed..')
Spark-submit:
spark-submit --master yarn --deploy-mode cluster --name checkCon --num-executors 1 --executor-memory 1G --executor-cores 1 --driver-memory 1G --files /home/devuser/conn_props/core-site_fs.xml,/home/devuser/conn_props/hdfs-site_fs.xml --principal devuser#NAME.COM --keytab /home/devuser/devuser.keytab check_con.py
Exception:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, executor name, executor 2): java.io.IOException: DestHost:destPort <port given in the csv_data statement> , LocalHost:localPort <localport name>. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
Because of this .config('spark.yarn.keytab','/home/devuser/devuser.keytab') conf.
In client mode your job will be running in local & given path is available so Job completed successfully.
Where as in cluster mode /home/devuser/devuser.keytab not available or accessible in data nodes & driver so it is failing.
SparkSession\
.builder\
.master('yarn')\
.config('spark.app.name','dummy_App')\
.config('spark.executor.memory','2g')\
.config('spark.executor.cores','2')\
.config('spark.yarn.keytab','/home/devuser/devuser.keytab')\ # This line is causing problem
.config('spark.yarn.principal','devuser#NAME.COM')\
.config('spark.executor.instances','2')\
.config('hadoop.security.authentication','kerberos')\
.config('spark.yarn.access.hadoopFileSystems','hdfs://hostname:port')\
.getOrCreate()
Don't hardcode spark.yarn.keytab & spark.yarn.principal configs.
Pass these configs as part of spark-submit command.
spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
-–conf spark.yarn.keytab=path_to_keytab \
-–conf spark.yarn.principal=principal#REALM.COM \
--jars ${JARS} \
[...]
AFAIK,
the below approach will solve your problem,
KERBEROS_KEYTAB_PATH=/home/devuser/devuser.keytab and KERBEROS_PRINCIPAL=devuser#NAME.COM .
Approach 1: with kinit command
step 1: kinit and proceed with spark-submit
kinit -kt ${KERBEROS_KEYTAB_PATH} ${KERBEROS_PRINCIPAL}
step 2: Run klist and verify Kerberization is correctly working for the logged in devuser.
Ticket cache: FILE:/tmp/krb5cc_XXXXXXXXX_XXXXXX
Default principal: devuser#NAME.COM
Valid starting Expires Service principal
07/30/2020 15:52:28 07/31/2020 01:52:28 krbtgt/NAME.COM#NAME.COM
renew until 08/06/2020 15:52:28
step 3: replace the spark code with spark session
sparkSession = SparkSession.builder().config(sparkConf).appName("TEST1").enableHiveSupport().getOrCreate()
step 4: Run spark-submit as below.
$SPARK_HOME/bin/spark-submit --class com.test.load.Data \
--master yarn \
--deploy-mode cluster \
--driver-memory 2g \
--executor-memory 2g \
--executor-cores 2 --num-executors 2 \
--conf "spark.driver.cores=2" \
--name "TEST1" \
--principal ${KERBEROS_PRINCIPAL} \
--keytab ${KERBEROS_KEYTAB_PATH} \
--conf spark.files=$SPARK_HOME/conf/hive-site.xml \
/home/devuser/sparkproject/Test-jar-1.0.jar 2> /home/devuser/logs/test1.log
Is it possible to have one master file that stores a list of arguments than can be referenced from a spark-submit command?
Example of the properties file, configurations.txt (does not have to be .txt):
school_library = "central"
school_canteen = "Nothernwall"
Expected requirement:
Calling it one spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library
Calling it in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_canteen
Calling both in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library configurations.school_canteen
Yes.
You can do that by the conf --files
For example, You are submitting a spark job with a config file: /data/config.conf:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster
--executor-memory 20G \
--num-executors 50 \
--files /data/config.conf \
/path/to/examples.jar
And this file will be uploaded & place in the working directory on the driver. So you have to access by its name.
Ex:
new FileInputStream("config.conf")
Spark-submit parameter "--properties-file" can be used.
Property names have to be started with "spark." prefix, for ex:
spark.mykey=myvalue
Values in this case extracted from configuration (SparkConf)
SparkConf on pyspark does not read the configuration arguments passed to spark-submit.
My python code is something like
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("foo")
sc = SparkContext(conf=conf)
# processing code...
sc.stop()
and I submit it with
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g
but none of the configuration arguments are applied. That is, the application is executed with the default values of local[*] for master, 1g for driver memory and 1g for executor memory. This was confirmed by the Spark GUI.
However, the configuration arguments are followed if I use pyspark to submit the application:
PYSPARK_PYTHON="/opt/anaconda/bin/python" pyspark --master local[4] \
--conf="spark.driver.memory=8g"
Notice that --executor-memory 16g was also changed to --conf="spark.executor.memory=16g" because the former doesn't work either.
What am I doing wrong?
I believe you need to remove the = sign from --conf=. Your spark-submit script should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit foo.py \
--master local[4] --conf spark.driver.memory=16g --executor-memory 16g
Note that spark-submit also supports setting driver memory with the flag --driver-memory 16G
Apparently, the order of the arguments matter. The last argument should be the name of the python script. So, the call should be
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --conf="spark.driver.memory=16g" --executor-memory 16g foo.py
or, following #glennie-helles-sindholt's advise,
PYSPARK_PYTHON="/opt/anaconda/bin/python" spark-submit \
--master local[4] --driver-memory 16g --executor-memory 16g foo.py
I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar