How to add .py files to spark-submit? - apache-spark

I have a spark-submit with --py-files. I am facing issue while I run the script. I am getting an error -> ModuleNotFoundError: No module named 'apps' although the imports are done properly.
My code looks like :
spark_submit = '''
/opt/spark/bin/spark-submit \
--master spark://spark-master:7077 \
--jars /opt/spark-jars/postgresql-42.2.22.jar \
--driver-cores {} \
--driver-memory {} \
--executor-cores {} \
--executor-memory {} \
--num-executors {} \
--py-files /opt/spark-apps/l2lpt/models/FBP.py,/opt/spark-apps/l2lpt/utils/exceptions.py,/opt/spark-apps/l2lpt/anomaly_detector/anomaly_detector.py,/opt/spark-apps/l2lpt/l2lpt.py \
/opt/spark-apps/l2lpt/main.py {} {}
'''.format(
driver_cores,
driver_memory,
executor_cores,
executor_memory,
num_executors,
offset,
timestamp)
Does the sequence of the .py files added matter? I am unable to understand in which sequence these .py files need to be added? Do I need to add all the .py files that my main() function will be calling to?

When you wanted to spark-submit a PySpark application (Spark with Python), you need to specify the .py file you wanted to run and specify the .egg file or .zip file for dependency libraries.

#run below command from terminal or Pycharm terminal:
spark-submit --master local --deploy-mode client .\filename.py

Related

Passing custom log4j.properties file from s3

I'm trying to set custom logging configurations. If I add the log file to the cluster and reference it in my spark submit, the configurations take effect. But if I try to access the file using --files s3://... then it doesn't work.
Works (assuming I placed the file in the home dir):
spark-submit \
--master yarn \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
Doesn't work:
spark-submit \
--master yarn \
--files s3://my_path/log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
How can I use a config file in s3 to set the logging configuration?
You can't directly Log4J loads its files from the local filesystem, always.
You can use configs inside a JAR, and as spark will download JARs with your job, you should be able to get it indirectly. Create a JAR containing only the log4j.properties file, tell spark to load it with the job

Path of jars added to a Spark Job - spark-submit

I am using Spark 2.1 (BTW) on a YARN cluster.
I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) Spark JAR.
I am trying to do so through spark-submit.
The question Add jars to a Spark Job - spark-submit - and the related answers - are full of interesting points.
One helpful answer is the following one:
spark-submit --jars additional1.jar,additional2.jar \
--driver-class-path additional1.jar:additional2.jar \
--conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
So, I understand the following:
"--jars" is for uploading jar on each node
"--driver-class-path" is for using uploaded jar for the driver.
"--conf spark.executor.extraClassPath" is for using uploaded jar for executors.
While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?
The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"
Fine, but for the following command, what should I put instead of XXX and YYY ?
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path XXX:YYY \
--conf spark.executor.extraClassPath=XXX:YYY \
--class MyClass main-application.jar
When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?
Thanks.
PS: I have tried
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path some1.jar:some2.jar \
--conf spark.executor.extraClassPath=some1.jar:some2.jar \
--class MyClass main-application.jar
No success (if I made no mistake)
And I have tried also:
spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
--driver-class-path ./some1.jar:./some2.jar \
--conf spark.executor.extraClassPath=./some1.jar:./some2.jar \
--class MyClass main-application.jar
No success either.
spark-submit by default uses client mode.
In client mode, you should not use --jars in conjunction with --driver-class-path.
--driver-class-path will overwrite original classpath, instead of prepending to it as one may expect.
--jars will automatically add the extra jars to the driver and executor classpath so you do not need to add its path manually.
It seems that in cluster mode --driver-class-path is ignored.

Switching between Spark YARN Client and Cluster mode when using Typesafe config

Been struggling with an issue on handling multiple config files with Spark YARN and switching between cluster and client mode.
In my application, I need to load two config files:
An application config
An environment config
My current setup:
example-application.conf:
include required(file("env.conf"))
app {
source
{
source-name: "some-source"
source-type: "file"
source-path: ${env.some-source-path}
}
....
}
env.conf:
env {
some-source-path: "/path/to/file"
}
Code:
// Spark submit that works:
$SPARK_HOME/bin/spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
--principal ${PRINCIPAL_NAME} --keytab ${KEYTAB_PATH} \
--jars ${JARS} \
--num-executors 10 \
--executor-memory 4g \
--executor-cores 4 \
${APP_JAR} "example-application.conf" "$#"
// How above file is loaded in code:
val appConfFile = new File(configFileName) // configFileName = "example-application.conf"
val conf = ConfigFactory.parseFile(appConfFile)
In cluster mode, the above setup works because the --files option of the spark-submit command will copy the files to all the nodes involved on cluster mode to the same location as the jars. Therefore, providing the name of the config file is good enough.
However, I am not sure how to get this setup to work such that I can easily swap my application from client to cluster mode. In client mode, the application fails because the ConfigFactory cannot find the example-application.conf to parse it. I can fix this by providing the full path for the application config but then the include function include required(file("env.conf")) will fail.
Any recommendations on how to set this up so that I can easily swap between cluster and client mode?
Thanks!
Pass complete path of config file as part of spark-submit & handle the logic of extracting inside your spark code.
spark.submit.deployMode=client then take full path i.e ${APP_BASE_DIR}/conf/example-application.conf
spark.submit.deployMode=cluster then take only file name i.e example-application.conf
// Spark submit that works:
$SPARK_HOME/bin/spark-submit --class ${APP_MAIN_CLASS} \
--master yarn \
--deploy-mode cluster \
--name ${APP_INSTANCE} \
--files ${APP_BASE_DIR}/conf/${ENV_NAME}/env.conf,${APP_BASE_DIR}/conf/example-application.conf \
--principal ${PRINCIPAL_NAME} --keytab ${KEYTAB_PATH} \
--jars ${JARS} \
--num-executors 10 \
--executor-memory 4g \
--executor-cores 4 \
${APP_JAR} ${APP_BASE_DIR}/conf/example-application.conf "$#"
// How above file is loaded in code:
val configFile = if(!spark.conf.get("spark.submit.deployMode").contains("client")) configFileName.split("/").last else configFileName
val appConfFile = new File(configFile) // configFileName = "example-application.conf"
val conf = ConfigFactory.parseFile(appConfFile)

How to pass external resouce yml /property file while running spark job on cluster?

I am using spark-sql 2.4.1 version, jackson jars & Java 8.
In my spark program/job I am reading few configurations/properties from external "conditions.yml" file which is place in "resource" folder of my Java Project as below
ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
try {
driverConfig = mapper.readValue(
Configuration.class.getClassLoader().getResourceAsStream("conditions.yml"),Configuration.class);
}
If I want to pass "conditions.yml" file from outside while submitting spark-job how to pass this file ? where it should be placed?
In my program I am reading from "resouces" directory i.e. .getResourceAsStream("conditions.yml") ...if i pass this property file from spark-submit ...will the job takes from here from resouces or external path ?
If I want to pass as external file , do I need to change the code above ?
Updated Question:
In my spark driver program I am reading the property file as program arguments
Which is being loaded as below
Config props = ConfigFactory.parseFile(new File(args[0]));
While running my spark job in shell script
I am giving as below
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties condition.yml
Error :
Not loading the properties... what is wrong here ? What is the correct way to pass the Program Args to Spark-Job Java program?
you will have to use --file path to your file in spark-submit command to be able to pass any files. please note this is
syntax for that is
"--file /home/user/config/my-file.yml"
if it is on hdfs then provide the hdfs path
this should copy the file to class path and your code should be able find it from the driver.
the implementation of reading the file should be done with something like this
def readProperties(propertiesPath: String) = {
val url = getClass.getResource("/" + propertiesPath)
assert(url != null, s"Could not create URL to read $propertiesPath properties file")
val source = Source.fromURL(url)
val properties = new Properties
properties.load(source.bufferedReader)
properties
}
hope that is what you are looking for
You can add:
spec:
args:
--deploy-mode
cluster

How do I reference modules .egg files supplied via the --py-files option of spark-submit?

I'm using spark-submit with the py-files option to include an egg (spark_submit_test_lib-0.1-py2.7.egg) that I've built.
Structure of that .egg is basically:
root
|- EGG-INFO
|- spark_submit_test_lib
|- __init__.pyc
|- __init__.py
|- spark_submit_test_lib.pyc
|- spark_submit_test_lib.py
|- def do_sum()
in my driver script spark_submit_test.py I have this import:
from spark_submit_test_lib import do_sum
I submit to my hadoop cluster using:
spark-submit --queue 'myqueue' --py-files spark_submit_test_lib-0.1-py2.7.egg --deploy-mode cluster --master yarn spark_submit_test.py
it fails with error:
ImportError: No module named spark_submit_test_lib
I tried changing the import statement to
from spark_submit_test_lib.spark_submit_test_lib import do_sum
but to no avail, still getting the same error.
I see someone has had a similar problem (in that case he/she wants spark-submit to use a file inside the .egg as the driver - so a similar problem but not the same): What filepath or dot notation should I use when using spark-submit.py with .egg files as an argument to --py-files but at the time of writing there are no answers to it.
this command works for me
spark2-submit --master yarn \
--driver-memory 20g \
--num-executors 50 \
--executor-cores 1 \
--deploy-mode client \
--jars spark-avro_2.11-3.2.0.jar \
--py-files spark_submit_test_lib-0.1-py2.7.egg \
driver.py
I think this is due to the fact that the --py-files argument is meant to supply files that will be used by nodes on the spark cluster, not in your driver program. I believe your driver python program needs to be local. I could be wrong about this but this is what I have experienced and my eventual conclusion to the question you linked.

Resources