spark.executor.extraJavaOptions ignored in spark-submit - apache-spark

I am a newbie trying to profile a local spark job.
Here is the command that I am trying to execute, but I am getting a warning stating my executor options are being ignored since they are non-spark config properties.
error:
Warning: Ignoring non-spark config property: “spark.executor.extraJavaOptions=javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application”
Command:
./bin/spark-submit --master local[2] --class org.apache.spark.examples.GroupByTest --conf “spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application” --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar
Spark version : 2.0.3
Please let me know, how to solve this.
Thanks in Advance.

I think the problem is the double quote you are using to specify the spark.executor.extraJavaOptions. It should have been a single quote.
./bin/spark-submit --master local[2] --conf 'spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application' --class org.apache.spark.examples.GroupByTest --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar

Apart from answers above, if your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar

Related

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

How to specify spark.authenticate.secret value while executing spark pgm

I got this error when I launched my spark program with deploy-mode : client.
java.lang.illegalargumentexception: error: a secret key must be specified via the spark.authenticate.secret config.
deploy-mode: client
master: YARN.
How do I resolve this error? I have no clue.
I got similar error in yarn-cluster mode, following helped me :
https://issues.apache.org/jira/browse/SPARK-23476
If spark.authenticate=true is specified as a cluster wide config, then the following has to be added
--conf "spark.authenticate=false" --conf "spark.shuffle.service.enabled=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.network.crypto.enabled=false" --conf "spark.authenticate.enableSaslEncryption=false"
in the spark-submit command.

How to choose the queue for Spark job using spark-submit?

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?
By using --queue
So an example of a spark-submit job would be:-
spark-submit --master yarn --conf spark.executor.memory=48G --conf spark.driver.memory=6G --packages [packages separated by ,] --queue [queue_name] --class [class_name] [jar_file] [arguments]

How to pass environment variables to spark driver in cluster mode with spark-submit

spark-submit allows to configure the executor environment variables with --conf spark.executorEnv.FOO=bar, and the Spark REST API allows to pass some environment variables with the environmentVariables field.
Unfortunately I've found nothing similar to configure the environment variable of the driver when submitting the driver with spark-submit in cluster mode:
spark-submit --deploy-mode cluster myapp.jar
Is it possible to set the environment variables of the driver with spark-submit in cluster mode?
On YARN at least, this works:
spark-submit --deploy-mode cluster --conf spark.yarn.appMasterEnv.FOO=bar myapp.jar
It's mentioned in http://spark.apache.org/docs/latest/configuration.html#environment-variables that:
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
I have tested that it can be passed with --conf flag for spark-submit, so that you don't have to edit global conf files.
On Yarn in cluster mode, it worked by adding the environment variables in the spark-submit command using --conf as below-
spark-submit --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --conf "spark.yarn.appMasterEnv.FOO=/Path/foo" --conf "spark.executorEnv.FOO2=/path/foo2" app.jar
Also, you can do it by adding them in conf/spark-defaults.conf file.
You can use the below Classification to set-up environment variables on executor and master node:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
If you just set spark.yarn.appMasterEnv.FOO = "foo", then the env variable won't be present on executor instances.
Did you test with
--conf spark.driver.FOO="bar"
and then getting value with
spark.conf.get("spark.driver.FOO")
For deploy-mode cluster
As previous answers mentioned, if you want to pass env variable to spark master, you want to use:
--conf spark.yarn.appMasterEnv.FOO=bar // pass bar value to FOO variable
--conf spark.yarn.appMasterEnv.FOO=${FOO} // passing current FOO env variable
--conf spark.yarn.appMasterEnv.FOO2=bar2 // multiple variables are passed separately
Thanks #juhoautio
For deploy-mode client
Even though its not mentioned in the question, if you are starting to doubt your skills and found this SO page in google I just wanted to reassure you.
When using client mode, all the ENV variables from your current SHELL are available for spark master. So basically:
export FOO=bar
export FOO2=bar2
is definitely working, so if you can't access this value for some reason, it must be something else :-)
Yes, That is possible. What are the variables you need you could post that in spark-submit like you're doing?
spark-submit --deploy-mode cluster myapp.jar
Take variables from http://spark.apache.org/docs/latest/configuration.html and depends on your optimization use these. This link could also be helpful.
I used to use in cluster mode but now I'm using in YARN so my variables are as follows: (Hopefully helpful)
hastimal#nm:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.Processing --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/Processing.jar Main_Class /inputRDF/rdf_data_all.nt /output /users/hastimal/ /users/hastimal/query.txt index 2
In this, my jar following are arguments of class.
cc /inputData/data_all.txt /output /users/hastimal/
/users/hastimal/query.txt index 2

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources