how to add JVM option -Xss512m to spark-submit? - apache-spark

How to add JVM option -Xss512m to spark-submit?
In other words, where do I have to use spark.executor.extraJavaOptions and spark.driver.extraJavaOptions?

The Java options have to be specified via the conf parameter so ideally what you will be doing is:
spark-submit --class YOUR_MAIN_CLASS --conf "spark.executor.extraJavaOptions=-Xss512m"
--conf "spark.driver.extraJavaOptions=-Xss512m" APP.jar

Related

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

How to set spark.driver.extraClassPath through Apache Livy on Azure Spark cluster?

I would would like to add some configuration when a Spark Job is submitted via Apache Livy into an Azure cluster. Currently to launch a spark Job via Apache Livy in the cluster, I use the following command
curl -X POST --data '{"file": "/home/xxx/lib/MyJar.jar", "className": "org.springframework.boot.loader.JarLauncher"}' -H "Content-Type: application/json" localhost:8998/batches
This command generate the following process
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false --class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Due to a technical issue when running the jar, Ineed to introduce two configurations into this command.
--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*"
--conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"
It's related to a logback issue when running on spark which use log4j2. the extra class path adds logback jars
I found here https://groups.google.com/a/cloudera.org/forum/#!topic/hue-user/fcRM3YiqAAA that it can be done by adding this conf to LIVY_SERVER_JAVA_OPTS or spark-defaults.conf
From Ambari I modified LIVY_SERVER_JAVA_OPTS in livy-env.sh (in spak2 & livy menu) and
Advanced spark2-defaults in Spark2.
Unfortunately this is not working on our side. Even I can see that the LivyServer is launched with -Dspark.driver.extraClassPath
Is there any specific configuration to add in Azure Hdinsight to make it working?
Note that the process should be like
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false **--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*" --conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"**
--class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Thx
Add the following
"conf":{ "spark.driver.extraClassPath":"wasbs:///pathtojar.jar","spark.yarn.user.classpath.first":"true"}

How to choose the queue for Spark job using spark-submit?

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?
By using --queue
So an example of a spark-submit job would be:-
spark-submit --master yarn --conf spark.executor.memory=48G --conf spark.driver.memory=6G --packages [packages separated by ,] --queue [queue_name] --class [class_name] [jar_file] [arguments]

How to pass environment variables to spark driver in cluster mode with spark-submit

spark-submit allows to configure the executor environment variables with --conf spark.executorEnv.FOO=bar, and the Spark REST API allows to pass some environment variables with the environmentVariables field.
Unfortunately I've found nothing similar to configure the environment variable of the driver when submitting the driver with spark-submit in cluster mode:
spark-submit --deploy-mode cluster myapp.jar
Is it possible to set the environment variables of the driver with spark-submit in cluster mode?
On YARN at least, this works:
spark-submit --deploy-mode cluster --conf spark.yarn.appMasterEnv.FOO=bar myapp.jar
It's mentioned in http://spark.apache.org/docs/latest/configuration.html#environment-variables that:
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
I have tested that it can be passed with --conf flag for spark-submit, so that you don't have to edit global conf files.
On Yarn in cluster mode, it worked by adding the environment variables in the spark-submit command using --conf as below-
spark-submit --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --conf "spark.yarn.appMasterEnv.FOO=/Path/foo" --conf "spark.executorEnv.FOO2=/path/foo2" app.jar
Also, you can do it by adding them in conf/spark-defaults.conf file.
You can use the below Classification to set-up environment variables on executor and master node:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
If you just set spark.yarn.appMasterEnv.FOO = "foo", then the env variable won't be present on executor instances.
Did you test with
--conf spark.driver.FOO="bar"
and then getting value with
spark.conf.get("spark.driver.FOO")
For deploy-mode cluster
As previous answers mentioned, if you want to pass env variable to spark master, you want to use:
--conf spark.yarn.appMasterEnv.FOO=bar // pass bar value to FOO variable
--conf spark.yarn.appMasterEnv.FOO=${FOO} // passing current FOO env variable
--conf spark.yarn.appMasterEnv.FOO2=bar2 // multiple variables are passed separately
Thanks #juhoautio
For deploy-mode client
Even though its not mentioned in the question, if you are starting to doubt your skills and found this SO page in google I just wanted to reassure you.
When using client mode, all the ENV variables from your current SHELL are available for spark master. So basically:
export FOO=bar
export FOO2=bar2
is definitely working, so if you can't access this value for some reason, it must be something else :-)
Yes, That is possible. What are the variables you need you could post that in spark-submit like you're doing?
spark-submit --deploy-mode cluster myapp.jar
Take variables from http://spark.apache.org/docs/latest/configuration.html and depends on your optimization use these. This link could also be helpful.
I used to use in cluster mode but now I'm using in YARN so my variables are as follows: (Hopefully helpful)
hastimal#nm:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.Processing --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/Processing.jar Main_Class /inputRDF/rdf_data_all.nt /output /users/hastimal/ /users/hastimal/query.txt index 2
In this, my jar following are arguments of class.
cc /inputData/data_all.txt /output /users/hastimal/
/users/hastimal/query.txt index 2

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?
Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties
Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Resources