How to choose the queue for Spark job using spark-submit?

How to choose the queue for Spark job using spark-submit? - apache-spark

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?

By using --queue
So an example of a spark-submit job would be:-
spark-submit --master yarn --conf spark.executor.memory=48G --conf spark.driver.memory=6G --packages [packages separated by ,] --queue [queue_name] --class [class_name] [jar_file] [arguments]

Related

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null

Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.

Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

Submit job to DCOS Spark with multiple instances?

I have two instances of spark in my DCOS cluster, when I submit my job via CLI
dcos spark run --submit-args="\
--driver-cores 8 \
--driver-memory 16384M \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs://hdfs/history \
--class com.CalcPi \
<url to job -spark-test-assembly-0.0.5-SNAPSHOT.jar> 99000000"`
the job is forever stuck in the queue. But when I have only one instance everything works fine. I have already try
--deploy-mode cluster --supervise

The following config options are hopefully the answer you are looking for:
dcos config set spark.app_id spark-one
dcos spark run ...
dcos config set spark.app_id spark-two
dcos spark run ...

How to pass environment variables to spark driver in cluster mode with spark-submit

spark-submit allows to configure the executor environment variables with --conf spark.executorEnv.FOO=bar, and the Spark REST API allows to pass some environment variables with the environmentVariables field.
Unfortunately I've found nothing similar to configure the environment variable of the driver when submitting the driver with spark-submit in cluster mode:
spark-submit --deploy-mode cluster myapp.jar
Is it possible to set the environment variables of the driver with spark-submit in cluster mode?

On YARN at least, this works:
spark-submit --deploy-mode cluster --conf spark.yarn.appMasterEnv.FOO=bar myapp.jar
It's mentioned in http://spark.apache.org/docs/latest/configuration.html#environment-variables that:
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
I have tested that it can be passed with --conf flag for spark-submit, so that you don't have to edit global conf files.

On Yarn in cluster mode, it worked by adding the environment variables in the spark-submit command using --conf as below-
spark-submit --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --conf "spark.yarn.appMasterEnv.FOO=/Path/foo" --conf "spark.executorEnv.FOO2=/path/foo2" app.jar
Also, you can do it by adding them in conf/spark-defaults.conf file.

You can use the below Classification to set-up environment variables on executor and master node:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
If you just set spark.yarn.appMasterEnv.FOO = "foo", then the env variable won't be present on executor instances.

Did you test with
--conf spark.driver.FOO="bar"
and then getting value with
spark.conf.get("spark.driver.FOO")

For deploy-mode cluster
As previous answers mentioned, if you want to pass env variable to spark master, you want to use:
--conf spark.yarn.appMasterEnv.FOO=bar // pass bar value to FOO variable
--conf spark.yarn.appMasterEnv.FOO=${FOO} // passing current FOO env variable
--conf spark.yarn.appMasterEnv.FOO2=bar2 // multiple variables are passed separately
Thanks #juhoautio
For deploy-mode client
Even though its not mentioned in the question, if you are starting to doubt your skills and found this SO page in google I just wanted to reassure you.
When using client mode, all the ENV variables from your current SHELL are available for spark master. So basically:
export FOO=bar
export FOO2=bar2
is definitely working, so if you can't access this value for some reason, it must be something else :-)

Yes, That is possible. What are the variables you need you could post that in spark-submit like you're doing?
spark-submit --deploy-mode cluster myapp.jar
Take variables from http://spark.apache.org/docs/latest/configuration.html and depends on your optimization use these. This link could also be helpful.
I used to use in cluster mode but now I'm using in YARN so my variables are as follows: (Hopefully helpful)
hastimal#nm:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.Processing --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/Processing.jar Main_Class /inputRDF/rdf_data_all.nt /output /users/hastimal/ /users/hastimal/query.txt index 2
In this, my jar following are arguments of class.
cc /inputData/data_all.txt /output /users/hastimal/
/users/hastimal/query.txt index 2

custom log using spark

I´m trying to configure a custom log using spark-submit, this my configure:
driver:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-driver.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.driver.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-driver.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
executor:
-DlogsPath=/var/opt/log\
-DlogsFile=spark-submit-executor.log\
-Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties\
spark.executor.extraJavaOptions -> -DlogsPath=/var/opt/log -DlogsFile=spark-submit-executor.log -Dlog4j.configuration=jar:file:../bin/myapp.jar!/log4j.properties
The spark-submit-drive.log is created and filled fine but spark-submit-executor.log is not crated
any idea?

Please try using log4j while running your job through spark submit.
Example:
spark-submit -- class com.something.Driver
--master yarn \
--driver-memory 1g \
--executor-memory 1g \
--driver-java-options '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
--conf spark.executor.extraJavaOptions '-Dlog4j.configuration=file:/absolute path to log4j property file/log4j.properties' \
jarfilename.jar
Note: You have to define both the properties with driver-java-options and conf spark.executor.extraJavaOptions, also you can use the default log4j.properties

Please try to use
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties"
or
--file
/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties
The below submit it works for me.
bin/spark-submit --class com.viaplay.log4jtest.log4jtest --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/Users/feng/SparkLog4j/SparkLog4jTest/target/log4j2.properties" --master local[*] /Users/feng/SparkLog4j/SparkLog4jTest/target/SparkLog4jTest-1.0-jar-with-dependencies.jar

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to choose the queue for Spark job using spark-submit? - apache-spark

Is there a way to provide parameters or settings to choose the queue in which I'd like my spark_submit job to run?

By using --queue So an example of a spark-submit job would be:- spark-submit --master yarn --conf spark.executor.memory=48G --conf spark.driver.memory=6G --packages [packages separated by ,] --queue [queue_name] --class [class_name] [jar_file] [arguments]

Related

Load properties file in Spark classpath during spark-submit execution

Spark metrics sink doesn't expose executor's metrics

Submit job to DCOS Spark with multiple instances?

How to pass environment variables to spark driver in cluster mode with spark-submit

custom log using spark

Categories

Resources