Where to specify Spark configs when running Spark app in EMR cluster - apache-spark

When I am running Spark app on EMR, what is the difference between adding configs to spark/conf spark-defaults.conf file VS adding them when running spark submit?
For example, If I adding this to my conf spark-defaults.conf :
spark.master yarn
spark.executor.instances 4
spark.executor.memory 29G
spark.executor.cores 3
spark.yarn.executor.memoryOverhead 4096
spark.yarn.driver.memoryOverhead 2048
spark.driver.memory 12G
spark.driver.cores 1
spark.default.parallelism 48
Is that the same as adding it to command line arguments :
Arguments :/home/hadoop/spark/bin/spark-submit --deploy-mode cluster
--master yarn-cluster --conf spark.driver.memory=12G --conf spark.executor.memory=29G --conf spark.executor.cores=3 --conf
spark.executor.instances=4 --conf
spark.yarn.executor.memoryOverhead=4096 --conf
spark.yarn.driver.memoryOverhead=2048 --conf spark.driver.cores=1
--conf spark.default.parallelism=48 --class com.emr.spark.MyApp s3n://mybucket/application/spark/MeSparkApplication.jar
?
And would it be the same if I add this in my Java Code, for example:
SparkConf sparkConf = new SparkConf().setAppName(applicationName);
sparkConf.set("spark.executor.instances", "4");

The difference is in priority. According to spark documentation:
Properties set directly on the SparkConf take highest precedence, then
flags passed to spark-submit or spark-shell, then options in the
spark-defaults.conf file

Related

Spark-submit with properties file for python

I am trying to load spark configuration through the properties file using the following command in Spark 2.4.0.
spark-submit --properties-file props.conf sample.py
It gives the following error
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
The props.conf file has this
spark.master yarn
spark.submit.deployMode client
spark.authenticate true
spark.sql.crossJoin.enabled true
spark.dynamicAllocation.enabled true
spark.driver.memory 4g
spark.driver.memoryOverhead 2048
spark.executor.memory 2g
spark.executor.memoryOverhead 2048
Now, when I try to run the same by adding all arguments to the command itself, it works fine.
spark2-submit \
--conf spark.master=yarn \
--conf spark.submit.deployMode=client \
--conf spark.authenticate=true \
--conf spark.sql.crossJoin.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.driver.memory=4g \
--conf spark.driver.memoryOverhead=2048 \
--conf spark.executor.memory=2g \
--conf spark.executor.memoryOverhead=2048 \
sample.py
This works as expected.
I don't think spark supports --properties-file, one workaround is making the change on $SPARK_HOME/conf/spark-defaults.conf, spark will auto load it.
You can refer https://spark.apache.org/docs/latest/submitting-applications.html#loading-configuration-from-a-file

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster.
The EMR cluster is using spark 2.4 and hadoop 2.8.5.
The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.
I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.
I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram.
So that's 108 cpu cores and 180gb of ram overall.
Here is my spark-submit settings that I paste in the EMR add step box:
--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

Dynamic Resource allocation in Spark-Yarn Cluster Mode

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.
the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

Resources