Dynamic Resource allocation in Spark-Yarn Cluster Mode - apache-spark

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.

the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

Related

SparkSubmit launched by Kubernetes Spark Operator stuck with no error details

In K8S Spark Operator, submitted job are getting stuck at Java thread, at the following command with no error details:
/opt/tools/Linux/jdk/openjdk1.8.0.332_8.62.0.20_x64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/ org.apache.spark.deploy.SparkSubmit* --master k8s://https://x.y.z.a:443 --deploy-mode cluster --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent --conf spark.executor.memory=512m --conf spark.driver.memory=512m --conf spark.network.crypto.enabled=true --conf spark.driver.cores=0.100000 --conf spark.io.encryption.enabled=true --conf spark.kubernetes.driver.limit.cores=200m --conf spark.kubernetes.driver.label.version=3.0.1 --conf spark.app.name=sparkimpersonationx42aa8bff --conf spark.kubernetes.submission.waitAppCompletion=false --conf spark.executor.cores=1 --conf spark.authenticate=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=abc --conf spark.kubernetes.container.image=placeholder:94 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=b651fb42-90fd-4675-8e2f-9b4b6e380010 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=sparkimpersonationx42aa8bff --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=b651fb42-90fd-4675-8e2f-9b4b6e380010 --conf spark.kubernetes.driver.pod.name=sparkimpersonationx42aa8bff-driver --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-abc --conf spark.executor.instances=1 --conf spark.kubernetes.executor.label.version=3.0.1 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=sparkimpersonationx42aa8bff --class org.apache.spark.examples.SparkPi --jars local:///sample-apps/sample-basic-spark-operator/extra-jars/* local:///sample-apps/sample-basic-spark-operator/sample-basic-spark-operator.jar
From the available information, the causes for this can be:
The workload pods cannot get scheduled on your k8s nodes. You can check this with kubectl get pods, are the pods Running?
The resource limits have been reached and the pods are unresponsive.
The spark-operator itself might not be running. You should check the logs for the operator itself.
That's all I can say from what is available.

Why does spark-shell/pyspark gets less resources compared to spark-submit

I am running a PySpark code which runs on a single node due to code requirements.
But when I run the code using PySpark
pyspark --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g
I get Segment error and the code fails, when I checked in Yarn I saw that my job was not getting resources even when they were available.
But when I use spark-submit
spark-submit --master yarn --executor-cores 1 --driver-memory 60g --executor-memory 5g --conf spark.driver.maxResultSize=4g code.py
the code gets all the resources it needs and the code runs perfectly.
Am I missing some fundamental aspect of spark or why is this happening?

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

Correct spark configuration to fully utilise EMR cluster resources

I'm quite new to configuring spark, so wanted to know whether I am fully utilising my EMR cluster.
The EMR cluster is using spark 2.4 and hadoop 2.8.5.
The app reads loads of small gzipped json files from s3, transforms the data and writes them back out to s3.
I've read various articles, but I was hoping I could get my configuration double checked in case there were set settings that conflict with each other or something.
I'm using a c4.8xlarge cluster with each of the 3 worker nodes having 36 cpu cores and 60gb of ram.
So that's 108 cpu cores and 180gb of ram overall.
Here is my spark-submit settings that I paste in the EMR add step box:
--class com.example.app
--master yarn
--driver-memory 12g
--executor-memory 3g
--executor-cores 3
--num-executors 33
--conf spark.executor.memory=5g
--conf spark.executor.cores=3
--conf spark.executor.instances=33
--conf spark.driver.cores=16
--conf spark.driver.memory=12g
--conf spark.default.parallelism=200
--conf spark.sql.shuffle.partitions=500
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
--conf spark.speculation=false
--conf spark.yarn.am.memory=1g
--conf spark.executor.heartbeatInterval=360000
--conf spark.network.timeout=420000
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true
--conf spark.kryoserializer.buffer.max=512m
--conf spark.shuffle.consolidateFiles=true
--conf spark.hadoop.fs.s3a.multiobjectdelete.enable=false
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.worker.instances=3

How to pass environment variables to spark driver in cluster mode with spark-submit

spark-submit allows to configure the executor environment variables with --conf spark.executorEnv.FOO=bar, and the Spark REST API allows to pass some environment variables with the environmentVariables field.
Unfortunately I've found nothing similar to configure the environment variable of the driver when submitting the driver with spark-submit in cluster mode:
spark-submit --deploy-mode cluster myapp.jar
Is it possible to set the environment variables of the driver with spark-submit in cluster mode?
On YARN at least, this works:
spark-submit --deploy-mode cluster --conf spark.yarn.appMasterEnv.FOO=bar myapp.jar
It's mentioned in http://spark.apache.org/docs/latest/configuration.html#environment-variables that:
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file.
I have tested that it can be passed with --conf flag for spark-submit, so that you don't have to edit global conf files.
On Yarn in cluster mode, it worked by adding the environment variables in the spark-submit command using --conf as below-
spark-submit --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --conf "spark.yarn.appMasterEnv.FOO=/Path/foo" --conf "spark.executorEnv.FOO2=/path/foo2" app.jar
Also, you can do it by adding them in conf/spark-defaults.conf file.
You can use the below Classification to set-up environment variables on executor and master node:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
If you just set spark.yarn.appMasterEnv.FOO = "foo", then the env variable won't be present on executor instances.
Did you test with
--conf spark.driver.FOO="bar"
and then getting value with
spark.conf.get("spark.driver.FOO")
For deploy-mode cluster
As previous answers mentioned, if you want to pass env variable to spark master, you want to use:
--conf spark.yarn.appMasterEnv.FOO=bar // pass bar value to FOO variable
--conf spark.yarn.appMasterEnv.FOO=${FOO} // passing current FOO env variable
--conf spark.yarn.appMasterEnv.FOO2=bar2 // multiple variables are passed separately
Thanks #juhoautio
For deploy-mode client
Even though its not mentioned in the question, if you are starting to doubt your skills and found this SO page in google I just wanted to reassure you.
When using client mode, all the ENV variables from your current SHELL are available for spark master. So basically:
export FOO=bar
export FOO2=bar2
is definitely working, so if you can't access this value for some reason, it must be something else :-)
Yes, That is possible. What are the variables you need you could post that in spark-submit like you're doing?
spark-submit --deploy-mode cluster myapp.jar
Take variables from http://spark.apache.org/docs/latest/configuration.html and depends on your optimization use these. This link could also be helpful.
I used to use in cluster mode but now I'm using in YARN so my variables are as follows: (Hopefully helpful)
hastimal#nm:/usr/local/spark$ ./bin/spark-submit --class com.hastimal.Processing --master yarn-cluster --num-executors 15 --executor-memory 52g --executor-cores 7 --driver-memory 52g --driver-cores 7 --conf spark.default.parallelism=105 --conf spark.driver.maxResultSize=4g --conf spark.network.timeout=300 --conf spark.yarn.executor.memoryOverhead=4608 --conf spark.yarn.driver.memoryOverhead=4608 --conf spark.akka.frameSize=1200 --conf spark.io.compression.codec=lz4 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf spark.shuffle.spill.compress=true --conf spark.shuffle.compress=true --conf spark.shuffle.manager=sort /users/hastimal/Processing.jar Main_Class /inputRDF/rdf_data_all.nt /output /users/hastimal/ /users/hastimal/query.txt index 2
In this, my jar following are arguments of class.
cc /inputData/data_all.txt /output /users/hastimal/
/users/hastimal/query.txt index 2

Resources