How to specify spark.authenticate.secret value while executing spark pgm - apache-spark

I got this error when I launched my spark program with deploy-mode : client.
java.lang.illegalargumentexception: error: a secret key must be specified via the spark.authenticate.secret config.
deploy-mode: client
master: YARN.
How do I resolve this error? I have no clue.

I got similar error in yarn-cluster mode, following helped me :
https://issues.apache.org/jira/browse/SPARK-23476
If spark.authenticate=true is specified as a cluster wide config, then the following has to be added
--conf "spark.authenticate=false" --conf "spark.shuffle.service.enabled=false" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.network.crypto.enabled=false" --conf "spark.authenticate.enableSaslEncryption=false"
in the spark-submit command.

Related

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

How to share files on master node to executors in Spark, How to use --files argument?

Can someone please explain, How can i ship my files in master to all executors using --files argument in spark-submit
/bin/spark-submit --master yarn --queue development --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=128G --files /keras/mnist.npz
But this gives me error. I am new to spark.
Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource.
Obviously you didn't specify the application class on this command. Find more details on Running Spark On Yarn.

spark.executor.extraJavaOptions ignored in spark-submit

I am a newbie trying to profile a local spark job.
Here is the command that I am trying to execute, but I am getting a warning stating my executor options are being ignored since they are non-spark config properties.
error:
Warning: Ignoring non-spark config property: “spark.executor.extraJavaOptions=javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application”
Command:
./bin/spark-submit --master local[2] --class org.apache.spark.examples.GroupByTest --conf “spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application” --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar
Spark version : 2.0.3
Please let me know, how to solve this.
Thanks in Advance.
I think the problem is the double quote you are using to specify the spark.executor.extraJavaOptions. It should have been a single quote.
./bin/spark-submit --master local[2] --conf 'spark.executor.extraJavaOptions=-javaagent:statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar=server=localhost,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application' --class org.apache.spark.examples.GroupByTest --name HdfsWordCount --jars /Users/shprin/statD/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar libexec/examples/jars/spark-examples_2.11-2.3.0.jar
Apart from answers above, if your parameter contains both spaces and single quotes (for instance a query paramter) you should enclose it with in escaped double quote \"
Example:
spark-submit --master yarn --deploy-mode cluster --conf "spark.driver.extraJavaOptions=-DfileFormat=PARQUET -Dquery=\"select * from bucket where code in ('A')\" -Dchunk=yes" spark-app.jar

How to set spark.driver.extraClassPath through Apache Livy on Azure Spark cluster?

I would would like to add some configuration when a Spark Job is submitted via Apache Livy into an Azure cluster. Currently to launch a spark Job via Apache Livy in the cluster, I use the following command
curl -X POST --data '{"file": "/home/xxx/lib/MyJar.jar", "className": "org.springframework.boot.loader.JarLauncher"}' -H "Content-Type: application/json" localhost:8998/batches
This command generate the following process
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false --class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Due to a technical issue when running the jar, Ineed to introduce two configurations into this command.
--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*"
--conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"
It's related to a logback issue when running on spark which use log4j2. the extra class path adds logback jars
I found here https://groups.google.com/a/cloudera.org/forum/#!topic/hue-user/fcRM3YiqAAA that it can be done by adding this conf to LIVY_SERVER_JAVA_OPTS or spark-defaults.conf
From Ambari I modified LIVY_SERVER_JAVA_OPTS in livy-env.sh (in spak2 & livy menu) and
Advanced spark2-defaults in Spark2.
Unfortunately this is not working on our side. Even I can see that the LivyServer is launched with -Dspark.driver.extraClassPath
Is there any specific configuration to add in Azure Hdinsight to make it working?
Note that the process should be like
……. org.apache.spark.deploy.SparkSubmit --conf spark.master=yarn-cluster --conf spark.yarn.tags=livy-batch-51-qHXmHXWg --conf spark.yarn.submit.waitAppCompletion=false **--conf "spark.driver.extraClassPath=/home/xxx/lib /jars/*" --conf "spark.executor.extraClassPath=/home/xxx/lib/jars/*"**
--class org.springframework.boot.loader.JarLauncher adl://home/home/xxx/lib/MyJar.jar
Thx
Add the following
"conf":{ "spark.driver.extraClassPath":"wasbs:///pathtojar.jar","spark.yarn.user.classpath.first":"true"}

Dynamic Resource allocation in Spark-Yarn Cluster Mode

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.
the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

Resources