Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist - apache-spark

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.

You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

Related

FileNotFound error when running spark-submit

I am trying to run the spark-submit command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf
I believe spark.yarn.stagingDir needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

spark-submit on kubernetes cluster does not recognise k8s --master property

I have successfully installed a Kubernetes cluster and can verify this by:
C:\windows\system32>kubectl cluster-info
Kubernetes master is running at https://<ip>:<port>
KubeDNS is running at https://<ip>:<port>/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Then I am trying to run the SparkPi with the Spark I downloaded from https://spark.apache.org/downloads.html .
spark-submit --master k8s://https://192.168.99.100:8443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=gettyimages/spark c:\users\<username>\Desktop\spark-2.4.0-bin-hadoop2.7\examples\jars\spark-examples_2.11-2.4.0.jar
I am getting this error:
Error: Master must either be yarn or start with spark, mesos, local
Run with --help for usage help or --verbose for debug output
I tried versions 2.4.0 and 2.3.3. I also tried
spark-submit --help
to see what I can get regarding the --master property. This is what I get:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
According to the documentation [https://spark.apache.org/docs/latest/running-on-kubernetes.html] on running Spark workloads in Kubernetes, spark-submit does not even seem to recognise the k8s value for master. [ included in possible Spark masters: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls ]
Any ideas? What would I be missing here?
Thanks
Issue was my CMD was recognising a previous spark-submit version I had installed(2.2) even though i was running the command from the bin directory of spark installation.

SparkConf settings not used when running Spark app in cluster mode on YARN

I wrote a Spark application, which sets sets some configuration stuff via SparkConf instance, like this:
SparkConf conf = new SparkConf().setAppName("Test App Name");
conf.set("spark.driver.cores", "1");
conf.set("spark.driver.memory", "1800m");
conf.set("spark.yarn.am.cores", "1");
conf.set("spark.yarn.am.memory", "1800m");
conf.set("spark.executor.instances", "30");
conf.set("spark.executor.cores", "3");
conf.set("spark.executor.memory", "2048m");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> inputRDD = sc.textFile(...);
...
When I run this application with the command (master=yarn & deploy-mode=client)
spark-submit --class spark.MyApp --master yarn --deploy-mode client /home/myuser/application.jar
everything seems to work fine, the Spark History UI shows correct executor information:
But when running it with (master=yarn & deploy-mode=cluster)
my Spark UI shows wrong executor information (~512 MB instead of ~1400 MB):
Also my App name equals Test App Name when running in client mode, but is spark.MyApp when running in cluster mode. It seems that however some default settings are taken when running in Cluster mode. What am I doing wrong here? How can I make these settings for the Cluster mode?
I'm using Spark 1.6.2 on a HDP 2.5 cluster, managed by YARN.
OK, I think I found out the problem! In short form: There's a difference between running Spark settings in Standalone and in YARN-managed mode!
So when you run Spark applications in the Standalone mode, you can focus on the Configuration documentation of Spark, see http://spark.apache.org/docs/1.6.2/configuration.html
You can use the following settings for Driver & Executor CPU/RAM (just as explained in the documentation):
spark.executor.cores
spark.executor.memory
spark.driver.cores
spark.driver.memory
BUT: When running Spark inside a YARN-managed Hadoop environment, you have to be careful with the following settings and consider the following points:
orientate on the "Spark on YARN" documentation rather then on the Configuration documentation linked above: http://spark.apache.org/docs/1.6.2/running-on-yarn.html (the properties explained here have a higher priority then the ones explained in the Configuration docu (this seems to describe only the Standalone cluster vs. client mode, not the YARN cluster vs. client mode!!))
you can't use SparkConf to set properties in yarn-cluster mode! Instead use the corresponding spark-submit parameters:
--executor-cores 5
--executor-memory 5g
--driver-cores 3
--driver-memory 3g
In yarn-client mode you can't use the spark.driver.cores and spark.driver.memory properties! You have to use the corresponding AM properties in a SparkConf instance:
spark.yarn.am.cores
spark.yarn.am.memory
You can't set these AM properties via spark-submit parameters!
To set executor resources in yarn-client mode you can use
spark.executor.cores and spark.executor.memory in SparkConf
--executor-cores and executor-memory parameters in spark-submit
if you set both, the SparkConf settings overwrite the spark-submit parameter values!
This is the textual form of my notes:
Hope I can help anybody else with this findings...
Just to add on to D. Müller's answer:
Same issue happened to me and I tried the settings with some different combination. I am running Pypark 2.0.0 on YARN cluster.
I found that driver-memory must be written during spark submit but executor-memory can be written in script (i.e. SparkConf) and the application will still work.
My application will die if driver-memory is less than 2g. The error is:
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
ERROR yarn.ApplicationMaster: User application exited with status 143
CASE 1:
driver & executor both written in SparkConf
spark = (SparkSession
.builder
.appName("driver_executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.config("spark.driver.memory","2g")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster myscript.py
CASE 2:
- driver in spark submit
- executor in SparkConf in script
spark = (SparkSession
.builder
.appName("executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
The job Finished with succeed status. Executor memory correct.
CASE 3:
- driver in spark submit
- executor not written
spark = (SparkSession
.builder
.appName("executor_not_written")
.enableHiveSupport()
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
Apparently the executor memory is not set. Meaning CASE 2 actually captured executor memory settings despite writing it inside sparkConf.

Erro spark-assembly-1.4.1-hadoop2.6.0.jar does not exist

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File
file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar
does not exist
I can see in my service list ,
YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
large-scale data processing.
already installed.
My spark-env.sh in local machine
export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
Has anyone encountered similar before ?
I think the right command to call is like following:
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 --conf spark.yarn.jars=hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
or you can add
spark.yarn.jars hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
in your spark.default.conf file

how to : spark yarn cluster

I have set up a hadoop cluster with 3 machines one master and 2 slave
In the master i have installed spark
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly
Added HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-env.sh
Then i ran SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar
I checked localhost:8088 and i saw application SparkPi running..
Is it just this or i should install spark in the 2 slave machines..
How can i get all the machine started?
Is there any help doc out there.. I feel like i am missing something..
In spark standalone more we start the master and worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
i also wanted to know how to get more than one worked running in this case as well
and i know we can can configure slaves in conf/slave but can anyone share an example
Please help i am stuck
Assuming you're using Spark 1.1.0, as it says in the documentation (http://spark.apache.org/docs/1.1.0/submitting-applications.html#master-urls), for the master parameter you can use values yarn-cluster or yarn-client. You do not need to use deploy-mode parameter in that case.
You do not have to install Spark on all the YARN nodes. That is what YARN is for: to distribute your application (in this case Spark) over a Hadoop cluster.

Resources