Spark's example throws FileNotFoundException in client mode - apache-spark

I have: Ubuntu 14.04, Hadoop 2.7.7, Spark 2.2.0.
I just installed everything.
When I try to run the the Spark's example:
bin/spark-submit --deploy-mode client \
--class org.apache.spark.examples.SparkPi \
examples/jars/spark-examples_2.11-2.2.0.jar 10
I get the following error:
INFO yarn.Client:
client token: N/A
diagnostics: Application application_1552490646290_0007 failed 2 times due to AM Container for
appattempt_1552490646290_0007_000002 exited with exitCode: -1000 For
more detailed output, check application tracking
page:http://ip-123-45-67-89:8088/cluster/app/application_1552490646290_0007 Then,
click on links to logs of each attempt. Diagnostics: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist java.io.FileNotFoundException: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
I get the same error both in client mode and cluster mode.

It seems that it fails by loading the spark libs. As Daniel points out, it could be related to your read rights. Besides, it could be related to running out of disk space.
However, in our case to avoid transfer latencies to the master and read/write rights in the local machine, we put the spark-libs in the HDFS of the Yarn cluster and then, we point them in the spark.yarn.archive property.
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
export HADOOP_USER_NAME=hadoop
hadoop fs -mkdir -p /apps/spark/
hadoop fs -put -f ${SPARK_HOME}/spark-libs.jar /apps/spark/
# spark-defaults.conf
spark.yarn.archive hdfs:///apps/spark/spark-libs.jar

First, Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Second , If you run on YARN mode you have point your master to yarn submitting-applications, and put your jar file in hdfs
# Run on a YARN cluster
# Connect to a YARN cluster in client or cluster mode depending on the value
# of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR
# or YARN_CONF_DIR variable.
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
hdfs://path/to/spark-examples.jar
1000

Related

FileNotFound error when running spark-submit

I am trying to run the spark-submit command on my Hadoop cluster
Here is a summary of my Hadoop Cluster:
The cluster is built using 5 VirtualBox VM's connected on an internal network
There is 1 namenode and 4 datanodes created.
All the VM's were built from the Bitnami Hadoop Stack VirtualBox image
When I run the following command:
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.0.3.jar 10
I receive the following error:
java.io.FileNotFoundException: File file:/home/bitnami/sparkStaging/bitnami/.sparkStaging/application_1658417340986_0002/__spark_conf__.zip does not exist
I also get a similar error when trying to create a sparkSession using PySpark:
spark = SparkSession.builder.appName('appName').getOrCreate()
I have tried/verified the following
environment variables: HADOOP_HOME, SPARK_HOME AND HADOOP_CONF_DIR have been set in my .bashrc file
SPARK_DIST_CLASSPATH and HADOOP_CONF_DIR have been defined in spark-env.sh
Added spark.master yarn, spark.yarn.stagingDir file:///home/bitnami/sparkStaging and spark.yarn.jars file:///opt/bitnami/hadoop/spark/jars/ in spark-defaults.conf
I believe spark.yarn.stagingDir needs to be an HDFS path.
More specifically, the "YARN Staging directory" needs to be available on all Spark executors, not just a local file path from where you run spark-submit
The path that isn't found is being reported from the YARN cluster, where /home/bitnami might not exist, or the Unix user running the Spark executor containers does not have access to that path.
Similarly, spark.yarn.jars (or spark.yarn.archive) should be HDFS paths because these will get downloaded, in parallel, across all executors.
Since the spark job is supposed to be submitted to the Hadoop cluster managed by YARN, master and deploy-mode has to be set. From the spark 3.3.0 docs:
# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Or programatically:
spark = SparkSession.builder().appName('appName').master("yarn").config("spark.submit.deployMode","cluster").getOrCreate()

Capture spark executor logs in local file on YARN CUSTER MODE

I am running spark streaming in yarn cluster mode and i want to capture logs and write it in driver local file for this I have created custom log4j.properties files in which I have mentioned driver's local file path but I can only see drivers logs in this file, Why my executors logs are not captured in this file and how can I capture executor log. I have tried different approaches and my spark-submit command is as follows:-
spark-submit --master yarn --deploy-mode yarn-cluster
--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:/home/log/conf/log4j.properties"
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/home/log/conf/log4j.properties" --class com.Word.count.SparkStream /home/project/WordCount/target/Count-0.0.1-SNAPSHOT.jar
you may post your log4j.properties.I assume that you can see executor logs in executor node local dir

Spark Structure Streaming job failing in cluster mode

I am using spark-sql-2.4.1 v in my application.
While writing data on to hdfs folder I am facing this issue in spark-streaming application
Error:
yarn.Client: Deleted staging directory hdfs://dev/user/xyz/.sparkStaging/application_1575699597805_47
20/02/24 14:02:15 ERROR yarn.Client: Application diagnostics message: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user= xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
.
.
.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=xyz, access=WRITE, inode="/tmp/hadoop-admin":admin:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:350)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:251)
While writing data on to HDFS folder I am facing this issue in spark-streaming application. When I run in yarn-cluster mode I face this issue i.e.
--master yarn \
--deploy-mode cluster \
But when I run in “yarn-client” mode it runs fine i.e.
--master yarn \
--deploy-mode client \
What is the root cause of this problem?
Fundamental question here, why it is trying to write in "/tmp/hadoop-admin/" instead of respective user directory i.e. hdfs://qa2/user/xyz/?
I have come across this fix:
https://issues.apache.org/jira/browse/SPARK-26825
How can I implement it in my spark-sql application?
The only difference between the working --deploy-mode client and the failing --deploy-mode cluster cases is the location of the driver. In client deploy mode, the driver runs on the machine you execute spark-submit (which is usually an edge node that is configured to use a YARN cluster, but it is not part of it) while in cluster deploy mode the driver runs as part of a YARN cluster (one of the nodes under control of YARN).
It looks like you've got a misconfigured edge node.
I'd not be surprised if a regular Spark SQL-only Spark application would be failing too. I'd not be surprised to hear that it has nothing to do with a streaming query (Spark Structured Streaming) and would fail for any Spark application.

Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.
You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

How to give dependent jars to spark submit in cluster mode

I am running spark using cluster mode for deployment . Below is the command
JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\
$JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\
$JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\
$JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\
$JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar
dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--executor-memory 512M \
--total-executor-cores 3 \
--deploy-mode "cluster" \
--master spark://$MASTER:7077 \
--jars=$JARS \
--supervise \
--class "com.testclass" $APP_JAR input.json \
--files "/home/test/input.json"
The above command is working fine in client mode. But when I use it in cluster mode I get class not found exception
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not. Please help me in getting this solved.
EDIT:
I am using nfs and I have mounted the same directory on all the spark nodes under same name. Still I get the error. How it is able to pick the application jar which is also under same directory but not the dependent jars ?
In client mode the dependent jars are getting copied to the
/var/lib/spark/work directory whereas in cluster mode it is not.
In Cluster mode, driver pragram is running in the cluster not in local(compared to client mode) and dependent jars should be accessible in cluster, otherwise driver program and executor will throw "java.lang.NoClassDefFoundError" exception.
Actually When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.
Your extra jars could be added to --jars, they will be copied to cluster automatically.
please refer to "Advanced Dependency Management" section in below link:
http://spark.apache.org/docs/latest/submitting-applications.html
As spark documentation says,
Keep all jars and dependencies in same local path in all nodes in cluster or
Keep the jar is distributed files system where all nodes have access to.
Spark properties

Resources