Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully - apache-spark

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using client mode (HostingEnvironment.QueueBackgroundWorkTime...). We are hoping we can execute these jobs in a more "fire and forget" style.
Our jobs continue to run successfully, but we see a curious entry in the logs where the yarn client that submits the job to the application manager is always reporting failure:
18/11/29 16:54:35 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:36 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:37 INFO yarn.Client: Application report for application_1539978346138_110818 (state: FINISHED)
18/11/29 16:54:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: <ip address>
ApplicationMaster RPC port: 0
queue: root.default
start time: 1543510402372
final status: FAILED
tracking URL: http://server.host.com:8088/proxy/application_1539978346138_110818/
user: p800s1
Exception in thread "main" org.apache.spark.SparkException: Application application_1539978346138_110818 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:892)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/29 16:54:37 INFO util.ShutdownHookManager: Shutdown hook called
We always create a SparkSession and always return sys.exit(0) (although that appears to be ignored by the Spark framework regardless of how we submit a job). We also have our own internal error logging that routes to Kafka/ElasticSearch. No errors are reported during the job run.
Here's an example of the submit command: spark2-submit --keytab /etc/keytabs/p800s1.ktf --principal p800s1#OURDOMAIN.COM --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 4g --class com.path.to.MainClass /path/to/UberJar.jar arg1 arg2
This seems to be harmless noise, but I don't like noise that I don't understand. Has anyone experienced something similar?

Related

PySpark EMR step fails with exit code 1

I'm learning to use AWS EMR for the first time to submit my Spark jobs. The script I'm using is very short (restaurant.py):
from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession
class SparkRawConsumer:
def __init__(self):
self.sparkContext = SparkContext.getOrCreate()
self.sparkContext.setLogLevel("ERROR")
self.sqlContext = SQLContext(self.sparkContext)
self.df = self.sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('zomato.csv')
if __name__ == "__main__":
sparkConsumer = SparkRawConsumer()
print(sparkConsumer.df.count())
sparkConsumer.df.groupBy("City").agg({"Average Cost for two": "avg", "Aggregate rating": "avg"})
I used the AWS GUI to submit my step, but the CLI export is
spark-submit --deploy-mode cluster s3://data-pipeline-testing-yu-chen/dependencies/restaurant.py -files s3://data-pipeline-testing-yu-chen/dependencies/zomato.csv
However, the step will run for a few minutes, and then return an exit code of 1. I'm pretty confused what exactly is going on, and finding it difficult to interpret the output of my syserr:
18/07/28 06:40:10 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:11 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:12 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:13 INFO Client: Application report for application_1532756827478_0012 (state: FINISHED)
18/07/28 06:40:13 INFO Client:
client token: N/A
diagnostics: User application exited with status 1
ApplicationMaster host: myip
ApplicationMaster RPC port: 0
queue: default
start time: 1532759825922
final status: FAILED
tracking URL: http://myip.compute.internal:20888/proxy/application_1532756827478_0012/
user: hadoop
18/07/28 06:40:13 INFO Client: Deleted staging directory hdfs://myip.compute.internal:8020/user/hadoop/.sparkStaging/application_1532756827478_0012
Exception in thread "main" org.apache.spark.SparkException: Application application_1532756827478_0012 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/28 06:40:13 INFO ShutdownHookManager: Shutdown hook called
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
Command exiting with ret '1'
I am able to run the script by SSH-ing into my master instance, and then issuing spark-submit restaurant.py. I've loaded a CSV file into my master instance by using:
[hadoop#my-ip ~]$ aws s3 sync s3://data-pipeline-testing-yu-chen/dependencies/ .
Then I load my restaurant.csv file into HDFS:
hadoop fs -put zomato.csv ./zomato.csv
My guess is that the -files option I am passing in isn't getting used the way I intend it to be used, but I'm really at a loss for how to interpret the console output and begin debugging.

Spark Job Container exited with exitCode: -1000

I have been struggling to run sample job with spark 2.0.0 in yarn cluster mode, job exists with exitCode: -1000 without any other clues. Same job runs properly in local mode.
Spark command:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
TestJob class:
public class TestJob {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf();
JavaSparkContext jsc = new JavaSparkContext(conf);
System.out.println(
"TOtal count:"+
jsc.parallelize(Arrays.asList(new Integer[]{1,2,3,4})).count());
jsc.stop();
}
}
Error Log:
17/10/04 22:26:52 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:52 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: UNDEFINED
tracking URL: http://xyzserver:8088/proxy/application_1506717704791_130756/
user: xyz
17/10/04 22:26:53 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:54 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:55 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:56 INFO Client: Application report for application_1506717704791_130756 (state: FAILED)
17/10/04 22:26:56 INFO Client:
client token: N/A
diagnostics: Application application_1506717704791_130756 failed 5 times due to AM Container for appattempt_1506717704791_130756_000005 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xyzserver:8088/cluster/app/application_1506717704791_130756Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: FAILED
tracking URL: http://xyzserver:8088/cluster/app/application_1506717704791_130756
user: xyz
17/10/04 22:26:56 INFO Client: Deleted staging directory /xyz/spark/.sparkStaging/application_1506717704791_130756
Exception in thread "main" org.apache.spark.SparkException: Application application_1506717704791_130756 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
When I browse the page http://xyzserver:8088/cluster/app/application_1506717704791_130756 it doesn't exists.
No Yarn application logs found-
$yarn logs -applicationId application_1506717704791_130756
/apps/yarn/logs/xyz/logs/application_1506717704791_130756 does not have any log files.
What could be the possibly rootcause of this error and how to get detailed error logs?
After spending nearly one whole day I found the rootcause. When I remove spark.yarn.stagingDir it starts working and I am still not sure why spark is complaining about it-
Previous Spark Submit-
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
New-
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#

Property spark.yarn.jars - how to deal with it?

My knowledge with Spark is limited and you would sense it after reading this question. I have just one node and spark, hadoop and yarn are installed on it.
I was able to code and run word-count problem in cluster mode by below command
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner
--master yarn
--deploy-mode cluster
--driver-memory=2g
--executor-memory 2g
--executor-cores 1
--num-executors 1
SparkSimple-0.0.1SNAPSHOT.jar
hdfs://sanjeevd.br:9000/user/spark-test/word-count/input
hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
It works just fine.
Now I understood that 'spark on yarn' requires spark jar files available on the cluster and if I don't do anything then every time I run my program it will copy hundreds of jar files from $SPARK_HOME to each node (in my case it's just one node). I see that code's execution pauses for some time before it finishes copying. See below -
16/12/12 17:24:03 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/12/12 17:24:06 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_libs__11112433502351931.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_libs__11112433502351931.zip
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:24:08 INFO yarn.Client: Uploading resource file:/tmp/spark-a6cc0d6e-45f9-4712-8bac-fb363d6992f2/__spark_conf__6716604236006329155.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0001/__spark_conf__.zip
Spark's documentation suggests to set spark.yarn.jars property to avoid this copying. So I set below below property in spark-defaults.conf file.
spark.yarn.jars hdfs://sanjeevd.br:9000//user/spark/share/lib
http://spark.apache.org/docs/latest/running-on-yarn.html#preparations
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.
Btw, I have all the jar files from LOCAL /opt/spark/jars to HDFS /user/spark/share/lib. They are 206 in number.
This makes my jar failed. Below is the error -
spark-submit --class com.sanjeevd.sparksimple.wordcount.JobRunner --master yarn --deploy-mode cluster --driver-memory=2g --executor-memory 2g --executor-cores 1 --num-executors 1 SparkSimple-0.0.1-SNAPSHOT.jar hdfs://sanjeevd.br:9000/user/spark-test/word-count/input hdfs://sanjeevd.br:9000/user/spark-test/word-count/output
16/12/12 17:43:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/12 17:43:07 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/12/12 17:43:07 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/12/12 17:43:07 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/12/12 17:43:07 INFO yarn.Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
16/12/12 17:43:07 INFO yarn.Client: Setting up container launch context for our AM
16/12/12 17:43:07 INFO yarn.Client: Setting up the launch environment for our AM container
16/12/12 17:43:07 INFO yarn.Client: Preparing resources for our AM container
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/home/sanjeevd/personal/Spark-Simple/target/SparkSimple-0.0.1-SNAPSHOT.jar -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/SparkSimple-0.0.1-SNAPSHOT.jar
16/12/12 17:43:07 INFO yarn.Client: Uploading resource file:/tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f/__spark_conf__7881471844385719101.zip -> hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005/__spark_conf__.zip
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls to: sanjeevd
16/12/12 17:43:08 INFO spark.SecurityManager: Changing view acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: Changing modify acls groups to:
16/12/12 17:43:08 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sanjeevd); groups with view permissions: Set(); users with modify permissions: Set(sanjeevd); groups with modify permissions: Set()
16/12/12 17:43:08 INFO yarn.Client: Submitting application application_1481592214176_0005 to ResourceManager
16/12/12 17:43:08 INFO impl.YarnClientImpl: Submitted application application_1481592214176_0005
16/12/12 17:43:09 INFO yarn.Client: Application report for application_1481592214176_0005 (state: ACCEPTED)
16/12/12 17:43:09 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: UNDEFINED
tracking URL: http://sanjeevd.br:8088/proxy/application_1481592214176_0005/
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Application report for application_1481592214176_0005 (state: FAILED)
16/12/12 17:43:10 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1481592214176_0005 failed 1 times due to AM Container for appattempt_1481592214176_0005_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1481592214176_0005_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1481593388442
final status: FAILED
tracking URL: http://sanjeevd.br:8088/cluster/app/application_1481592214176_0005
user: sanjeevd
16/12/12 17:43:10 INFO yarn.Client: Deleting staging directory hdfs://sanjeevd.br:9000/user/sanjeevd/.sparkStaging/application_1481592214176_0005
Exception in thread "main" org.apache.spark.SparkException: Application application_1481592214176_0005 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/12/12 17:43:10 INFO util.ShutdownHookManager: Shutdown hook called
16/12/12 17:43:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-fae6a5ad-65d9-4b64-9ba6-65da1310ae9f
Do you know what wrong am I doing? The task's log says below -
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I understand the error that ApplicationMaster class is not found but my question is why it is not found - where this class is supposed to be? I don't have assembly jar since I'm using spark 2.0.1 where there is no assembly comes bundled.
What this has to do with spark.yarn.jars property? This property is to help spark run on yarn, and that should be it. What additional I need to do when using spark.yarn.jars?
Thanks in reading this question and for your help in advance.
You could also use the spark.yarn.archive option and set that to the location of an archive (you create) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:
Create the archive: jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
Upload to HDFS: hdfs dfs -put spark-libs.jar /some/path/.
2a. For a large cluster, increase the replication count of the Spark archive so that you reduce the amount of times a NodeManager will do a remote copy. hdfs dfs –setrep -w 10 hdfs:///some/path/spark-libs.jar (Change the amount of replicas proportional to the number of total NodeManagers)
Set spark.yarn.archive to hdfs:///some/path/spark-libs.jar
I was finally able to make sense of this property. I found by hit-n-trial that correct syntax of this property is
spark.yarn.jars=hdfs://xx:9000/user/spark/share/lib/*.jar
I didn't put *.jar in the end and my path was just ended with /lib. I tried putting actual assembly jar like this - spark.yarn.jars=hdfs://sanjeevd.brickred:9000/user/spark/share/lib/spark-yarn_2.11-2.0.1.jar but no luck. All it said that unable to load ApplicationMaster.
I posted my response to a similar question asked by someone at https://stackoverflow.com/a/41179608/2332121
If you look at spark.yarn.jars documentation it says the following
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
This means that you are actually overriding the SPARK_HOME/jars and telling yarn to pick up all the jars required for the application run from your path,If you set spark.yarn.jars property, all the dependent jars for spark to run should be present in this path, If you go and look inside spark-assembly.jar present in SPARK_HOME/lib , org.apache.spark.deploy.yarn.ApplicationMaster class is present, so make sure that all the spark dependencies are present in the HDFS path that you specify as spark.yarn.jars.

Spark streaming job fails after getting stopped by Driver

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.
I have these conf passed to the spark-submit job :
--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3
The job returns this error and finishes after running for a short while :
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)
.....
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver
Updated :
These logs were found too :
INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
....
INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
INFO yarn.ExecutorRunnable: Starting Executor Container.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...
INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)
INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
What might be the reasons for this? Appreciate some help.
Thanks
can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.
Try something like
SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");
also try running application in yarn-client mode and share the output.
I got the same issue. and I have found 1 solution to fix the issue by removing sparkContext.stop() at the end of main function, leave the stop action for GC.
Spark team has resolved the issue in Spark core, however, the fix has just been master branch so far. We need to wait until the fix has been updated into the new release.
https://issues.apache.org/jira/browse/SPARK-12009

Spark driver program launching in `cluster` mode failed in a weird way

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
It will throws following error:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
Master of cluster outputs following log:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient#bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
And the corresponding worker for launching driver program outputs:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
My spark-env.sh is:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
And my spark-defaults.conf is:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
However, when I launch the program with client mode with following command, it works fine.
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar
The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Note that cluster mode is currently not supported for standalone
clusters, Mesos clusters, or python applications.
If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.
Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html
Go to above link and have a look at "Launching Applications with spark-submit" section.
Think it will help. Thanks.

Resources