Spark Job Container exited with exitCode: -1000 - apache-spark

I have been struggling to run sample job with spark 2.0.0 in yarn cluster mode, job exists with exitCode: -1000 without any other clues. Same job runs properly in local mode.
Spark command:
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
TestJob class:
public class TestJob {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf();
JavaSparkContext jsc = new JavaSparkContext(conf);
System.out.println(
"TOtal count:"+
jsc.parallelize(Arrays.asList(new Integer[]{1,2,3,4})).count());
jsc.stop();
}
}
Error Log:
17/10/04 22:26:52 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:52 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: UNDEFINED
tracking URL: http://xyzserver:8088/proxy/application_1506717704791_130756/
user: xyz
17/10/04 22:26:53 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:54 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:55 INFO Client: Application report for application_1506717704791_130756 (state: ACCEPTED)
17/10/04 22:26:56 INFO Client: Application report for application_1506717704791_130756 (state: FAILED)
17/10/04 22:26:56 INFO Client:
client token: N/A
diagnostics: Application application_1506717704791_130756 failed 5 times due to AM Container for appattempt_1506717704791_130756_000005 exited with exitCode: -1000
For more detailed output, check application tracking page:http://xyzserver:8088/cluster/app/application_1506717704791_130756Then, click on links to logs of each attempt.
Diagnostics: Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.xyz
start time: 1507181210893
final status: FAILED
tracking URL: http://xyzserver:8088/cluster/app/application_1506717704791_130756
user: xyz
17/10/04 22:26:56 INFO Client: Deleted staging directory /xyz/spark/.sparkStaging/application_1506717704791_130756
Exception in thread "main" org.apache.spark.SparkException: Application application_1506717704791_130756 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
When I browse the page http://xyzserver:8088/cluster/app/application_1506717704791_130756 it doesn't exists.
No Yarn application logs found-
$yarn logs -applicationId application_1506717704791_130756
/apps/yarn/logs/xyz/logs/application_1506717704791_130756 does not have any log files.
What could be the possibly rootcause of this error and how to get detailed error logs?

After spending nearly one whole day I found the rootcause. When I remove spark.yarn.stagingDir it starts working and I am still not sure why spark is complaining about it-
Previous Spark Submit-
spark-submit \
--conf "spark.yarn.stagingDir=/xyz/warehouse/spark" \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#
New-
spark-submit \
--queue xyz \
--class com.xyz.TestJob \
--master yarn \
--deploy-mode cluster \
--conf "spark.local.dir=/xyz/warehouse/tmp" \
/xyzpath/java-test-1.0-SNAPSHOT.jar $#

Related

Spark streaming job on yarn keeps terminating after a few hours

I have a spark streaming job that consumes a kafka topic and writes to a database. I submitted the job to yarn with the following parameters:
spark-submit \
--jars mongo-spark-connector_2.11-2.4.0.jar,mongo-java-driver-3.11.0.jar,spark-sql-kafka-0-10_2.11-2.4.5.jar \
--driver-class-path mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.executor.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--conf spark.driver.extraClassPath=mongo-spark-connector_2.11-2.4.0.jar:mongo-java-driver-3.11.0.jar:spark-sql-kafka-0-10_2.11-2.4.5.jar \
--class com.example.StreamingApp \
--driver-memory 2g \
--num-executors 6 --executor-cores 3 --executor-memory 3g \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.backpressure.pid.minRate=10 \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.maxAppAttempts=4 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.yarn.max.executor.failures=16 \
--conf spark.yarn.executor.failuresValidityInterval=1h \
--conf spark.task.maxFailures=8 \
--queue users.adminuser \
--conf spark.speculation=true \
StreamingApp-2-4.0.1-SNAPSHOT.jar
But it terminates after a few hours with the following message on the terminal:
21/02/14 04:05:14 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:14 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restart
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.adminuser
start time: 1613260105314
final status: UNDEFINED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:15 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:16 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:17 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:18 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:19 INFO yarn.Client: Application report for application_1613217899387_6697 (state: ACCEPTED)
21/02/14 04:05:20 INFO yarn.Client: Application report for application_1613217899387_6697 (state: FINISHED)
21/02/14 04:05:20 INFO yarn.Client:
client token: N/A
diagnostics: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
ApplicationMaster host: XXXXXXXXXx
ApplicationMaster RPC port: 41848
queue: root.users.adminuser
start time: 1613260105314
final status: FAILED
tracking URL: https://XXXXXXXXXx:8090/proxy/application_1613217899387_6697/
user: adminuser
21/02/14 04:05:20 ERROR yarn.Client: Application diagnostics message: Attempt recovered after RM restartDue to executor failures all available nodes are blacklisted
Exception in thread "main" org.apache.spark.SparkException: Application application_1613217899387_6697 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1155)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1603)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:851)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/02/14 04:05:20 INFO util.ShutdownHookManager: Shutdown hook called
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e51d75e7-f19b-4f2f-8d46-b91b1af064b3
21/02/14 04:05:20 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-16d23c16-30f5-4ab7-95b2-4ac4ad584905
I've relaunched a few times, but the same thing keeps happening.
Spark version is 2.4.0-cdh6.2.1, ResourceManager version is 3.0.0-cdh6.2.1

Spark in Yarn Cluster Mode - Yarn client reports FAILED even when job completes successfully

I am experimenting with running Spark in yarn cluster mode (v2.3.0). We have traditionally been running in yarn client mode, but some jobs are submitted from .NET web services, so we have to keep a host process running in the background when using client mode (HostingEnvironment.QueueBackgroundWorkTime...). We are hoping we can execute these jobs in a more "fire and forget" style.
Our jobs continue to run successfully, but we see a curious entry in the logs where the yarn client that submits the job to the application manager is always reporting failure:
18/11/29 16:54:35 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:36 INFO yarn.Client: Application report for application_1539978346138_110818 (state: RUNNING)
18/11/29 16:54:37 INFO yarn.Client: Application report for application_1539978346138_110818 (state: FINISHED)
18/11/29 16:54:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: <ip address>
ApplicationMaster RPC port: 0
queue: root.default
start time: 1543510402372
final status: FAILED
tracking URL: http://server.host.com:8088/proxy/application_1539978346138_110818/
user: p800s1
Exception in thread "main" org.apache.spark.SparkException: Application application_1539978346138_110818 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1153)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1568)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:892)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/11/29 16:54:37 INFO util.ShutdownHookManager: Shutdown hook called
We always create a SparkSession and always return sys.exit(0) (although that appears to be ignored by the Spark framework regardless of how we submit a job). We also have our own internal error logging that routes to Kafka/ElasticSearch. No errors are reported during the job run.
Here's an example of the submit command: spark2-submit --keytab /etc/keytabs/p800s1.ktf --principal p800s1#OURDOMAIN.COM --master yarn --deploy-mode cluster --driver-memory 2g --executor-memory 4g --class com.path.to.MainClass /path/to/UberJar.jar arg1 arg2
This seems to be harmless noise, but I don't like noise that I don't understand. Has anyone experienced something similar?

ERROR : User did not initialize spark context

Log error :
TestSuccessfull
2018-08-20 04:52:15 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13
2018-08-20 04:52:15 ERROR ApplicationMaster:91 - Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:498)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
2018-08-20 04:52:15 INFO SparkContext:54 - Invoking stop() from shutdown hook
Error log on console After submit command :
2018-08-20 05:47:35 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:36 INFO Client:54 - Application report for application_1534690018301_0035 (state: ACCEPTED)
2018-08-20 05:47:37 INFO Client:54 - Application report for application_1534690018301_0035 (state: FAILED)
2018-08-20 05:47:37 INFO Client:54 -
client token: N/A
diagnostics: Application application_1534690018301_0035 failed 2 times due to AM Container for appattempt_1534690018301_0035_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: [2018-08-20 05:47:36.454]Exception from container-launch.
Container id: container_1534690018301_0035_02_000001
Exit code: 13
My code :
val sparkConf = new SparkConf().setAppName("Gathering Data")
val sc = new SparkContext(sparkConf)
submit command :
spark-submit --class spark_basic.Test_Local --master yarn --deploy-mode cluster /home/IdeaProjects/target/Spark-1.0-SNAPSHOT.jar
discription :
I have installed spark on hadoop in psedo distribustion mode.
spark-shell working fine. only problem when i used cluster mode .
My code also work file . i am able print output but at final its giving error .
I presume your lines of code has a line which sets master to local.
SparkConf.setMaster("local[*]")
if so, try to comment out that line and try again as you will be setting the master to yarn in your command
/usr/cdh/current/spark-client/bin/spark-submit --class com.test.sparkApp --master yarn --deploy-mode cluster --num-executors 40 --executor-cores 4 --driver-memory 17g --executor-memory 22g --files /usr/cdh/current/spark-client/conf/hive-site.xml /home/user/sparkApp.jar
Finally i got with
spark-submit
/home/mahendra/Marvaland/SparkEcho/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --master yarn --class spark_basic.Test_Local /home/mahendra/IdeaProjects/SparkTraining/target/SparkTraining-1.0-SNAPSHOT.jar
spark session
val spark = SparkSession.builder()
.appName("DataETL")
.master("local[1]")
.enableHiveSupport()
.getOrCreate()
thanks #cricket_007
This error may occur if you are submitting the spark job like this:
spark-submit --class some.path.com.Main --master yarn --deploy-mode cluster some_spark.jar (with passing master and deploy-mode as argument in CLI) and at the same time having this line: new SparkContext in your code.
Either get the context with val sc = SparkContext.getOrCreate() or do not pass the spark-submit master and deploy-mode arguments if want to have new SparkContext.

Spark job error: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN

I am submitting a spark with that would write to Kerborized cluster with following command. I didn't add any code in the spark program to enable authentication etc stuff. I just passed principal and keytab with spark-submit.
But i am getting 'Failed to renew token' error. My spark program could connect to hive metastore.
Can i know what is causing this?
> ./spark-submit --class com.abcd.xyz.voice.cc.cc.cc --verbose --master
> yarn --deploy-mode cluster --executor-cores 6 --executor-memory 6g
> --driver-java-options "-Dlog4j.configuration=file:/app/home/abcd/conf/my_Driver.log4j"
> --principal myowner#CABLE.abcd.COM --keytab /app/home/emm/myfile.keytab --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/app/home/emm/conf/my_Executor.log4j
> -Ddm.logging.name=LegalDemand" /app/home/emm/bin/myjar.jar --files file:///app/home/emm/mykeytab.keytab --conf
> spark.hadoop.fs.hdfs.impl.disable.cache=true
> /app/home/emm/conf/my.properties
17/07/11 18:02:42 INFO yarn.Client:
client token: N/A
diagnostics: Failed to renew token: Kind: TIMELINE_DELEGATION_TOKEN, Service: 172.27.30.133:8188, Ident:
(owner=myowner, renewer=yarn, realUser=, issueDate=1499796160528,
maxDate=1500400960528, sequenceNumber=74505, masterKeyId=294)
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1499796160725
final status: FAILED
tracking URL: http://abcd.net:8088/proxy/application_1499697586013_1727/
user: myowner

Spark on Mesos Cluster - Task Fails

I'm trying to run a Spark application in a Mesos cluster where I have one master and one slave. The slave has 8GB RAM assigned for Mesos. The master is running the Spark Mesos Dispatcher.
I use the following command to submit a Spark application (which is a streaming application).
spark-submit --master mesos://mesos-master:7077 --class com.verifone.media.ums.scheduling.spark.SparkBootstrapper --deploy-mode cluster scheduling-spark-0.5.jar
And I see the following output which shows its successfully submitted.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/09/01 12:52:38 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: Submission successfully created as driver-20150901072239-0002. Polling submission state...
15/09/01 12:52:39 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20150901072239-0002 in mesos://mesos-master:7077.
15/09/01 12:52:39 INFO RestSubmissionClient: State of driver driver-20150901072239-0002 is now QUEUED.
15/09/01 12:52:40 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
{
"action" : "CreateSubmissionResponse",
"serverSparkVersion" : "1.4.1",
"submissionId" : "driver-20150901072239-0002",
"success" : true
}
However, this fails in Mesos, and when I look at the Spark Cluster UI, I see the following message.
task_id { value: "driver-20150901070957-0001" } state: TASK_FAILED message: "" slave_id { value: "20150831-082639-167881920-5050-4116-S6" } timestamp: 1.441091399975446E9 source: SOURCE_SLAVE reason: REASON_MEMORY_LIMIT 11: "\305-^E\377)N\327\277\361:\351\fm\215\312"
Seems like it is related to memory, but I'm not sure whether I have to configure something here to get this working.
UPDATE
I looked at the mesos logs in the slave, and I see the following message.
E0901 07:56:26.086618 1284 fetcher.cpp:515] Failed to run mesos-fetcher: Failed to fetch all URIs for container '33183181-e91b-4012-9e21-baa37485e755' with exit status: 256
So I thought that this could be because of the Spark Executor URL, so I modified the spark-submit to be as follows and increased memory for both driver and slave, but still I see the same error.
spark-submit \
--master mesos://mesos-master:7077 \
--class com.verifone.media.ums.scheduling.spark.SparkBootstrapper \
--deploy-mode cluster \
--driver-memory 1G \
--executor-memory 4G \
--conf spark.executor.uri=http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz \
scheduling-spark-0.5.jar
UPDATE 2
I went past this point by following #hartem's advice (see comments). Tasks are running now, but still, actual Spark application does not run in the cluster. When I look at the logs I see the following. After the last line, seems that Spark does not proceed any further.
15/09/01 10:33:41 INFO SparkContext: Added JAR file:/tmp/mesos/slaves/20150831-082639-167881920-5050-4116-S8/frameworks/20150831-082639-167881920-5050-4116-0004/executors/driver-20150901103327-0002/runs/47339c12-fb78-43d6-bc8a-958dd94d0ccf/spark-1.4.1-bin-hadoop2.6/../scheduling-spark-0.5.jar at http://192.172.1.31:33666/jars/scheduling-spark-0.5.jar with timestamp 1441103621639
I0901 10:33:41.728466 4375 sched.cpp:157] Version: 0.23.0
I0901 10:33:41.730764 4383 sched.cpp:254] New master detected at master#192.172.1.10:7077
I0901 10:33:41.730908 4383 sched.cpp:264] No credentials provided. Attempting to register without authentication
I had similar issue problem was slave could not find the required jar for running the class file(SparkPi). So i gave the http URL of the jar it worked, it requires jar to be placed in distributed system not on local file system.
/home/centos/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--name SparkPiTestApp \
--class org.apache.spark.examples.SparkPi \
--master mesos://xxxxxxx:7077 \
--deploy-mode cluster \
--executor-memory 5G --total-executor-cores 30 \
http://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 100
Could you please do export GLOG_v=1 before launching the slave and see if there is anything interesting in the slave log? I would also look for stdout and stderr files under the slave working directory and see if they contain any clues.

Resources