Application failed 2 times due to AM container for appattempt_ exited with exitCode: 0 - apache-spark

When I submit my spark program, it fails at the end but with a ExitCode:0 as shown in the picture.
The program should write a table on hive and despite the failure, the table was created successfully.
But I can't figure out the origin of the error. Can you help please.
Yarn logs -appID gives the following output here

I finally solved my problem. In fact today I suprisingly got another error from the same program saying
ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver X:37478 disassociated! Shutting down.
And I found so many solutions talking about memory or timeOut.
What did the trick is just that I forgot to close my sparkSession (spark.close()).

Related

Spark job failing with "Fail to know the executor driver is alive or not", "Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>

I'm running a job on a local Spark cluster (pyspark). When I run it with a small dataset it works fine, but once it's large, I get an error. I'm wondering 1. How to find logs from the scheduler process that appears to be crashing, and 2. more generally, what might be going on and how to debug the problem. Thanks in advance. Happy to provide more info.
Here's the error (from what I understand to be the driver logs):
block-manager-ask-thread-pool-224 ERROR BlockManagerMasterEndpoint: Fail to know th
e executor driver is alive or not.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at...
...
<stacktrace>
...
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler#<host:port>
and then immediately below that
block-manager-ask-thread-pool-224 WARN BlockManagerMasterEndpoint: Error trying to remove shuffle 25. The executor driver may have been lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from <host:port>
What I know about my job... I'm using Pyspark and running Spark standalone, using a local cluster with 72 workers (the machine has 96 cores). Here's my config:
spark:
master: "local[72]"
files:
maxPartitionBytes: 67108864
sql:
files:
maxPartitionBytes: 67108864
driver:
memory: "50g"
maxResultSize: "2g"
supervise: true
cores: 72
log:
dfsDir: <my/logs/dir>
persistToDfs:
enabled: true
loglevel: "WARN"
logConf: true
I've set SPARK_LOG_DIR and SPARK_WORKER_LOG_DIR to attempt to see scheduler logs, but I still only see driver (worker?) logs as far as I can tell, with the above error. I'm monitoring memory usage and it doesn't seem like my machine is memory-constrained, but I can't be sure I'm checking at the right moments. The machine has about 1TB of memory and tens of terabytes of free disk space.
Thanks in advance!

Spark application in incomplete section of spark-history even when complited

In my Spark-history some applications are "incomplete" for a week now. I've tried to kill them, close sparkContext(), kill main .py process, but nothing helped.
For example,
yarn application -status <id>
shows:
...
State: FINISHED
Final-State: SUCCEDED
...
Log Aggregation Status: TIME_OUT
...
But in Spark-History I still see it in incomplete section of my applications. If I open this application there, I can see 1 Active job with 1 Alive executor, but they are doing nothing for all week. This seems like a logging bug, but as I know this problem is only with me, other coworkers don't have this problem.
This thread doesn't helped me, because I dont have access to start-history-server.sh.
I suppose this problem because of
Log Aggregation Status: TIME_OUT
because my "completed" applications have
Log Aggregation Status: SUCCEDED
What can I do to fix this? Right now I have 90+ incomplete applications.
I've found clear description of my problem with same situation (yarn, spark, etc.), but there is no solution: What is 'Active Jobs' in Spark History Server Spark UI Jobs section
From Spark Monitoring and Instrumentation:
...
3. Applications which exited without registering themselves as completed will be listed as incomplete --even though they are no
longer running. This can happen if an application crashes.
...
Meaning:
History Server's UI shows only those Spark applications whose event logs it can find in its spark.eventLog.dir directory (a config typically set to /user/spark/applicationHistory in Hadoop). If a log doesn't end with the special ApplicationEnd event
:
{"Event":"SparkListenerApplicationEnd","Timestamp":1667223930402}
...the application is considered incomplete (even if it is no longer running) and will be displayed on the Incomplete Applications page.
To your question it means that "moving" application to the Completed Apps page won't be trivial, and will require manually editing eventlog and re-uploading it to SHS directory in Hadoop. Moreover, it won't solve anything, since most likely, your application keeps crashing before it can write that final message, and its next run will end up on the same Incomplete page again.
To diagnose the reason why it fails, perhaps you can look at the application driver logs for any clues -- errors or exception messages. Graceful shutdown looks different depending on what kind of resource manager and what deploy mode your app is using. For deploy-mode=cluster and YARN RM, it would look something like this:
:
22/10/31 11:11:11 INFO spark.SparkContext: Successfully stopped SparkContext
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
22/10/31 11:11:11 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://.../.../.sparkStaging/application_<appId>
22/10/31 11:11:11 INFO util.ShutdownHookManager: Shutdown hook called
22/10/31 11:11:11 INFO util.ShutdownHookManager: Deleting directory /.../.../appcache/application_<appId>/spark-<guid>

Databricks error:Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster

I'm working in databricks 8.1. If I’m running my Pyspark code line by line using Shift + Enter then my code is running fine but if I’m running the entire notebook using Run All button then I’m getting internal error message. The complete error message is
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
com.databricks.rpc.RPCResponseTooLarge: rpc response (of 20984709 bytes) exceeds limit of 20971520
bytes at com.databricks.rpc.Jetty9Client$$anon$1.onContent(Jetty9Client.scala:370) at
shaded.v9_4.org.eclipse.jetty.client.api.Response$Listener$Adapter.onContent(Response.java:248) at
shaded.v9_4.org.eclipse.jetty.client.ResponseNotifier.notifyContent(ResponseNotifier.java:135) at
shaded.v9_4.org.eclipse.jetty.client.ResponseNotifier.notifyContent(ResponseNotifier.java:126) at
shaded.v9_4.org.eclipse.jetty.client.HttpReceiver.responseContent(HttpReceiver.java:340) at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.content(HttpReceiverOverHTTP.java:283)
at shaded.v9_4.org.eclipse.jetty.http.HttpParser.parseContent(HttpParser.java:1762) at
shaded.v9_4.org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1490) at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.parse(HttpReceiverOverHTTP.java:172)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.process(HttpReceiverOverHTTP.java:135)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpReceiverOverHTTP.receive(HttpReceiverOverHTTP.java:73)
at
shaded.v9_4.org.eclipse.jetty.client.http.HttpChannelOverHTTP.receive(HttpChannelOverHTTP.java:133)
at shaded.v9_4.org.eclipse.jetty.client.http.HttpConnectionOverHTTP.onFillable(HttpConnectionOverHTTP.java:151)
at shaded.v9_4.org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at shaded.v9_4.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:426)
at shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:320) at
shaded.v9_4.org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:158) at
shaded.v9_4.org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at
shaded.v9_4.org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at
shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at shaded.v9_4.org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at shaded.v9_4.org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:367) at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782) at shaded.v9_4.org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914) at java.base/java.lang.Thread.run(Thread.java:834)
Several times, I restarted the cluster but still I'm getting same issue
Can you suggest me the steps to resolve this issue?

getOrCreate deployment failing randomly

When attempting to call H2OContext.getOrCreate with a valid SparkContext, randomly we keep seeing failures to deploy:
17/04/21 17:21:32 ERROR TaskSchedulerImpl: Lost executor 0 on 172.17.0.4: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
17/04/21 17:21:38 ERROR LiveListenerBus: Listener ExecutorAddNotSupportedListener threw an exception
java.lang.IllegalArgumentException: Executor without H2O instance discovered, killing the cloud!
at org.apache.spark.listeners.ExecutorAddNotSupportedListener.onExecutorAdded(H2OSparkListener.scala:27)
at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:61)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1252)
at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
The H2OContext.getOrCreate causes the error:
Context.spark_session = SparkSession.builder.getOrCreate()
Context.h2o_context = H2OContext.getOrCreate(Context.spark_session)
Any thoughts from the H2O Crew?
this is a known behaviour of Sparkling Water internal backend at the moment. To avoid this, the external Sparkling Water backend can be used. More information about this can be found here https://github.com/h2oai/sparkling-water/blob/master/doc/backends.md
I'm currently working on this JIRA which should eliminate the behaviour above as well. It's work in progress, this JIRA https://0xdata.atlassian.net/browse/SW-369 can be tracked to get the status of the task.

Spark - How to identify a failed Job through 'SparkLauncher'

I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
...
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at http://192.168.1.33:4040
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.
The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
...
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
}
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.
you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.

Resources