Why is Spark application's final status FAILED while it finishes successfully? - apache-spark

My application Spark 2.0.0 runs on yarn 2.7.2. It finishes successfully but Yarn marks it as failed with error:
Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
I see no errors on executors nor driver and application writes the data it is supposed to.

This seems to be caused by calling System.exit( 0 ) specifically in my code. After removing it the problem is gone

Related

How to fix this fatal error while running spark jobs on HDIinsight cluster? Session 681 unexpectedly reached final status 'dead'. See logs:

I am running pyspark code on HDIcluster and getting this error:
The code failed because of a fatal error: Session 681 unexpectedly
reached final status 'dead'. See logs:
I don't have experience in YARN or Hadoop. I tried few links provided in stack overflow. But none of them helped. One strange thing is I was able to run the same code yesterday with out that error.
I just ran this import
from pyspark.sql import SparkSession
This is the error I am getting:
19/06/21 20:35:35 INFO Client:
client token: N/A
diagnostics: [Fri Jun 21 20:35:35 +0000 2019] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:819200, vCores:240> ; Queue's Absolute capacity = 50.0 % ; Queue's Absolute used capacity = 99.1875 % ; Queue's Absolute max capacity = 100.0 % ;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1561149335158
final status: UNDEFINED
tracking URL: https://mmsorderpredhdi.azurehdinsight.net/yarnui/hn/proxy/application_1560840076505_0062/
user: livy
19/06/21 20:35:35 INFO ShutdownHookManager: Shutdown hook called
19/06/21 20:35:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-bb63c5f0-7579-4456-b32a-0e643ca97ecc
YARN Diagnostics:
Application killed by user..
Question : Is there something to deal with Queue's absolute used capacity?
Could you please check the logs to find the exact issue?
Where do I find the log file?
On Azure HDInsight cluster, You may found the livy log by connecting to one of the Head nodes with SSH and downloading a file at this path.
hdfs dfs -ls /app-logs/livy/logs-ifile
For more details, refer "Access Apache Hadoop YARN application logs on Linux-based HDInsight"
And also, you may refer “How to start sparksession in pyspark”.
Hope this helps.

Dkron job failed to execute by the scheduled time

I'm using dkron 0.10.4. I created the scheduled jobs for dkron and it was working fine previously. But suddenly the jobs are not executed on the scheduled time. It shows the output as
rpc error: code = Unknown desc = exit status 1
and status "False"
You could open dkron.yml in /etc/dkron and add debug line like
log-level: debug
Run dkron agent :
# dkron agent --config /etc/dkron/dkron.yml
You should see debug lines with more informations about your rpc error code.

Stopping a job in spark

I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

NEO4J local server does not start

I am running Linux in VirtualBox and am having an issue that I did not encounter on my machine with Linux as the primary OS.
When launching the neo4j service through sudo ./neo4j start in /opt/neo4j-community-2.3.1/bin I get a timeout with the message Failed to start within 120 seconds. Neo4j Server may have failed to start, please check the logs
my log from /opt/neo4j-community-2.3.1/data/graph.db/messages.log says:
http://pastebin.com/wUA715QQ
and data/log/console.log says:
2016-01-06 02:07:03.404+0100 INFO Successfully started database
2016-01-06 02:07:03.603+0100 INFO Successfully stopped database
2016-01-06 02:07:03.604+0100 INFO Successfully shutdown Neo4j Server
2016-01-06 02:07:03.608+0100 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:67)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:234)
at org.neo4j.server.Bootstrapper.start(Bootstrapper.java:97)
at org.neo4j.server.CommunityBootstrapper.start(CommunityBootstrapper.java:48)
at org.neo4j.server.CommunityBootstrapper.main(CommunityBootstrapper.java:35)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.security.auth.FileUserRepository#9ab182' was successfully initialized, but failed to start. Please see attached cause exception.
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:462)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:111)
at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:194)
... 3 more
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.neo4j.server.security.auth.FileUserRepository.loadUsersFromFile(FileUserRepository.java:208)
at org.neo4j.server.security.auth.FileUserRepository.start(FileUserRepository.java:73)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:452)
... 5 more
Any idea why the server won't start?
Check the permissions on /opt/neo4j-community-2.3.1/data/dbms/auth
See the line that says:
Caused by: java.nio.file.AccessDeniedException: /opt/neo4j-community-2.3.1/data/dbms/auth

Resources