Spark - How to identify a failed Job through 'SparkLauncher' - apache-spark

I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
...
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at http://192.168.1.33:4040
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.

The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
...
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
}
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.

you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.

Related

Spark application in incomplete section of spark-history even when complited

In my Spark-history some applications are "incomplete" for a week now. I've tried to kill them, close sparkContext(), kill main .py process, but nothing helped.
For example,
yarn application -status <id>
shows:
...
State: FINISHED
Final-State: SUCCEDED
...
Log Aggregation Status: TIME_OUT
...
But in Spark-History I still see it in incomplete section of my applications. If I open this application there, I can see 1 Active job with 1 Alive executor, but they are doing nothing for all week. This seems like a logging bug, but as I know this problem is only with me, other coworkers don't have this problem.
This thread doesn't helped me, because I dont have access to start-history-server.sh.
I suppose this problem because of
Log Aggregation Status: TIME_OUT
because my "completed" applications have
Log Aggregation Status: SUCCEDED
What can I do to fix this? Right now I have 90+ incomplete applications.
I've found clear description of my problem with same situation (yarn, spark, etc.), but there is no solution: What is 'Active Jobs' in Spark History Server Spark UI Jobs section
From Spark Monitoring and Instrumentation:
...
3. Applications which exited without registering themselves as completed will be listed as incomplete --even though they are no
longer running. This can happen if an application crashes.
...
Meaning:
History Server's UI shows only those Spark applications whose event logs it can find in its spark.eventLog.dir directory (a config typically set to /user/spark/applicationHistory in Hadoop). If a log doesn't end with the special ApplicationEnd event
:
{"Event":"SparkListenerApplicationEnd","Timestamp":1667223930402}
...the application is considered incomplete (even if it is no longer running) and will be displayed on the Incomplete Applications page.
To your question it means that "moving" application to the Completed Apps page won't be trivial, and will require manually editing eventlog and re-uploading it to SHS directory in Hadoop. Moreover, it won't solve anything, since most likely, your application keeps crashing before it can write that final message, and its next run will end up on the same Incomplete page again.
To diagnose the reason why it fails, perhaps you can look at the application driver logs for any clues -- errors or exception messages. Graceful shutdown looks different depending on what kind of resource manager and what deploy mode your app is using. For deploy-mode=cluster and YARN RM, it would look something like this:
:
22/10/31 11:11:11 INFO spark.SparkContext: Successfully stopped SparkContext
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
22/10/31 11:11:11 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
22/10/31 11:11:11 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://.../.../.sparkStaging/application_<appId>
22/10/31 11:11:11 INFO util.ShutdownHookManager: Shutdown hook called
22/10/31 11:11:11 INFO util.ShutdownHookManager: Deleting directory /.../.../appcache/application_<appId>/spark-<guid>

Spark streaming error during job runtime in cluster (yarn resource manager)

I am facing the following error:
I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4 nodes to execute the spark Job.
After 10 minutes of the job running without any specific error, I get the following error permanently:
ERROR org.apache.spark.streaming.CheckpointWriter:
Could not submit checkpoint task to the thread pool executor java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler#68395dc9 rejected
from java.util.concurrent.ThreadPoolExecutor#1a1acc25
[Running, pool size = 1, active threads = 1, queued tasks = 1000, completed tasks = 412]

Unable to gracefully finish an Airflow DAG

I have a spark-streaming job that runs on EMR, scheduled by Airflow. We want to gracefully terminate this EMR cluster every week.
But when I issue the kill or SIGTERM signal to the running spark-streaming application it is reporting as "failed" task in the Airflow DAG. This is preventing the DAG to move further, preventing the next run from triggering.
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
For the first part, can you share your code that kills the Spark app? I think you should be able to have this task return success and have everything downstream "just work".
I'm not too familiar with EMR, but looking at the docs it looks like "job flow" is their name for the Spark cluster. In that case, are you using the built-in EmrTerminateJobFlowOperator?
I wonder if the failed task is the cluster terminating propagating back an error code or something? Also, is it possible that the cluster is failing to terminate and your code is raising an exception leading to a failed task?
To answer the second part, if you have multiple upstream tasks, you can use an alternate trigger rule on the operator to determine which downstream tasks run.
class TriggerRule(object):
ALL_SUCCESS = 'all_success'
ALL_FAILED = 'all_failed'
ALL_DONE = 'all_done'
ONE_SUCCESS = 'one_success'
ONE_FAILED = 'one_failed'
DUMMY = 'dummy'
https://github.com/apache/incubator-airflow/blob/master/airflow/utils/trigger_rule.py
https://github.com/apache/incubator-airflow/blob/master/docs/concepts.rst#trigger-rules

Spark - Cosmos - connector problems

I am playing around with the Azure Spark-CosmosDB connector which lets you access CosmosDB nodes directly from a Spark cluster for analytics using Jupyter on HDINsight
I have been following the steps described here,including uploading the required jars to Azure storage and executing the %%configure magic to prepare the environment.
But it always seems to terminate due to an I/O exception when trying to open the jar (see yarn log below)
17/10/09 20:10:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)
17/10/09 20:10:35 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/10/09 20:10:35 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.io.IOException: Error accessing /mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1507534135641_0014/container_1507534135641_0014_01_000001/azure-cosmosdb-spark-0.0.3-SNAPSHOT.jar)`
Not sure whether this is related to the jar not being copied to the worker nodes.
any idea? thanks, Nick

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?
I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John

Resources