Spark streaming error during job runtime in cluster (yarn resource manager) - apache-spark

I am facing the following error:
I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4 nodes to execute the spark Job.
After 10 minutes of the job running without any specific error, I get the following error permanently:
ERROR org.apache.spark.streaming.CheckpointWriter:
Could not submit checkpoint task to the thread pool executor java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler#68395dc9 rejected
from java.util.concurrent.ThreadPoolExecutor#1a1acc25
[Running, pool size = 1, active threads = 1, queued tasks = 1000, completed tasks = 412]

Related

Unable to gracefully finish an Airflow DAG

I have a spark-streaming job that runs on EMR, scheduled by Airflow. We want to gracefully terminate this EMR cluster every week.
But when I issue the kill or SIGTERM signal to the running spark-streaming application it is reporting as "failed" task in the Airflow DAG. This is preventing the DAG to move further, preventing the next run from triggering.
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
For the first part, can you share your code that kills the Spark app? I think you should be able to have this task return success and have everything downstream "just work".
I'm not too familiar with EMR, but looking at the docs it looks like "job flow" is their name for the Spark cluster. In that case, are you using the built-in EmrTerminateJobFlowOperator?
I wonder if the failed task is the cluster terminating propagating back an error code or something? Also, is it possible that the cluster is failing to terminate and your code is raising an exception leading to a failed task?
To answer the second part, if you have multiple upstream tasks, you can use an alternate trigger rule on the operator to determine which downstream tasks run.
class TriggerRule(object):
ALL_SUCCESS = 'all_success'
ALL_FAILED = 'all_failed'
ALL_DONE = 'all_done'
ONE_SUCCESS = 'one_success'
ONE_FAILED = 'one_failed'
DUMMY = 'dummy'
https://github.com/apache/incubator-airflow/blob/master/airflow/utils/trigger_rule.py
https://github.com/apache/incubator-airflow/blob/master/docs/concepts.rst#trigger-rules

NodeJS Agenda scheduler: cluster with 2 or 3 workers, jobs are not getting "distributed" evenly

I'm using the great NodeJS Agenda task scheduler (https://github.com/agenda/agenda) to schedule background jobs in my Node/Express app.
Agenda can also run in a Node Cluster configuration, however I'm running into a problem with that (maybe I overlooked something because this seems pretty basic).
So I've used the code example from the README (https://github.com/agenda/agenda#spawning--forking-processes) to set up a cluster with N workers, each worker (Node cluster process) has an agenda instance.
Now suppose I have 2 workers (processes) and I call "agenda.now()" from worker 1, then it can be picked up (processed) by any of the 2 workers, because all of them are monitoring the queue, right?
However I always see the job being picked up by the first worker, the other one(s) aren't picking up the job.
What am I overlooking? All the workers should be monitoring the queue so all of them should be picking up jobs.
setting the lockLimit would distribute the jobs evenly in case of cluster. It can be setup while instantiation of agenda
Takes a number which specifies the max number jobs that can be locked at any given moment. By default it is 0 for no max.
const agenda = new Agenda({lockLimit:3,
db: { address: __mongodb_connection__ , collection: "jobCollectionName"}
});

Spark - How to identify a failed Job through 'SparkLauncher'

I am using Spark 2.0 and sometimes my job fails due to problems with input. For example, I am reading CSV files off from a S3 folder based on the date, and if there's no data for the current date, my job has nothing to process so it throws an exception as follows. This gets printed in the driver's logs.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: s3n://data/2016-08-31/*.csv;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
...
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/09/03 10:51:54 INFO SparkContext: Invoking stop() from shutdown hook
16/09/03 10:51:54 INFO SparkUI: Stopped Spark web UI at http://192.168.1.33:4040
16/09/03 10:51:54 INFO StandaloneSchedulerBackend: Shutting down all executors
16/09/03 10:51:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
Spark App app-20160903105040-0007 state changed to FINISHED
However, despite this uncaught exception, my Spark Job status is 'FINISHED'. I would expect it to be in 'FAILED' status because there was an exception. Why is it marked as FINISHED? How can I find out whether the job failed or not?
Note: I am spawning the Spark jobs using SparkLauncher, and listening to state changes through AppHandle. But the state change I receive is FINISHED whereas I am expecting FAILED.
The one FINISHED you see is for Spark application not a job. It is FINISHED since the Spark context was able to start and stop properly.
You can see any job information using JavaSparkStatusTracker.
For active jobs nothing additional should be done, since it has ".getActiveJobIds" method.
For getting finished/failed you will need to setup the job group ID in the thread from which you are calling for a spark execution:
JavaSparkContext sc;
...
sc.setJobGroup(MY_JOB_ID, "Some description");
Then whenever you need, you can read the status of each job with in specified job group:
JavaSparkStatusTracker statusTracker = sc.statusTracker();
for (int jobId : statusTracker.getJobIdsForGroup(JOB_GROUP_ALL)) {
final SparkJobInfo jobInfo = statusTracker.getJobInfo(jobId);
final JobExecutionStatus status = jobInfo.status();
}
The JobExecutionStatus can be one of RUNNING, SUCCEEDED, FAILED, UNKNOWN; The last one is for case of job is submitted, but not actually started.
Note: all this is available from Spark driver, which is jar you are launching using SparkLauncher. So above code should be placed into the jar.
If you want to check in general is there any failures from the side of Spark Launcher, you can exit the application started by Jar with exit code different than 0 using kind of System.exit(1), if detected a job failure. The Process returned by SparkLauncher::launch contains exitValue method, so you can detect is it failed or no.
you can always go to spark history server and click on your job id to
get the job details.
If you are using yarn then you can go to resource manager web UI to
track your job status.

delpoy spark app to remote spark master: get error-java.util.concurrent.RejectedExecutionException

For one spark app, I use "sbt run" in my own laptop to submit the job to remote master;
I met the problem as follows:
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread
Thread[appclient-registration-retry-thread,5,run-main-group-0]
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask#65d92aad rejected from
java.util.concurrent.ThreadPoolExecutor#1480f818[Running, pool size = 1,
active threads = 1, queued tasks = 0, completed tasks = 0]

Spark Java Application no cores and waiting

I'm new to spark and I'm trying to develop my first application. I'm only trying to count lines in a file but I got the error:
2015-11-28 10:21:34 WARN TaskSchedulerImpl:71 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have enough cores and enough memory. I read that can be a firewall problem but I'm getting this error both on my server and on my macbook and for sure on the macbook there is no firewall. If I open the UI it says that the application is WAITING and apparently the application is getting no cores at all:
Application ID Name Cores Memory per Node State
app-20151128102116-0002 (kill) New app 0 1024.0 MB WAITING
My code is very simple:
SparkConf sparkConf = new SparkConf().setAppName(new String("New app"));
sparkConf.setMaster("spark://MacBook-Air.local:7077");
JavaRDD<String> textFile = sc.textFile("/Users/mattiazeni/Desktop/test.csv.bz2");
if(logger.isInfoEnabled()){
logger.info(textFile.count());
}
if I try to run the same program from the shell in scala it works great.
Any suggestion?
Check that workers are running - there should be at least one worker listed on the Spark UI at http://:8080
If not are running, use /sbin/start-slaves.sh (or start-all.sh)

Resources