Spark Web UI showing Job SUCCEEDED but Tasks Succeeded Less than Total - apache-spark

In the "Details for Job n section, the UI shows "Status: SUCCEEDED", however one of the Stages shows 22029 succeeded Tasks out of 59400 total tasks. I'm running this through a Python Jupyter notebook running Spark 3.0.1, and I haven't stopped the Spark context yet so the application is still running. In fact, the Stages Tab shows the stage in question as still active. I don't understand how the stage could still be active, yet the Job is listed as Completed and Successful in the UI.
The relevant code (I think) is below, where I try to parallelize as much as possible many SQL queries and then union the result dataframes together. Lastly, I'm writing them to cloud storage in parquet.
EDIT: I also can see the same information from the REST API using the endpoints documented here in the docs, and those values are the same as I see in the Web UI.
There are no jobs appearing in the Jobs tab as failed, and I believe that ultimately the data is successfully written and correct.
I have seen in the logs many instances of Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. Because of that, I am experimenting with the parameter spark.scheduler.listenerbus.eventqueue.capacity to increase it and see if that results in a difference in the reporting of Succeeded and Total tasks for that stage.
Upon increasing spark.scheduler.listenerbus.eventqueue.capacity from default of 10000 to 65000, there seems to be a corresponding decrease in events dropped, as well as an increase in Succeeded Tasks reported for that stage, improving to ~ 47K from ~ 22K. I have also noticed that the difference in Succeeded and Total tasks for that stage is on the order of the number of dropped events in the log so I will see if limiting the dropped events can resolve the discrepancy.
def make_df(query: str):
df = spark.sql(query)
return df
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_list = list(map(make_df, queries))
df = functools.reduce(lambda x, y: x.union(y), df_list)
df.repartition("col1", "col2")\
.write.partitionBy("col1", "col2")\
.mode("overwrite")\
.parquet(path)
Why would my job be reporting as Successful when there are still tasks remaining that aren't successful?

Related

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

Best approach to check if Spark streaming jobs are hanging

I have Spark streaming application which basically gets a trigger message from Kafka which kick starts the batch processing which could potentially take up to 2 hours.
There were incidents where some of the jobs were hanging indefinitely and didn't get completed within the usual time and currently there is no way we could figure out the status of the job without checking the Spark UI manually. I want to have a way where the currently running spark jobs are hanging or not. So basically if it's hanging for more than 30 minutes I want to notify the users so they can take an action. What all options do I have?
I see I can use metrics from driver and executors. If I were to choose the most important one, it would be the last received batch records. When StreamingMetrics.streaming.lastReceivedBatch_records == 0 it probably means that Spark streaming job has been stopped or failed.
But in my scenario, I will receive only 1 streaming trigger event and then it will kick start the processing which may take up to 2 hours so I won't be able to rely on the records received.
Is there a better way? TIA
YARN provides the REST API to check the status of application and status of cluster resource utilization as well.
with API call it will give a list of running applications and their start times and other details. you can have simple REST client that triggers maybe once in every 30 min or so and check if the job is running for more than 2 hours then send a simple mail alert.
Here is the API documentation:
https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Maybe a simple solution like.
At the start of the processing - launch a waiting thread.
val TWO_HOURS = 2 * 60 * 60 * 1000
val t = new Thread(new Runnable {
override def run(): Unit = {
try {
Thread.sleep(TWO_HOURS)
// send an email that job didn't end
} catch {
case _: Exception => _
}
}
})
And in the place where you can say that batch processing is ended
t.interrupt()
If processing is done within 2 hours - waiter thread is interrupted and e-mail is not sent. If processing is not done - e-mail will be sent.
Let me draw your attention towards Streaming Query listeners. These are quite amazing lightweight things that can monitor your streaming query progress.
In an application that has multiple queries, you can figure out which queries are lagging or have stopped due to some exception.
Please find below sample code to understand its implementation. I hope that you can use this and convert this piece to better suit your needs. Thanks!
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: QueryStartedEvent) {
//logger message to show that the query has started
}
override def onQueryProgress(event: QueryProgressEvent) {
synchronized {
if(event.progress.name.equalsIgnoreCase("QueryName"))
{
recordsReadCount = recordsReadCount + event.progress.numInputRows
//Logger messages to show continuous progress
}
}
}
override def onQueryTerminated(event: QueryTerminatedEvent) {
synchronized {
//logger message to show the reason of termination.
}
}
})
I'm using Kubernetes currently with the Google Spark Operator. [1]
Some of my streaming jobs hang while using Spark 2.4.3: few tasks fail, then the current batch job never progresses.
I have set a timeout using a StreamingProgressListener so that a thread signals when no new batch is submitted for a long time. The signal is then forwarded to a Pushover client that sends a notification to an Android device. Then System.exit(1) is called. The Spark Operator will eventually restart the job.
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
One way is to monitor the output of the spark job that was kick started. Generally, for example,
If it writes to HDFS, monitor the HDFS output directory for last modified file timestamp or file count generated
If it writes to a Database, you could have a query to check the timestamp of the last record inserted into your job output table.
If it writes to Kafka, you could use Kafka GetOffsetShell to get the output topic's current offset.
Utilize
TaskContext
This provides contextual information for a task, and supports adding listeners for task completion/failure (see addTaskCompletionListener).
More detailed information such as the task 'attemptNumber' or 'taskMetrics' is available as well.
This information can be used by your application during runtime to determine if their is a 'hang' (depending on the problem)
More information about what is 'hanging' would be useful in providing a more specific solution.
I had a similar scenario to deal with about a year ago and this is what I did -
As soon as Kafka receive's message, spark streaming job picks up the event and start processing.
Spark streaming job sends an alert email to Support group saying "Event Received and spark transformation STARTED". Start timestamp is stored.
After spark processing/transformations are done - sends an alert email to Support group saying "Spark transformation ENDED Successfully". End timestamp is stored.
Above 2 steps will help support group to track if spark processing success email is not received after it's started and they can investigate by looking at spark UI for job failure or delayed processing (maybe job is hung due to resource unavailability for a long time)
At last - store event id or details in HDFS file along with start and end timestamp. And save this file to the HDFS path where some hive log_table is pointing to. This will be helpful for future reference to how spark code is performing over the period time and can be fine tuned if required.
Hope this is helpful.

How to debug a slow PySpark application

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

dask processes tasks twice

I noticed that a tasks of a dask graph can be executed several times by different workers.
Also I see that log in the scheduler console (Don't know if it can be related to resilience):
"WARNING - Lost connection to ... while sending result: Stream is
closed"
Is there a way to impede dask to execute the same task twice on different workers ?
Note that i'm using:
dask 0.15.0
distributed 1.15.1
Thx
Bertrand
The short answer is "no".
Dask reserves the right to call your function many times. This might occur if a worker goes down or if Dask does some load balancing and moves some tasks around the cluster while at the same time they've just started.
However you can significantly reduce the likelihood of a task running multiple times by turning off work stealing:
def turn_off_stealing(dask_scheduler):
dask_scheduler.extensions['stealing']._pc.stop()
client.run(turn_off_stealing)

Resources