How to debug a slow PySpark application - apache-spark

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?

You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)

Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

Related

Spark Web UI showing Job SUCCEEDED but Tasks Succeeded Less than Total

In the "Details for Job n section, the UI shows "Status: SUCCEEDED", however one of the Stages shows 22029 succeeded Tasks out of 59400 total tasks. I'm running this through a Python Jupyter notebook running Spark 3.0.1, and I haven't stopped the Spark context yet so the application is still running. In fact, the Stages Tab shows the stage in question as still active. I don't understand how the stage could still be active, yet the Job is listed as Completed and Successful in the UI.
The relevant code (I think) is below, where I try to parallelize as much as possible many SQL queries and then union the result dataframes together. Lastly, I'm writing them to cloud storage in parquet.
EDIT: I also can see the same information from the REST API using the endpoints documented here in the docs, and those values are the same as I see in the Web UI.
There are no jobs appearing in the Jobs tab as failed, and I believe that ultimately the data is successfully written and correct.
I have seen in the logs many instances of Dropping event from queue appStatus. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. Because of that, I am experimenting with the parameter spark.scheduler.listenerbus.eventqueue.capacity to increase it and see if that results in a difference in the reporting of Succeeded and Total tasks for that stage.
Upon increasing spark.scheduler.listenerbus.eventqueue.capacity from default of 10000 to 65000, there seems to be a corresponding decrease in events dropped, as well as an increase in Succeeded Tasks reported for that stage, improving to ~ 47K from ~ 22K. I have also noticed that the difference in Succeeded and Total tasks for that stage is on the order of the number of dropped events in the log so I will see if limiting the dropped events can resolve the discrepancy.
def make_df(query: str):
df = spark.sql(query)
return df
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df_list = list(map(make_df, queries))
df = functools.reduce(lambda x, y: x.union(y), df_list)
df.repartition("col1", "col2")\
.write.partitionBy("col1", "col2")\
.mode("overwrite")\
.parquet(path)
Why would my job be reporting as Successful when there are still tasks remaining that aren't successful?

Netsuite Map Reduce yielding

I read in documentation that soft limits on governance cause map reduce scripts to yield and reschedule. My problem is I cannot see in docs where it explains what happens in the yield. Is the getInputData called again to regather the same data set ok to be mapped or is the initial data set persisted somewhere and already mapped and reduced records are Excluded from processing?
With yielding, the getInputData stage is not called again. From the docs;
If a job monopolizes a processor for too long, the system can
naturally finish the job after the current map or reduce function has
completed. In this case, the system creates a new job to continue
executing remaining key/value pairs. Based on its priority and
submission timestamp, the new job either starts right after the
original job has finished, or it starts later, to allow
higher-priority jobs processing other scripts to execute. For more
details, see SuiteScript 2.0 Map/Reduce Yielding.
This is different from server restarts or interruptions, however.

Apache Spark DAGScheduler Flow of Data

I am trying to understand how exactly Apache Spark scheduler works. To do so, i've set a local cluster with one master and two workers. I only submit one application, which simply reads 4 files (2 small (~10MB) and 2 big(~1,1GB)),joins them and collects the result. In addition, i cache in memory the two small files.
I am running the standalone cluster mode with FIFO.I've understood how the stages are formed but i cannot figure out how the flow of data is determined(the arrows). When i look at SparkUI, i notice that each time,even though the stages are formed in the same way, the arrows( flow of data and control i guess) are different. It's like the scheduler works non-deterministically.
I've read the relative chapters (about DAG and Task Scheduler) from Jacek Laskowski's book, but it isn't still clear in my head how the flow of control is determined . Thanks in advance for the help.
Cheers,
Jim
It's like the scheduler works non-deterministically.
Yes, there's some randomness in scheduling tasks to make it more "fair". In that sense Spark scheduler does work "non-deterministically", but within acceptable limits of execution placement (i.e. assigning tasks with lesser location preferences to executors).
The component in Apache Spark that does the work of selecting a task for a task set (that corresponds to a stage) is TaskSetManager:
Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node, and statusUpdate, which tells it that one of its tasks changed state (e.g. finished).

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

Performance analysis of U-SQL script

When I run a U SQL script from portal/visual studio it follows stages like preparing,queued,running,finalizing. What exactly happens behind the scenes in all these stages?Will there be any execution time difference when the job is run from visual studio/portal in dev and production environment? We need to clock the speeds and record the time the script would take in production.Ultimately, the goal is to run these scripts as Data Factory activities in production.
I assume that there would be differences since I assume your dev environment would probably run at lower resource usage (lower degree of parallelism both between jobs and inside a job) than your production environment. Otherwise there should be no difference.
Note that we are still working on performance so if you are running into particular issues, please let us know.
The phases roughly do the following (I am probably missing some parts):
preparing: includes compilation, optimization, Codegen, preparing the execution graph and required resources and putting the job into the queue.
queueing: The job sits in the queue to get executed once the job is at the top of the queue and resources are available to start the job. This can be impacted by setting the maximal number of jobs that can run in parallel (a setting you can set by "calling" support/us).
running: Actual job execution. This will be affected by resources: Maximal number of parallelism that is specified on the job, network bandwidth, store access (throttling, bandwidth).
finalizing: Cleanup and stitching results into files, "sealing" table files. This can be more expensive depending on where you write the data (ADL is faster than WASB for example).

Resources