Why does web UI show different durations in Jobs and Stages pages? - apache-spark

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?

Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

Related

JCL should read internal reader than completely submit outer JCL

I have a batch job that has 10 steps in STEP5. I have written an internal JCL and I want after Internal reader step are completed successfully my next step in the parent job which is STEP06 to execute. Could you please give any resolution to this problem.
For what you have described, there are 2 approaches:
Break your process into 3 jobs - steps 1-5 as one job, the second
job consisting of the JCL submitted in sep 5 and the third job
consisting of step 6 (or steps 6-10 - you do not make it clear if
the main JCL has 6 steps and the 'inner' JCL 4 steps, making the 10
steps you mention, or if the main JCL has 10 steps).
The execution of the 3 jobs will need to be serialised somehow.
Simply have the 'inner' JCL as a series of steps in the 'outer' JCL
so that you only have once job with steps that run in order.
The typical approach to this sort of issue would be to use scheduler to handle the 3 part process as 3 jobs the middle one perhaps not submitted by the scheduler but monitored/tracked by it.
With a good scheduler setup, there is a good chance that even if the jobs were run on different machines or even different types of machines that they could be tracked.
To have a single job delayed halfway through is possible but would require some sort of program to run in a loop (waiting so as not to use excessive cpu) checking for an event (a dataset being created or deleted, the job could itself could be checked or numerous other ways).
Another way could be to have part 1 submit a job to do part 2 and that job to then submit another as part 3.
Yet another way, perhaps frowned upon depedning upon it's importance, would be to have three jobs the first part submitted to run, the third part submitted but on hold. The first submits the second which releases the third.
Of course there is also the possibility that one job could do everthing as a single job.

How to submit jobs across multiple partitions at the same time (Slurm)

After I submit a job to node/partition cn430 today, I find that the node is keeping obsessed,
After the previous job finished, my job still didn't get running due to priority. Then I noticed that all of these jobs have the same prefix, namely 4988443, which is ahead of my job id 4988560.
It seems that the user has submitted about 1000 jobs together with the same priority across multiple partitions,
I am wondering how to implement it.
Firstoff, cn430 really looks like a node rather than a partition. The partition to which it belongs seems to be named shared-gp.
What you see is a job array. It is a way to submit a large number of jobs that only differ in a specific parameter. Each job in the array is scheduled independently, so if you do not request a specific node (e.g. with -wor --nodelist), Slurm will broadcast them to the nodes that are available.
Note that the job priorities will decay overtime if faishare is being implemented so the jobs that are currently pending will have their priority decrease because of those currently running.

How to debug a slow PySpark application

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

Apache Spark DAGScheduler Flow of Data

I am trying to understand how exactly Apache Spark scheduler works. To do so, i've set a local cluster with one master and two workers. I only submit one application, which simply reads 4 files (2 small (~10MB) and 2 big(~1,1GB)),joins them and collects the result. In addition, i cache in memory the two small files.
I am running the standalone cluster mode with FIFO.I've understood how the stages are formed but i cannot figure out how the flow of data is determined(the arrows). When i look at SparkUI, i notice that each time,even though the stages are formed in the same way, the arrows( flow of data and control i guess) are different. It's like the scheduler works non-deterministically.
I've read the relative chapters (about DAG and Task Scheduler) from Jacek Laskowski's book, but it isn't still clear in my head how the flow of control is determined . Thanks in advance for the help.
Cheers,
Jim
It's like the scheduler works non-deterministically.
Yes, there's some randomness in scheduling tasks to make it more "fair". In that sense Spark scheduler does work "non-deterministically", but within acceptable limits of execution placement (i.e. assigning tasks with lesser location preferences to executors).
The component in Apache Spark that does the work of selecting a task for a task set (that corresponds to a stage) is TaskSetManager:
Schedules the tasks within a single TaskSet in the TaskSchedulerImpl. This class keeps track of each task, retries tasks if they fail (up to a limited number of times), and handles locality-aware scheduling for this TaskSet via delay scheduling. The main interfaces to it are resourceOffer, which asks the TaskSet whether it wants to run a task on one node, and statusUpdate, which tells it that one of its tasks changed state (e.g. finished).

Spark Streaming Execution Flow

I am a newbie to Spark Streaming and I have some doubts regarding the same like
Do we need always more than one executor or with one we can do our job
I am pulling data from kafka using createDirectStream which is receiver less method and batch duration is one minute , so is my data is received for one batch and then processed during other batch duration or it is simultaneously processed
If it is processed simultaneously then how is it assured that my processing is finished in the batch duration
How to use the that web UI to monitor and debugging
Do we need always more than one executor or with one we can do our job
It depends :). If you have a very small volume of traffic coming in, it could very well be that one machine code suffice in terms of load. In terms of fault tolerance that might not be a very good idea, since a single executor could crash and make your entire stream fault.
I am pulling data from kafka using createDirectStream which is
receiver less method and batch duration is one minute , so is my data
is received for one batch and then processed during other batch
duration or it is simultaneously processed
Your data is read once per minute, processed, and only upon the completion of the entire job will it continue to the next. As long as your batch processing time is less than one minute, there shouldn't be a problem. If processing takes more than a minute, you will start to accumulate delays.
If it is processed simultaneously then how is it assured that my
processing is finished in the batch duration?
As long as you don't set spark.streaming.concurrentJobs to more than 1, a single streaming graph will be executed, one at a time.
How to use the that web UI to monitor and debugging
This question is generally too broad for SO. I suggest starting with the Streaming tab that gets created once you submit your application, and start diving into each batch details and continuing from there.
To add a bit more on monitoring
How to use the that web UI to monitor and debugging
Monitor your application in the Streaming tab on localhost:4040, the main metrics to look for are Processing Time and Scheduling Delay. Have a look at the offical doc : http://spark.apache.org/docs/latest/streaming-programming-guide.html#monitoring-applications
batch duration is one minute
Your batch duration a bit long, try to adjust it with lower values to improve your latency. 4 seconds can be a good start.
Also it's a good idea to monitor these metrics on Graphite and set alerts. Have a look at this post https://stackoverflow.com/a/29983398/3535853

Resources