I often use the Spark UI to monitor my jobs as it is quite convenient. I like the timeline as it gives me hints about what takes time and where I can improve things.
However, when I run a job with a lot of executors, the beginning of the timeline is completely flooded by the addition of each individual executor:
Here I zoomed out the maximum I could, and even then, I could not get the whole timeline displayed in a single window. I know I can just scroll down until I get to the bottom, but it becomes quite tedious as I have to do it each time I reload the page.
So I wonder if there is a setting somewhere to disable those tags about executors being added. Unfortunately, I have not been able to find anything relevant on the internet (maybe a problem of keywords?).
Related
You may have spark code that joins, filters, then groubBys something, and at the end does take(1), for example. But when you look at SparkUI, it only shows that take(1) is taking a long time as an action that contains all those transformations. And it seems like there's no way to see which transformation is taking a long time.
So, how do I find out which transformation is taking a long time in Spark UI?
You can use Stages Tab in spark UI. Stages tab displays a summary page that shows the current state of all stages of all jobs in the Spark application.
At the beginning of the page is the summary with the count of all stages by status (active, pending, completed, skipped, and failed)
You need to identify your transformation operation there. If you are using same transformation multiple time you can differentiate by clicking on details you can know exact line number from code from where it getting called.
Check the time spend and still if you are not satisfied you can visit Storage Tab to check if you persisting your datasets correctly or not. Sometimes id not persisted, spark calculates same things many times.
Good Luck!
I'm trying to optimize one program with Spark SQL, this program is basically a HUGE SQL query (joins like 10 tables with many cases etc etc). I'm more used to more DF-API-oriented programs, and those did show the different stages much better.
It's quite well structured and I understand it more or less. However I have a problem, I always use Spark UI SQL view to get hints on where to focus the optimizations.
However in this kind of program Spark UI SQL shows nothing, is there a reason for this? (or a way to force it to show).
I'm expecting to see each join/scan with the number of output rows after it and such.... but I only see a full "WholeStageCodeGen" for a "Parsed logical plan" which is like 800lines
I can't show code, it has the following "points":
1- Action triggering it, its "show"(20)
3- Takes like 1 hour of execution (few executors yet)
2- has a persist before the show/action.
3- Uses Kudu, Hive and In-memory tables (registered before this query)
4- Has like 700 lines logical plan
Is there a way to improve the tracing there? (maybe disabling WholeStageCodegen?, but that may hurt performance...)
Thanks!
I am currently using spark to process documents. I have two servers at my disposal (innov1 and innov2) and I am using yarn as the resource manager.
The first step is to gather the paths of the files from a database, filter them, repartition them and persist them in a RDD[String]. However, I can't manage to have a fair sharing of the persist among all the executors:
persisted RDD memory taken among executors
and this lead to the executors not doing the same amount of work after that:
Work done by each executors (do not care about the 'dead' here, it's another problem)
And this happens randomly, sometimes it's innov1 that takes all the persist, and then only executors on innov1 work (but it tends to be innov2 in general). Right now, each time two executors are on innov1, I just kill the job to relaunch, and I pray for them to be on innov2 (which is utterly stupid, and break the goal of using spark).
What I have tried so far (and that didn't work):
make the driver sleep 60 seconds before the loading from the database (maybe innov1 takes more time to wake up?)
add spark.scheduler.minRegisteredResourcesRatio=1.0 when I submit the job (same idea than above)
persist with replication x2 (idea from this link), hoping that some of the block would be replicated on innov1
Note for point 3, sometimes it was persisting a replication on the same executor (which is a bit counter intuitive), or even weirder, not replicated at all (innov2 is not able to communicate with innov1?).
I am open to any suggestion, or link to similar problems I would have missed.
Edit:
I can't really put code here, as it's part of my company's product. I can give a simplified version however:
val rawHBaseRDD : RDD[(ImmutableBytesWritable, Result)] = sc
.newAPIHadoopRDD(...)
.map(x => (x._1, x._2)) // from doc of newAPIHadoopRDD
.repartition(200)
.persist(MEMORY_ONLY)
val pathsRDD: RDD[(String, String)] = rawHBaseRDD
.mapPartitions {
...
extract the key and the path from ImmutableBytesWritable and
Result.rawCells()
...
}
.filter(some cond)
.repartition(200)
.persist(MEMORY_ONLY)
For both persist, everything is on innov2. Is it possible that it's because the data are only on innov2? even if it's the case, I would assume that repartition help to share the rows between innov1 and innov2, but it doesn't happen here.
Your persisted data set is not very big - some ~100MB according to your screenshot. You have allocated 10 cores with 20GB of memory, so the 100MB fits easily into the memory of a single executor and that is basically what is happening.
In other words, you have allocated many more resources than are actually needed, so Spark just randomly picks the subset of resources that it needs to complete the job. Sometimes those resources happen to be on one worker, sometimes on another and sometimes it uses resources from both workers.
You have to remember that to Spark, it makes no difference if all resources are placed on a single machine or on a 100 different machines - as long as you are not trying to use more resources than available (in which case you would get an OOM).
Unfortunately (fortunately?) the problem solved by itself today. I assume it is not spark related as I hadn't modified the code until the resolution.
It's probably due to the complete reboot of all services with Ambari (even if I am not 100% sure, because I already tried this before), as it's the only "major" change that happened today.
I've a job running which shows the Event Timeline as follows, I am trying to guess the gaps between these single lines, they seem to be parallel but not immediately sequencial with other stages...
Any other insight from this, and what is the cluster doing during these gaps?
Without any code to look at, a blind guess is that during those gaps the driver is busy doing some work. If you are doing a .collect(), or a broadcast(), or any type of local processing in the driver program, then the executors will sit idle, waiting to have work assigned to them.
Note that in a visualization you see tasks from a table below it. If you change a paging size or a sorting of the table, you can see the actual pattern.
I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.