Failures tolerance in Spark - apache-spark

Is there a way to set for each stage how many failures I can tolerate when running a Spark job? For example, if I have 1000 nodes and I tolerate 10 failures, then in a case where 5 nodes have failed, my job will not rerun them and ignore their results.
As a a result, I will get a bit less accurate result, but such capability will haste the running time execution since I get a result with no need to wait for the failing nodes, assuming that their execution time is taking too long.
Thanks!

I think what you're looking for is
spark.speculation=true
This is from http://spark.apache.org/docs/1.2.0/configuration.html#scheduling
Which will use a heuristic to relaunch the task on another machine if one is clearly lagging.

Related

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

Spark optimize "DataFrame.explain" / Catalyst

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.
I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.
I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.
I can't parallelize driver work (already did, but this task can't be overlapped with anything else).
Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)
Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)
I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S
I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).
Thanks!,
PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.
Just in case it helps someone (and if no-one provides more insights).
As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).
One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).
So I splitted my code in two parts:
1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)
And added one more part:
3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.
Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.
With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.
Comparing times:
Before I would have
1:30 Spark optimizing time + 6 minutes execution time
Now I have
max
(
1:30 Spark Optimizing time + 4 minutes execution time,
0:02 Spark Optimizing time + 2 minutes execution time
)
+ 15 seconds joining both parts
Not so much, but quite a few "expensive" people will be waiting for it to finish :)

Spark tasks stuck at RUNNING

I'm trying to run a Spark ML pipeline (load some data from JDBC, run some transformers, train a model) on my Yarn cluster but each time I run it, a couple - sometimes one, sometimes 3 or 4 - of my executors get stuck running their first task set (that'd be 3 tasks for each of their 3 cores), while the rest run normally, checking off 3 at a time.
In the UI, you'd see something like this:
Some things I have observed so far:
When I set up my executors to use 1 core each with spark.executor.cores (i.e. run 1 task at a time), the issue does not occur;
The stuck executors always seem to be them ones that had to get some partitions shuffled to them in order to run the task;
The stuck tasks would ultimately get successfully speculatively executed by another instance;
Occasionally, a single task would get stuck in an executor that is otherwise normal, the other 2 cores would keep working fine, however;
The stuck executor instances look like everything is normal: CPU is at ~100%, plenty of memory to spare, the JVM processes are alive, neither Spark or Yarn log anything out of the ordinary and they can still receive instructions from the driver, such as "drop this task, someone else speculatively executed it already" -- though, for some reason, they don't drop it;
Those executors never get killed off by the driver, so I imagine they keep sending their heartbeats just fine;
Any ideas as to what may be causing this or what I should try?
TLDR: Make sure your code is threadsafe and race condition-free before you blame Spark.
Figured it out. For posterity: was using an thread-unsafe data structure (a mutable HashMap). Since executors on the same machine share a JVM, this was resulting in data races that were locking up the separate threads/tasks.
The upshot: when you have spark.executor.cores > 1 (and you probably should), make sure your code is threadsafe.

Why does web UI show different durations in Jobs and Stages pages?

I am running a dummy spark job that does the exactly same set of operations in every iteration. The following figure shows 30 iterations, where each job corresponds to one iteration. It can be seen the duration is always around 70 ms except for job 0, 4, 16, and 28. The behavior of job 0 is expected as it is when the data is first loaded.
But when I click on job 16 to enter its detailed view, the duration is only 64 ms, which is similar to the other jobs, the screen shot of this duration is as follows:
I am wondering where does Spark spend the (2000 - 64) ms on job 16?
Gotcha! That's exactly the very same question I asked myself few days ago. I'm glad to share the findings with you (hoping that when I'm lucking understanding others chime in and fill the gaps).
The difference between what you can see in Jobs and Stages pages is the time required to schedule the stage for execution.
In Spark, a single job can have one or many stages with one or many tasks. That creates an execution plan.
By default, a Spark application runs in FIFO scheduling mode which is to execute one Spark job at a time regardless of how many cores are in use (you can check it in the web UI's Jobs page).
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
You should then see how many tasks a single job will execute and divide it by the number of cores the Spark application have assigned (you can check it in the web UI's Executors page).
That will give you the estimate on how many "cycles" you may need to wait before all tasks (and hence the jobs) complete.
NB: That's where dynamic allocation comes to the stage as you may sometimes want more cores later and start with a very few upfront. That's what the conclusion I offered to my client when we noticed a similar behaviour.
I can see that all the jobs in your example have 1 stage with 1 task (which make them very simple and highly unrealistic in production environment). That tells me that your machine could have got busier at different intervals and so the time Spark took to schedule a Spark job was longer but once scheduled the corresponding stage finished as the other stages from other jobs. I'd say it's a beauty of profiling that it may sometimes (often?) get very unpredictable and hard to reason about.
Just to shed more light on the internals of how web UI works. web UI uses a bunch of Spark listeners that collect current status of the running Spark application. There is at least one Spark listener per page in web UI. They intercept different execution times depending on their role.
Read about org.apache.spark.scheduler.SparkListener interface and review different callback to learn about the variety of events they can intercept.

How to make Spark to fail fast with clarity

I'm learning Spark, and quite often I have some issue that causes tasks and stages to fail. With my default configuration, there are rounds of retries and a bunch of ERROR messages to that effect.
While I totally appreciate the idea of retrying tasks when I finally get to production, I'd love to know how to make my application fail at the first sign of trouble so that I can avoid all the extra noise in the logs and within the application history itself. For example, if I run it out of memory, I'd love to just see the OOM exception near the end of my log and have the whole app fail.
What's the best way to setup configs for this kind of workflow?
You can set spark.task.maxFailures to 1.
spark.task.maxFailures is the number of individual task failures before giving up on the job, and its default value is 4.

Resources