Spark scheduling tasks of a job onto a single executor

Spark scheduling tasks of a job onto a single executor - apache-spark

I've noticed that spark can sometimes schedule all tasks of a job onto the same executor when other executors are busy processing other job tasks
Here was a quick example on the spark shell, I created 3 executor with 1 core each
#1 sc.parallelize(List.range(0,1000), 2).mapPartitions(x => { Thread.sleep(100000); x }).count()
#2 sc.parallelize(List.range(0,1000), 10).mapPartitions(x => { Thread.sleep(1000); x }).count()
Job #1 hogs up the 2 executors by sleeping, and job #2 ends up running all of its 10 tasks sequentially on the 3rd executor. While it is understandable from sparks perspective that there weren't enough resources, but it should atleast try to distribute tasks even if it means they get delayed? What happens now is that the 1 executor becomes a hotspot for the next stage in that job because all the shuffle files have been persisted on that executor
While this is a manually generated example, we have noticed in our production setup where jobs with say 10k tasks are only being executed on a handful of executors
Just to debrief you a bit,
We create a single application and run several jobs ( 16 jobs in parallel sometimes even 48 ) within it. Should we have 1 application per job instead? What's the rationale behind creating 1 application vs multiple? All jobs are usually independent of each other

Related

Spark executors and shuffle in local mode

I am running a TPC-DS benchmark for Spark 3.0.1 in local mode and using sparkMeasure to get workload statistics. I have 16 total cores and SparkContext is available as
Spark context available as 'sc' (master = local[*], app id = local-1623251009819)
Q1. For local[*], driver and executors are created in a single JVM with 16 threads. Considering Spark's configuration which of the following will be true?
1 worker instance, 1 executor having 16 cores/threads
1 worker instance, 16 executors each having 1 core
For a particular query, sparkMeasure reports shuffle data as follows
shuffleRecordsRead => 183364403
shuffleTotalBlocksFetched => 52582
shuffleTotalBlocksFetched => 52582
shuffleLocalBlocksFetched => 52582
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 1570948723 (1498.0 MB)
shuffleLocalBytesRead => 1570948723 (1498.0 MB)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 1570948723 (1498.0 MB)
shuffleRecordsWritten => 183364480
Q2. Regardless of the query specifics, why is there data shuffling when everything is inside a single JVM?

executor is a jvm process when you use local[*] you run Spark
locally with as many worker threads as logical cores on your machine so : 1 executor and as many worker threads as logical
cores. when you configure SPARK_WORKER_INSTANCES=5 in spark-env.sh and execute these commands start-master.sh and start-slave.sh spark://local:7077 to bring up a standalone spark cluster in your
local machine you have one master and 5 workers, if you want to send
your application to this cluster you must configure application like
SparkSession.builder().appName("app").master("spark://localhost:7077")
in this case you can't specify [*] or [2] for example. but when
you specify master to be local[*] a jvm process is created and
master and all workers will be in that jvm process and after your
application finished that jvm instance will be destroyed. local[*]
and spark://localhost:7077 are two separate things.
workers do their job using tasks and each task actually is a thread
i.e. task = thread. workers have memory and they assign a memory
partition to each task in order to they do their job such as reading
a part of a dataset into its own memory partition or do a
transformation on read data. when a task such as join needs other
partitions, shuffle occurs regardless weather the job is ran in
cluster or local. if you were in cluster there is a possibility that
two tasks were in different machines so Network transmission will be
added to other stuffs such as writing the result and then reading by
another task. in local if task B needs the data in the partition of
the task A, task A should write it down and then task B will read it
to do its job

Local mode is the same as non-distributed single-JVM deployment mode.
Q1: It is neither. In this mode Spark spawns all execution components
namely Driver, n threads for data processing and Master in a single JVM.
If I had to abstract it to one of your 2 options I would say, 1 worker
instance, 16 executors each having 1 core, but as said this is not the
right way to look at it. The other option could be N Workers with M Executors with 1 Core each where N x M = 16.
The default parallelism is the number of threads as specified in the
master URL = local[*].
Q2: The threads will service partitions, concurrently, one at a time,
as many as needed, sequentially within the current Stage, being
assigned by the Driver when free. A stage is a boundary that causes
shuffling, regardless of how you run, in YARN Cluster or local.
Shuffling - what is that then? Shuffle occurs when data is required to
be re-arranged over existing partitions. E.g. a groupBy or orderBy? We
may have M partitions and after the groupBy N partitions. This is a
wide-transformation concept at the core of Spark for parallel
processing, so (even) with local[*] this will apply.

Why spark task id not executed in order?

I run the simplest program wordcount, code below:
val text = spark.read.textFile("/datasets/wordcount_512m.txt")
text.flatMap(line => line.split(" ")).groupByKey(identity).count().collect()
And my HDFS block size is 128MB, tow executors and each executor has two cores. And I looked into the SPARK UI, in stage 0, it's normal.
There are four tasks run in parallel.
But a strange thing happened in stage 1, some task id do not execute in order.
As the picture show, some bigger task id run before small task id(task 91 run before task 0 ). What are these abnormal task ids represent for?

SPARK Stages within a Job must execute in order, else none of it would make sense computationally.
Within a Stage there are Tasks - 1 per partition. It does not matter in which order these tasks execute, as long as they complete. That is the parallel compute notion - no dependencies. The scheduling of these is not relevant.

Why is there a delay in the launch of spark executors?

While trying to optimise a Spark job, I am having trouble understanding a delay of 3-4s in the launch of the second and of 6-7s third and fourth executors.
This is what I'm working with:
Spark 2.2
Two worker nodes having 8 cpu cores each. (master node separate)
Executors are configured to use 3 cores each.
Following is the screenshot of the jobs tab in Spark UI.
The job is divided into three stages. As seen, second, third and fourth executors are added only during the second stage.
Following is the snap of the Stage 0.
And following the snap of the Stage 1.
As seen in the image above, executor 2 (on the same worker as the first) takes around 3s to launch. Executors 3 and 4 (on the second worker) taken even longer, approximately 6s.
I tried playing around with the spark.locality.wait variable : values of 0s, 1s, 1ms. But there does not seem to be any change in the launch times of the executors.
Is there some other reason for this delay? Where else can I look to understand this better?

You might be interested to check Spark's executor request policy, and review the settings spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout for your application.
A Spark application with dynamic allocation enabled requests
additional executors when it has pending tasks waiting to be
scheduled. ...
Spark requests executors in rounds. The actual request is triggered
when there have been pending tasks for
spark.dynamicAllocation.schedulerBacklogTimeout seconds, and then
triggered again every
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout seconds
thereafter if the queue of pending tasks persists. Additionally, the
number of executors requested in each round increases exponentially
from the previous round. For instance, an application will add 1
executor in the first round, and then 2, 4, 8 and so on executors in
the subsequent rounds.
Another potential source for a delay could be spark.locality.wait. Since in Stage 1 you have quite a bit of tasks with sub-optimal locality levels (Rack local: 59), and the default for spark.locality.wait is 3 seconds, it could actually be the primary reason for the delays that you're seeing.

It takes time for the yarn to create the executors, Nothing can be done about this overhead. If you want to optimize you can set up a Spark server and then create requests for the server, And this saves the warm up time.

Why would the spark executor runs out of memory when multiple tasks are executed serially?

Please see the update for the follow up.
Please note I am not interested in increasing the parallelism of this process. I am trying to understand the executor memory model.
Let's say my application is decomposed to 1 stage (several mappers, filter, store result to hdfs [in other words, no reducers])
Let's say I have:
10 Executors (1 core per Executor, 5 GB per executor)
10 Partitions
10 Tasks (I know that each task requires 5 GB to complete successfully)
I end up with 10 tasks, each task running on an executor successfully.
Now same application and same setup but this time I have reduced the number of executors:
5 Executors (1 core per Executor, 5 GB per executor)
10 Partitions
10 Tasks (I know that each task requires 5 GB to complete successfully)
I still have 10 tasks. But this time the 5 executors, execute 5 tasks in parallel successfully. But when the executor tries to execute the second set of tasks (tasks 6-10) the executor tries to get more than the specified amount of memory and yarn kills it...
I thought what should happen is that 5 tasks would run successfully then 5 more tasks would run successfully. Since all tasks are identical...
But it looks like the executor is forced to carry some memory footprint from the execution of the first 5 tasks.
Follow up
The reason the executors were failing was because I was doing a lot of string manipulation and a configured spark.yarn.executor.memoryOverhead to be too small (512 MB).
Once I fixed this problem, I repeated the experiment and was able to successfully process the 10 tasks using 5 executors.
I am leaving this question and the findings here as documentation, in case someone has the same question...

Spark executor & tasks concurrency

In Spark, an executor may run many tasks concurrently maybe 2 or 5 or 6 .
How Spark figures out (or calculate) the number of tasks to be run in the same executor concurrently i.e how many tasks can run in an executor concurrently?
An executor may be executing one task but one more task maybe be placed to run concurrently on same executor? What's the criteria for that?
An executor has fixed number of cores & memory. As we do not specify memory & cores requirements for task in Spark, how to calculate how many can run concurrently in an executor?

The number of tasks run parallely within an executor = number of cores configured.
You can always change this number through configuration.
The total number of tasks run by executor overall ( parallel or sequential) depends upon the total number of tasks created ( through number of splits) and through number of executors.
All tasks running in one executor share the same memory configured. Inside, it just launches as many threads as number of cores.

One most probable issue could be the skewed partitions in the RDD you are processing. If 2-6 partitions are having a lot of data on them, then in order to reduce data shuffle over the network, Spark will try that the executors process the data residing locally on their own nodes. So you'll see those 2-6 executors working for a long time and the others would be done with there data in few milliseconds.
You can find more about this in this stackoverflow question.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string