How does Spark paralellize slices to tasks/executors/workers? - apache-spark

I have a 2-node Spark cluster with 4 cores per node.
MASTER
(Worker-on-master) (Worker-on-node1)
Spark config:
slaves: master, node1
SPARK_WORKER_INSTANCES=1
I am trying to understand Spark's paralellize behaviour. The sparkPi example has this code:
val slices = 8 // my test value for slices
val n = 100000 * slices
val count = spark.parallelize(1 to n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
As per documentation:
Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster.
I set slices to be 8 which means the working set will be divided among 8 tasks on the cluster, in turn each worker node gets 4 tasks (1:1 per core)
Questions:
Where can I see task level details? Inside executors I don't see task breakdown so I can see the effect of slices on the UI.
How to programmatically find the working set size for the map function above? I assume it is n/slices (100000 above)
Are the multiple tasks run by an executor run sequentially or paralell in multiple threads?
Reasoning behind 2-4 slices per CPU.
I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of cores in each node (in a homogeneous cluster) so that each core gets its own executor and task (1:1:1)

I will try to answer your question as best I can:
1.- Where can I see task level details?
When submitting a job, Spark stores information about the task breakdown on each worker node, apart from the master. This data is stored, I believe (I have only tested with Spark for EC2), on the work folder under the spark directory.
2.- How to programmatically find the working set size for the map function?
Although I am not sure if it stores the size in memory of the slices, the logs mentioned on the first answer provide information about the amount of lines each RDD partition contains.
3.- Are the multiple tasks run by an executor run sequentially or paralelly in multiple threads?
I believe diferent tasks inside a node run sequentially. This is shown on the logs indicated above, which indicate the start and end time of every task.
4.- Reasoning behind 2-4 slices per CPU
Some nodes finish their tasks faster than others. Having more slices than available cores distributes the tasks in a balanced way avoiding long processing time due to slower nodes.

Taking a stab at #4:
For #4 it's worth noting that "slices" and "partitions" are the same thing, there is a bug filed and efforts to clean up the docs: https://issues.apache.org/jira/browse/SPARK-1701
Here's a link that expands the reasoning in #4: http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
Specifically look at the line:
In general, we recommend 2-3 tasks per CPU core in your cluster.
An important consideration is to avoid shuffling, and setting number of slices is part of that. It's a more complicated subject than I fully understand to explain fully here -- the basic idea is to partition your data into enough partitions/slices up front to avoid Spark having to re-shuffle to get more partitions later.

1) Where can I see task level details? Inside executors I don't see task breakdown so I can see the effect of slices on the UI.
I do not understand your question as from the UI we can definitely see the effect of partitioning (or slices if you prefer).
2) How to programmatically find the working set size for the map function above? I assume it is n/slices (100000 above)
please give more details on what size are you interested. If you mean the amount of memory consumed by each worker ... each Spark partition has 64MB so ... from the official Spark documentation :
Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large; in general tasks larger than about 20 KB are probably worth optimizing.
3) Are the multiple tasks run by an executor run sequentially or paralell in multiple threads?
a good source for this is this question :
Spark executor & tasks concurrency
4) Reasoning behind 2-4 slices per CPU.
I assume ideally we should tune SPARK_WORKER_INSTANCES to correspond to number of cores in each node (in a homogeneous cluster) so that each core gets its own executor and task (1:1:1)
the major goal is not to have idle workers... once it finishes one task it will always have something to work with while waiting for other nodes to complete longer tasks. With a (1:1:1) the workers would be idle.

Related

Spark UI Executor

In Spark UI, there are 18 executors are added and 6 executors are removed. When I checked the executor tabs, I've seen many dead and excluded executors. Currently, dynamic allocation is used in EMR.
I've looked up some postings about dead executors but these mostly related with job failure. For my case, it seems that the job itself is not failed but can see dead and excluded executors.
What are these "dead" and "excluded" executors?
How does it affect the performance of current spark cluster configuration?
(If it affects performance) then what would be good way to improve the performance?
With dynamic alocation enabled spark is trying to adjust number of executors to number of tasks in active stages. Lets take a look at this example:
Job started, first stage is read from huge source which is taking some time. Lets say that this source is partitioned and Spark generated 100 task to get the data. If your executor has 5 cores, Spark is going to spawn 20 executors to ensure the best parallelism (20 executors x 5 cores = 100 tasks in parallel)
Lets say that on next step you are doing repartitioning or sor merge join, with shuffle partitions set to 200 spark is going to generated 200 tasks. He is smart enough to figure out that he has currently only 100 cores avilable so if new resources are avilable he will try to spawn another 20 executors (40 executors x 5 cores = 200 tasks in parallel)
Now the join is done, in next stage you have only 50 partitions, to calculate this in parallel you dont need 40 executors, 10 is ok (10 executors x 5 cores = 50 tasks in paralell). Right now if process is taking enough of time Spark can free some resources and you are going to see deleted executors.
Now we have next stage which involves repartitioning. Number of partitions equals to 200. Withs 10 executors you can process in paralell only 50 partitions. Spark will try to get new executors...
You can read this blog post: https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
The problem with the spark.dynamicAllocation.enabled property is that
it requires you to set subproperties. Some example subproperties are
spark.dynamicAllocation.initialExecutors, minExecutors, and
maxExecutors. Subproperties are required for most cases to use the
right number of executors in a cluster for an application, especially
when you need multiple applications to run simultaneously. Setting
subproperties requires a lot of trial and error to get the numbers
right. If they’re not right, the capacity might be reserved but never
actually used. This leads to wastage of resources or memory errors for
other applications.
Here you will find some hints, from my experience it is worth to set maxExecutors if you are going to run few jobs in parallel in the same cluster as most of the time it is not worth to starve other jobs just to get 100% efficiency from one job

Wordcount in a large file using Spark

I have a question about how I can work on large files using Spark. Let's say I have a really large file (1 TB) while I only have access to 500GB RAM in my cluster. A simple wordcount application would look like the follows:
sc.textfile(path_to_file).flatmap(split_line_to_words).map(lambda x: (x,1)).reduceByKey()
When I do not have access to enough memory, will the above application fail due to OOM? if so, what are some ways I can fix this?
Well, this is not an issue.
N Partitions equal to block size of HDFS (like) file system will be created on Worker Nodes at some stage physically- resulting in many N small tasks to execute, easily fitting inside the 500GB, over the life of the Spark App.
Partitions and its task equivalent will run concurrently, based on how many executors you have allocated. If you have, say, M executors with 1 core, then max M tasks run concurrently. Depends also on scheduling and resource allocation mode.
Spark handles like any OS as it were, situations of size and resources and depending on resources available, more or less can be done. The DAG Scheduler plays a role in all this. But keeping it simple here.

Number of Executor Cores and benefits or otherwise - Spark

Some run-time clarifications are requested.
In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.
My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?
If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?
From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?
Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?
From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.
Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.
My understanding is that a Partition (...) serviced by a single Executor.
That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).
then what is an example of benefit from an assigned Executor running more Cores?
First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).
Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)
See also What are the benefits of running multiple Spark tasks in the same JVM?
If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

How the Number of partitions and Number of concurrent tasks in spark calculated

I have a cluster with 4 nodes (each with 16 cores) using Spark 1.0.1.
I have an RDD which I've repartitioned so it has 200 partitions (hoping to increase the parallelism).
When I do a transformation (such as filter) on this RDD, I can't seem to get more than 64 tasks (my total number of cores across the 4 nodes) going at one point in time. By tasks, I mean the number of tasks that appear under the Application Spark UI. I tried explicitly setting the spark.default.parallelism to 128 (hoping I would get 128 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect. Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated.
This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 64 cores, you can only run at most 64 tasks at once.
You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
you can see the more details on the following thread
How does Spark paralellize slices to tasks/executors/workers?

what factors affect how many spark job concurrently

We recently have set up the Spark Job Server to which the spark jobs are submitted.But we found out that our 20 nodes(8 cores/128G Memory per node) spark cluster can only afford 10 spark jobs running concurrently.
Can someone share some detailed info about what factors would actually affect how many spark jobs can be run concurrently? How can we tune the conf so that we can take full advantage of the cluster?
Question is missing some context, but first - it seems like Spark Job Server limits the number of concurrent jobs (unlike Spark itself, which puts a limit on number of tasks, not jobs):
From application.conf
# Number of jobs that can be run simultaneously per context
# If not set, defaults to number of cores on machine where jobserver is running
max-jobs-per-context = 8
If that's not the issue (you set the limit higher, or are using more than one context), then the total number of cores in the cluster (8*20 = 160) is the maximum number of concurrent tasks. If each of your jobs creates 16 tasks, Spark would queue the next incoming job waiting for CPUs to be available.
Spark creates a task per partition of the input data, and the number of partitions is decided according to the partitioning of the input on disk, or by calling repartition or coalesce on the RDD/DataFrame to manually change the partitioning. Some other actions that operate on more than one RDD (e.g. union) may also change the number of partitions.
Some things that could limit the parallelism that you're seeing:
If your job consists of only map operations (or other shuffle-less operations), it will be limited to the number of partitions of data you have. So even if you have 20 executors, if you have 10 partitions of data, it will only spawn 10 task (unless the data is splittable, in something like parquet, LZO indexed text, etc).
If you're performing a take() operation (without a shuffle), it performs an exponential take, using only one task and then growing until it collects enough data to satisfy the take operation. (Another question similar to this)
Can you share more about your workflow? That would help us diagnose it.

Resources