Assigning fixed resources for single task on executor - apache-spark

According to Deep Dive: Apache Spark Memory Management there is a contention for tasks running in parallel on the same executor.
From Spark 1.0+ there are two possible options:
Option 1: Static assignment - resources are shared across tasks equally,
Option 2: Dynamic assignment - resources are shared across tasks dynamically
AFAIK Spark uses by default second option. Is there a possibility to manually specify maximum resources for each task?

Related

If we create multiple Spark Sessions using newSession() method, how is the driver memory shared between multiple spark sessions

In my Spark Application I am creating multiple (2 - 3) spark sessions with the help of newSession() method. While submitting the applications, I am configuring spark.driver.memory to 24g.
How will this memory get distributed between 2 spark sessions if those are processing 2 different datasets in parallel. Thanks.
Sessions are used for configuration management not for resource management or parallel in-application processing. There is no built-in mechanism for any resource allocation, and are part of the same app from the manager perspective.
It means that first-comes-first-gets - there is no separation and however occupies resources first wins.

Airflow: how to specify quantitative usage of a resource pool?

I am looking at several open source workflow schedulers for a DAG of jobs with heterogeneous RAM usage. The scheduler should not only schedule less than a maximum number of threads, but should also keep the total amount of RAM of all concurrent tasks below the available memory.
In this Luigi Q&A, it was explained that
You can set how many of the resource is available in the config, and
then how many of the resource the task consumes as a property on the
task. This will then limit you to running n of that task at a
time.
in config:
[resources]
api=1
in code for Task:
resources = {"api": 1}
For Airflow, I haven't been able to find the same functionality in its docs. The best that seems possible is to specify a number of available slots in a resource pool, and to also specify that a task instance uses a single slot in a resource pool. However, it appears there is no way to specify that a task instance uses more than one slot in a pool.
Question: specifically for Airflow, how can I specify a quantitative resource usage of a task instance?
Assuming you're using CeleryExecutor, then starting from airflow version 1.9.0 you can manage Celery's tasks concurrency. This is not exactly memory management you've been asking about but number of concurrent worker's threads executing tasks.
Tweakable parameter is called CELERYD_CONCURRENCY and here is very nicely explained how to manage celery related config in Airflow.
[Edit]
Actually, Pools could also be used to limit concurrency.
Let's say you want to limit resource hungry task_id so that only 2 instances will be run at the same time. The only thing you need to do is:
create pool (in UI: Admin -> Pools) assign it name e.g. my_pool and define task's concurrency in field Slots (in this case 2)
when instantiating your Operator that will execute this task_id, pass defined pool name (pool=my_pool)

What "module" takes care of assigning partitions to specific nodes, YARN/cluster manager or Spark itself?

Which module in Apache Spark takes care of assigning partitions to specific nodes in the cluster, i.e. which module takes care of keeping the mapping between a partition to a specific node? Is this done by YARN/Cluster Manager or this is managed by core spark itself?
Is this done by YARN/Cluster Manager or this is managed by core spark itself?
It's done as part of Spark Core's TaskScheduler, and more specifically TaskSetManager that responds to resource offers (where resources are CPUs and RAM with CPUs being the only important scheduling factor).

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

If the task slots in one executor can be shared by different Spark application tasks?

I have read some literature about Spark task scheduling, and found some paper mentioned that the Executor is monopolized by only one application at every moment.
So I wandering about whether the task slots in one executor can be shared by different Spark applications at the same time?

Resources