Does memory configuration really matter with fair scheduler? - apache-spark

We have a hadoop cluster with fair scheduler configured. We used to see the scenario whan there were not many jobs in the cluster to run, the running job was trying to take as much as memory and cores available.
With the Fair scheduler does executor memory and cores are really matter for the spark Jobs? Or does it depend upon the fair scheduler to decide how much to give?

It's the policy of Fair Scheduler that the first job assigned to it will have all the resources provided.
When we run the second job, all the resources will be divided in to (available resources)/(no. of jobs)
Now the main thing to focus is, how much maximum number of container memory you have given to run the job. If it is equal to the total number of resources available then it's genuine for your job to use all the resources.

Related

best way to run 300+ concurrent spark jobs in Dataproc?

I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.

How to make slurm make a scheduling decision when jobs are submitted?

I'm using back-fill scheduler with Slurm to manage a small GPU cluster. The backfill scheduler makes a scheduling decision every bf_interval seconds (default value is 30 seconds). This means even when GPU resources are available sometimes I have to wait for a while until the they are allocated. I can obviously reduce bf_interval but given that we don't have a lot of job submissions it'd be good if I could force slurm to run the scheduling routine the moment a job is queued. Is this possible?
By default Slurm does it. From the documentation:
Slurm is designed to perform a quick and simple scheduling attempt at events such as job submission or completion and configuration changes.
Have you change the default configuration for this? And, are you sure that not scheduling on submission is your problem?

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Does Spark's Fair Scheduler pool provides inter- or intra-application scheduling?

I am quite confused,because these pools are getting created for each spark application, and also if I keep minshare for a pool greater than the total cores of the cluster, the pool got created.
So if these pools are intra application do I need to, assign different pools to different spark jobs manually, because if I use sparkcontext.setlocalproperty for setting the pool, then all the stages of that application goes to that pool.
Point is that can we have jobs from two different application, to go in the same pool, so if I have application a1 and used sparkcontext.(pool,p1), and other application a2 and used sparkcontext.(pool,p1), would jobs for both applocation will go to the same pool p1 or p1 for a1 is different from p1 for a2.
As described in Spark's official documentation in Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
and later in the same document:
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
With that, the scheduling happens within resources given to a Spark application and how much it gets depends on CPUs/vcores and memory available in a cluster manager.
The Fair Scheduler mode is essentially for Spark applications with parallel jobs.

Hadoop: Using cgroups for TaskTracker tasks

Is it possible to configure cgroups or Hadoop in a way that each process that is spawned by the TaskTracker is assigned to a specific cgroup?
I want to enforce memory limits using cgroups. It is possible to assign a cgroup to the TaskTracker but if jobs wreak havoc the TaskTracker will be probably also killed by the oom-killer because they are in the same group.
Let's say I have 8GB memory on a machine. I want to reserve 1,5GB for the DataNode and system utilities and let the Hadoop TaskTracker use 6,5GB of memory. Now I start a Job using the streaming API at spawns 4 mappers and 2 reducers (each of these could in theory use 1GB RAM) that eats more memory than allowed. Now the cgroup memory limit will be hit and oom-killer starts to kill a job. I would rather use a cgroup for each Map and Reduce task e.g. a cgroup that is limited to 1GB memory.
Is this a real or more theoretical problem? Would the oom-killer really kill the Hadoop TaskTracker or would he start killing the forked processes first? If the latter is most of the time true my idea would probably work. If not - a bad job would still kill the TaskTracker on all cluster machines and require manual restarts.
Is there anything else to look for when using cgroups?
Have you looked at the hadoop parameters that allow the to set and max the heap allocations for the tasktracker's child processes (tasks) and also do not forget to look at the reuse of jvm possibility.
useful links:
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
http://developer.yahoo.com/hadoop/tutorial/module7.html
How to avoid OutOfMemoryException when running Hadoop?
http://www.quora.com/Why-does-Hadoop-use-one-JVM-per-task-block
If it's that you have lot of students and staff accessing the Hadoop cluster for job submission, you can probably look at Job Scheduling in Hadoop.
Here is the gist of some types you may be interested in -
Fair scheduler:
The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources.
To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted).
Capacity scheduler:
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity). Capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications.
Here's the link from where I shamelessly copied the above mentioned things, due to lack of time.
http://www.ibm.com/developerworks/library/os-hadoop-scheduling/index.html
To configure Hadoop use this link: http://hadoop.apache.org/docs/r1.1.1/fair_scheduler.html#Installation

Resources