Hadoop: Using cgroups for TaskTracker tasks - linux

Is it possible to configure cgroups or Hadoop in a way that each process that is spawned by the TaskTracker is assigned to a specific cgroup?
I want to enforce memory limits using cgroups. It is possible to assign a cgroup to the TaskTracker but if jobs wreak havoc the TaskTracker will be probably also killed by the oom-killer because they are in the same group.
Let's say I have 8GB memory on a machine. I want to reserve 1,5GB for the DataNode and system utilities and let the Hadoop TaskTracker use 6,5GB of memory. Now I start a Job using the streaming API at spawns 4 mappers and 2 reducers (each of these could in theory use 1GB RAM) that eats more memory than allowed. Now the cgroup memory limit will be hit and oom-killer starts to kill a job. I would rather use a cgroup for each Map and Reduce task e.g. a cgroup that is limited to 1GB memory.
Is this a real or more theoretical problem? Would the oom-killer really kill the Hadoop TaskTracker or would he start killing the forked processes first? If the latter is most of the time true my idea would probably work. If not - a bad job would still kill the TaskTracker on all cluster machines and require manual restarts.
Is there anything else to look for when using cgroups?

Have you looked at the hadoop parameters that allow the to set and max the heap allocations for the tasktracker's child processes (tasks) and also do not forget to look at the reuse of jvm possibility.
useful links:
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
http://developer.yahoo.com/hadoop/tutorial/module7.html
How to avoid OutOfMemoryException when running Hadoop?
http://www.quora.com/Why-does-Hadoop-use-one-JVM-per-task-block

If it's that you have lot of students and staff accessing the Hadoop cluster for job submission, you can probably look at Job Scheduling in Hadoop.
Here is the gist of some types you may be interested in -
Fair scheduler:
The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources.
To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted).
Capacity scheduler:
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity). Capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications.
Here's the link from where I shamelessly copied the above mentioned things, due to lack of time.
http://www.ibm.com/developerworks/library/os-hadoop-scheduling/index.html
To configure Hadoop use this link: http://hadoop.apache.org/docs/r1.1.1/fair_scheduler.html#Installation

Related

Can slurm run 3 seperate computers as one "node"?

I'm an intern that's been tasked with installing slurm across three compute units running ubuntu. How things work now is people ssh into one of the compute units and run a job on there, since all three units share memory through nfs mounting. Otherwise they are separate machines though.
My issue is that from what I've read in the documentation, it seems like when installing slurm I would specify each of these compute units as a completely separate node, and any jobs I'd like to run that use multiple cores would still be limited by how many cores are available on the individual node. My supervisor has told me however that the three units should be installed as a single node, and when a job comes in that needs more cores than available on a single compute unit, slurm should just use all the cores. The intention is that we won't be changing how we execute jobs (like a parallelized R script), just "wrapping" them in a sbatch script before sending them to slurm for scheduling and execution.
So is my supervisor correct in that slurm can be used to run our parallelized scripts unchanged with more cores than available on a single machine?
Running a script on more cores than available is nonsense. It does not provide any performance increase, rather the oposite, as more threads have to be managed but the computing power is the same.
But he is right in the sense that you can wrap your current script and send it to SLURM for execution using the whole node. But the three machines will be three nodes. They cannot work as a single node because, well, they are not a single node/machine. They do not share either memory nor busses, nor peripherals... they just share some disk thru the network.
You say that
any jobs I'd like to run that use multiple cores would still be limited by how many cores are available on the individual node
but that's the current situation with SSH. Nothing is lost by using SLURM to manage the resources. In fact, SLURM will take care of giving each job the proper resources and avoiding other users interfering with your computations.
Your best bet: create a cluster of three nodes as usual and let people send their jobs asking for as many resources they need without exceeding the available resources.

Does memory configuration really matter with fair scheduler?

We have a hadoop cluster with fair scheduler configured. We used to see the scenario whan there were not many jobs in the cluster to run, the running job was trying to take as much as memory and cores available.
With the Fair scheduler does executor memory and cores are really matter for the spark Jobs? Or does it depend upon the fair scheduler to decide how much to give?
It's the policy of Fair Scheduler that the first job assigned to it will have all the resources provided.
When we run the second job, all the resources will be divided in to (available resources)/(no. of jobs)
Now the main thing to focus is, how much maximum number of container memory you have given to run the job. If it is equal to the total number of resources available then it's genuine for your job to use all the resources.

Is there a way to set the niceness setting of spark executor processes?

I have a cluster of machines that I have to share with other processes. Lets just say I am not a nice person and want my spark executor processes to have a higher priority then other people's processes. How can I set that?
I am using StandAlone mode, v2.01, running on RHEL7
Spark does not currently (2.4.0) support nice process priorities. Grepping through the codebase, there is no usage of nice, and hence no easy to set process priority on executors using out-of-the-box Spark. It would be a little odd of Spark to do this, since it only assumes it can start a JVM, not that the base operating system is UNIX.
There are hacky ways to get around this that I do NOT recommend. For instance, if you are using Mesos as a resource manager, you could set spark.mesos.executor.docker.image to an image where java actually calls nice -1 old-java "$#".
Allocate all the resources to the spark application leaving minimal
resource needed for os to run.
A simple scenario :
Imagine a cluster with six nodes running NodeManagers(Yarn Mode), each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes.

What are the benefits of running multiple Spark instances per node (physical machine)?

Is there any advantage to starting more than one spark instance (master or worker) on a particular machine/node?
The spark standalone documentation doesn't explicitly say anything about starting a cluster or multiple workers on the same node. It does seem to implicitly conflate that one worker equals one node
Their hardware provisioning page says:
Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you purchase machines with more RAM than this, you can run multiple worker JVMs per node. In Spark’s standalone mode, you can set the number of workers per node with the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cores per worker with SPARK_WORKER_CORES.
So aside from working with large amounts of memory or testing cluster configuration, is there any benefit to running more than one worker per node?
I think the obvious benefit is to improve the resource utilization of the hardware per box without losing performance. In terms of parallelism, one big executor with multiple cores seems to be same with multiple executors with less cores.

Presto configuration

As I set up a cluster of Presto and try to do some performance tuning, I wonder if there's a more comprehensive configuration guide of Presto, e.g. how can I control how many CPU cores a Presto worker can use. And is it good practice if I start multiple presto workers on a single server (in which case I don't need a dedicated server to run the coordinator)?
Besides, I don't quite understand the task.max-memory argument. Will the presto worker start multiple tasks for a single query? If yes, maybe I can use task.max-memory together with the -Xmx JVM argument to control the level of parallelism?
Thanks in advance.
Presto is a multithreaded Java program and works hard to use all available CPU resources when processing a query (assuming the input table is large enough to warrant such parallelism). You can artificially constrain the amount of CPU resources that Presto uses at the operating system level using cgroups, CPU affinity, etc.
There is no reason or benefit to starting multiple Presto workers on a single machine. You should not do this because they will needlessly compete with each other for resources and likely perform worse than a single process would.
We use a dedicated coordinator in our deployments that have 50+ machines because we found that having the coordinator process queries would slow it down while it performs the query coordination work, which has a negative impact on overall query performance. For small clusters, dedicating a machine to coordination is likely a waste of resources. You'll need to run some experiments with your own cluster setup and workload to determine which way is best for your environment.
You can have a single Presto process act as both a coordinator and worker, which can be useful for tiny clusters or testing purposes. To do so, add this to the etc/config.properties file:
coordinator=true
node-scheduler.include-coordinator=true
Your idea of starting a dedicated coordinator process on a machine shared with a worker process is interesting. For example, on a machine with 16 processors, you could use cgroups or CPU affinity to dedicate 2 cores to the coordinator process and restrict the worker process to 14 cores. We have never tried this, but it could be a good option for small clusters.
A task is a stage in a query plan that runs on a worker (the CLI shows the list of stages while the query is running). For a query like SELECT COUNT(*) FROM t, there will be a task on every work that performs the table scan and partial aggregation, and another task on a single worker for the final aggregation. More complex queries that have joins, subqueries, etc., can result in multiple tasks on every worker node for a single query.
-Xmx must be higher than task.max-memory, or at least equal.
otherwise you will be likely to see OOM issue as I have experienced that before.
and also, since Presto-0.113 they have changed the way Presto manages the query memory and according configurations.
please refer to this link:
https://prestodb.io/docs/current/installation/deployment.html
For your question regarding "many CPU cores a Presto worker can use", I think it's controlled by the parameter task.concurrency, which by default is 16

Resources