Why the Scylla did not use cgroup blkio for I/O scheduler? - io

Recently I found a article.
And I noticed in the article that the I/O scheduler in scylla uses a easier traffic control for I/O which just tasks task_quota, iops and io_bandwidth into account.
To my kownledge, the cgroup, blkio also uses these three factor for I/O scheduler.
I am confused that, what is the difference between scylla I/O scheduler and cgroup blkio? Why scylla did not use cgroup blkio directlly?

Linux control groups and the Linux blk layer do have a scheduler. The main issue is the granularity. In Linux, the granularity is process based. That's not good enough for Scylla which is a multi threaded application. More over, in Scylla we have many types of compute and IO producers, some are latency sensitive (Like read and write operations), some are background operations and can be executed later (like compaction, streaming and repair).
Linux cgroups and blkio cannot make the difference between those and only Scylla userspace which tags them can be the component who'll schedule and queue them.
More data on this blog: https://www.scylladb.com/2018/04/19/scylla-i-o-scheduler-3/

Related

memory allocation and scheduling in a Linux machine with NUMA

I'm trying to understand what is the default approach of Linux OS, running on machine with NUMA nodes, for allocating memory, scheduling tasks and load balance.
Regarding memory allocation, I've read this (from this doc)
"The system will eventually run out
of memory local to nodes running large processes.
In response to these severe problems, memory is, by default, not allocated exclusively on the local node. To utilize all the system’s memory the default strategy is to
stripe the memory. This guarantees equal use of all the
memory of the system"
Is it true? The OS try to utilize all the system memory even if it will allocate memory from "far" nodes?
Regarding scheduling and load balancing:
"The Linux scheduler is aware of the NUMA topology of the platform–embodied in the “scheduling domains” data structures [see Documentation/scheduler/sched-domains.rst]–and the scheduler attempts to minimize task migration to distant scheduling domains. However, the scheduler does not take a task’s NUMA footprint into account directly. Thus, under sufficient imbalance, tasks can migrate between nodes, remote from their initial node and kernel data structures."
So Linux scheduler will try to run the task on cores from the same node first and only if its unavailable or the memory consumption cause imbalancing it will migrate it?

Cgroup and Slurm

I know how to use cgroups(allocating memory, cpu usage...) and slurm(submit, suspend/stop a job). I would like to know how cgroups work with slurm. Where could I fix the memory or CPU usage when I submit a job to slurm? I read the documentation from Slurm Schedmd(https://slurm.schedmd.com/cgroups.html) but it doesn't give a good explanation. Maybe it is a misunderstanding of mine. Can anyone explain how to allocate some ressource for a job using cgroup in slurm? Thanks in advance.
cgroup usage in Slurm is at the admin configuration level, and is transparent for the user. When configured to use cgroups, Slurm creates cgroups for the jobs and deletes them automatically, based on the resources required at job submission (e.g. --mem, --ntasks, etc.)

Is there a way to set the niceness setting of spark executor processes?

I have a cluster of machines that I have to share with other processes. Lets just say I am not a nice person and want my spark executor processes to have a higher priority then other people's processes. How can I set that?
I am using StandAlone mode, v2.01, running on RHEL7
Spark does not currently (2.4.0) support nice process priorities. Grepping through the codebase, there is no usage of nice, and hence no easy to set process priority on executors using out-of-the-box Spark. It would be a little odd of Spark to do this, since it only assumes it can start a JVM, not that the base operating system is UNIX.
There are hacky ways to get around this that I do NOT recommend. For instance, if you are using Mesos as a resource manager, you could set spark.mesos.executor.docker.image to an image where java actually calls nice -1 old-java "$#".
Allocate all the resources to the spark application leaving minimal
resource needed for os to run.
A simple scenario :
Imagine a cluster with six nodes running NodeManagers(Yarn Mode), each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes.

Presto configuration

As I set up a cluster of Presto and try to do some performance tuning, I wonder if there's a more comprehensive configuration guide of Presto, e.g. how can I control how many CPU cores a Presto worker can use. And is it good practice if I start multiple presto workers on a single server (in which case I don't need a dedicated server to run the coordinator)?
Besides, I don't quite understand the task.max-memory argument. Will the presto worker start multiple tasks for a single query? If yes, maybe I can use task.max-memory together with the -Xmx JVM argument to control the level of parallelism?
Thanks in advance.
Presto is a multithreaded Java program and works hard to use all available CPU resources when processing a query (assuming the input table is large enough to warrant such parallelism). You can artificially constrain the amount of CPU resources that Presto uses at the operating system level using cgroups, CPU affinity, etc.
There is no reason or benefit to starting multiple Presto workers on a single machine. You should not do this because they will needlessly compete with each other for resources and likely perform worse than a single process would.
We use a dedicated coordinator in our deployments that have 50+ machines because we found that having the coordinator process queries would slow it down while it performs the query coordination work, which has a negative impact on overall query performance. For small clusters, dedicating a machine to coordination is likely a waste of resources. You'll need to run some experiments with your own cluster setup and workload to determine which way is best for your environment.
You can have a single Presto process act as both a coordinator and worker, which can be useful for tiny clusters or testing purposes. To do so, add this to the etc/config.properties file:
coordinator=true
node-scheduler.include-coordinator=true
Your idea of starting a dedicated coordinator process on a machine shared with a worker process is interesting. For example, on a machine with 16 processors, you could use cgroups or CPU affinity to dedicate 2 cores to the coordinator process and restrict the worker process to 14 cores. We have never tried this, but it could be a good option for small clusters.
A task is a stage in a query plan that runs on a worker (the CLI shows the list of stages while the query is running). For a query like SELECT COUNT(*) FROM t, there will be a task on every work that performs the table scan and partial aggregation, and another task on a single worker for the final aggregation. More complex queries that have joins, subqueries, etc., can result in multiple tasks on every worker node for a single query.
-Xmx must be higher than task.max-memory, or at least equal.
otherwise you will be likely to see OOM issue as I have experienced that before.
and also, since Presto-0.113 they have changed the way Presto manages the query memory and according configurations.
please refer to this link:
https://prestodb.io/docs/current/installation/deployment.html
For your question regarding "many CPU cores a Presto worker can use", I think it's controlled by the parameter task.concurrency, which by default is 16

Hadoop: Using cgroups for TaskTracker tasks

Is it possible to configure cgroups or Hadoop in a way that each process that is spawned by the TaskTracker is assigned to a specific cgroup?
I want to enforce memory limits using cgroups. It is possible to assign a cgroup to the TaskTracker but if jobs wreak havoc the TaskTracker will be probably also killed by the oom-killer because they are in the same group.
Let's say I have 8GB memory on a machine. I want to reserve 1,5GB for the DataNode and system utilities and let the Hadoop TaskTracker use 6,5GB of memory. Now I start a Job using the streaming API at spawns 4 mappers and 2 reducers (each of these could in theory use 1GB RAM) that eats more memory than allowed. Now the cgroup memory limit will be hit and oom-killer starts to kill a job. I would rather use a cgroup for each Map and Reduce task e.g. a cgroup that is limited to 1GB memory.
Is this a real or more theoretical problem? Would the oom-killer really kill the Hadoop TaskTracker or would he start killing the forked processes first? If the latter is most of the time true my idea would probably work. If not - a bad job would still kill the TaskTracker on all cluster machines and require manual restarts.
Is there anything else to look for when using cgroups?
Have you looked at the hadoop parameters that allow the to set and max the heap allocations for the tasktracker's child processes (tasks) and also do not forget to look at the reuse of jvm possibility.
useful links:
http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
http://developer.yahoo.com/hadoop/tutorial/module7.html
How to avoid OutOfMemoryException when running Hadoop?
http://www.quora.com/Why-does-Hadoop-use-one-JVM-per-task-block
If it's that you have lot of students and staff accessing the Hadoop cluster for job submission, you can probably look at Job Scheduling in Hadoop.
Here is the gist of some types you may be interested in -
Fair scheduler:
The core idea behind the fair share scheduler was to assign resources to jobs such that on average over time, each job gets an equal share of the available resources.
To ensure fairness, each user is assigned to a pool. In this way, if one user submits many jobs, he or she can receive the same share of cluster resources as all other users (independent of the work they have submitted).
Capacity scheduler:
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity). Capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications.
Here's the link from where I shamelessly copied the above mentioned things, due to lack of time.
http://www.ibm.com/developerworks/library/os-hadoop-scheduling/index.html
To configure Hadoop use this link: http://hadoop.apache.org/docs/r1.1.1/fair_scheduler.html#Installation

Resources