I found several documents in stackoverflow with complex explanation mostly i don't understand. In simple term what is difference Process vs thread prioritization?
TL;DR - A simple answer is that there is no difference between a cgroup process and thread, but it is complex for multi-core systems due to migration.
The mechanics to set the thread priority versus the process priority is different from user space. Increase thread priority. The accounting and scheduling are generally the same after the priority is set.
As a goal, the scheduling of process versus thread is equal. Ie, there is no difference according to the goals.
In a simplifying move, Linux turns process scheduling into thread scheduling by treating a scheduled process as if it were single-threaded. If a process is multi-threaded with N threads, then N scheduling actions would be required to cover the threads. Threads within a multi-threaded process remain related in that they share resources such as memory address space. Linux threads are sometimes described as lightweight processes, with the lightweight underscoring the sharing of resources among the threads within a process. opensource.com
...but the hardware complicates this. Technologies like Hyperthreading will make it more efficient to keep threads on the same core. The layout of L1, L2, TLB, etc shared among cores on hardware prefers that threads are kept on the same CPU islands. This negates cache flushing and other effects. cgroups was implemented to keep related processes together. For instance auto-cgroup launched from the same TTY will share similar shared libraries and I/O and pipe from one process to another will benefit from data being in the cache when we move from the generating process to the consuming process.
Because threads have tighter coupling it is far less likely that they will migrate between cores independantly. Processes that do not share a cgroup are more likely to migrate to an idle core, so it is possible that they may end up getting more CPU, due to the asymmetry of work performed on different cores due to access to hardware resources. The configuration of 'server' versus 'desktop', or scheduling latency can also have an effect. It is possible that power management might also come into play. Ie, Does the code keep a core sleeping or wake it for process migration? And of course, the particular kernel version will have an effect as well as these noted high level configuration options.
References
A decade of wasted cores
Linux fair scheduling
Top-tier memory
Related
I'm running chains of programs, many of which like to make their own decisions about how many cores or threads to use and I have some control over how data is partitioned.
I was hoping this would be a fire and forget situation... as in the operating system would just put thread and process spawning on hold until enough resources freed up... but alas, instead a lot of competition for resources ensued.
Are there any operating systems or OS settings (Linux in particular) that are equipped to deal with an explosion in processes/threads and avoid thrashing?
Are there any guidelines on how to parallelize a workflow that is embarrassingly parallel across many steps and many levels? Are there any tools that help devise a strategy based on a scheduling paradigm?
Are there any operating systems or OS settings (Linux in particular) that are equipped to deal with an explosion in processes/threads and avoid thrashing?
Threads/Processes are OS resources and like nearly all OS resources, they are expensive. This is especially true for processes since a context switch from one process to another has a pretty big overhead (eg. TLB flush and possibly a direct/delayed cache flush) and they generally operate on different part of the memory.
Using many threads in one process is generally not much a problem as long as they are not all ready to be scheduled at the same time. If so, the scheduler needs to map them on available cores and this scheduling is a quite expensive. In fact, Scheduling problems are generally NP-complete though heuristics are used in practice. The scheduler needs to take into account many parameter such as IOs, locks/wait, locality, affinity, fairness, priorities, etc. Additionally, each thread has its own stack (generally few MiB) so the number of threads needs not to be too big so not to take too much memory. Contexts switches from one thread to another should still cause some cache issue due to the stack to be in different location in memory and they can be quickly flushed. Thrashing tends to happens more frequently if threads operates on different datasets rather than operating on the same problem and benefit from shared memory through synchronization can be expensive too so the granularity need to be carefully tuned.
Note that you can tune the scheduler on Linux (typically the IO scheduler) but while some scheduler may behave better than others for your target application none are perfect. Application-level scheduling tends to be much more effective in practice.
Are there any guidelines on how to parallelize a workflow that is embarrassingly parallel across many steps and many levels? Are there any tools that help devise a strategy based on a scheduling paradigm?
This is hard to help you without more information, but you can schedule the work yourself on a pool of worker threads (typically the number of physical or logical cores). You can use green-threads (like fibers) or tasks for that. Task scheduling is good for many reasons: you can specify dependencies between tasks, switching from one task to another is usually cheaper than fiber context switch, the stack can be reused for many tasks (and be kept in the cache), you can tune the scheduling of the tasks based on your target application. That being said, task scheduling is good only if tasks do not wait for each other: they need to be split in multiple tasks in this case (ie. continuation). This is not always possible nor simple (eg. call to external libraries). Fibers are better in this specific case (but they have also some issues).
There are a lot of articles that discuss multi-core myth. That, in order to really benefit from multiple cores, one needs to write parallel algorithms. Many of them mention Amdahl's law.
Lets assume for simplicity that we have a desktop computer with a 4-core commodity CPU. And assume that the goal is to improve our application performance, as well as overall system performance.
I wonder how CPU cores are used to perform tasks.
Whether threads from a single process are allocated all cores
Or threads from different processes are scheduled to run on different cores.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded? Are threads from the same process more likely to be scheduled at the same time on multiple cores?
What are some factors that matter? CPU cache maybe? Some application related maybe? Why?
Why would you ever want to use parallel libraries/algorithms? After all, CPU resources are shared between all running processes and there are always enough of them.
Is there an "active process" notion? i.e. process that gets most attention from the scheduler. If so, then how much more attention does this process usually get?
Whether threads from a single process are allocated all cores
Yes.
Or threads from different processes are scheduled to run on different cores.
Yes, that too.
If the latter is the case, then why is the myth even discussed? Won't multitasking OSes always benefit from multi-core CPUs, even if all the processes are single threaded?
To some extent, yes. But if that process is doing a lot of computation and the only one we care about at some particular time, the benefit will be pretty low.
On the other hand, it also means the process won't be as likely to be interrupted just because the OS has to do something like handle a disk interrupt, an arriving network packet, or something like that. Interrupting a process to handle some hardware task not only reduce the CPU time the process gets but it also pollutes the CPU caches causing the process to run more slowly when it resumes. So multi-core CPUs can allow a single-threaded process to command a core for a higher percentage of the time and in longer bursts.
Are threads from the same process more likely to be scheduled at the same time on multiple cores?
Typically no. Why would you want to do that? That would tend to degrade overall system performance as threads from the same process are more likely to step on each other's toes. You want the system to get other process' work done efficiently so you get the CPU back.
Is there an "active process" notion?
To some extent. Windows has precisely such a notion -- a "foreground process". Most OSes don't. But they do have a "dynamic priority boost" feature. Basically, if a process is sitting around doing nothing and then needs to do something, it is given some priority as a "reward". This allows a process that sits around waiting for work to be done to get its work done quickly and makes the system feel more interactive and responsive. It often makes little sense on servers, but it's helpful on desktops. Whether this is implemented on threads individually or on all the threads of a process as a group is implementation specific.
If you run separate processes or threads that doesn't needs to interact each others then it will be far better having 4 cores rather then having just 1.
As soon as the processes or threads needs to share some data, you will get the overhead to serialize the access to the shared data.
A lot depends on how good an application is written to run on a multi-core CPU. It may happen in the worst case that trying to run an application on a 4-core CPU is slower than running it on a single core CPU; more likely the increase in performance would be far less than 100%.
Suppose I have a multi-threaded application (say ~40 threads) running on a multiprocessor system (say 8 cores) with Linux as the operating system where different threads are more essentially LWP (Light Weight Processes) being scheduled by the kernel.
What would be benefits/drawbacks of using the CPU affinity? Whether CPU affinity is going to help by localizing the threads to a subset of cores thus minimizing cache sharing/misses?
If you use strict affinity, then a particular thread MUST run on that processor (or set of processors). If you have many threads that work completely independently, and they work on larger chunks of memory than a few kilobytes, then it's unlikely you'll benefit much from running on one particular core - since it's quite possible the other threads running on this particular CPU would have thrown out any L1 cache, and quite possibly L2 caches too. Which is more important for performance - cahce content or "getting to run sooner"? Are some CPU's always idle, or is the CPU load 100% on every core?
However, only you know (until you tell us) what your threads are doing. How big is the "working set" (how much memory - code and data) are they touching each time they get to run? How long does each thread run when they are running? What is the interaction with other threads? Are other threads using shared data with "this" thread? How much and what is the pattern of sharing?
Finally, the ultimate answer is "What makes it run faster?" - an answer you can only find by having good (realistic) benchmarks and trying the different possible options. Even if you give us every single line of code, running time measurements for each thread, etc, etc, we could only make more or less sophisticated guesses - until these have been tried and tested (with VARYING usage patterns), it's almost impossible to know.
In general, I'd suggest that having many threads either suggest that each thread isn't very busy (CPU-wise), or you are "doing it wrong"... More threads aren't better if they are all running flat out - better to have fewer threads in that case, because they are just going to fight each other.
The scheduler already tries to keep threads on the same cores, and to avoid migrations. This suggests that there's probably not a lot of mileage in managing thread affinity manually, unless:
you can demonstrate that for some reason the kernel is doing a bad a job for your particular application; or
there's some specific knowledge about your application that you can exploit to good effect.
localizing the threads to a subset of cores thus minimizing cache
sharing/misses
Not necessarily, you have to consider cache coherence too, if two or more threads access a shared memory buffer and each one is bound to a different CPU core their caches have to be synchronized if one thread writes to a shared cache line there will be a significant overhead to invalidate other caches.
Erlang is known for being able to support MANY lightweight processes; it can do this because these are not processes in the traditional sense, or even threads like in P-threads, but threads entirely in user space.
This is well and good (fantastic actually). But how then are Erlang threads executed in parallel in a multicore/multiprocessor environment? Surely they have to somehow be mapped to kernel threads in order to be executed on separate cores?
Assuming that that's the case, how is this done? Are many lightweight processes mapped to a single kernel thread?
Or is there another way around this problem?
Answer depends on the VM which is used:
1) non-SMP: There is one scheduler (OS thread), which executes all Erlang processes, taken from the pool of runnable processes (i.e. those who are not blocked by e.g. receive)
2) SMP: There are K schedulers (OS threads, K is usually a number of CPU cores), which executes Erlang processes from the shared process queue. It is a simple FIFO queue (with locks to allow simultaneous access from multiple OS threads).
3) SMP in R13B and newer: There will be K schedulers (as before) which executes Erlang processes from multiple process queues. Each scheduler has it's own queue, so process migration logic from one scheduler to another will be added. This solution will improve performance by avoiding excessive locking in shared process queue.
For more information see this document prepared by Kenneth Lundin, Ericsson AB, for Erlang User Conference, Stockholm, November 13, 2008.
I want to ammend previous answers.
Erlang, or rather the Erlang runtime system (erts), defaults the number of schedulers (OS threads) and the number of runqueues to number of processing elements on your platform. That is processors cores or hardware threads. You can change these settings in runtime using:
erlang:system_flag(schedulers_online, NP) -> PrevNP
The Erlang processes does not have any affinity to any schedulers yet. The logic balancing the processes between the schedulers follows two rules. 1) A starving scheduler will steal work from another scheduler. 2) Migration paths are setup to push processes from schedulers with lots of processes to schedulers with less work. This is done to assure fairness in reduction count (execution time) for each process.
Schedulers however can be locked to specific processing elements. This not done by default. To let erts do the scheduler->core affinity use:
erlang:system_flag(scheduler_bind_type, default_bind) -> PrevBind
Several other bind types can be found in the documentation. Using affinity can greatly improve performance in heavy load situations! Especially in high lock contention situations. Also, the linux kernel cannot handle hyperthreads to say the least. If you have hyperthreads on your platform you should really use this feature in erlang.
I'm purely guessing here, but I'd imagine that there's a small number of threads, which pick processes from a common process pool for execution. Once a process hits a blocking operation, the thread executing it puts it aside and picks another. When a process being executed causes another process to become unblocked, that newly unblocked process gets placed into the pool. I suppose a thread might also stop execution of a process even when it's not blocked at certain points to serve other processes.
I would like to add some input to what was described in the accepted answer.
Erlang Scheduler is the essential part of the Erlang Runtime System and provides its own abstraction and implementation of the conception of lightweight processes atop the OS threads.
Each Scheduler runs within a single OS thread. Normally, there are as many schedulers as CPU (cores) are on he hardware (it is configurable though and naturally does not bring much value when number of schedulers exceeds those of hardware cores). The system might also be configured that scheduler will not jump between OS threads.
Now, when the Erlang process is being created it is entirely the responsibility of the ERTS and Scheduler to manage life cycle and resources consumption as well as its memory footprint etc.
One of the core implementation details is that each process has a time budget of 2000 reductions available when the Scheduler picks up that process from the run queue. Each progress in the system (even I/O) is guaranteed to have a reductions budget. That is what actually makes ERTS a system with preemptive multitasking.
I would recommend a great blog post on that topic by Jesper Louis Andersen http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html
As the short answer: Erlang processes are not OS threads and do not map to them directly. Erlang Schedulers are what runs on the OS threads and provide smart implementation of more finely grained Erlang processes hiding those details behind programmer's eyes.
Erlang is known for being able to support MANY lightweight processes; it can do this because these are not processes in the traditional sense, or even threads like in P-threads, but threads entirely in user space.
This is well and good (fantastic actually). But how then are Erlang threads executed in parallel in a multicore/multiprocessor environment? Surely they have to somehow be mapped to kernel threads in order to be executed on separate cores?
Assuming that that's the case, how is this done? Are many lightweight processes mapped to a single kernel thread?
Or is there another way around this problem?
Answer depends on the VM which is used:
1) non-SMP: There is one scheduler (OS thread), which executes all Erlang processes, taken from the pool of runnable processes (i.e. those who are not blocked by e.g. receive)
2) SMP: There are K schedulers (OS threads, K is usually a number of CPU cores), which executes Erlang processes from the shared process queue. It is a simple FIFO queue (with locks to allow simultaneous access from multiple OS threads).
3) SMP in R13B and newer: There will be K schedulers (as before) which executes Erlang processes from multiple process queues. Each scheduler has it's own queue, so process migration logic from one scheduler to another will be added. This solution will improve performance by avoiding excessive locking in shared process queue.
For more information see this document prepared by Kenneth Lundin, Ericsson AB, for Erlang User Conference, Stockholm, November 13, 2008.
I want to ammend previous answers.
Erlang, or rather the Erlang runtime system (erts), defaults the number of schedulers (OS threads) and the number of runqueues to number of processing elements on your platform. That is processors cores or hardware threads. You can change these settings in runtime using:
erlang:system_flag(schedulers_online, NP) -> PrevNP
The Erlang processes does not have any affinity to any schedulers yet. The logic balancing the processes between the schedulers follows two rules. 1) A starving scheduler will steal work from another scheduler. 2) Migration paths are setup to push processes from schedulers with lots of processes to schedulers with less work. This is done to assure fairness in reduction count (execution time) for each process.
Schedulers however can be locked to specific processing elements. This not done by default. To let erts do the scheduler->core affinity use:
erlang:system_flag(scheduler_bind_type, default_bind) -> PrevBind
Several other bind types can be found in the documentation. Using affinity can greatly improve performance in heavy load situations! Especially in high lock contention situations. Also, the linux kernel cannot handle hyperthreads to say the least. If you have hyperthreads on your platform you should really use this feature in erlang.
I'm purely guessing here, but I'd imagine that there's a small number of threads, which pick processes from a common process pool for execution. Once a process hits a blocking operation, the thread executing it puts it aside and picks another. When a process being executed causes another process to become unblocked, that newly unblocked process gets placed into the pool. I suppose a thread might also stop execution of a process even when it's not blocked at certain points to serve other processes.
I would like to add some input to what was described in the accepted answer.
Erlang Scheduler is the essential part of the Erlang Runtime System and provides its own abstraction and implementation of the conception of lightweight processes atop the OS threads.
Each Scheduler runs within a single OS thread. Normally, there are as many schedulers as CPU (cores) are on he hardware (it is configurable though and naturally does not bring much value when number of schedulers exceeds those of hardware cores). The system might also be configured that scheduler will not jump between OS threads.
Now, when the Erlang process is being created it is entirely the responsibility of the ERTS and Scheduler to manage life cycle and resources consumption as well as its memory footprint etc.
One of the core implementation details is that each process has a time budget of 2000 reductions available when the Scheduler picks up that process from the run queue. Each progress in the system (even I/O) is guaranteed to have a reductions budget. That is what actually makes ERTS a system with preemptive multitasking.
I would recommend a great blog post on that topic by Jesper Louis Andersen http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-scheduling.html
As the short answer: Erlang processes are not OS threads and do not map to them directly. Erlang Schedulers are what runs on the OS threads and provide smart implementation of more finely grained Erlang processes hiding those details behind programmer's eyes.