What's the relationship between forkOn and the -qm RTS flag? - multithreading

Suppose that I have a program that only spawn threads using forkOn. In such scenario, there will be no load balancing of Haskell threads among different capabilities. So is there a difference in executing this program with and without +RTS -qm?
According to the documentation, -qm disables the thread migration, which I think it has a similar effect of using only forkOn. Am I correct in this assumption? I'm sure not how clear the documentation is in this regard.

I'm no expert on the subject, but I'll give it a shot anyway.
GHC (The Haskell compiler) can have one or multiple HECs (Haskell Execution Context, also known as cap or capability). With runtime flag +RTS -N <number> or setNumCapabilities function it's possible to define how many those HECs are available for program. One HEC is one operating system thread. The runtime scheduler distributes Haskell lightweight threads between HECs.
With forkOn function, it's possible to select which HEC the thread is ran on. getNumCapabilities returns the number of capabilities (HECs).
Thread migration means that Haskell threads can be migrated (moved) to another HEC. The runtime flag +RTS -qm disables this thread migration.
Documentation about forkOn states that
Like forkIO, but lets you specify on which capability the thread should run. Unlike a forkIO thread, a thread created by forkOn will stay on the same capability for its entire lifetime (forkIO threads can migrate between capabilities according to the scheduling policy).
so with forkOn it's possible to select one single HEC the thread is ran in.
Compared to forkIO which states that
Foreign calls made by this thread are not guaranteed to be made by any particular OS thread; if you need foreign calls to be made by a particular OS thread, then use forkOS instead.
Now, are forkOn function and +RTS -qm (disabled thread migration) the same thing? Probably not. With forkOn user explicitly selects which HEC the Haskell thread is ran on (for example it's possible to put all Haskell threads into same HEC). With +RTS -qm and forkIO the Haskell threads don't switch between HECs, but there's no way knowing which HEC the Haskell thread spawned by forkIO ends in.
References:
Runtime Support for Multicore Haskell
The GHC scheduler
GHC(STG,Cmm,asm) illustrated

Related

How is preemptive scheduling implemented for user-level threads in Linux?

With user-level threads there are N user-level threads running on top of a single kernel thread. This is in contrast to pthreads where only one user thread runs on a kernel thread.
The N user-level threads are preemptively scheduled on the single kernel thread. But what are the details of how that is done.
I heard something that suggested that the threading library sets things up so that a signal is sent by the kernel and that is the mechanism to yank execution from an individual user-level thread to a signal handler that can then do the preemptive scheduling.
But what are the details of how state such as registers and thread structs are saved and/or mutated to make this all work? Is there maybe a very simple of user-level threads that is useful for learning the details?
To get the details right, use the source! But this is what I remember from when I read it...
There are two ways user-level threads can be scheduled: voluntarily and preemptively.
Voluntary scheduling: threads must call a function periodically to pass the use of the CPU to another thread. This function is called yield() or schedule() or something like that.
Preemptive scheduling: the library forcefully removes the CPU from one thread and passes it to another. This is usually done with timer signals, such as SIGALARM (see man ualarm for the details).
About how to do the real switch, if your OS is friendly and provides the necessary functions, that is easy. In Linux you have the makecontext() / swapcontext() functions that make swapping from one task to another easy. Again, see the man pages for details.
Unfortunately, these functions are removed from POSIX, so other UNIX may not have them. If that's the case, there are other tricks that can be done. The most popular was the one calling sigaltstack() to set up an alternate stack for managing the signals, then kill() itself to get to the alternate stack, and longjmp() from the signal function to the actual user-mode-thread you want to run. Clever, uh?
As a side note, in Windows user-mode threads are called fibers and are fully supported also (see the docs of CreateFiber()).
The last resort is using assembler, that can be made to work almost everywhere, but it is totally system specific. The steps to create a UMT would be:
Allocate a stack.
Allocate and initialize a UMT context: a struct to hold the value of the relevant CPU registers.
And to switch from one UMT to another:
Save the current context.
Switch the stack.
Restore the next context in the CPU and jump to the next instruction.
These steps are relatively easy to do in assembler, but quite impossible in plain C without support from any of the tricks cited above.

kernel thread native thread os thread

can any one please tell me. Are all term "kernel thread", "native thread" and "Os thread" represent kernel thread? Or they are different? If they are different what is relationship among all?
There's no real standard for that. Terminology varies depending on context. However I'll try to explain the different kind of threads that I know of (and add fibers just for completeness as I've seen people call them threads).
-- Threading within the kernel
These are most likely what your kernel thread term refers to. They only exist at the kernel level. They allow (a somewhat limited) parallel execution of the kernel code itself.
-- Application threading
These are what the term thread generally means. They are separate threads of parallel execution which may be scheduled on different processors, that share the same address space and are handled as a single process by the operating system.
The POSIX standard defines the properties threads should have in POSIX compliant systems (in fact the libraries and how each library entry is supposed to behave). Windows threading model is extremely similar to the POSIX one and, AFAIK, it's safe to talk of threading in general the way I did: parallel execution that happens within the same process and can be scheduled on different processors.
-- Ancient linux threading
In the early days the linux kernel did not support threading. However it did support creating two different processes that shared the same address space. There was a project (LinuxThreads) that tried to use this to implement some sort of threading abilities.
The problem was, of course, that the kernel would still treat them as separate processes. The result was therefore not POSIX compliant. For example the treatment of signals was problematic (as signals are a process level concept). It was IN THIS VERY SPECIFIC CONTEXT that the term "native" started to become common. It refers to "native" as in "kernel level" support for threading.
With help from the kernel actual support for POSIX compliant threading was finally implemented. Today that's the only kind of threading that really deserves the name. The old way is, in fact, not real threading at all. It's a sharing of the address space by multiple processes, and as such should be referred to. But there was a time when that was referred to as threading (as it was the only thing you could do with Linux).
-- User level and Green threading
This is another context where "native" is often used to contrast to another threading model. Green threads and userl level threads are threads that do happen within the same process, but they are totally handled at userlevel. Green threads are used in virtual machines (especially those that implement pcode execution, as is the case for the java virtual machine), and they are also implemented at library level by many languages (examples: Haskell, Racket, Smalltalk).
These threads do not need to rely on any threading facilities by the kernel (but often do rely on asynchronous I/O). As such they generally cannot schedule on separate processors. In these contexts "native thread" or "OS thread" could be used to refer to the actual kernel scheduled threads in contrast to the green/user level threads.
Note that "cannot be scheduled on separate processors" is only true if they are used alone. In an hybrid system that has both user level/green threads and native/os threads, it may be possible to create exactly one native/os thread for each processor (and on some systems to set the affinity mask so that each only runs on a specific processor) and then effectively assign the userlevel threads to these.
-- Fibers and cooperative multitasking
I have seen some people call these threads. It's improper, the correct name is fibers. They are also a model of parallel execution, but contrary to threads (and processes) they are cooperative. Which means that whenever a fiber is running, the other fibers will not run until the running fiber voluntarily "yields" execution accepting to be suspended and eventually resumed later.

forkIO threads and OS threads

If I create a thread using forkIO I need to provide a function to run and get back an identifier (threadID). I then can communicate with this animal via e.g. the workloads, MVARs etc.. However, to my understanding the created thread is very limited and can only work in sort of a SIMD fashion where the function that was provided for thread creation is the instruction. I cannot change the function that I provided when the thread was initiated. I understand that these user threads are eventually by the OS mapped to OS threads.
I would like to know how the Haskell threads and the OS threads do interface. Why can Haskell threads that do completely different things be mapped to one and the same OS thread? Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)? How does the scheduler(?) recognize user threads in an application that could possibly be distributed? In other words, why are OS threads so flexible?
Last, is there any way to dump the heap of a selected thread from within the application?
First, let's address one quick misconception:
I understand that these user threads are eventually by the OS mapped to OS threads.
Actually, the Haskell runtime is in charge of choosing which Haskell thread a particular OS thread from its pool is executing.
Now the questions, one at a time.
Why can Haskell threads that do completely different things be mapped to one and the same OS thread?
Ignoring FFI for the moment, all OS threads are actually running the Haskell runtime, which keeps track of a list of ready Haskell threads. The runtime chooses a Haskell thread to execute, and jumps into the code, executing until the thread yields control back to the runtime. At that moment, the runtime has a chance to continue executing the same thread or pick a different one.
In short: many Haskell threads can be mapped to a single OS thread because in reality that OS thread is doing only one thing, namely, running the Haskell runtime.
Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)?
I don't understand this question (and I think it stems from a second misconception). You start OS threads with a fixed instruction in exactly the same sense that you start Haskell threads with a fixed instruction: for each thing, you just give a chunk of code to execute and that's what it does.
How does the scheduler(?) recognize user threads in an application that could possibly be distributed?
"Distributed" is a dangerous word: usually, it refers to spreading code across multiple machines (presumably not what you meant here). As for how the Haskell runtime can tell when there's multiple threads, well, that's easy: you tell it when you call forkIO.
In other words, why are OS threads so flexible?
It's not clear to me that OS threads are any more flexible than Haskell threads, so this question is a bit strange.
Last, is there any way to dump the heap of a selected thread from within the application?
I actually don't really know of any tools for dumping the Haskell heap at all, in multithreaded applications or otherwise. You can dump a representation of the part of the heap reachable from a particular object, if you like, using a package like vacuum. I've used vacuum-cairo to visualize these dumps with great success in the past.
For further information, you may enjoy the middle two sections, "Conventions" and "Foreign Imports", from my intro to multithreaded gtk2hs programming, and perhaps also bits of the section on "The Non-Threaded Runtime".
Instead of trying to directly answer your question, I will try to provide a conceptual model for how multi-threaded Haskell programs are implemented. I will ignore many details, and complexities.
Operating systems implement preemptive multithreading using hardware interrupts to allow multiple "threads" of computation to run logically on the same core at the same time.
The threads provided by operating systems tend to be heavy weight. They are well suited to certain types of "multi-threaded" applications, and, on systems like Linux, are fundamentally the same tool that allows multiple programs to run at the same time (a task they excel at).
But, these threads are bit heavy weight for many uses in high level languages such as Haskell. Essentially, the GHC runtime works as mini-OS, implementing its own "threads" on top of the OS threads, in the same way an OS implements threads on top of cores.
It is conceptually easy to imagine that a language like Haskell would be implemented in this way. Evaluating Haskell consists of "forcing thunks" where a thunk is a unit of computation that might 1. depend on another value (thunk) and/or 2. create new thunks.
Thus, one can imagine multiple threads each evaluating thunks at the same time. One would construct a queue of thunks to be evaluated. Each thread would pop the top of the queue, and evaluate that thunk until it was completed, then select a new thunk from the queue. The operation par and its ilk can "spark" new computation by adding a thunk to that queue.
Extending this model to IO actions is not particularly hard to imagine either. Instead of each simply forcing pure thunk, we imagine the unit of Haskell computation being somewhat more complicated. Psuedo Haskell for such a runtime:
type Spark = (ThreadId,Action)
data Action = Compute Thunk | Perform IOAction
note: this is for conceptual understanding only, don't think things are implemented this way
When we run a Spark, we look for exceptions "thrown" to that thread ID. Assuming we have none, execution consists of either forcing a thunk or performing an IO action.
Obviously, my explanation here has been very hand-wavy, and ignored some complexity. For more, the GHC team have written excellent articles such as "Runtime Support for Multicore Haskell" by Marlow et al. You might also want to look at text book on Operating Systems, as they often go in some depth on how to build a scheduler.

Identifying the current HEC for a function in Haskell

I'm writing a parallel Haskell program using Strategies. It's not doing what it's supposed to do, and I would like to inspect which Haskell Execution Context (HEC) a function is executed in.
Is there a getHEC call or something similar which I could use in my debug output?
You can find out which capability (i.e. CPU core) a Haskell thread is running on by calling threadCapability from Control.Concurrent.
If you're running your program with +RTS -N, there will be one OS-level thread (HEC) spawned per core, so the capability number returned by threadCapability will tell you which OS thread your forkIO green thread is running on. If, however, you are explicitly specifying the number of OS threads with +RTS -Nn, where n is some integer other than the number of cores on your system, this will probably be less useful to you.
You might also find ThreadScope to be useful for debugging and visualizing the execution of parallel programs.

Is it possible to create threads without system calls in Linux x86 GAS assembly?

Whilst learning the "assembler language" (in linux on a x86 architecture using the GNU as assembler), one of the aha moments was the possibility of using system calls. These system calls come in very handy and are sometimes even necessary as your program runs in user-space.
However system calls are rather expensive in terms of performance as they require an interrupt (and of course a system call) which means that a context switch must be made from your current active program in user-space to the system running in kernel-space.
The point I want to make is this: I'm currently implementing a compiler (for a university project) and one of the extra features I wanted to add is the support for multi-threaded code in order to enhance the performance of the compiled program. Because some of the multi-threaded code will be automatically generated by the compiler itself, this will almost guarantee that there will be really tiny bits of multi-threaded code in it as well. In order to gain a performance win, I must be sure that using threads will make this happen.
My fear however is that, in order to use threading, I must make system calls and the necessary interrupts. The tiny little (auto-generated) threads will therefore be highly affected by the time it takes to make these system calls, which could even lead to a performance loss...
my question is therefore twofold (with an extra bonus question underneath it):
Is it possible to write assembler
code which can run multiple threads
simultaneously on multiple cores at
once, without the need of system
calls?
Will I get a performance gain if I have really tiny threads (tiny as in the total execution time of the thread), performance loss, or isn't it worth the effort at all?
My guess is that multithreaded assembler code is not possible without system calls. Even if this is the case, do you have a suggestion (or even better: some real code) for implementing threads as efficient as possible?
The short answer is that you can't. When you write assembly code it runs sequentially (or with branches) on one and only one logical (i.e. hardware) thread. If you want some of the code to execute on another logical thread (whether on the same core, on a different core on the same CPU or even on a different CPU), you need to have the OS set up the other thread's instruction pointer (CS:EIP) to point to the code you want to run. This implies using system calls to get the OS to do what you want.
User threads won't give you the threading support that you want, because they all run on the same hardware thread.
Edit: Incorporating Ira Baxter's answer with Parlanse. If you ensure that your program has a thread running in each logical thread to begin with, then you can build your own scheduler without relying on the OS. Either way, you need a scheduler to handle hopping from one thread to another. Between calls to the scheduler, there are no special assembly instructions to handle multi-threading. The scheduler itself can't rely on any special assembly, but rather on conventions between parts of the scheduler in each thread.
Either way, whether or not you use the OS, you still have to rely on some scheduler to handle cross-thread execution.
"Doctor, doctor, it hurts when I do this". Doctor: "Don't do that".
The short answer is you can do multithreaded programming without
calling expensive OS task management primitives. Simply ignore the OS for thread
scheduling operations. This means you have to write your own thread
scheduler, and simply never pass control back to the OS.
(And you have to be cleverer somehow about your thread overhead
than the pretty smart OS guys).
We chose this approach precisely because windows process/thread/
fiber calls were all too expensive to support computation
grains of a few hundred instructions.
Our PARLANSE programming langauge is a parallel programming language:
See http://www.semdesigns.com/Products/Parlanse/index.html
PARLANSE runs under Windows, offers parallel "grains" as the abstract parallelism
construct, and schedules such grains by a combination of a highly
tuned hand-written scheduler and scheduling code generated by the
PARLANSE compiler that takes into account the context of grain
to minimimze scheduling overhead. For instance, the compiler
ensures that the registers of a grain contain no information at the point
where scheduling (e.g., "wait") might be required, and thus
the scheduler code only has to save the PC and SP. In fact,
quite often the scheduler code doesnt get control at all;
a forked grain simply stores the forking PC and SP,
switches to compiler-preallocated stack and jumps to the grain
code. Completion of the grain will restart the forker.
Normally there's an interlock to synchronize grains, implemented
by the compiler using native LOCK DEC instructions that implement
what amounts to counting semaphores. Applications
can fork logically millions of grains; the scheduler limits
parent grains from generating more work if the work queues
are long enough so more work won't be helpful. The scheduler
implements work-stealing to allow work-starved CPUs to grab
ready grains form neighboring CPU work queues. This has
been implemented to handle up to 32 CPUs; but we're a bit worried
that the x86 vendors may actually swamp use with more than
that in the next few years!
PARLANSE is a mature langauge; we've been using it since 1997,
and have implemented a several-million line parallel application in it.
Implement user-mode threading.
Historically, threading models are generalised as N:M, which is to say N user-mode threads running on M kernel-model threads. Modern useage is 1:1, but it wasn't always like that and it doesn't have to be like that.
You are free to maintain in a single kernel thread an arbitrary number of user-mode threads. It's just that it's your responsibility to switch between them sufficiently often that it all looks concurrent. Your threads are of course co-operative rather than pre-emptive; you basically scatted yield() calls throughout your own code to ensure regular switching occurs.
If you want to gain performance, you'll have to leverage kernel threads. Only the kernel can help you get code running simultaneously on more than one CPU core. Unless your program is I/O bound (or performing other blocking operations), performing user-mode cooperative multithreading (also known as fibers) is not going to gain you any performance. You'll just be performing extra context switches, but the one CPU that your real thread is running will still be running at 100% either way.
System calls have gotten faster. Modern CPUs have support for the sysenter instruction, which is significantly faster than the old int instruction. See also this article for how Linux does system calls in the fastest way possible.
Make sure that the automatically-generated multithreading has the threads run for long enough that you gain performance. Don't try to parallelize short pieces of code, you'll just waste time spawning and joining threads. Also be wary of memory effects (although these are harder to measure and predict) -- if multiple threads are accessing independent data sets, they will run much faster than if they were accessing the same data repeatedly due to the cache coherency problem.
Quite a bit late now, but I was interested in this kind of topic myself.
In fact, there's nothing all that special about threads that specifically requires the kernel to intervene EXCEPT for parallelization/performance.
Obligatory BLUF:
Q1: No. At least initial system calls are necessary to create multiple kernel threads across the various CPU cores/hyper-threads.
Q2: It depends. If you create/destroy threads that perform tiny operations then you're wasting resources (the thread creation process would greatly exceed the time used by the tread before it exits). If you create N threads (where N is ~# of cores/hyper-threads on the system) and re-task them then the answer COULD be yes depending on your implementation.
Q3: You COULD optimize operation if you KNEW ahead of time a precise method of ordering operations. Specifically, you could create what amounts to a ROP-chain (or a forward call chain, but this may actually end up being more complex to implement). This ROP-chain (as executed by a thread) would continuously execute 'ret' instructions (to its own stack) where that stack is continuously prepended (or appended in the case where it rolls over to the beginning). In such a (weird!) model the scheduler keeps a pointer to each thread's 'ROP-chain end' and writes new values to it whereby the code circles through memory executing function code that ultimately results in a ret instruction. Again, this is a weird model, but is intriguing nonetheless.
Onto my 2-cents worth of content.
I recently created what effectively operate as threads in pure assembly by managing various stack regions (created via mmap) and maintaining a dedicated area to store the control/individualization information for the "threads". It is possible, although I didn't design it this way, to create a single large block of memory via mmap that I subdivide into each thread's 'private' area. Thus only a single syscall would be required (although guard pages between would be smart these would require additional syscalls).
This implementation uses only the base kernel thread created when the process spawns and there is only a single usermode thread throughout the entire execution of the program. The program updates its own state and schedules itself via an internal control structure. I/O and such are handled via blocking options when possible (to reduce complexity), but this isn't strictly required. Of course I made use of mutexes and semaphores.
To implement this system (entirely in userspace and also via non-root access if desired) the following were required:
A notion of what threads boil down to:
A stack for stack operations (kinda self explaining and obvious)
A set of instructions to execute (also obvious)
A small block of memory to hold individual register contents
What a scheduler boils down to:
A manager for a series of threads (note that processes never actually execute, just their thread(s) do) in a scheduler-specified ordered list (usually priority).
A thread context switcher:
A MACRO injected into various parts of code (I usually put these at the end of heavy-duty functions) that equates roughly to 'thread yield', which saves the thread's state and loads another thread's state.
So, it is indeed possible to (entirely in assembly and without system calls other than initial mmap and mprotect) to create usermode thread-like constructs in a non-root process.
I only added this answer because you specifically mention x86 assembly and this answer was entirely derived via a self-contained program written entirely in x86 assembly that achieves the goals (minus multi-core capabilities) of minimizing system calls and also minimizes system-side thread overhead.
System calls are not that slow now, with syscall or sysenter instead of int. Still, there will only be an overhead when you create or destroy the threads. Once they are running, there are no system calls. User mode threads will not really help you, since they only run on one core.
First you should learn how to use threads in C (pthreads, POSIX theads). On GNU/Linux you will probably want to use POSIX threads or GLib threads.
Then you can simply call the C from assembly code.
Here are some pointers:
Posix threads: link text
A tutorial where you will learn how to call C functions from assembly: link text
Butenhof's book on POSIX threads link text

Resources