I'm writing a parallel Haskell program using Strategies. It's not doing what it's supposed to do, and I would like to inspect which Haskell Execution Context (HEC) a function is executed in.
Is there a getHEC call or something similar which I could use in my debug output?
You can find out which capability (i.e. CPU core) a Haskell thread is running on by calling threadCapability from Control.Concurrent.
If you're running your program with +RTS -N, there will be one OS-level thread (HEC) spawned per core, so the capability number returned by threadCapability will tell you which OS thread your forkIO green thread is running on. If, however, you are explicitly specifying the number of OS threads with +RTS -Nn, where n is some integer other than the number of cores on your system, this will probably be less useful to you.
You might also find ThreadScope to be useful for debugging and visualizing the execution of parallel programs.
Related
I'm trying to understand how the parent and various child OS threads work in a haskell program compiled with GHC -threaded.
Using
module Main where
import Control.Concurrent
main = do
threadDelay 9999999999
Compiling with -threaded on ghc 8.6.5, and running with +RTS -N3 for instance, I can see
$ pstree -p 6615
hello(6615)─┬─{ghc_ticker}(6618)
├─{hello:w}(6616)
├─{hello:w}(6617)
├─{hello:w}(6619)
├─{hello:w}(6620)
├─{hello:w}(6621)
├─{hello:w}(6622)
└─{hello:w}(6623)
It looks like I get N*2 + 1 of these "hello:w" threads as I vary +RTS -N.
What are these "hello:w" threads, and why are there apparently two per HEC + 1?
And what does ghc_ticker do?
I also noticed on a large real service I'm testing with +RTS -N4 I'm getting e.g. 14 of these "my-service:w" threads, and when under load these process IDs seem to churn (half of them stay alive until I kill the service).
Why 14, and why are half of them spawned and die?
I'd also accept an answer that helped guide me to instrumenting my code to figure out these latter two questions.
The ghc_ticker in spawned at startup, it runs this function. It's purpose is described as
The interval timer is used for profiling and for context switching in
the threaded build.
The other *:w threads are workers, they are created whenever there is more work to do (aka Task), but there are no more spare workers, see here
On startup ghc creates one worker per capability, then they are created as needed and reused when possible. It's hard to say why you have 14 workers in -N4 case. I can only guess that they are serving IO manager threads: see here. Let's not forget about FFI also - FFI call may block worker. You can try to put a breakpoint in createOSThread to see why workers are created.
You can read more about scheduler here
ADDED:
Hmm, I think I can explain the N*2+1 workers: N workers per capability are created at startup; N more - IO manager event loops, one per capability; plus one IO manager timer thread. Though I'm not sure why the first N workers (created at startup) where not reused for IO manager threads.
The GDB manual states that when using all-stop mode for debugging a multithreaded application, it is not possible to advance every thread in lock-step by exactly one statement. This makes sense since a step in GDB essentially allows all threads to be scheduled by the OS (however the OS decides to do this) until the next statement is reached by the thread for which the step was called.
My question is this: Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping (while still using GDB to keep as many variables constant as possible), or does the stepping muck with the scheduling enough that the advancement of threads is not (on average) the same as without stepping?
If the stepping does affect the behavior, how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program? Will recording and playing back be viable?
Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping
Not really. The "not stepping" average behavior will have threads either running out of their time quanta, or blocking on system calls. In the "stepping" case, the threads are unlikely to every run out of their time quanta (because the time distance between steps is likely to be very short). So the average behavior is likely to be very different.
how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program?
In general, you shouldn't care about multithreaded program flow. It is impossible to debug multithreaded programs that way.
When doing multithreaded programming, you must care about preserving invariants (every resource that can be accessed by multiple threads is protected against data races, etc.). If you do, your program will just work (TM). If you don't, you are unlikely to find all ways that the program misbehaves anyway.
Suppose that I have a program that only spawn threads using forkOn. In such scenario, there will be no load balancing of Haskell threads among different capabilities. So is there a difference in executing this program with and without +RTS -qm?
According to the documentation, -qm disables the thread migration, which I think it has a similar effect of using only forkOn. Am I correct in this assumption? I'm sure not how clear the documentation is in this regard.
I'm no expert on the subject, but I'll give it a shot anyway.
GHC (The Haskell compiler) can have one or multiple HECs (Haskell Execution Context, also known as cap or capability). With runtime flag +RTS -N <number> or setNumCapabilities function it's possible to define how many those HECs are available for program. One HEC is one operating system thread. The runtime scheduler distributes Haskell lightweight threads between HECs.
With forkOn function, it's possible to select which HEC the thread is ran on. getNumCapabilities returns the number of capabilities (HECs).
Thread migration means that Haskell threads can be migrated (moved) to another HEC. The runtime flag +RTS -qm disables this thread migration.
Documentation about forkOn states that
Like forkIO, but lets you specify on which capability the thread should run. Unlike a forkIO thread, a thread created by forkOn will stay on the same capability for its entire lifetime (forkIO threads can migrate between capabilities according to the scheduling policy).
so with forkOn it's possible to select one single HEC the thread is ran in.
Compared to forkIO which states that
Foreign calls made by this thread are not guaranteed to be made by any particular OS thread; if you need foreign calls to be made by a particular OS thread, then use forkOS instead.
Now, are forkOn function and +RTS -qm (disabled thread migration) the same thing? Probably not. With forkOn user explicitly selects which HEC the Haskell thread is ran on (for example it's possible to put all Haskell threads into same HEC). With +RTS -qm and forkIO the Haskell threads don't switch between HECs, but there's no way knowing which HEC the Haskell thread spawned by forkIO ends in.
References:
Runtime Support for Multicore Haskell
The GHC scheduler
GHC(STG,Cmm,asm) illustrated
When I run simply "matlab", maxNumCompThreads returns 4.
When I run "matlab -singleCompThread", maxNumCompThreads returns 1.
However in both instances, ps uH p <PID> | wc -l (which I picked up from another question on SO to determine the number of threads a process is using) returns 35.
What gives? Can somebody explain to me what the 35 represents, and whether or not I can trust maxNumCompThreads as indicating that Matlab is only using one thread?
The number of threads used by MATLAB for computation (maxNumCompThreads) is different from the number of threads MATLAB.exe uses to manage its internal functions: the interpreter, memory manager, command line, who knows what else. If you were writing MATLAB, imagine the number of threads required to manage the various ongoing, independent tasks. Perhaps have a look at the Octave or FreeMat code to get an idea.
Many of the threads you see are used by the JVM that MATLAB launches. You could try the flag "-nojvm" to cut things down further. Obviously, without the JVM, functionality is very limited. "-singleCompThread" limits only the threads used by numeric computation such as MATLAB's intrinsic multithreading as well as threads used by external libraries such as MKL and FFTW.
If I create a thread using forkIO I need to provide a function to run and get back an identifier (threadID). I then can communicate with this animal via e.g. the workloads, MVARs etc.. However, to my understanding the created thread is very limited and can only work in sort of a SIMD fashion where the function that was provided for thread creation is the instruction. I cannot change the function that I provided when the thread was initiated. I understand that these user threads are eventually by the OS mapped to OS threads.
I would like to know how the Haskell threads and the OS threads do interface. Why can Haskell threads that do completely different things be mapped to one and the same OS thread? Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)? How does the scheduler(?) recognize user threads in an application that could possibly be distributed? In other words, why are OS threads so flexible?
Last, is there any way to dump the heap of a selected thread from within the application?
First, let's address one quick misconception:
I understand that these user threads are eventually by the OS mapped to OS threads.
Actually, the Haskell runtime is in charge of choosing which Haskell thread a particular OS thread from its pool is executing.
Now the questions, one at a time.
Why can Haskell threads that do completely different things be mapped to one and the same OS thread?
Ignoring FFI for the moment, all OS threads are actually running the Haskell runtime, which keeps track of a list of ready Haskell threads. The runtime chooses a Haskell thread to execute, and jumps into the code, executing until the thread yields control back to the runtime. At that moment, the runtime has a chance to continue executing the same thread or pick a different one.
In short: many Haskell threads can be mapped to a single OS thread because in reality that OS thread is doing only one thing, namely, running the Haskell runtime.
Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)?
I don't understand this question (and I think it stems from a second misconception). You start OS threads with a fixed instruction in exactly the same sense that you start Haskell threads with a fixed instruction: for each thing, you just give a chunk of code to execute and that's what it does.
How does the scheduler(?) recognize user threads in an application that could possibly be distributed?
"Distributed" is a dangerous word: usually, it refers to spreading code across multiple machines (presumably not what you meant here). As for how the Haskell runtime can tell when there's multiple threads, well, that's easy: you tell it when you call forkIO.
In other words, why are OS threads so flexible?
It's not clear to me that OS threads are any more flexible than Haskell threads, so this question is a bit strange.
Last, is there any way to dump the heap of a selected thread from within the application?
I actually don't really know of any tools for dumping the Haskell heap at all, in multithreaded applications or otherwise. You can dump a representation of the part of the heap reachable from a particular object, if you like, using a package like vacuum. I've used vacuum-cairo to visualize these dumps with great success in the past.
For further information, you may enjoy the middle two sections, "Conventions" and "Foreign Imports", from my intro to multithreaded gtk2hs programming, and perhaps also bits of the section on "The Non-Threaded Runtime".
Instead of trying to directly answer your question, I will try to provide a conceptual model for how multi-threaded Haskell programs are implemented. I will ignore many details, and complexities.
Operating systems implement preemptive multithreading using hardware interrupts to allow multiple "threads" of computation to run logically on the same core at the same time.
The threads provided by operating systems tend to be heavy weight. They are well suited to certain types of "multi-threaded" applications, and, on systems like Linux, are fundamentally the same tool that allows multiple programs to run at the same time (a task they excel at).
But, these threads are bit heavy weight for many uses in high level languages such as Haskell. Essentially, the GHC runtime works as mini-OS, implementing its own "threads" on top of the OS threads, in the same way an OS implements threads on top of cores.
It is conceptually easy to imagine that a language like Haskell would be implemented in this way. Evaluating Haskell consists of "forcing thunks" where a thunk is a unit of computation that might 1. depend on another value (thunk) and/or 2. create new thunks.
Thus, one can imagine multiple threads each evaluating thunks at the same time. One would construct a queue of thunks to be evaluated. Each thread would pop the top of the queue, and evaluate that thunk until it was completed, then select a new thunk from the queue. The operation par and its ilk can "spark" new computation by adding a thunk to that queue.
Extending this model to IO actions is not particularly hard to imagine either. Instead of each simply forcing pure thunk, we imagine the unit of Haskell computation being somewhat more complicated. Psuedo Haskell for such a runtime:
type Spark = (ThreadId,Action)
data Action = Compute Thunk | Perform IOAction
note: this is for conceptual understanding only, don't think things are implemented this way
When we run a Spark, we look for exceptions "thrown" to that thread ID. Assuming we have none, execution consists of either forcing a thunk or performing an IO action.
Obviously, my explanation here has been very hand-wavy, and ignored some complexity. For more, the GHC team have written excellent articles such as "Runtime Support for Multicore Haskell" by Marlow et al. You might also want to look at text book on Operating Systems, as they often go in some depth on how to build a scheduler.