Understanding the child process of a threaded GHC haskell program - multithreading

I'm trying to understand how the parent and various child OS threads work in a haskell program compiled with GHC -threaded.
Using
module Main where
import Control.Concurrent
main = do
threadDelay 9999999999
Compiling with -threaded on ghc 8.6.5, and running with +RTS -N3 for instance, I can see
$ pstree -p 6615
hello(6615)─┬─{ghc_ticker}(6618)
├─{hello:w}(6616)
├─{hello:w}(6617)
├─{hello:w}(6619)
├─{hello:w}(6620)
├─{hello:w}(6621)
├─{hello:w}(6622)
└─{hello:w}(6623)
It looks like I get N*2 + 1 of these "hello:w" threads as I vary +RTS -N.
What are these "hello:w" threads, and why are there apparently two per HEC + 1?
And what does ghc_ticker do?
I also noticed on a large real service I'm testing with +RTS -N4 I'm getting e.g. 14 of these "my-service:w" threads, and when under load these process IDs seem to churn (half of them stay alive until I kill the service).
Why 14, and why are half of them spawned and die?
I'd also accept an answer that helped guide me to instrumenting my code to figure out these latter two questions.

The ghc_ticker in spawned at startup, it runs this function. It's purpose is described as
The interval timer is used for profiling and for context switching in
the threaded build.
The other *:w threads are workers, they are created whenever there is more work to do (aka Task), but there are no more spare workers, see here
On startup ghc creates one worker per capability, then they are created as needed and reused when possible. It's hard to say why you have 14 workers in -N4 case. I can only guess that they are serving IO manager threads: see here. Let's not forget about FFI also - FFI call may block worker. You can try to put a breakpoint in createOSThread to see why workers are created.
You can read more about scheduler here
ADDED:
Hmm, I think I can explain the N*2+1 workers: N workers per capability are created at startup; N more - IO manager event loops, one per capability; plus one IO manager timer thread. Though I'm not sure why the first N workers (created at startup) where not reused for IO manager threads.

Related

What could delay pthread_join() after threads have exited successfully?

My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.

What's the relationship between forkOn and the -qm RTS flag?

Suppose that I have a program that only spawn threads using forkOn. In such scenario, there will be no load balancing of Haskell threads among different capabilities. So is there a difference in executing this program with and without +RTS -qm?
According to the documentation, -qm disables the thread migration, which I think it has a similar effect of using only forkOn. Am I correct in this assumption? I'm sure not how clear the documentation is in this regard.
I'm no expert on the subject, but I'll give it a shot anyway.
GHC (The Haskell compiler) can have one or multiple HECs (Haskell Execution Context, also known as cap or capability). With runtime flag +RTS -N <number> or setNumCapabilities function it's possible to define how many those HECs are available for program. One HEC is one operating system thread. The runtime scheduler distributes Haskell lightweight threads between HECs.
With forkOn function, it's possible to select which HEC the thread is ran on. getNumCapabilities returns the number of capabilities (HECs).
Thread migration means that Haskell threads can be migrated (moved) to another HEC. The runtime flag +RTS -qm disables this thread migration.
Documentation about forkOn states that
Like forkIO, but lets you specify on which capability the thread should run. Unlike a forkIO thread, a thread created by forkOn will stay on the same capability for its entire lifetime (forkIO threads can migrate between capabilities according to the scheduling policy).
so with forkOn it's possible to select one single HEC the thread is ran in.
Compared to forkIO which states that
Foreign calls made by this thread are not guaranteed to be made by any particular OS thread; if you need foreign calls to be made by a particular OS thread, then use forkOS instead.
Now, are forkOn function and +RTS -qm (disabled thread migration) the same thing? Probably not. With forkOn user explicitly selects which HEC the Haskell thread is ran on (for example it's possible to put all Haskell threads into same HEC). With +RTS -qm and forkIO the Haskell threads don't switch between HECs, but there's no way knowing which HEC the Haskell thread spawned by forkIO ends in.
References:
Runtime Support for Multicore Haskell
The GHC scheduler
GHC(STG,Cmm,asm) illustrated

Excessive amount of system calls when using `threadDelay`

I'm having a couple of Haskell processes running in production on a system with 12 cores. All processes are compiled with -threaded and run with 12 capabilities. One library they all use is resource-pool which keeps a pool of database connection.
What's interesting is that even though all processes are practically idle they consume around 2% CPU time. Inspecting one of these processes with strace -p $(pgrep processname) -f reveals that the process is doing an unreasonable amount of system calls even though it should not really be doing anything. To put things into perspective:
Running strace on a process with -N2 for 5 seconds produces a 66K log file.
Running it with (unreasonable) -N64 yields a 60 Megabyte log.
So the number of capabilities increases the amount of system calls being issued drastically.
Digging deeper we find that resource-pool is running a reaper thread which fires every second to inspect if it can clean up some resources. We can simulate the same behavior with this trivial program.
module Main where
import Control.Concurrent
import Control.Monad (forever)
main :: IO ()
main = forever $ do
threadDelay (10 ^ 6)
If I pass -B to the runtime system I get audio feedback whenever a GC is issued, which in this case is every 60 seconds.
So when I suppress these GC cycles by passing -I0 to the RTS running the strace command on the process only yields around 70K large log files. Since the process is also running a scotty server, GC is triggered when requests are coming in, so they seem to happen when I actually need them.
Since we are going to increase the amount of Haskell processes on this machine by a large amount during the course of the next year I was wondering how to keep their idle time at a reasonable level. Apparently passing -I0 seems to be a rather bad idea (?). Another idea would be to just decrease the number of capabilities from 12 to maybe something like 4. Is there any other way to tweak the RTS so that I can keep the processes from burning to many CPU cycles while idling?
The way GHC's memory management is structured, in order to keep memory usage under control, a 'major GC' is periodically needed, during the running of the program. This is a relatively expensive operation, and it 'stops the world' - the program makes no progress whilst this is occurring.
Obviously, it is undesirable for this to happen at any crucial point of program execution. Therefore by default, whenever a GHC-compiled program goes idle, a major GC is performed. This is usually an unobtrusive way of keeping the garbage level down and overall memory efficiency and performance up, without interrupting program interaction. This is known as 'idle GC'.
However, this can become a problem in scenarios like this: Many concurrent processes, Each of them woken frequently, running for a short amount of time, then going back to idle. This is a common scenario for server processes. In this case, when idle GC kicks in, it doesn't obstruct the process it is running in, which has completed its work, but it does steal resources from other processes running on the system. Since the program frequently idles, it is not necessary for the overhead of a major GC to be incurred on every single idle.
The 'brute force' approach would be to pass the RTS option -I0 to the program, disabling idle GC entirely. This will solve this in the short run, but misses an opportunity to collect garbage. This could allow garbage to accumulate, causing GC to kick in at an inopportune moment.
Partly in response to this question, the flag -Iw was added to the GHC runtime system. This establishes a minimum interval between which idle GCs are allowed to run. For example, -Iw5 will not run idle GC until 5 seconds have elapsed since the last GC, even if the program idles several times. This should solve the problem.
Just keep in mind the caveat in the GHC User's Guide:
This is an experimental feature, please let us know if it causes problems and/or could benefit from further tuning.
Happy Haskelling!

golang thread count misleading

I have written a small application on go, which starts 4 threads for doing various things + one main thread. So in total there are 5 threads. But if I'll start activity monitor and monitor the process, this is what I see
First of all why 7 threads. And it is not constant. Sometimes it is 5 and other times it is 7. Also all 4 threads started by main thread ends after doing hat they are suppose to. I verify that threads end by putting a differ statement on the top of thread. Still thread count in Activity monitor stays 7.
Does anyone knows what is going on over here? Are these extra threads started by go runtime? Is there a way to find out how many threads are active my program that are started by my code and not by go runtime.
Yes they are started by the runtime, for example http://play.golang.org/p/c0cIngo_sO it will print 4 goroutines are running.
Goroutines aren't threads, 1 OS thread can handle 100s of goroutines, however if you're doing something heavy or using a blocking system call, the runtime will start a new thread to handle the other goroutines.
I suppose you mean Goroutines when you say threads.
The Go runtime transparently multiplexes lightweight Goroutines onto OS threads. That's also why you don't need to call functions like select()—that's the runtime's job.
If you spawn 7 Go routines and some of them block, the runtime might decide to terminate the idle OS threads. This is why you see less threads than Go routines.
I think you mistake Goroutines for thread.
In your go program, the thread you mean is actually goroutine ,which is a coroutine and is not a real thread , which is implemented by go's runtime(you need to know about go runtime, every go program is running on a runtime, and runtime actually use thread to implement goroutines).Diffrent goroutine may be running in the same thread, or may be not ,but you never know . You can use runtime.GOMAXPROCS for multi-core cpu .
And the threads you see in the monitor are real threads .

Identifying the current HEC for a function in Haskell

I'm writing a parallel Haskell program using Strategies. It's not doing what it's supposed to do, and I would like to inspect which Haskell Execution Context (HEC) a function is executed in.
Is there a getHEC call or something similar which I could use in my debug output?
You can find out which capability (i.e. CPU core) a Haskell thread is running on by calling threadCapability from Control.Concurrent.
If you're running your program with +RTS -N, there will be one OS-level thread (HEC) spawned per core, so the capability number returned by threadCapability will tell you which OS thread your forkIO green thread is running on. If, however, you are explicitly specifying the number of OS threads with +RTS -Nn, where n is some integer other than the number of cores on your system, this will probably be less useful to you.
You might also find ThreadScope to be useful for debugging and visualizing the execution of parallel programs.

Resources