Haskell computationally intensive thread blocks all other threads

Haskell computationally intensive thread blocks all other threads - multithreading

I want to write a program whose main thread forks a new thread for computation and waits on it to finish for a period of time. If the child thread does not finish in given time it is timed out and killed. I have the following code for this.
import Control.Concurrent
fibs :: Int -> Int
fibs 0 = 0
fibs 1 = 1
fibs n = fibs (n-1) + fibs (n-2)
main = do
mvar <- newEmptyMVar
tid <- forkIO $ do
threadDelay (1 * 1000 * 1000)
putMVar mvar Nothing
tid' <- forkIO $ do
if fibs 1234 == 100
then putStrLn "Incorrect answer" >> putMVar mvar (Just False)
else putStrLn "Maybe correct answer" >> putMVar mvar (Just True)
putStrLn "Waiting for result or timeout"
result <- takeMVar mvar
killThread tid
killThread tid'
I compiled the above program with ghc -O2 Test.hs and ghc -O2 -threaded Test.hs and ran it but in both cases the program just hangs without printing anything or exiting. If I add a threadDelay (2 * 1000 * 1000) to the computation thread before the if block then the program works as expected and finishes after a second as the timer thread is able to fill the mvar.
Why is threading not working as I expect?

GHC uses a sort of hybrid of cooperative and preemptive multitasking in its concurrency implementation.
At the Haskell level, it seems preemptive because threads don't need to explicitly yield and can be seemingly interrupted by the runtime at any time. But at the runtime level, threads "yield" whenever they allocate memory. Since almost all Haskell threads are constantly allocating, this usually works pretty well.
However, if a particular calculation can be optimized into non-allocating code, it may become uncooperative at the runtime level and so un-preemptible at the Haskell level. As #Carl pointed out, it's actually the -fomit-yields flag, which is implied by -O2 that allows this to happen:
-fomit-yields
Tells GHC to omit heap checks when no allocation is being performed. While this improves binary sizes by about 5%, it also means that threads run in tight non-allocating loops will not get preempted in a timely fashion. If it is important to always be able to interrupt such threads, you should turn this optimization off. Consider also recompiling all libraries with this optimization turned off, if you need to guarantee interruptibility.
Obviously, in the single-threaded runtime (no -threaded flag), this means that one thread can completely starve out all other threads. Less obviously, the same thing can happen even if you compile with -threaded and use +RTS -N options. The problem is that an uncooperative thread can starve out the runtime scheduler itself. If at some point the uncooperative thread is the only thread currently scheduled to run, it will become uninterruptible, and the scheduler will never be re-run to consider scheduling additional threads, even if they could run on other O/S threads.
If you're just trying to test some stuff, change the signature of fib to fib :: Integer -> Integer. Since Integer causes allocation, everything will start working again (with or without -threaded).
If you run into this problem in real code, the easiest solution, by far, is the one suggested by #Carl: if you need to guarantee interruptability of threads, the code should be compiled with -fno-omit-yields, which keeps scheduler calls in non-allocating code. As per the documentation, this increases binary sizes; I assume it comes with a small performance penalty, too.
Alternatively, if the computation is already in IO, then explicitly yielding in the optimized loop may be a good approach. For a pure computation, you could convert it to IO and yield, though usually you can find a simple way to introduce an allocation again. In most realistic situations, there will be a way to introduce only a "few" yields or allocations -- enough to make the thread responsive again but not enough to seriously affect performance. (For example, if you have some nested recursive loops, yield or force an allocation in the outermost loop.)

Related

Understanding the child process of a threaded GHC haskell program

I'm trying to understand how the parent and various child OS threads work in a haskell program compiled with GHC -threaded.
Using
module Main where
import Control.Concurrent
main = do
threadDelay 9999999999
Compiling with -threaded on ghc 8.6.5, and running with +RTS -N3 for instance, I can see
$ pstree -p 6615
hello(6615)─┬─{ghc_ticker}(6618)
├─{hello:w}(6616)
├─{hello:w}(6617)
├─{hello:w}(6619)
├─{hello:w}(6620)
├─{hello:w}(6621)
├─{hello:w}(6622)
└─{hello:w}(6623)
It looks like I get N*2 + 1 of these "hello:w" threads as I vary +RTS -N.
What are these "hello:w" threads, and why are there apparently two per HEC + 1?
And what does ghc_ticker do?
I also noticed on a large real service I'm testing with +RTS -N4 I'm getting e.g. 14 of these "my-service:w" threads, and when under load these process IDs seem to churn (half of them stay alive until I kill the service).
Why 14, and why are half of them spawned and die?
I'd also accept an answer that helped guide me to instrumenting my code to figure out these latter two questions.

The ghc_ticker in spawned at startup, it runs this function. It's purpose is described as
The interval timer is used for profiling and for context switching in
the threaded build.
The other *:w threads are workers, they are created whenever there is more work to do (aka Task), but there are no more spare workers, see here
On startup ghc creates one worker per capability, then they are created as needed and reused when possible. It's hard to say why you have 14 workers in -N4 case. I can only guess that they are serving IO manager threads: see here. Let's not forget about FFI also - FFI call may block worker. You can try to put a breakpoint in createOSThread to see why workers are created.
You can read more about scheduler here
ADDED:
Hmm, I think I can explain the N*2+1 workers: N workers per capability are created at startup; N more - IO manager event loops, one per capability; plus one IO manager timer thread. Though I'm not sure why the first N workers (created at startup) where not reused for IO manager threads.

What's the relationship between forkOn and the -qm RTS flag?

Suppose that I have a program that only spawn threads using forkOn. In such scenario, there will be no load balancing of Haskell threads among different capabilities. So is there a difference in executing this program with and without +RTS -qm?
According to the documentation, -qm disables the thread migration, which I think it has a similar effect of using only forkOn. Am I correct in this assumption? I'm sure not how clear the documentation is in this regard.

I'm no expert on the subject, but I'll give it a shot anyway.
GHC (The Haskell compiler) can have one or multiple HECs (Haskell Execution Context, also known as cap or capability). With runtime flag +RTS -N <number> or setNumCapabilities function it's possible to define how many those HECs are available for program. One HEC is one operating system thread. The runtime scheduler distributes Haskell lightweight threads between HECs.
With forkOn function, it's possible to select which HEC the thread is ran on. getNumCapabilities returns the number of capabilities (HECs).
Thread migration means that Haskell threads can be migrated (moved) to another HEC. The runtime flag +RTS -qm disables this thread migration.
Documentation about forkOn states that
Like forkIO, but lets you specify on which capability the thread should run. Unlike a forkIO thread, a thread created by forkOn will stay on the same capability for its entire lifetime (forkIO threads can migrate between capabilities according to the scheduling policy).
so with forkOn it's possible to select one single HEC the thread is ran in.
Compared to forkIO which states that
Foreign calls made by this thread are not guaranteed to be made by any particular OS thread; if you need foreign calls to be made by a particular OS thread, then use forkOS instead.
Now, are forkOn function and +RTS -qm (disabled thread migration) the same thing? Probably not. With forkOn user explicitly selects which HEC the Haskell thread is ran on (for example it's possible to put all Haskell threads into same HEC). With +RTS -qm and forkIO the Haskell threads don't switch between HECs, but there's no way knowing which HEC the Haskell thread spawned by forkIO ends in.
References:
Runtime Support for Multicore Haskell
The GHC scheduler
GHC(STG,Cmm,asm) illustrated

What are some good use cases for calling 'yield' in a thread?

Many languages that support multi-threading provide an action that allows a thread to offer a context switch to another threads. For example Haskell's yield.
However, the documentation doesn't say what is the actual use case. When it's appropriate to use these yield functions, and when not?
Recently I've seen one such use case in Improving the performance of Warp again where it turns out that when a network server sends a message, it's worth calling yield before trying to receive data again, because it takes the client some time to process the answer and issue another request.
I'd like to see other examples or guidelines when calling yield brings some benefit.
I'm mainly interested in Haskell, but I don't mind learning about other languages or the concept in general.
Note: This has nothing to do with generators or coroutines, such as yield in Python or Ruby.

GHC's IO manager uses yield to improve performance. The usage can be found on github but I'll paste it here as well.
step :: EventManager -> IO State
step mgr#EventManager{..} = do
waitForIO
state <- readIORef emState
state `seq` return state
where
waitForIO = do
n1 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n1 <= 0) $ do
yield
n2 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n2 <= 0) $ do
_ <- I.poll emBackend (Just Forever) (onFdEvent mgr)
return ()
A helpful comment explains the usage of yield :
If the [first non-blocking] poll fails to find events, we yield, putting the poll loop thread at
end of the Haskell run queue. When it comes back around, we do one more
non-blocking poll, in case we get lucky and have ready events. If that also returns no events, then we do a blocking poll.
So yield is used to minimize the number of blocking polls the EventManager must perform.

GHC only suspends threads at specific safe points (in particular when allocating memory). Quoting The Glasgow Haskell Compiler by Simon Marlow and Simon Peyton-Jones:
A context switch only occurs when the thread is at a safe point, where very little additional state needs to be saved. Because we use accurate GC, the stack of the thread can be moved and expanded or shrunk on demand. Contrast these with OS threads, where every context switch must save the entire processor state, and where stacks are immovable so a large chunk of address space has to be reserved up front for each thread.
[...]
Having said that, the implementation does have one problem that users occasionally run into, especially when running benchmarks. We mentioned above that lightweight threads derive some of their efficiency by only context-switching at "safe points", points in the code that the compiler designates as safe, where the internal state of the virtual machine (stack, heap, registers, etc.) is in a tidy state and garbage collection could take place. In GHC, a safe point is whenever memory is allocated, which in almost all Haskell programs happens regularly enough that the program never executes more than a few tens of instructions without hitting a safe point. However, it is possible in highly optimised code to find loops that run for many iterations without allocating memory. This tends to happen often in benchmarks (e.g., functions like factorial and Fibonacci). It occurs less often in real code, although it does happen. The lack of safe points prevents the scheduler from running, which can have detrimental effects. It is possible to solve this problem, but not without impacting the performance of these loops, and often people care about saving every cycle in their inner loops. This may just be a compromise we have to live with.
Therefore it can happen that a program with a tight loop has no such points and never switches threads. Then yield is necessary to let other threads run. See this question and this answer.

Bounded computation in Haskell

Is there any way in Haskell (using GHC if it matters, for code that needs to run on Linux and Windows) to perform bounded computation? That is, "compute the result of this function if it is feasible to do so, but if the attempt has used more than X CPU cycles, Y stack space or Z heap space, and still not done, stop and return an indication that it was not possible to complete the computation"?

System.Timeout.timeout :: Int -> IO a -> IO (Maybe a)
http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/base-4.3.1.0/System-Timeout.html#v:timeout

Here's a hackish solution you could try: spawn your computation with forkIO, and let the parent thread (or a monitoring thread which has access to the forked thread's ThreadId) periodically poll for any quantity you'd want, and throw an asynchronous exception to the computing thread as necessary (interestingly, that's exactly how timeout works.)
The next question would be whether there's a way to find out how big the heap currently is from within Haskell. Total memory consumption and cycles you can find out by spawning shell commands, or querying the OS in another way (I wouldn't know how to do that on Windows.)
It's not a perfect solution, but it's a simple one, which you could implement and test in a couple of minutes.

On a per-process level, you can use GHC's RTS options to control maximum stack and heap sizes.

When sharing an IORef, is it safe to read with readIORef as long as I'm writing with atomicModifyIORef?

If I share an IORef among multiple threads, and use atomicModifyIORef to write to it:
atomicModifyIORef ref (\_ -> (new, ()))
Is it safe to read the value with plain old readIORef? Or is there a chance readIORef will return the old value in another thread after atomicModifyIORef has modified it?
I think that's what the documentation implies:
atomicModifyIORef acts as a barrier to reordering. Multiple
atomicModifyIORef operations occur in strict program order. An
atomicModifyIORef is never observed to take place ahead of any earlier
(in program order) IORef operations, or after any later IORef
operations.
I just want to be sure.

atomicModifyIORef guarantees that nothing happens between the atomic read and the following atomic write, thus making the whole operation atomic. The comment that you quoted just states that no atomicModifyIORefs will ever happen in parallel, and that the optimizer won't try to reorder the statements to optimize the program (separate reads and writes can safely be moved around in some cases; for example in a' <- read a; b' <- read b; write c $ a' + b', the reads can safely be reordered)
readIORef is already atomic, since it only performs one operation.
However, you are discussing a different issue. Yes, if the atomicModify happens at t=3ms and the read happens at t=4ms, you will get the modified value. However; threads aren't guaranteed to run in parallel, so if you do (pseudocode):
forkIO $ do
sleep 100 ms
atomicModify
sleep 1000 ms
read
... there's no guarantee that the read will happen after the modify (it's extremely unlikely on modern OS's, though), because the operating system might decide to schedule the new short-lived thread in such a way that it doesn't happen in parallel.

If you want to share a mutable reference between several threads, you really should use TVar instead IORef. That's the whole motivation for TVars after all. You use TVars pretty much the same way as IORefs, but any access or modification needs to be enclosed inside an atomically block which always guarateed to be an atomic operation.

You don't want to use IORef with multiple threads, since they give basically no guarantees. I usually use an MVar instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string