Bounded computation in Haskell

Bounded computation in Haskell - haskell

Is there any way in Haskell (using GHC if it matters, for code that needs to run on Linux and Windows) to perform bounded computation? That is, "compute the result of this function if it is feasible to do so, but if the attempt has used more than X CPU cycles, Y stack space or Z heap space, and still not done, stop and return an indication that it was not possible to complete the computation"?

System.Timeout.timeout :: Int -> IO a -> IO (Maybe a)
http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/base-4.3.1.0/System-Timeout.html#v:timeout

Here's a hackish solution you could try: spawn your computation with forkIO, and let the parent thread (or a monitoring thread which has access to the forked thread's ThreadId) periodically poll for any quantity you'd want, and throw an asynchronous exception to the computing thread as necessary (interestingly, that's exactly how timeout works.)
The next question would be whether there's a way to find out how big the heap currently is from within Haskell. Total memory consumption and cycles you can find out by spawning shell commands, or querying the OS in another way (I wouldn't know how to do that on Windows.)
It's not a perfect solution, but it's a simple one, which you could implement and test in a couple of minutes.

On a per-process level, you can use GHC's RTS options to control maximum stack and heap sizes.

Related

What happens with CPU context when Goroutines are switching?

If I correctly understand how goroutines work on top of system threads - they run from queue one by one. But does it mean that every goroutine loads\unloads it's context to CPU? If yes what's difference between system threads and goroutines?
The most significant problem is time-cost of context-switching. Is it correct?
What mechanism lays under detecting which data was requested by which goroutine? For example: I am sending request to DB from goroutine A and doesn't wait for response and at the same time occurred switch to a next goroutine. How system understands that a request came from A and not from B or C?

Goroutines, memory and OS threads
Go has a segmented stack that grows as needed. Go runtime does the scheduling, not the OS. The runtime multiplexes the goroutines onto a relatively small number of real OS threads.
Goroutines switch cost
Goroutines are scheduled cooperatively and when a switch occurs, only 3 registers need to be saved/restored - Program Counter, Stack Pointer, and DX. From the OS's perspective Go program behaves as an event-driven program.
Goroutines and CPU
You cannot directly control the number of threads that the runtime will create. It is possible to set the number of processor cores used by the program by setting the variable GOMAXPROCS with a call of runtime.GOMAXPROCS(n).
Program Counter
and a completely different story
In computing, a program is a specific set of ordered operations for a computer to perform. An instruction is an order given to a computer processor by a program. Within a computer, an address is a specific location in memory or storage. A program counter register is one of a small set of data holding places that the processor uses.
This is a different story of how programs work and communicate with each other and it doesn't directly relate to a goroutine topic.
Sources:
http://blog.nindalf.com/how-goroutines-work/
https://gobyexample.com/goroutines
http://tleyden.github.io/blog/2014/10/30/goroutines-vs-threads/
http://whatis.techtarget.com/definition/program-counter

Gs, Ms, Ps
A "G" is simply a goroutine. It's represented by type g. When a goroutine exits, its g object is returned to a pool of free gs and can later be reused for some other goroutine.
An "M" is an OS thread that can be executing user Go code, runtime code, a system call, or be idle. It's represented by type m. There can be any number of Ms at a time since any number of threads may be blocked in system calls.
Finally, a "P" represents the resources required to execute user Go code, such as scheduler and memory allocator state. It's represented by type p. There are exactly GOMAXPROCS Ps. A P can be thought of like a CPU in the OS scheduler and the contents of the p type like per-CPU state. This is a good place to put state that needs to be sharded for efficiency, but doesn't need to be per-thread or per-goroutine.
The scheduler's job is to match up a G (the code to execute), an M (where to execute it), and a P (the rights and resources to execute it). When an M stops executing user Go code, for example by entering a system call, it returns its P to the idle P pool. In order to resume executing user Go code, for example on return from a system call, it must acquire a P from the idle pool.
All g, m, and p objects are heap allocated, but are never freed, so their memory remains type stable. As a result, the runtime can avoid write barriers in the depths of the scheduler.
User stacks and system stacks
Every non-dead G has a user stack associated with it, which is what user Go code executes on. User stacks start small (e.g., 2K) and grow or shrink dynamically.
Every M has a system stack associated with it (also known as the M's "g0" stack because it's implemented as a stub G) and, on Unix platforms, a signal stack (also known as the M's "gsignal" stack). System and signal stacks cannot grow, but are large enough to execute runtime and cgo code (8K in a pure Go binary; system-allocated in a cgo binary).
Runtime code often temporarily switches to the system stack using systemstack, mcall, or asmcgocall to perform tasks that must not be preempted, that must not grow the user stack, or that switch user goroutines. Code running on the system stack is implicitly non-preemptible and the garbage collector does not scan system stacks. While running on the system stack, the current user stack is not used for execution.
Ref: https://github.com/golang/go/blob/master/src/runtime/HACKING.md

What are some good use cases for calling 'yield' in a thread?

Many languages that support multi-threading provide an action that allows a thread to offer a context switch to another threads. For example Haskell's yield.
However, the documentation doesn't say what is the actual use case. When it's appropriate to use these yield functions, and when not?
Recently I've seen one such use case in Improving the performance of Warp again where it turns out that when a network server sends a message, it's worth calling yield before trying to receive data again, because it takes the client some time to process the answer and issue another request.
I'd like to see other examples or guidelines when calling yield brings some benefit.
I'm mainly interested in Haskell, but I don't mind learning about other languages or the concept in general.
Note: This has nothing to do with generators or coroutines, such as yield in Python or Ruby.

GHC's IO manager uses yield to improve performance. The usage can be found on github but I'll paste it here as well.
step :: EventManager -> IO State
step mgr#EventManager{..} = do
waitForIO
state <- readIORef emState
state `seq` return state
where
waitForIO = do
n1 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n1 <= 0) $ do
yield
n2 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n2 <= 0) $ do
_ <- I.poll emBackend (Just Forever) (onFdEvent mgr)
return ()
A helpful comment explains the usage of yield :
If the [first non-blocking] poll fails to find events, we yield, putting the poll loop thread at
end of the Haskell run queue. When it comes back around, we do one more
non-blocking poll, in case we get lucky and have ready events. If that also returns no events, then we do a blocking poll.
So yield is used to minimize the number of blocking polls the EventManager must perform.

GHC only suspends threads at specific safe points (in particular when allocating memory). Quoting The Glasgow Haskell Compiler by Simon Marlow and Simon Peyton-Jones:
A context switch only occurs when the thread is at a safe point, where very little additional state needs to be saved. Because we use accurate GC, the stack of the thread can be moved and expanded or shrunk on demand. Contrast these with OS threads, where every context switch must save the entire processor state, and where stacks are immovable so a large chunk of address space has to be reserved up front for each thread.
[...]
Having said that, the implementation does have one problem that users occasionally run into, especially when running benchmarks. We mentioned above that lightweight threads derive some of their efficiency by only context-switching at "safe points", points in the code that the compiler designates as safe, where the internal state of the virtual machine (stack, heap, registers, etc.) is in a tidy state and garbage collection could take place. In GHC, a safe point is whenever memory is allocated, which in almost all Haskell programs happens regularly enough that the program never executes more than a few tens of instructions without hitting a safe point. However, it is possible in highly optimised code to find loops that run for many iterations without allocating memory. This tends to happen often in benchmarks (e.g., functions like factorial and Fibonacci). It occurs less often in real code, although it does happen. The lack of safe points prevents the scheduler from running, which can have detrimental effects. It is possible to solve this problem, but not without impacting the performance of these loops, and often people care about saving every cycle in their inner loops. This may just be a compromise we have to live with.
Therefore it can happen that a program with a tight loop has no such points and never switches threads. Then yield is necessary to let other threads run. See this question and this answer.

Limiting thread memory access per thread in GHC

I'm wondering, is it possible to limit the amount of memory a thread uses? I'm looking at running a server where untrusted user code is submitted and run. I can use SafeHaskell to ensure that it doesn't perform any unauthorized IO, but I need to make sure that a user's code doesn't crash the entire server, i.e. by causing a stack overflow or out-of-memory heap error.
Is there a way to limit the amount of memory each individual thread can access, or some way to ensure that if one thread consumes a massive amount of memory, that only that thread is terminated?
Perhaps, is there a way that when any thread encounters an out of memory error, I can catch the exception and choose which thread dies?
I'm talking more about concurrency, in the sense of forkIO and STM threads, rather than paralellism with par and seq.
Note: this is very similar to this question, but it never received an answer to the general problem, rather the answers dealt with the specific scenario of the question. Additionally, it's possible that since 2011, something might have changed in GHC 7.8, maybe with the new IO manager?

I don't know about Haskell, but in general, the answer to your question is no. In all programming languages/runtimes/operating systems/etc. that I know of, threads are nothing more than different paths of execution through the same code. The important thing in this case, is that threads always share the same virtual address space.
That being said, there is no technical reason why a memory allocator in your particular language & runtime system could not use a thread-specific variable to track how much has been allocated by any given thread, and impose an arbitrary limit.
No technical reason why it couldn't do that, but if thread A allocates an object which is subsequently accessed by thread B, thread C, thread D,... Then what sense does it make to penalize thread A for having allocated it? There is no practical way to track the "ownership" of an object that is accessed by many threads in the general case, which is why none of the languages/runtimes/OSes/etc. that I know of attempt to do it.

Ensuring even CPU time distribution among threads in Haskell

I have a planning algorithm written in Haskell which is tasked with evaluating a set of possible plans in a given amount of time, where the evaluation process is one which may be run for arbitrary amounts of time to produce increasingly accurate results. The natural and purportedly most efficient way to do this is to give each evaluation task its own lightweight Haskell thread, and have the main thread harvest the results after sleeping for the specified amount of time.
But in practice, invariably one or two threads will be CPU-starved for the entire available time. My own experimentation with semaphores/etc to control execution has shown this to be surprisingly difficult to fix, as I can't seem to force a given thread to stop executing (including using "yield" from Control.Concurrent.)
Is there a good known way to ensure that an arbitrary number of Haskell threads (not OS threads) each receive a roughly even amount of CPU-time over a (fairly short) span of wall-clock-time? Failing that, a good way to ensure that a number of threads executing an identical iteration fairly "take turns" on a given number of cores such that all cores are being used?

AFAIK, Haskell threads should all receive roughly equal amounts of CPU power as long as they are all actively trying to do work. The only reason that wouldn't happen is if they start making blocking I/O calls, or if each thread runs only for a few milliseconds or something.
Perhaps the problem you are seeing is actually that each thread just runs for a split second, yielding an unevaluated expression as its result, which the main thread then evaluates itself? If that were the case, it would look like the main thread is getting all the CPU time.

forkIO threads and OS threads

If I create a thread using forkIO I need to provide a function to run and get back an identifier (threadID). I then can communicate with this animal via e.g. the workloads, MVARs etc.. However, to my understanding the created thread is very limited and can only work in sort of a SIMD fashion where the function that was provided for thread creation is the instruction. I cannot change the function that I provided when the thread was initiated. I understand that these user threads are eventually by the OS mapped to OS threads.
I would like to know how the Haskell threads and the OS threads do interface. Why can Haskell threads that do completely different things be mapped to one and the same OS thread? Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)? How does the scheduler(?) recognize user threads in an application that could possibly be distributed? In other words, why are OS threads so flexible?
Last, is there any way to dump the heap of a selected thread from within the application?

First, let's address one quick misconception:
I understand that these user threads are eventually by the OS mapped to OS threads.
Actually, the Haskell runtime is in charge of choosing which Haskell thread a particular OS thread from its pool is executing.
Now the questions, one at a time.
Why can Haskell threads that do completely different things be mapped to one and the same OS thread?
Ignoring FFI for the moment, all OS threads are actually running the Haskell runtime, which keeps track of a list of ready Haskell threads. The runtime chooses a Haskell thread to execute, and jumps into the code, executing until the thread yields control back to the runtime. At that moment, the runtime has a chance to continue executing the same thread or pick a different one.
In short: many Haskell threads can be mapped to a single OS thread because in reality that OS thread is doing only one thing, namely, running the Haskell runtime.
Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)?
I don't understand this question (and I think it stems from a second misconception). You start OS threads with a fixed instruction in exactly the same sense that you start Haskell threads with a fixed instruction: for each thing, you just give a chunk of code to execute and that's what it does.
How does the scheduler(?) recognize user threads in an application that could possibly be distributed?
"Distributed" is a dangerous word: usually, it refers to spreading code across multiple machines (presumably not what you meant here). As for how the Haskell runtime can tell when there's multiple threads, well, that's easy: you tell it when you call forkIO.
In other words, why are OS threads so flexible?
It's not clear to me that OS threads are any more flexible than Haskell threads, so this question is a bit strange.
Last, is there any way to dump the heap of a selected thread from within the application?
I actually don't really know of any tools for dumping the Haskell heap at all, in multithreaded applications or otherwise. You can dump a representation of the part of the heap reachable from a particular object, if you like, using a package like vacuum. I've used vacuum-cairo to visualize these dumps with great success in the past.
For further information, you may enjoy the middle two sections, "Conventions" and "Foreign Imports", from my intro to multithreaded gtk2hs programming, and perhaps also bits of the section on "The Non-Threaded Runtime".

Instead of trying to directly answer your question, I will try to provide a conceptual model for how multi-threaded Haskell programs are implemented. I will ignore many details, and complexities.
Operating systems implement preemptive multithreading using hardware interrupts to allow multiple "threads" of computation to run logically on the same core at the same time.
The threads provided by operating systems tend to be heavy weight. They are well suited to certain types of "multi-threaded" applications, and, on systems like Linux, are fundamentally the same tool that allows multiple programs to run at the same time (a task they excel at).
But, these threads are bit heavy weight for many uses in high level languages such as Haskell. Essentially, the GHC runtime works as mini-OS, implementing its own "threads" on top of the OS threads, in the same way an OS implements threads on top of cores.
It is conceptually easy to imagine that a language like Haskell would be implemented in this way. Evaluating Haskell consists of "forcing thunks" where a thunk is a unit of computation that might 1. depend on another value (thunk) and/or 2. create new thunks.
Thus, one can imagine multiple threads each evaluating thunks at the same time. One would construct a queue of thunks to be evaluated. Each thread would pop the top of the queue, and evaluate that thunk until it was completed, then select a new thunk from the queue. The operation par and its ilk can "spark" new computation by adding a thunk to that queue.
Extending this model to IO actions is not particularly hard to imagine either. Instead of each simply forcing pure thunk, we imagine the unit of Haskell computation being somewhat more complicated. Psuedo Haskell for such a runtime:
type Spark = (ThreadId,Action)
data Action = Compute Thunk | Perform IOAction
note: this is for conceptual understanding only, don't think things are implemented this way
When we run a Spark, we look for exceptions "thrown" to that thread ID. Assuming we have none, execution consists of either forcing a thunk or performing an IO action.
Obviously, my explanation here has been very hand-wavy, and ignored some complexity. For more, the GHC team have written excellent articles such as "Runtime Support for Multicore Haskell" by Marlow et al. You might also want to look at text book on Operating Systems, as they often go in some depth on how to build a scheduler.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string