I am reading semi-explicit parallelism in Haskell, and get some confusion.
par :: a -> b -> b
People say that this approach allows us to make automatically parallelization by evaluating every sub-expression of a Haskell program in parallel. But this approach has the following disadvantages:
1) It creates far too many small items of the work, which cannot be efficiently scheduled. As I understand, if you use par function for every lines of Haskell program, it will creates too many threads, and it's not practical at all. Is that right?
2) With this approach, parallelism is limited by data dependencies in the source program. If I understand correctly, it means every sub-expression must be independent. Like, in the par function, a and b must be independent.
3) The Haskell runtime system does not necessarily create a thread to compute the value of the expression a. Instead, it creates a spark, which has the potential to be executed on a different thread from the parent thread.
So, my question is : finally the runtime system will create a thread to compute a or not? Or if the expression a is needed to compute the expression b, the system will create a new thread to compute a? Otherwise, it will not. Is this true?
I am a newbie to Haskell, so maybe my questions are still basic for all of you. Thanks for your answer.
The par combinator you mention is part of the Glasgow parallel Haskell (GpH) which implements semi-explicit parallelism, which however means it is not fully implicit and hence does not provide automatic parallelisation. The programmer still needs to identify subexperssions deemed worthwhile executing in parallel to avoid the issue you mention in 1).
Moreover, the annotation is not prescriptive (as e.g. pthread_create in C or forkIO in Haskell) but advisory, that means the runtime system finally decides whether to evaluate the subexpressions in parallel or not. This provides additional flexibility and a way for dynamic granularity control. Additionally, so-called Evaluation Strategies have been designed to abstract over par and pseq and to separate specification of coordination from computation. For instance, a strategy parListChunk allows to chunk a list and force it to weak head normal form (this is a case where some strictness is needed).
2) Parallelism is limited by data dependencies in the sense that the computation defines the way the graph is reduced and which computation is demanded at which point. It is not true that every sub-expression must be independent. For instance E1 par E2 returns the result of E2, it means to be useful, some part of E1 needs to be used in E2 and hence E2 depends on E1.
3) The picture is slightly confused here because of the GHC-specific terminology. There are Capabilities (or Haskell Execution Contexts) which implement parallel graph reduction and maintain a spark pool and a thread pool each. Usually there is one Capability per core (can be thought of as OS threads). On the other end of the continuum there are sparks, which are basically pointers to parts of the graph that have not been evaluated yet (thunks). And there are threads (actually sort of tasks or work units), so that to be evaluated in parallel a spark needs to be turned into a thread (which has a so called thread state object that contains the necessary execution environment and allows a thunk to be evaluated in parallel). A thunk may depend on results of other thunks and blocks until these results arrive. These threads are much more lightweight than OS threads and are being multiplexed onto the available Capablities.
So, in summary, a runtime will not even neccesarily create a lightweight thread to evaluate a sub-expression. By the way, random work-stealing is used for load-balancing.
This is a very high-level approach to parallelism and avoids race conditions and deadlocks by design. The synchronisation is implicitly mediated through graph reduction. A nice position statement discusses further Why Parallel Functional Programming Matters. For more information on the abstract machine behind the scenes have a look at Stackless Tagless G-Machine and checkout the notes on Haskell Execution Model on GHC wiki (usually the most up-to-date documentation alongside the source code).
Yes, you are correct. You would not gain anything by creating a spark for every expression you want computed. You would get way, way too many sparks. Trying to manage this is what Data Parallel Haskell is about. DPH is a way of breaking down a nested computations into well-sized chunks which can then be computed in parallel. Keep in mind that this is still a research effort and probably not ready for mainstream consumption.
Once again, you are correct. If a depends on b you have to compute as much of a as b needs to be able to start computation of b.
Yup. Threads actually have a pretty high overhead compared to some of the alternatives. Sparks are somewhat like thunks only they can be computedd independently of time.
No, the RTS will not create a thread to compute a. You can decide how many threads the RTS should have running (+RTS -N6 for six threads) and they will be kept alive for the duration of the program.
par only creates a spark. A spark is not a thread. The sparks occupy a work pool, and the scheduler performs work stealing – i.e. when a thread goes idle it picks up a spark from the pool and computes it.
Related
Two (related) questions about concurrent programming in Haskell compared to Java/Scala:
What do Haskeller's use for concurrent data structures? Is there anything analogous to Java's java.util.concurrent.{ConcurrentHashMap, ConcurrentSkipListSet, ..}. MVar (Map k v) does not count :). Shared mutable state is evil, but occasionally necessary.
Is there any equivalent to Java's ExecutorService? AFAIK, Haskell threads (see fork#, yield#, etc. in GHC.Exts) are all scheduled by something built into the RTS. But what if I specifically want to use a fork join pool, or schedule some computations on a thread pool? Being able to put Future's on specific execution contexts is really handy in Scala, and I don't know how to do that in Haskell.
An entire book could be written on this topic, so I'll try to touch on just the points you're asking about.
You can use the par combinator to schedule pure computations to be done at some future point in time. The RTS already implements work-stealing queues for this and already maintains one thread per CPU core for running them. (If you link your program with the appropriate switches.) Note that this won't help one bit for impure code, and won't let you specify which thread or which core the code executes on.
For shared mutable storage, you have two options.
Explicit locking using MVar. This has all the usual pitfalls of locking in other programming languages. (Deadlocks, forgetting to lock things, locking too many things, locking things too long, not locking them for long enough...) So MVar (Map k v) absolutely counts!
STM. You seem to misunderstand what this does. The whole point of STM is that you don't need locks. It lets you use shared mutable data structures "as if" they're not shared, but it automatically prevents data races, inconsistent state, and all the other usual problems with not using locks. It also allows a thread to wait on multiple conditions simultaneously. It's an incredible framework!
If you want to run code on a specific OS thread, you're probably looking for forkOS rather than forkIO.
Given your use case, I suspect STM is probably what you're looking for. If you have a specific task you're trying to do, post another question and you will probably get more specific advise.
I've been working a lot with concurrency at the practical level, and therefore I've also started to study it theoretically to gain insight into this field of computer science.
However, I've trouble understanding the following:
Why cannot a Lock for 2-threads be implemented using only 1 shared variable satisfying mutual exclusion and deadlock freedom?
More generally, why is at least n shared variables needed for a n-thread lock satisfying mutual exclusion and deadlock freedom?
Consider two threads A and B. I see that A must write to this variable in order to signify it acquires the lock. The variable could be a boolean. Is it because that A needs to read the variable before writing it, and this is two operations? (not done atomically)
Most likely, you're reading things that make assumptions about the platform's capabilities that are no longer realistic. You're probably considering the case where a CPU has no prefetching, no posted writes, total read and store ordering, no compiler optimization that affect memory visibility or memory operation ordering, and no risk of word tearing, but does not have an atomic "read-modify-write" operation like increment or compare-exchange. With these assumptions, there's really no way to do it with one variable.
This is an interesting theoretical problem, but has very little practical relevance. Modern CPUs do have all of those optimizations -- they prefetch reads, they post writes to buffers, they re-order reads and stores, and compilers optimize away memory options. Word tearing is typically not an issue for aligned operations to native integer types. But, more importantly, modern CPUs have sophisticated, high-performance atomic operations such as increment, decrement, compare-exchange, and so on.
When you write synchronization primitives, the exercise is highly platform-specific. The combination of capabilities available to you varies from platform to platform. Even more importantly, their costs vary drastically from platform to platform, so even if many solutions are possible, they may not be equally good.
Lastly, you have to have a deep understanding of what each primitive actually makes the platform do. For example, on modern Intel CPUs, there is hyper-threading. It's important that, for example, a thread waiting for a spinlock doesn't starve another thread sharing the physical core. That requires deep understanding of how hyper-threading actually works. Similarly, it's easy to code a spinlock so that you take the mother of all mispredicted branches when you acquire the lock and blow out the pipelines at the instant where performance is the most critical. You need to understand how branch prediction works and how it interacts with instruction pipelining to avoid this issue.
The vast majority of programmers should never, ever write synchronization primitives and use them in actual, real world code. Getting them to work with assured reliability is hard, and getting them to perform properly is much, much harder. And to top it off, it's not possible to measure their performance easily. (Of course, it's great to experiment, so long as you don't get an exaggerated sense of the usefulness of your experimental code.)
How does GHC handle thunks that are accessed by multiple threads (either explicit threads, or the internal ones that evaluate sparks)? Can it happen that multiple threads evaluate the same thunk, duplicating work? Or, if thunks are synchronized, how, so that performance doesn't suffer?
From the blog post #Lambdageek linked, the GHC Commentary and the GHC User's Guide I piece together the following:
GHC tries to prevent reevaluating thunks, but because true locking between threads is expensive, and thunks are usually pure and so harmless to reevaluate, it normally does so in a sloppy manner, with a small chance of duplicating work anyhow.
The method it uses for avoiding work is to replace thunks by a blackhole, a special marker that tells other threads (or sometimes, the thread itself; that's how <<loop>> detection happens) that the thunk is being evaluated.
Given this, there are at least three options:
By default, it uses "lazy blackholing", where this is only done before the thread is about to pause. It then "walks" its stack and creates "true" blackholes for new thunks, using locking to ensure each thunk only gets one thread blackholing it, and aborting its own evaluation if it discovers another thread has already blackholed a thunk. This is cheaper as it doesn't need to consider thunks whose evaluation time is so short that it fits completely between two pauses.
With the -feager-blackholing-flag, blackholes are instead created as soon as a thunk starts evaluating, and the User's Guide recommends this if you are doing a lot of parallelism. However, because locking on every thunk would be too expensive, these blackholes are cheaper "eager" ones, which are not synchronized with other threads (although other threads can still see them if there's not a race condition). Only when the thread pauses do these get turned into "true" blackholes.
A third case, which the blog post was particularly about, is used for special functions like unsafePerformIO, for which it's harmful to evaluate a thunk more than once. In this case, the thread uses a "true" blackhole with real locking, but creates it immediately, by inserting an artificial thread pause before the real evaluation.
To summarize the article linked to in the comments: the thunks in GHC are not strictly atomic between multiple threads: it is possible in a race condition for multiple threads to evaluate the same thunk, duplicating work. But, this is not a big deal in practice because:
Guaranteed purity means it never matters whether a thunk is evaluated twice, in terms of program semantics. unsafePerformIO is a case where that might be an issue, but it turns out that unsafePerformIO is careful to avoid re-running the same IO action.
Because all values are thunks, most thunks turn out to be quite small, so occasionally duplicating the work to force one is not a big deal in terms of program speed. You might imagine that it's expensive to duplicate, for example, last [1,2..10000000] because that whole computation is expensive. But of course really the outermost thunk just resolves to another thunk, something like:
case [1,2..10000000] of
[x] -> x
(_:xs) -> last xs
[] -> error "empty list"
and if two threads duplicate the work to turn the call to last into a use of case, that's pretty cheap in the grand scheme of things.
Yes, sometimes the same thunk can be evaluated by multiple threads. GHC runtime tries to minimize the probability of duplicated work, so in practice it is rare. Please see "Haskell on a Shared-Memory Multiprocessor" paper for low level details, mostly the "Lock-free thunk evaluation" section. (I'd recommend the paper for every professional haskell developer btw.)
Many languages that support multi-threading provide an action that allows a thread to offer a context switch to another threads. For example Haskell's yield.
However, the documentation doesn't say what is the actual use case. When it's appropriate to use these yield functions, and when not?
Recently I've seen one such use case in Improving the performance of Warp again where it turns out that when a network server sends a message, it's worth calling yield before trying to receive data again, because it takes the client some time to process the answer and issue another request.
I'd like to see other examples or guidelines when calling yield brings some benefit.
I'm mainly interested in Haskell, but I don't mind learning about other languages or the concept in general.
Note: This has nothing to do with generators or coroutines, such as yield in Python or Ruby.
GHC's IO manager uses yield to improve performance. The usage can be found on github but I'll paste it here as well.
step :: EventManager -> IO State
step mgr#EventManager{..} = do
waitForIO
state <- readIORef emState
state `seq` return state
where
waitForIO = do
n1 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n1 <= 0) $ do
yield
n2 <- I.poll emBackend Nothing (onFdEvent mgr)
when (n2 <= 0) $ do
_ <- I.poll emBackend (Just Forever) (onFdEvent mgr)
return ()
A helpful comment explains the usage of yield :
If the [first non-blocking] poll fails to find events, we yield, putting the poll loop thread at
end of the Haskell run queue. When it comes back around, we do one more
non-blocking poll, in case we get lucky and have ready events. If that also returns no events, then we do a blocking poll.
So yield is used to minimize the number of blocking polls the EventManager must perform.
GHC only suspends threads at specific safe points (in particular when allocating memory). Quoting The Glasgow Haskell Compiler by Simon Marlow and Simon Peyton-Jones:
A context switch only occurs when the thread is at a safe point, where very little additional state needs to be saved. Because we use accurate GC, the stack of the thread can be moved and expanded or shrunk on demand. Contrast these with OS threads, where every context switch must save the entire processor state, and where stacks are immovable so a large chunk of address space has to be reserved up front for each thread.
[...]
Having said that, the implementation does have one problem that users occasionally run into, especially when running benchmarks. We mentioned above that lightweight threads derive some of their efficiency by only context-switching at "safe points", points in the code that the compiler designates as safe, where the internal state of the virtual machine (stack, heap, registers, etc.) is in a tidy state and garbage collection could take place. In GHC, a safe point is whenever memory is allocated, which in almost all Haskell programs happens regularly enough that the program never executes more than a few tens of instructions without hitting a safe point. However, it is possible in highly optimised code to find loops that run for many iterations without allocating memory. This tends to happen often in benchmarks (e.g., functions like factorial and Fibonacci). It occurs less often in real code, although it does happen. The lack of safe points prevents the scheduler from running, which can have detrimental effects. It is possible to solve this problem, but not without impacting the performance of these loops, and often people care about saving every cycle in their inner loops. This may just be a compromise we have to live with.
Therefore it can happen that a program with a tight loop has no such points and never switches threads. Then yield is necessary to let other threads run. See this question and this answer.
If I create a thread using forkIO I need to provide a function to run and get back an identifier (threadID). I then can communicate with this animal via e.g. the workloads, MVARs etc.. However, to my understanding the created thread is very limited and can only work in sort of a SIMD fashion where the function that was provided for thread creation is the instruction. I cannot change the function that I provided when the thread was initiated. I understand that these user threads are eventually by the OS mapped to OS threads.
I would like to know how the Haskell threads and the OS threads do interface. Why can Haskell threads that do completely different things be mapped to one and the same OS thread? Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)? How does the scheduler(?) recognize user threads in an application that could possibly be distributed? In other words, why are OS threads so flexible?
Last, is there any way to dump the heap of a selected thread from within the application?
First, let's address one quick misconception:
I understand that these user threads are eventually by the OS mapped to OS threads.
Actually, the Haskell runtime is in charge of choosing which Haskell thread a particular OS thread from its pool is executing.
Now the questions, one at a time.
Why can Haskell threads that do completely different things be mapped to one and the same OS thread?
Ignoring FFI for the moment, all OS threads are actually running the Haskell runtime, which keeps track of a list of ready Haskell threads. The runtime chooses a Haskell thread to execute, and jumps into the code, executing until the thread yields control back to the runtime. At that moment, the runtime has a chance to continue executing the same thread or pick a different one.
In short: many Haskell threads can be mapped to a single OS thread because in reality that OS thread is doing only one thing, namely, running the Haskell runtime.
Why was there no need to initiate the OS thread with a fixed instruction (as it is needed in forkIO)?
I don't understand this question (and I think it stems from a second misconception). You start OS threads with a fixed instruction in exactly the same sense that you start Haskell threads with a fixed instruction: for each thing, you just give a chunk of code to execute and that's what it does.
How does the scheduler(?) recognize user threads in an application that could possibly be distributed?
"Distributed" is a dangerous word: usually, it refers to spreading code across multiple machines (presumably not what you meant here). As for how the Haskell runtime can tell when there's multiple threads, well, that's easy: you tell it when you call forkIO.
In other words, why are OS threads so flexible?
It's not clear to me that OS threads are any more flexible than Haskell threads, so this question is a bit strange.
Last, is there any way to dump the heap of a selected thread from within the application?
I actually don't really know of any tools for dumping the Haskell heap at all, in multithreaded applications or otherwise. You can dump a representation of the part of the heap reachable from a particular object, if you like, using a package like vacuum. I've used vacuum-cairo to visualize these dumps with great success in the past.
For further information, you may enjoy the middle two sections, "Conventions" and "Foreign Imports", from my intro to multithreaded gtk2hs programming, and perhaps also bits of the section on "The Non-Threaded Runtime".
Instead of trying to directly answer your question, I will try to provide a conceptual model for how multi-threaded Haskell programs are implemented. I will ignore many details, and complexities.
Operating systems implement preemptive multithreading using hardware interrupts to allow multiple "threads" of computation to run logically on the same core at the same time.
The threads provided by operating systems tend to be heavy weight. They are well suited to certain types of "multi-threaded" applications, and, on systems like Linux, are fundamentally the same tool that allows multiple programs to run at the same time (a task they excel at).
But, these threads are bit heavy weight for many uses in high level languages such as Haskell. Essentially, the GHC runtime works as mini-OS, implementing its own "threads" on top of the OS threads, in the same way an OS implements threads on top of cores.
It is conceptually easy to imagine that a language like Haskell would be implemented in this way. Evaluating Haskell consists of "forcing thunks" where a thunk is a unit of computation that might 1. depend on another value (thunk) and/or 2. create new thunks.
Thus, one can imagine multiple threads each evaluating thunks at the same time. One would construct a queue of thunks to be evaluated. Each thread would pop the top of the queue, and evaluate that thunk until it was completed, then select a new thunk from the queue. The operation par and its ilk can "spark" new computation by adding a thunk to that queue.
Extending this model to IO actions is not particularly hard to imagine either. Instead of each simply forcing pure thunk, we imagine the unit of Haskell computation being somewhat more complicated. Psuedo Haskell for such a runtime:
type Spark = (ThreadId,Action)
data Action = Compute Thunk | Perform IOAction
note: this is for conceptual understanding only, don't think things are implemented this way
When we run a Spark, we look for exceptions "thrown" to that thread ID. Assuming we have none, execution consists of either forcing a thunk or performing an IO action.
Obviously, my explanation here has been very hand-wavy, and ignored some complexity. For more, the GHC team have written excellent articles such as "Runtime Support for Multicore Haskell" by Marlow et al. You might also want to look at text book on Operating Systems, as they often go in some depth on how to build a scheduler.