I've been asking a few questions about concurrency in Haskell, particular TVar, and I've had concerns about livelock with TVar.
Instead, I've proposed this solution.
(1) Wrap all shared data in the program in one data structure, and wrap that in an IORef.
(2) Simply do any changes using atomicModifyIORef.
I believe this prevents both deadlocks and livelocks (whereas TVar only prevents the former). Also, because atomicModifyIORef simply links another thunk into a chain (which is a couple of pointer operations) this is not a bottl neck. All of the actual operations on the data can be done in parallel, as long as they do not mutually depend on each other. The Haskell runtime system will work this out.
I however feel like this is too simple. Are there any "gotchas" I've missed?
This design would probably be ok if the following are true:
reads will be much more prevalent than writes
a number of reads will be interspersed between writes
(possibly) writes will affect only a small portion of the global data structure
Of course, given those conditions, pretty much any concurrency system would be fine. Since you're concerned about livelock, I suspect you're dealing with more complicated access pattern. In which case, read on.
Your design appears to be guided by the following chain of reasoning:
atomicModifyIORef is very cheap, because it just creates thunks
because atomicModifyIORef is cheap, it's not going to cause thread contention
Cheap data access + no contention = Concurrency FTW!
Here's the missing step in this reasoning: your IORef modifications only create thunks, and you have no control over where thunks are evaluated. If you can't control where the data is evaluated, you have no real parallelism.
Since you haven't yet presented the intended data access patterns this is speculation, however I expect that what will happen is that your repeated modifications to the data will build up a chain of thunks. Then at some point you'll read from the data and force an evaluation, causing all of those thunks to be evaluated sequentially in one thread. At this point, you may as well have written single-threaded code to begin with.
The way around this is to ensure that your data is evaluated (at least as far as you would like it to be) before it's written into the IORef. This is what the return parameter of atomicModifyIORef is for.
Consider these functions, meant to modify aVar :: IORef [Int]
doubleList1 :: [Int] -> ([Int],())
doubleList1 xs = (map (*2) xs, ())
doubleList2 :: [Int] -> ([Int], [Int])
doubleList2 xs = let ys = map (*2) xs in (ys,ys)
doubleList3 :: [Int] -> ([Int], Int)
doubleList3 xs = let ys = map (*2) xs in (ys, sum ys)
Here's what happens when you use these functions as arguments:
!() <- atomicModifyIORef aVar doubleList1 - only a thunk is created, no data is evaluated. An unpleasant surprise for whichever thread reads from aVar next!
!oList <- atomicModifyIORef aVar doubleList2 - the new list is evaluated only so far as to determine the initial constructor, that is (:) or []. Still no real work has been done.
!oSum <- atomicModifyIORef aVar doubleList3 - by evaluating the sum of the list, this guarantees that computation is fully evaluated.
In the first two cases, there's very little work being done so the atomicModifyIORef will exit quickly. But that work wasn't done in that thread, and now you don't know when it will happen.
In the third case, you know the work was done in the intended thread. First a thunk is created and the IORef updated, then the thread begins to evaluate the sum and finally returns the result. But suppose some other thread reads the data while the sum is being calculated. It may start evaluating the thunk itself, and now you've got two threads doing duplicate work.
In a nutshell, this design hasn't solved anything. It's likely to work in situations where your concurrency problems weren't hard, but for extreme cases like you've been considering, you're still going to be burning cycles with multiple threads doing duplicate work. And unlike STM, you have no control over how and when to retry. At least STM you can abort in the middle of a transaction, with thunk evaluation it's entirely out of your hands.
Well, it's not going to compose well. And serializing all of your shared memory modifications through a single IORef will mean that only one thread will be able to modify shared memory at a time, all you've really done is made a global lock. Yes it will work, but it will be slow and nowhere near as flexible as TVars or even MVars.
AFAICT if your computation leaves un-evaluated thunks after it does its thing with the IORef contents, that thunk will simply be evaluated in whatever thread tries to use the result, rather than being evaluated in parallel as you would like. See the gotchas section of MVar docs, here
It might be more interesting and helpful for others if you provided a concrete problem that you're trying to solve (or a simplified, but similar one).
Related
Let's say I have multiple threads that are reading from a file and I want to make sure that only a single thread is reading from the file at any point in time.
One way to implement this is to use an mvar :: MVar () and ensure mutual exclusion as follows:
thread = do
...
_ <- takeMVar mvar
x <- readFile "somefile" -- critical section
putMVar mvar ()
...
-- do something that evaluates x.
The above should work fine in strict languages, but unless I'm missing something, I might run into problems with this approach in Haskell. In particular, since x is evaluated only after the thread exits the critical section, it seems to me that the file will only be read after the thread has executed putMVar, which defeats the point of using MVars in the first place, as multiple threads may read the file at the same time.
Is the problem that I'm describing real and, if so, how do I get around it?
Yes, it's real. You get around it by avoiding all the base functions that are implemented using unsafeInterleaveIO. I don't have a complete list, but that's at least readFile, getContents, hGetContents. IO actions that don't do lazy IO -- like hGet or hGetLine -- are fine.
If you must use lazy IO, then fully evaluate its results in an IO action inside the critical section, e.g. by combining rnf and evaluate.
Some other commentary on related things, but that aren't directly answers to this question:
Laziness and lazy IO are really separate concepts. They happen to share a name because humans are lazy at naming. Most IO actions do not involve lazy IO and do not run into this problem.
There is a related problem about stuffing unevaluated pure computations into your MVar and accidentally evaluating it on a different thread than you were expecting, but if you avoid lazy IO then evaluating on the wrong thread is merely a performance bug rather than an actual semantics bug.
readFile should be named unsafeReadFile because it's unsafe in the same way as unsafeInterleaveIO. If you stay away from functions that have, or should have, the unsafe prefix then you won't have this problem.
Haskell isn't a lazily evaluated language. It's language in which, as in mathematics, evaluation order doesn't matter (except that you mustn't spend an unbounded amount of time trying to evaluate a function's argument before evaluating the function body). Compilers are free to reorder computations for efficiency reasons, and GHC does, so programs compiled with GHC aren't lazily evaluated as a rule.
readFile (along with getContents and hGetContents) is one of a small number of standard Haskell functions without the unsafe prefix that violate Haskell's value semantics. GHC has to specially disable its optimizations when it encounters such functions because they make program transformations observable that aren't supposed to be observable.
These functions are convenient hacks that can make some toy programs easier to write. You shouldn't use them in threaded code, or, in my opinion, at all. I think they shouldn't even be used in introductory programming courses (which is probably what they were meant for) because they give beginners a totally wrong impression of how evaluation in Haskell is supposed to work.
update: please, bear in mind, I'm just started learning Haskell
Let's say we're writing an application with the following general functionality:
when starting, it gathers some data from an external source;
this data are a set of complex structures which contain lists,
arrays, ints, string, etc.;
when running, the application serves web API (servlets) that provides
access to the data.
Now, if the application would be written in Java, we could use static ConcurrentHashMap object where the data could be stored (representing Java classes). So that, during start, the app could fill the map with data, and then servlets could access it providing some API to the clients.
If the application would be written in Erlang, we could use ETS/DETS for storing the data (as native Erlang structures).
Now the question: what is the proper Haskell way for implementing such design?
It shouldn't be DB, it should be some sort of a lightweight in-memory something, that could store complex structures (Haskell native structures), and that could be accessible from different threads (servlets, talking by Java-world entities). In Haskell: no static global vars as in Java, no ETS and OTP as in Erlang, - so how to do it the right way (with no using external solutions like Redis)?
Thanks
update: another important part of the question - since Haskell doesn't (?) have 'global static' variables, then what would be the right way for implementing this 'global accessible' data keeping object (say, it is "stm-containers")? Should I initialize it somewhere in the 'main' function and then just pass it to every REST API handler? Or is there any other more correct way?
It's not clear from your question whether the client API will provide ways of mutating the data.
If not (i.e., the API will only be about querying), then any immutable data-structure will suffice, since one beauty of immutable data is that it can be accessed from multiple threads safely with you being sure that it can't change. No need for the overhead of locks or other strategies for working with concurrency. You'll simply construct the immutable data during the initialisation and then just query it. For this consider a package like "unordered-containers".
If your API will also be mutating the data, then you will need mutable data-structures, which are optimised for concurrency. "stm-containers" is one package, which provides those.
First off, I'm going to assume you mean it needs to be available to multiple threads, not multiple processes. (The difference being that threads share memory, processes do not.) If that assumption is wrong, much of your question doesn't make sense.
So, the first important point: Haskell has mutable data structures. They can easily be shared between threads. Here's a small example:
import Control.Concurrent
import Control.Monad
main :: IO ()
main = do
v <- newMVar 0 :: IO (MVar Int)
forkIO . forever $ do
x <- takeMVar v
putMVar v $! x + 1
forM_ [1..10] $ \_ -> do
x <- readMVar v
threadDelay 100
print x
Note the use of ($!) when putting the value in the MVar. MVars don't enforce that their contents are evaluated. There's some subtlety in making sure everything works properly. You will get lots of space leaks until you understand Haskell's evaluation model. That's part of why this sort of thing is usually done in a library that handles all those details.
Given this, the first pass approach is to just store a map of some sort in an MVar. Unless it's under a lot of contention, that actually has pretty good performance properties.
When it is under contention, you have a good fallback secondary approach, especially when using a hash map. That's striping. Instead of storing one map in one MVar, use N maps in N MVars. The first step in a lookup is using the hash to determine which of the N MVars to look in.
There are fancy lock-free algorithms, which could be implemented using finer-grained mutable values. But in general, they are a lot of engineering effort for a few percent improvement in performance that doesn't really matter in most use cases.
Haskell is functional and pure, so basically it has all the properties needed for a compiler to be able to tackle implicit parallelism.
Consider this trivial example:
f = do
a <- Just 1
b <- Just $ Just 2
-- ^ The above line does not utilize an `a` variable, so it can be safely
-- executed in parallel with the preceding line
c <- b
-- ^ The above line references a `b` variable, so it can only be executed
-- sequentially after it
return (a, c)
-- On the exit from a monad scope we wait for all computations to finish and
-- gather the results
Schematically the execution plan can be described as:
do
|
+---------+---------+
| |
a <- Just 1 b <- Just $ Just 2
| |
| c <- b
| |
+---------+---------+
|
return (a, c)
Why is there no such functionality implemented in the compiler with a flag or a pragma yet? What are the practical reasons?
This is a long studied topic. While you can implicitly derive parallelism in Haskell code, the problem is that there is too much parallelism, at too fine a grain, for current hardware.
So you end up spending effort on book keeping, not running things faster.
Since we don't have infinite parallel hardware, it is all about picking the right granularity -- too
coarse and there will be idle processors, too fine and the overheads
will be unacceptable.
What we have is more coarse grained parallelism (sparks) suitable for generating thousands or millions of parallel tasks (so not at the instruction level), which map down onto the mere handful of cores we typically have available today.
Note that for some subsets (e.g. array processing) there are fully automatic parallelization libraries with tight cost models.
For background on this see Feedback Directed Implicit Parallelism, where they introduce an automated approach to the insertion of par in arbitrary Haskell programs.
While your code block may not be the best example due to implicit data
dependence between the a and b, it is worth noting that these two
bindings commute in that
f = do
a <- Just 1
b <- Just $ Just 2
...
will give the same results
f = do
b <- Just $ Just 2
a <- Just 1
...
so this could still be parallelized in a speculative fashion. It is worth noting that
this does not need to have anything to do with monads. We could, for instance, evaluate
all independent expressions in a let-block in parallel or we could introduce a
version of let that would do so. The lparallel library for Common Lisp does this.
Now, I am by no means an expert on the subject, but this is my understanding
of the problem.
A major stumbling block is determining when it is advantageous to parallelize the
evaluation of multiple expressions. There is overhead associated with starting
the separate threads for evaluation, and, as your example shows, it may result
in wasted work. Some expressions may be too small to make parallel evaluation
worth the overhead. As I understand it, coming up will a fully accurate metric
of the cost of an expression would amount to solving the halting problem, so
you are relegated to using an heuristic approach to determining what to
evaluate in parallel.
Then it is not always faster to throw more cores at a problem. Even when
explicitly parallelizing a problem with the many Haskell libraries available,
you will often not see much speedup just by evaluating expressions in parallel
due to heavy memory allocation and usage and the strain this puts on the garbage
collector and CPU cache. You end up needing a nice compact memory layout and
to traverse your data intelligently. Having 16 threads traverse linked lists will
just bottleneck you at your memory bus and could actually make things slower.
At the very least, what expressions can be effectively parallelized is something that is
not obvious to many programmers (at least it isn't to this one), so getting a compiler to
do it effectively is non-trivial.
Short answer: Sometimes running stuff in parallel turns out to be slower, not faster. And figuring out when it is and when it isn't a good idea is an unsolved research problem.
However, you still can be "suddenly utilizing all those cores without ever bothering with threads, deadlocks and race conditions". It's not automatic; you just need to give the compiler some hints about where to do it! :-D
One of the reason is because Haskell is non-strict and it does not evaluate anything by default. In general the compiler does not know that computation of a and b terminates hence trying to compute it would be waste of resources:
x :: Maybe ([Int], [Int])
x = Just undefined
y :: Maybe ([Int], [Int])
y = Just (undefined, undefined)
z :: Maybe ([Int], [Int])
z = Just ([0], [1..])
a :: Maybe ([Int], [Int])
a = undefined
b :: Maybe ([Int], [Int])
b = Just ([0], map fib [0..])
where fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
Consider it for the following functions
main1 x = case x of
Just _ -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
(a, b) part does not need to be evaluated. As soon as you get that x = Just _ you can proceed to branch - hence it will work for all values but a
main2 x = case x of
Just (_, _) -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
This function enforces evaluation of tuple. Hence x will terminate with error while rest will work.
main3 x = case x of
Just (a, b) -> print a >> print b
Nothing -> putStrLn "Nothing"
This function will first print first list and then second. It will work for z (resulting in printing infinite stream of numbers but Haskell can deal with it). b will eventually run out of memory.
Now in general you don't know if computation terminates or not and how many resources it will consume. Infinite lists are perfectly fine in Haskell:
main = maybe (return ()) (print . take 5 . snd) b -- Prints first 5 Fibbonacci numbers
Hence spawning threads to evaluate expression in Haskell might try to evaluate something which is not meant to be evaluated fully - say list of all primes - yet programmers use as part of structure. The above examples are very simple and you may argue that compiler could notice them - however it is not possible in general due to Halting problem (you cannot write program which takes arbitrary program and its input and check if it terminates) - therefore it is not safe optimization.
In addition - which was mentioned by other answers - it is hard to predict whether the overhead of additional thread are worth engaging. Even though GHC doesn't spawn new threads for sparks using green threading (with fixed number of kernel threads - setting aside a few exceptions) you still need to move data from one core to another and synchronize between them which can be quite costly.
However Haskell do have guided parallelization without breaking the purity of language by par and similar functions.
Actually there was such attempt but not on common hardware due to the low available quantity of cores. The project is called Reduceron. It runs Haskell code with a high level of parallelism. In case it was ever released as a proper 2 GHz ASIC core, we'd have a serious breakthrough in Haskell execution speed.
I'm currently digesting the nice presentation Why learn Haskell? by Keegan McAllister. There he uses the snippet
minimum = head . sort
as an illustration of Haskell's lazy evaluation by stating that minimum has time-complexity O(n) in Haskell. However, I think the example is kind of academic in nature. I'm therefore asking for a more practical example where it's not trivially apparent that most of the intermediate calculations are thrown away.
Have you ever written an AI? Isn't it annoying that you have to thread pruning information (e.g. maximum depth, the minimum cost of an adjacent branch, or other such information) through the tree traversal function? This means you have to write a new tree traversal every time you want to improve your AI. That's dumb. With lazy evaluation, this is no longer a problem: write your tree traversal function once, to produce a huge (maybe even infinite!) game tree, and let your consumer decide how much of it to consume.
Writing a GUI that shows lots of information? Want it to run fast anyway? In other languages, you might have to write code that renders only the visible scenes. In Haskell, you can write code that renders the whole scene, and then later choose which pixels to observe. Similarly, rendering a complicated scene? Why not compute an infinite sequence of scenes at various detail levels, and pick the most appropriate one as the program runs?
You write an expensive function, and decide to memoize it for speed. In other languages, this requires building a data structure that tracks which inputs for the function you know the answer to, and updating the structure as you see new inputs. Remember to make it thread safe -- if we really need speed, we need parallelism, too! In Haskell, you build an infinite data structure, with an entry for each possible input, and evaluate the parts of the data structure that correspond to the inputs you care about. Thread safety comes for free with purity.
Here's one that's perhaps a bit more prosaic than the previous ones. Have you ever found a time when && and || weren't the only things you wanted to be short-circuiting? I sure have! For example, I love the <|> function for combining Maybe values: it takes the first one of its arguments that actually has a value. So Just 3 <|> Nothing = Just 3; Nothing <|> Just 7 = Just 7; and Nothing <|> Nothing = Nothing. Moreover, it's short-circuiting: if it turns out that its first argument is a Just, it won't bother doing the computation required to figure out what its second argument is.
And <|> isn't built in to the language; it's tacked on by a library. That is: laziness allows you to write brand new short-circuiting forms. (Indeed, in Haskell, even the short-circuiting behavior of (&&) and (||) aren't built-in compiler magic: they arise naturally from the semantics of the language plus their definitions in the standard libraries.)
In general, the common theme here is that you can separate the production of values from the determination of which values are interesting to look at. This makes things more composable, because the choice of what is interesting to look at need not be known by the producer.
Here's a well-known example I posted to another thread yesterday. Hamming numbers are numbers that don't have any prime factors larger than 5. I.e. they have the form 2^i*3^j*5^k. The first 20 of them are:
[1,2,3,4,5,6,8,9,10,12,15,16,18,20,24,25,27,30,32,36]
The 500000th one is:
1962938367679548095642112423564462631020433036610484123229980468750
The program that printed the 500000th one (after a brief moment of computation) is:
merge xxs#(x:xs) yys#(y:ys) =
case (x`compare`y) of
LT -> x:merge xs yys
EQ -> x:merge xs ys
GT -> y:merge xxs ys
hamming = 1 : m 2 `merge` m 3 `merge` m 5
where
m k = map (k *) hamming
main = print (hamming !! 499999)
Computing that number with reasonable speed in a non-lazy language takes quite a bit more code and head-scratching. There are a lot of examples here
Consider generating and consuming the first n elements of an infinite sequence. Without lazy evaluation, the naive encoding would run forever in the generation step, and never consume anything. With lazy evaluation, only as many elements are generated as the code tries to consume.
I often see the usage and explanation of Haskell's parallel strategies connected to pure computations (for example fib). However, I do not often see it used with monadic constructions: is there a reasonable interpretation of the effect of par and related functions when applied to ST s or IO ? Would any speedup be gained from such a usage?
Parallelism in the IO monad is more correctly called "Concurrency", and is supported by forkIO and friends in the Control.Concurrent module.
The difficulty with parallelising the ST monad is that ST is necessarily single-threaded - that's its purpose. There is a lazy variant of the ST monad, Control.Monad.ST.Lazy, which in principle could support parallel evaluation, but I'm not aware of anyone having tried to do this.
There's a new monad for parallel evaluation called Eval, which can be found in recent versions of the parallel package. I recommend using the Eval monad with rpar and rseq instead of par and pseq these days, because it leads to more robust and readable code. For example, the usual fib example can be written
fib n = if n < 2 then 1 else
runEval $ do
x <- rpar (fib (n-1))
y <- rseq (fib (n-2))
return (x+y)
There are some situations where this makes sense, but in general you shouldn't do it. Examine the following:
doPar =
let a = unsafePerformIO $ someIOCalc 1
b = unsafePerformIO $ someIOCalc 2
in a `par` b `pseq` a+b
in doPar, a calculation for a is sparked, then the main thread evaluates b. But, it's possible that after the main thread finishes the calculation of b it will begin to evaluate a as well. Now you have two threads evaluating a, meaning that some of the IO actions will be performed twice (or possibly more). But if one thread finishes evaluating a, the other will just drop what it's done so far. In order for this to be safe, you need a few things to be true:
It's safe for the IO actions to be performed multiple times.
It's safe for only some of the IO actions to be performed (e.g. there's no cleanup)
The IO actions are free of any race conditions. If one thread mutates some data when evaluating a, will the other thread also working on a behave sensibly? Probably not.
Any foreign calls are re-entrant (you need this for concurrency in general of course)
If your someIOCalc looks like this
someIOCalc n = do
prelaunchMissiles
threadDelay n
launchMissiles
it's absolutely not safe to use this with par and unsafePerformIO.
Now, is it ever worth it? Maybe. Sparks are cheap, even cheaper than threads, so in theory it should be a performance gain. In practice, perhaps not so much. Roman Leschinsky has a nice blog post about this.
Personally, I've found it much simpler to reason about forkIO.