Why is there no implicit parallelism in Haskell?

Why is there no implicit parallelism in Haskell? - haskell

Haskell is functional and pure, so basically it has all the properties needed for a compiler to be able to tackle implicit parallelism.
Consider this trivial example:
f = do
a <- Just 1
b <- Just $ Just 2
-- ^ The above line does not utilize an `a` variable, so it can be safely
-- executed in parallel with the preceding line
c <- b
-- ^ The above line references a `b` variable, so it can only be executed
-- sequentially after it
return (a, c)
-- On the exit from a monad scope we wait for all computations to finish and
-- gather the results
Schematically the execution plan can be described as:
do
|
+---------+---------+
| |
a <- Just 1 b <- Just $ Just 2
| |
| c <- b
| |
+---------+---------+
|
return (a, c)
Why is there no such functionality implemented in the compiler with a flag or a pragma yet? What are the practical reasons?

This is a long studied topic. While you can implicitly derive parallelism in Haskell code, the problem is that there is too much parallelism, at too fine a grain, for current hardware.
So you end up spending effort on book keeping, not running things faster.
Since we don't have infinite parallel hardware, it is all about picking the right granularity -- too
coarse and there will be idle processors, too ﬁne and the overheads
will be unacceptable.
What we have is more coarse grained parallelism (sparks) suitable for generating thousands or millions of parallel tasks (so not at the instruction level), which map down onto the mere handful of cores we typically have available today.
Note that for some subsets (e.g. array processing) there are fully automatic parallelization libraries with tight cost models.
For background on this see Feedback Directed Implicit Parallelism, where they introduce an automated approach to the insertion of par in arbitrary Haskell programs.

While your code block may not be the best example due to implicit data
dependence between the a and b, it is worth noting that these two
bindings commute in that
f = do
a <- Just 1
b <- Just $ Just 2
...
will give the same results
f = do
b <- Just $ Just 2
a <- Just 1
...
so this could still be parallelized in a speculative fashion. It is worth noting that
this does not need to have anything to do with monads. We could, for instance, evaluate
all independent expressions in a let-block in parallel or we could introduce a
version of let that would do so. The lparallel library for Common Lisp does this.
Now, I am by no means an expert on the subject, but this is my understanding
of the problem.
A major stumbling block is determining when it is advantageous to parallelize the
evaluation of multiple expressions. There is overhead associated with starting
the separate threads for evaluation, and, as your example shows, it may result
in wasted work. Some expressions may be too small to make parallel evaluation
worth the overhead. As I understand it, coming up will a fully accurate metric
of the cost of an expression would amount to solving the halting problem, so
you are relegated to using an heuristic approach to determining what to
evaluate in parallel.
Then it is not always faster to throw more cores at a problem. Even when
explicitly parallelizing a problem with the many Haskell libraries available,
you will often not see much speedup just by evaluating expressions in parallel
due to heavy memory allocation and usage and the strain this puts on the garbage
collector and CPU cache. You end up needing a nice compact memory layout and
to traverse your data intelligently. Having 16 threads traverse linked lists will
just bottleneck you at your memory bus and could actually make things slower.
At the very least, what expressions can be effectively parallelized is something that is
not obvious to many programmers (at least it isn't to this one), so getting a compiler to
do it effectively is non-trivial.

Short answer: Sometimes running stuff in parallel turns out to be slower, not faster. And figuring out when it is and when it isn't a good idea is an unsolved research problem.
However, you still can be "suddenly utilizing all those cores without ever bothering with threads, deadlocks and race conditions". It's not automatic; you just need to give the compiler some hints about where to do it! :-D

One of the reason is because Haskell is non-strict and it does not evaluate anything by default. In general the compiler does not know that computation of a and b terminates hence trying to compute it would be waste of resources:
x :: Maybe ([Int], [Int])
x = Just undefined
y :: Maybe ([Int], [Int])
y = Just (undefined, undefined)
z :: Maybe ([Int], [Int])
z = Just ([0], [1..])
a :: Maybe ([Int], [Int])
a = undefined
b :: Maybe ([Int], [Int])
b = Just ([0], map fib [0..])
where fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
Consider it for the following functions
main1 x = case x of
Just _ -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
(a, b) part does not need to be evaluated. As soon as you get that x = Just _ you can proceed to branch - hence it will work for all values but a
main2 x = case x of
Just (_, _) -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
This function enforces evaluation of tuple. Hence x will terminate with error while rest will work.
main3 x = case x of
Just (a, b) -> print a >> print b
Nothing -> putStrLn "Nothing"
This function will first print first list and then second. It will work for z (resulting in printing infinite stream of numbers but Haskell can deal with it). b will eventually run out of memory.
Now in general you don't know if computation terminates or not and how many resources it will consume. Infinite lists are perfectly fine in Haskell:
main = maybe (return ()) (print . take 5 . snd) b -- Prints first 5 Fibbonacci numbers
Hence spawning threads to evaluate expression in Haskell might try to evaluate something which is not meant to be evaluated fully - say list of all primes - yet programmers use as part of structure. The above examples are very simple and you may argue that compiler could notice them - however it is not possible in general due to Halting problem (you cannot write program which takes arbitrary program and its input and check if it terminates) - therefore it is not safe optimization.
In addition - which was mentioned by other answers - it is hard to predict whether the overhead of additional thread are worth engaging. Even though GHC doesn't spawn new threads for sparks using green threading (with fixed number of kernel threads - setting aside a few exceptions) you still need to move data from one core to another and synchronize between them which can be quite costly.
However Haskell do have guided parallelization without breaking the purity of language by par and similar functions.

Actually there was such attempt but not on common hardware due to the low available quantity of cores. The project is called Reduceron. It runs Haskell code with a high level of parallelism. In case it was ever released as a proper 2 GHz ASIC core, we'd have a serious breakthrough in Haskell execution speed.

Related

Always guaranteed evaluation order of `seq` (with strange behavior of `pseq` in addition)

The documentation of seq function says the following:
A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. If you need to guarantee a specific order of evaluation, you must use the function pseq from the "parallel" package.
So I have a lazy version of sum function with accumulator:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
Obviously, this is extremely slow on big lists. Now I'm rewriting this function using seq:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
And I see huge performance increase! But I wonder how reliable it is? Did I get it by luck? Because GHC can evaluate recursive call first (according to documentation) and still accumulate thunks. It looks like I need to use pseq to ensure that acc' is always evaluated before recursive call. But with pseq I see performance decrease in compare to seq version. Numbers on my machine (for calculating sum [1 .. 10^7]:
naive: 2.6s
seq: 0.2s
pseq: 0.5s
I'm using GHC-8.2.2 and I compile with stack ghc -- File.hs command.
After I tried to compile with stack ghc -- -O File.hs command performance gap between seq and pseq is gone. They now both run in 0.2s.
So does my implementation exhibit the properties I want? Or does GHC has some implementation quirk? Why is pseq slower? Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?

The answers so far have focused on the seq versus pseq performance issues, but I think you originally wanted to know which of the two you should use.
The short answer is: while both should generate nearly identically performing code in practice (at least when proper optimization flags are turned on), the primitive seq, and not pseq, is the correct choice for your situation. Using pseq is non-idiomatic, confusing, and potentially counterproductive from a performance standpoint, and your reason for using it is based on a flawed understand of what its order-of-evaluation guarantee means and what it implies with respect to performance. While there are no guarantees about performance across different sets of compiler flags (much less across other compilers), if you ever run into a situation where the seq version of the above code runs significantly slower than the pseq version using "production quality" optimization flags with the GHC compiler, you should consider it a GHC bug and file a bug report.
The long answer is, of course, longer...
First, let's be clear that seq and pseq are semantically identical in the sense that they both satisfy the equations:
seq _|_ b = _|_
seq a b = b -- if a is not _|_
pseq _|_ b = _|_
pseq a b = b -- if a is not _|_
This is really the only thing that either of them guarantees semantically, and since the definition of the Haskell language (as given in the Haskell report say) only makes -- at best -- semantic guarantees and does not deal with performance or implementation, there's no reason to choose between one or the other for reasons of guaranteed performance across different compilers or compiler flags.
Furthermore, in your particular seq-based version of the function sum, it's not too difficult to see that there is no situation in which seq is called with an undefined first argument but a defined second argument (assuming the use of a standard numeric type), so you aren't even using the semantic properties of seq. You could re-define seq as seq a b = b and have exactly the same semantics. Of course, you know this -- that's why your first version didn't use seq. Instead, you're using seq for an incidental performance side-effect, so we're out of the realm of semantic guarantees and back in the realm of specific GHC compiler implementation and performance characteristics (where there aren't really any guarantees to speak of).
Second, that brings us to the intended purpose of seq. It is rarely used for its semantic properties because those properties aren't very useful. Who would want a computation seq a b to return b except that it should fail to terminate if some unrelated expression a fails to terminate? (The exceptions -- no pun intended -- would be things like handling exceptions, where you might use seq or deepSeq which is based on seq to force evaluation of a non-terminating expression in either an uncontrolled or controlled way, before starting evaluation of another expression.)
Instead, seq a b is intended to force evaluation of a to weak head normal form before returning the result of b to prevent accumulation of thunks. The idea is, if you have an expression b which builds a thunk that could potentially accumulate on top of another unevaluated thunk represented by a, you can prevent that accumulation by using seq a b. The "guarantee" is a weak one: GHC guarantees that it understands you don't want a to remain an unevaluated thunk when seq a b's value is demanded. Technically, it doesn't guarantee that a will be "evaluated before" b, whatever that means, but you don't need that guarantee. When you worry that, without this guarantee, GHC might evaluate the recursive call first and still accumulate thunks, this is as ridiculous as worrying that pseq a b might evaluate its first argument, then wait 15 minutes (just to make absolutely sure the first argument has been evaluated!), before evaluating its second.
This is a situation where you should trust GHC to do the right thing. It may seem to you that the only way to realize the performance benefit of seq a b is for a to be evaluated to WHNF before evaluation of b starts, but it is conceivable that there are optimizations in this or other situations that technically start evaluating b (or even fully evaluate b to WHNF) while leaving a unevaluated for a short time to improve performance while still preserving the semantics of seq a b. By using pseq instead, you may prevent GHC from making such optimizations. (In your sum program situation, there undoubtedly is no such optimization, but in a more complex use of seq, there might be.)
Third, it's important to understand what pseq is actually for. It was first described in Marlow 2009 in the context of concurrent programming. Suppose we want to parallelize two expensive computations foo and bar and then combine (say, add) their results:
foo `par` (bar `seq` foo+bar) -- parens redundant but included for clarity
The intention here is that -- when this expression's value is demanded -- it creates a spark to compute foo in parallel and then, via the seq expression, starts evaluating bar to WHNF (i.e., it's numeric value, say) before finally evaluating foo+bar which will wait on the spark for foo before adding and returning the results.
Here, it's conceivable that GHC will recognize that for a specific numeric type, (1) foo+bar automatically fails to terminate if bar does, satisfying the formal semantic guarantee of seq; and (2) evaluating foo+bar to WHNF will automatically force evaluation of bar to WHNF preventing any thunk accumulation and so satisfying the informal implementation guarantee of seq. In this situation, GHC may feel free to optimize the seq away to yield:
foo `par` foo+bar
particularly if it feels that it would be more performant to start evaluation of foo+bar before finishing evaluating bar to WHNF.
What GHC isn't smart enough to realize is that -- if evaluation of foo in foo+bar starts before the foo spark is scheduled, the spark will fizzle, and no parallel execution will occur.
It's really only in this case, where you need to explicitly delay demanding the value of a sparked expression to allow an opportunity for it to be scheduled before the main thread "catches up" that you need the extra guarantee of pseq and are willing to have GHC forgo additional optimization opportunities permitted by the weaker guarantee of seq:
foo `par` (bar `pseq` foo+bar)
Here, pseq will prevent GHC from introducing any optimization that might allow foo+bar to start evaluating (potentially fizzling the foo spark) before bar is in WHNF (which, we hope, allows enough time for the spark to be scheduled).
The upshot is that, if you're using pseq for anything other than concurrent programming, you're using it wrong. (Well, maybe there are some weird situations, but...) If all you want to do is force strict evaluation and/or thunk evaluation to improve performance in non-concurrent code, using seq (or $! which is defined in terms of seq or Haskell strict data types which are defined in terms of $!) is the correct approach.
(Or, if #Kindaro's benchmarks are to be believed, maybe merciless benchmarking with specific compiler versions and flags is the correct approach.)

I only see such a difference with optimizations turned off.
With ghc -O both pseq and seq perform the same.
The relaxed semantics of seq allow transformations resulting in slower code indeed. I can't think of a situation where that actually happens. We just assume GHC does the right thing. Unfortunately, we don't have a way to express that behavior in terms of a high-level semantics for Haskell.
Why pseq is slower?
pseq x y = x `seq` lazy y
pseq is thus implemented using seq. The observed overhead is due to the extra indirection of calling pseq.
Even if these ultimately get optimized away, it may not necessarily be a good idea to use pseq instead of seq. While the stricter ordering semantics seem to imply the intended effect (that go does not accumulate a thunk), it may disable some further optimizations: perhaps evaluating x and evaluating y can be decomposed into low-level operations, some of which we wouldn't mind to cross the pseq boundary.
Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
This can throw either "a" or "b".
seq (error "a") (error "b")
I guess there is a rationale explained in the paper about exceptions in Haskell, A Semantics for imprecise exceptions.

Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.
Concerning the question "why pseq is slower", I have a theory.
Let us re-phrase acc' `seq` go acc' xs as strict (go (strict acc') xs).
Similarly, acc' `pseq` go acc' xs is re-phrased as lazy (go (strict acc') xs).
Now, let us re-phrase go acc (x:xs) = let ... in ... to go acc (x:xs) = strict (go (x + acc) xs) in the case of seq.
And to go acc (x:xs) = lazy (go (x + acc) xs) in the case of pseq.
Now, it is easy to see that, in the case of pseq, go gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum, go never appears to the left of pseq, and thus, during the run of sum, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go, so thunks accumulate.
This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go allocates linear memory in pseq case but not in the case of seq. You may see for yourself if you run the following shell commands:
for file in SumNaive.hs SumPseq.hs SumSeq.hs
do
stack ghc \
--library-profiling \
--package parallel \
-- \
$file \
-main-is ${file%.hs} \
-o ${file%.hs} \
-prof \
-fprof-auto
done
for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
time ./${file%.hs} +RTS -P
done
-- And compare the memory allocation of the go cost centre.
COST CENTRE ... ticks bytes
SumNaive.prof:sum.go ... 782 559999984
SumPseq.prof:sum.go ... 669 800000016
SumSeq.prof:sum.go ... 161 0
postscriptum
Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time measures, so that there is a common baseline.
SumNaive.hs
module SumNaive where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
main = print $ sum [1..10^7]
SumSeq.hs
module SumSeq where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
main = print $ sum [1..10^7]
SumPseq.hs
module SumPseq where
import Prelude hiding (sum)
import Control.Parallel (pseq)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `pseq` go acc' xs
main = print $ sum [1..10^7]
Time without optimizations:
./SumNaive +RTS -P 4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P 0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P 2.19s user 0.22s system 99% cpu 2.408 total
Time with -O:
./SumNaive +RTS -P 0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P 0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P 1.91s user 0.24s system 99% cpu 2.147 total
Time with -O2:
./SumNaive +RTS -P 0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P 0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P 1.92s user 0.22s system 99% cpu 2.137 total
It may be seen that:
Naive variant has poor performance without optimizations, but excellent performance with either -O or -O2 -- to the extent that it outperforms all others.
seq variant has a good performance that's very little improved by optimizations, so that with either -O or -O2 the Naive variant outperforms it.
pseq variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O or -O2. Optimization affects it about as little as the seq variant.

Parallel tree search

Let's say I have a lazy Tree whose leaves are possible solutions to a problem
data Tree a = Node [Tree a] | Leaf (Maybe a)
I need to find just one solution (or find out that there are none).
I have a P-core machine. From both time and memory efficiency considerations, it only makes sense to search along P different branches in parallel.
For example, suppose you have four branches of about the same computational complexity (corresponding to T seconds of CPU time), and each of them has an answer.
If you evaluate all four branches truly in parallel on a dual-core machine, then they all will finish in about 2T seconds.
If you evaluate just the first two branches and postpone the other two, then you'll get an answer in only T seconds, also using twice as less memory.
My question is, is it possible to use any of the parallel Haskell infrastructure (Par monad, parallel strategies, ...) to achieve this, or do I have to use lower-level tools like async?

Both Strategies and the Par monad will only start evaluating a new parallel task if there is a CPU available, so in your example with four branches on a 2-core machine, only two will be evaluated. Furthermore, Strategies will GC the other tasks once you have an answer (although it might take a while to do that).
However, if each of those two branches creates more tasks, then you probably wanted to give priority to the newer tasks (i.e., depth-first), but Strategies at least will give priority to the older tasks. The Par monad I think gives priority to the newer ones (but I'd have to check that), however the Par monad will evaluate all the tasks before returning an answer, because that is how it enforces determinism.
So probably the only way to get this to work exactly as you want it, at the moment, is to write a custom scheduler for the Par monad.

At least Par monad and strategies from parallel package allow to build only pure, unconditional parallel systems, which look prettily on such pictures:
a
/ \
b c
\ /\
d e
\ ...
While in general case you really need impure inter-thread communications:
solve :: Tree a -> Maybe a
smartPartition :: Tree a -> Int -> [[Tree a]]
smartPartition tree P = ... -- split the tree in fairly even chunks,
-- one per each machine core
solveP :: Tree a -> IO (Maybe a)
solveP tree = do
resRef <- newIORef Nothing
results <- parallel (map work (smartPartition tree))
return (msum results)
where work [] = return Nothing
work (t:ts) = do
res <- readIORef resRef
if (isJust res) then (return res) else do
let tRes = solve t
if (isNothing tRes) then (work ts) else do
writeIORef tRes
return tRes
However if your single leaf computations are sufficiently and equally expensive, unsing strategies should not (I'm not sure) harm performance much:
partitionLeafs :: Tree a -> Int -> [[Tree a]]
solveP :: Tree a -> Maybe a
solveP = msum . map step . transpose . partitionLeafs
where step = msum . parMap rdeepseq solve
P. S. I feel I understand field of the problem not better than you (at least), so you likely already know all the above. I've written this answer to develop discussion, because the question is very interesting for me.

Benefit of avoiding multiple list traversals

I've seen many examples in functional languages about processing a list and constructing a function to do something with its elements after receiving some additional value (usually not present at the time the function was generated), such as:
Calculating the difference between each element and the average
(the last 2 examples under "Lazy Evaluation")
Staging a list append in strict functional languages such as ML/OCaml, to avoid traversing the first list more than once
(the section titled "Staging")
Comparing a list to another with foldr (i.e. generating a function to compare another list to the first)
listEq a b = foldr comb null a b
where comb x frec [] = False
comb x frec (e:es) = x == e && frec es
cmp1To10 = listEq [1..10]
In all these examples, the authors generally remark the benefit of traversing the original list only once. But I can't keep myself from thinking "sure, instead of traversing a list of N elements, you are traversing a chain of N evaluations, so what?". I know there must be some benefit to it, could someone explain it please?
Edit: Thanks to both for the answers. Unfortunately, that's not what I wanted to know. I'll try to clarify my question, so it's not confused with the (more common) one about creating intermediate lists (which I already read about in various places). Also thanks for correcting my post formatting.
I'm interested in the cases where you construct a function to be applied to a list, where you don't yet have the necessary value to evaluate the result (be it a list or not). Then you can't avoid generating references to each list element (even if the list structure is not referenced anymore). And you have the same memory accesses as before, but you don't have to deconstruct the list (pattern matching).
For example, see the "staging" chapter in the mentioned ML book. I've tried it in ML and Racket, more specifically the staged version of "append" which traverses the first list and returns a function to insert the second list at the tail, without traversing the first list many times. Surprisingly for me, it was much faster even considering it still had to copy the list structure as the last pointer was different on each case.
The following is a variant of map which after applied to a list, it should be faster when changing the function. As Haskell is not strict, I would have to force the evaluation of listMap [1..100000] in cachedList (or maybe not, as after the first application it should still be in memory).
listMap = foldr comb (const [])
where comb x rest = \f -> f x : rest f
cachedList = listMap [1..100000]
doubles = cachedList (2*)
squares = cachedList (\x -> x*x)
-- print doubles and squares
-- ...
I know in Haskell it doesn't make a difference (please correct me if I'm wrong) using comb x rest f = ... vs comb x rest = \f -> ..., but I chose this version to emphasize the idea.
Update: after some simple tests, I couldn't find any difference in execution times in Haskell. The question then is only about strict languages such as Scheme (at least the Racket implementation, where I tested it) and ML.

Executing a few extra arithmetic instructions in your loop body is cheaper than executing a few extra memory fetches, basically.
Traversals mean doing lots of memory access, so the less you do, the better. Fusion of traversals reduces memory traffic, and increases the straight line compute load, so you get better performance.
Concretely, consider this program to compute some math on a list:
go :: [Int] -> [Int]
go = map (+2) . map (^3)
Clearly, we design it with two traversals of the list. Between the first and the second traversal, a result is stored in an intermediate data structure. However, it is a lazy structure, so only costs O(1) memory.
Now, the Haskell compiler immediately fuses the two loops into:
go = map ((+2) . (^3))
Why is that? After all, both are O(n) complexity, right?
The difference is in the constant factors.
Considering this abstraction: for each step of the first pipeline we do:
i <- read memory -- cost M
j = i ^ 3 -- cost A
write memory j -- cost M
k <- read memory -- cost M
l = k + 2 -- cost A
write memory l -- cost M
so we pay 4 memory accesses, and 2 arithmetic operations.
For the fused result we have:
i <- read memory -- cost M
j = (i ^ 3) + 2 -- cost 2A
write memory j -- cost M
where A and M are the constant factors for doing math on the ALU and memory access.
There are other constant factors as well (two loop branches) instead of one.
So unless memory access is free (it is not, by a long shot) then the second version is always faster.
Note that compilers that operate on immutable sequences can implement array fusion, the transformation that does this for you. GHC is such a compiler.

There is another very important reason. If you traverse a list only once, and you have no other reference to it, the GC can release the memory claimed by the list elements as you traverse them. Moreover, if the list is generated lazily, you always have only a constant memory consumption. For example
import Data.List
main = do
let xs = [1..10000000]
sum = foldl' (+) 0 xs
len = foldl' (\_ -> (+ 1)) 0 xs
print (sum / len)
computes sum, but needs to keep the reference to xs and the memory it occupies cannot be released, because it is needed to compute len later. (Or vice versa.) So the program consumes a considerable amount of memory, the larger xs the more memory it needs.
However, if we traverse the list only once, it is created lazily and the elements can be GC immediately, so no matter how big the list is, the program takes only O(1) memory.
{-# LANGUAGE BangPatterns #-}
import Data.List
main = do
let xs = [1..10000000]
(sum, len) = foldl' (\(!s,!l) x -> (s + x, l + 1)) (0, 0) xs
print (sum / len)

Sorry in advance for a chatty-style answer.
That's probably obvious, but if we're talking about the performance, you should always verify hypotheses by measuring.
A couple of years ago I was thinking about the operational semantics of GHC, the STG machine. And I asked myself the same question — surely the famous "one-traversal" algorithms are not that great? It only looks like one traversal on the surface, but under the hood you also have this chain-of-thunks structure which is usually quite similar to the original list.
I wrote a few versions (varying in strictness) of the famous RepMin problem — given a tree filled with numbers, generate the tree of the same shape, but replace every number with the minimum of all the numbers. If my memory is right (remember — always verify stuff yourself!), the naive two-traversal algorithm performed much faster than various clever one-traversal algorithms.
I also shared my observations with Simon Marlow (we were both at an FP summer school during that time), and he said that they use this approach in GHC. But not to improve performance, as you might have thought. Instead, he said, for a big AST (such as Haskell's one) writing down all the constructors takes much space (in terms of lines of code), and so they just reduce the amount of code by writing down just one (syntactic) traversal.
Personally I avoid this trick because if you make a mistake, you get a loop which is a very unpleasant thing to debug.

So the answer to your question is, partial compilation. Done ahead of time, it makes it so that there's no need to traverse the list to get to the individual elements - all the references are found in advance and stored inside the pre-compiled function.
As to your concern about the need for that function to be traversed too, it would be true in interpreted languages. But compilation eliminates this problem.
In the presence of laziness this coding trick may lead to the opposite results. Having full equations, e.g. Haskell GHC compiler is able to perform all kinds of optimizations, which essentially eliminate the lists completely and turn the code into an equivalent of loops. This happens when we compile the code with e.g. -O2 switch.
Writing out the partial equations may prevent this compiler optimization and force the actual creation of functions - with drastic slowdown of the resulting code. I tried your cachedList code and saw a 0.01s execution time turn into 0.20s (don't remember right now the exact test I did).

Tail optimization guarantee - loop encoding in Haskell

So the short version of my question is, how are we supposed to encode loops in Haskell, in general? There is no tail optimization guarantee in Haskell, bang patterns aren't even a part of the standard (right?), and fold/unfold paradigm is not guaranteed to work in all situation. Here's case in point were only bang-patterns did the trick for me of making it run in constant space (not even using $! helped ... although the testing was done at Ideone.com which uses ghc-6.8.2).
It is basically about a nested loop, which in list-paradigm can be stated as
prod (sum,concat) . unzip $
[ (c, [r | t]) | k<-[0..kmax], j<-[0..jmax], let (c,r,t)=...]
prod (f,g) x = (f.fst $ x, g.snd $ x)
Or in pseudocode:
let list_store = [] in
for k from 0 to kmax
for j from 0 to jmax
if test(k,j)
list_store += [entry(k,j)]
count += local_count(k,j)
result = (count, list_store)
Until I added the bang-pattern to it, I got either a memory blow-out or even a stack overflow. But bang patterns are not part of the standard, right? So the question is, how is one to code the above, in standard Haskell, to run in constant space?
Here is the test code. The calculation is fake, but the problems are the same. EDIT: The foldr-formulated code is:
testR m n = foldr f (0,[])
[ (c, [(i,j) | (i+j) == d ])
| i<- [0..m], j<-[0..n],
let c = if (rem j 3) == 0 then 2 else 1 ]
where d = m + n - 3
f (!c1, []) (!c, h) = (c1+c,h)
f (!c1, (x:_)) (!c, h) = (c1+c,x:h)
Trying to run print $ testR 1000 1000 produces stack overflow. Changing to foldl only succeeds if using bang-patterns in f, but it builds the list in reversed order. I'd like to build it lazily, and in the right order. Can it be done with any kind of fold, for the idiomatic solution?
EDIT: to sum up the answer I got from #ehird: there's nothing to fear using bang pattern. Though not in standard Haskell itself it is easily encoded in it as f ... c ... = case (seq c False) of {True -> undefined; _ -> ...}. The lesson is, only pattern match forces a value, and seq does NOT force anything by itself, but rather arranges that when seq x y is forced - by a pattern match - x will be forced too, and y will be the answer. Contrary to what I could understand from the Online Report, $! does NOT force anything by itself, though it is called a "strict application operator".
And the point from #stephentetley - strictness is very important in controlling the space behaviour. So it is perfectly OK to encode loops in Haskell with proper usage of strictness annotations with bang patterns, where needed, to write any kind of special folding (i.e. structure-consuming) function that is needed - like I ended up doing in the first place - and rely on GHC to optimize the code.
Thank you very much to all for your help.

Bang patterns are simply sugar for seq — whenever you see let !x = y in z, that can be translated into let x = y in x `seq` z. seq is standard, so there's no issue with translating programs that use bang patterns into a portable form.
It is true that Haskell makes no guarantees about performance — the report does not even define an evaluation order (only that it must be non-strict), let alone the existence or behaviour of a runtime stack. However, while the report doesn't specify a specific method of implementation, you can certainly optimise for one.
For example, call-by-need (and thus sharing) is used by all Haskell implementations in practice, and is vital for optimising Haskell code for memory usage and speed. Indeed, the pure memoisation trick1 (as relies on sharing (without it, it'll just slow things down).
This basic structure lets us see, for example, that stack overflows are caused by building up too-large thunks. Since you haven't posted your entire code, I can't tell you how to rewrite it without bang patterns, but I suspect [ (c, [r | t]) | ... ] should become [ c `seq` r `seq` t `seq` (c, [r | t]) | ... ]. Of course, bang patterns are more convenient; that's why they're such a common extension! (On the other hand, you probably don't need to force all of those; knowing what to force is entirely dependent on the specific structure of the code, and wildly adding bang patterns to everything usually just slows things down.)
Indeed, "tail recursion" per se does not mean all that much in Haskell: if your accumulator parameters aren't strict, you'll overflow the stack when you later try to force them, and indeed, thanks to laziness, many non-tail-recursive programs don't overflow the stack; printing repeat 1 won't ever overflow the stack, even though the definition — repeat x = x : repeat x — clearly has recursion in a non-tail position. This is because (:) is lazy in its second argument; if you traverse the list, you'll have constant space usage, as the repeat x thunks are forced, and the previous cons cells are thrown away by the garbage collector.
On a more philosophical note, tail-recursive loops are generally considered suboptimal in Haskell. In general, rather than iteratively computing a result in steps, we prefer to generate a structure with all the step-equivalents at the leaves, and do a transformation (like a fold) on it to produce the final result. This is a much higher-level view of things, made efficient by laziness (the structure is built up and garbage-collected as it's processed, rather than all at once).2
This can take some getting used to at first, and it certainly doesn't work in all cases — extremely complicated loop structures might be a pain to translate efficiently3 — but directly translating tail-recursive loops into Haskell can be painful precisely because it isn't really all that idiomatic.
As far as the paste you linked to goes, id $! x doesn't work to force anything because it's the same as x `seq` id x, which is the same as x `seq` x, which is the same as x. Basically, whenever x `seq` y is forced, x is forced, and the result is y. You can't use seq to just force things at arbitrary points; you use it to cause the forcing of thunks to depend on other thunks.
In this case, the problem is that you're building up a large thunk in c, so you probably want to make auxk and auxj force it; a simple method would be to add a clause like auxj _ _ c _ | seq c False = undefined to the top of the definition. (The guard is always checked, forcing c to be evaluated, but always results in False, so the right-hand side is never evaluated.)
Personally, I would suggest keeping the bang pattern you have in the final version, as it's more readable, but f c _ | seq c False = undefined would work just as well too.
1 See Elegant memoization with functional memo tries and the data-memocombinators library.
2 Indeed, GHC can often even eliminate the intermediate structure entirely using fusion and deforestation, producing machine code similar to how the computation would be written in a low-level imperative language.
3 Although if you have such loops, it's quite possible that this style of programming will help you simplify them — laziness means that you can easily separate independent parts of a computation out into separate structures, then filter and combine them, without worrying that you'll be duplicating work by making intermediate computations that will later be thrown away.

OK let's work from the ground up here.
You have a list of entries
entries = [(k,j) | j <- [0..jmax], k <- [0..kmax]]
And based on those indexes, you have tests and counts
tests m n = map (\(k,j) -> j + k == m + n - 3) entries
counts = map (\(_,j) -> if (rem j 3) == 0 then 2 else 1) entries
Now you want to build up two things: a "total" count, and the list of entries that "pass" the test. The problem, of course, is that you want to generate the latter lazily, while the former (to avoid exploding the stack) should be evaluated strictly.
If you evaluate these two things separately, then you must either 1) prevent sharing entries (generate it twice, once for each calculation), or 2) keep the entire entries list in memory. If you evaluate them together, then you must either 1) evaluate strictly, or 2) have a lot of stack space for the huge thunk created for the count. Option #2 for both cases is rather bad. Your imperative solution deals with this problem simply by evaluating simultaneously and strictly. For a solution in Haskell, you could take Option #1 for either the separate or the simultaneous evaluation. Or you could show us your "real" code and maybe we could help you find a way to rearrange your data dependencies; it may turn out you don't need the total count, or something like that.

Using parallel strategies with monads

I often see the usage and explanation of Haskell's parallel strategies connected to pure computations (for example fib). However, I do not often see it used with monadic constructions: is there a reasonable interpretation of the effect of par and related functions when applied to ST s or IO ? Would any speedup be gained from such a usage?

Parallelism in the IO monad is more correctly called "Concurrency", and is supported by forkIO and friends in the Control.Concurrent module.
The difficulty with parallelising the ST monad is that ST is necessarily single-threaded - that's its purpose. There is a lazy variant of the ST monad, Control.Monad.ST.Lazy, which in principle could support parallel evaluation, but I'm not aware of anyone having tried to do this.
There's a new monad for parallel evaluation called Eval, which can be found in recent versions of the parallel package. I recommend using the Eval monad with rpar and rseq instead of par and pseq these days, because it leads to more robust and readable code. For example, the usual fib example can be written
fib n = if n < 2 then 1 else
runEval $ do
x <- rpar (fib (n-1))
y <- rseq (fib (n-2))
return (x+y)

There are some situations where this makes sense, but in general you shouldn't do it. Examine the following:
doPar =
let a = unsafePerformIO $ someIOCalc 1
b = unsafePerformIO $ someIOCalc 2
in a `par` b `pseq` a+b
in doPar, a calculation for a is sparked, then the main thread evaluates b. But, it's possible that after the main thread finishes the calculation of b it will begin to evaluate a as well. Now you have two threads evaluating a, meaning that some of the IO actions will be performed twice (or possibly more). But if one thread finishes evaluating a, the other will just drop what it's done so far. In order for this to be safe, you need a few things to be true:
It's safe for the IO actions to be performed multiple times.
It's safe for only some of the IO actions to be performed (e.g. there's no cleanup)
The IO actions are free of any race conditions. If one thread mutates some data when evaluating a, will the other thread also working on a behave sensibly? Probably not.
Any foreign calls are re-entrant (you need this for concurrency in general of course)
If your someIOCalc looks like this
someIOCalc n = do
prelaunchMissiles
threadDelay n
launchMissiles
it's absolutely not safe to use this with par and unsafePerformIO.
Now, is it ever worth it? Maybe. Sparks are cheap, even cheaper than threads, so in theory it should be a performance gain. In practice, perhaps not so much. Roman Leschinsky has a nice blog post about this.
Personally, I've found it much simpler to reason about forkIO.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string