Runhaskell performance anomaly

Runhaskell performance anomaly - haskell

I'm trying to understand a performance anomaly being observed when running a program under runhaskell.
The program in question is:
isFactor n = (0 ==) . (mod n)
factors x = filter (isFactor x) [2..x]
main = putStrLn $ show $ sum $ factors 10000000
When I run this, it takes 1.18 seconds.
However, if I redefine isFactor as:
isFactor n f = (0 ==) (mod n f)
then the program takes 17.7 seconds.
That's a huge difference in performance and I would expect the programs to be equivalent. Does anybody know what I'm missing here?
Note: This doesn't happen when compiled under GHC.

Although the functions should be identical, there's a difference in how they're applied. With the first definition, isFactor is fully applied at the call site isFactor x. In the second definition, it isn't, because now isFactor explicitly takes two arguments.
Even minimal optimizations are enough for GHC to see through this and create identical code for both, however if you compile with -O0 -ddump-simpl you can determine that, with no optimizations, this makes a difference (at least with ghc-7.2.1, YMMV with other versions).
With the first isFactor, GHC creates a single function that is passed as the predicate to "GHC.List.Filter", with the calls to mod 10000000 and (==) inlined. For the second definition, what happens instead is that most of the calls within isFactor are let-bound references to class functions, and are not shared between multiple invocations of isFactor. So there's a lot of dictionary overhead that's completely unnecessary.
This is almost never an issue because even the default compiler settings will optimize it away, however runhaskell apparently doesn't even do that much. Even so, I have occasionally structured code as someFun x y = \z -> because I know that someFun will be partially applied and this was the only way to preserve sharing between calls (i.e. GHC's optimizer wasn't sufficiently clever).

As I understand it, runhaskell does little to no optimisation. It's designed to quickly load and run code. If it did more optimisation, it would take longer for your code to start running. Of course, in this case, the code runs faster with optimisations.
As I understand it, if a compiled version of the code exists, then runhaskell will use it. So if performance matters to you, just make sure you compile with optimisations turned on first. (I think you might even be able to pass switches to runhaskell to turn on optimisations - you'd have to check the documentation...)

Related

Program runs fine with profiling, slower without profiling

I'm observing the following execution times:
~9s when compiled with stack build
~5s when compiled with stack build --profile
I'd expect non-profiled execution time to be below 1s, which tells me that the profiled time above makes sense, it's the non-profiled time that is abnormally slow.
A few more details:
The program reads a relational algebra-like DSL, applies a series of rule-based transformations, and outputs a translation to SQL. Parsing is done with megaparsec. I/O is String based and is relatively small (~ 150 KBs). I would exclude I/O as a source of the problem. Transformations involve recursive rewriting rules over an ADT. In a few occasions, I use ugly-memo to speed up such recursive rewrites.
Using stack 2.9.1 with LTS 18.28, ghc 8.10.7
(EDIT: upgrading to LTS 20.11, ghc 9.2.5, does not help)
In the cabal file:
ghc-options: -O2 -Wall -fno-warn-unused-do-bind -fno-warn-missing-signatures -fno-warn-name-shadowing -fno-warn-orphans
ghc-prof-options: -O2 -fprof-auto "-with-rtsopts=-p -V0.0001 -hc -s"
Notice that none of the above is new, but I have never observed this behaviour before.
I've seen this related question, but compiling with -fno-state-hack does not help
Compiling with -O1 doesn't help (about the same as -O2), and -O0 is significantly slower, as expected.
The profiling information shows no culprit. The problem only shows up with non-profiled execution.
I realise I'm not giving many details. In particular, I'm not including any code snippet because it's a large code base and I have no idea of which part of it could trigger this behaviour. My point is indeed that I don't know how to narrow it down.
So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue, given that the obvious tool (profiling) turns out to be useless in this case?

Though I still don't know the exact reason for the difference between profile and non-profile execution, I now found the culprit in my code.
A function that recurse over the ADT was implemented as follows:
f :: (?env :: Env) => X -> Y
f = memo go
where
go :: (?env :: Env) => X -> Y
go (...) = ...
go x = f (children x) -- memoised recursion over x's children
Note that:
each recursion is memoised, as it the data structure may contain many duplicates
the function uses an Env parameter, which is used internally but never changes during the whole recursion. Note that it is declared as an implicit parameter.
Profiling info about this function showed many more function calls than expected. Three orders of magnitude more than expected! (Still pretty weird that, despite this, profile execution was reasonably fast).
Turns out, the implicit parameter was the issue. I don't know the exact reason, but I had read blogs about how evil implicit parameters can be and now I have first-hand proof of that.
Changing this function to using only explicit parameters fixed it completely:
f :: Env -> X -> Y
f env = memo (go env)
where
go :: Env -> X -> Y
go env (...) = ...
go env x = f env (children x) -- memoised recursion over x's children
EDIT: Thanks to #leftaroundabout for pointing out that my solution above isn't very correct - it under-uses memoisation (see discussion below). The correct version should be:
f :: Env -> X -> Y
f env = f_env
where
f_env = memo $ go env
go env (...)
go env x = f_env (children x)
What the implicit parameter seemed to cause is that uglymemo failed to recognise previously stored calls, so I got all the overhead that comes with it and no benefit.
My problem is kind of solved, but if someone can explain the underlying reasons for which implicit parameters can cause this behaviour, and why would profile execution be faster, I'd still be very interested to learn that.

This sounds like a contradiction: "So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue"
Run it under a debugger, and just manually pause it during those 9s or 5s. Do this several times. The stack will show you exactly how it's wasting time, whether I/O or not. That's this technique.
What have you got to lose?

Always guaranteed evaluation order of `seq` (with strange behavior of `pseq` in addition)

The documentation of seq function says the following:
A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. If you need to guarantee a specific order of evaluation, you must use the function pseq from the "parallel" package.
So I have a lazy version of sum function with accumulator:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
Obviously, this is extremely slow on big lists. Now I'm rewriting this function using seq:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
And I see huge performance increase! But I wonder how reliable it is? Did I get it by luck? Because GHC can evaluate recursive call first (according to documentation) and still accumulate thunks. It looks like I need to use pseq to ensure that acc' is always evaluated before recursive call. But with pseq I see performance decrease in compare to seq version. Numbers on my machine (for calculating sum [1 .. 10^7]:
naive: 2.6s
seq: 0.2s
pseq: 0.5s
I'm using GHC-8.2.2 and I compile with stack ghc -- File.hs command.
After I tried to compile with stack ghc -- -O File.hs command performance gap between seq and pseq is gone. They now both run in 0.2s.
So does my implementation exhibit the properties I want? Or does GHC has some implementation quirk? Why is pseq slower? Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?

The answers so far have focused on the seq versus pseq performance issues, but I think you originally wanted to know which of the two you should use.
The short answer is: while both should generate nearly identically performing code in practice (at least when proper optimization flags are turned on), the primitive seq, and not pseq, is the correct choice for your situation. Using pseq is non-idiomatic, confusing, and potentially counterproductive from a performance standpoint, and your reason for using it is based on a flawed understand of what its order-of-evaluation guarantee means and what it implies with respect to performance. While there are no guarantees about performance across different sets of compiler flags (much less across other compilers), if you ever run into a situation where the seq version of the above code runs significantly slower than the pseq version using "production quality" optimization flags with the GHC compiler, you should consider it a GHC bug and file a bug report.
The long answer is, of course, longer...
First, let's be clear that seq and pseq are semantically identical in the sense that they both satisfy the equations:
seq _|_ b = _|_
seq a b = b -- if a is not _|_
pseq _|_ b = _|_
pseq a b = b -- if a is not _|_
This is really the only thing that either of them guarantees semantically, and since the definition of the Haskell language (as given in the Haskell report say) only makes -- at best -- semantic guarantees and does not deal with performance or implementation, there's no reason to choose between one or the other for reasons of guaranteed performance across different compilers or compiler flags.
Furthermore, in your particular seq-based version of the function sum, it's not too difficult to see that there is no situation in which seq is called with an undefined first argument but a defined second argument (assuming the use of a standard numeric type), so you aren't even using the semantic properties of seq. You could re-define seq as seq a b = b and have exactly the same semantics. Of course, you know this -- that's why your first version didn't use seq. Instead, you're using seq for an incidental performance side-effect, so we're out of the realm of semantic guarantees and back in the realm of specific GHC compiler implementation and performance characteristics (where there aren't really any guarantees to speak of).
Second, that brings us to the intended purpose of seq. It is rarely used for its semantic properties because those properties aren't very useful. Who would want a computation seq a b to return b except that it should fail to terminate if some unrelated expression a fails to terminate? (The exceptions -- no pun intended -- would be things like handling exceptions, where you might use seq or deepSeq which is based on seq to force evaluation of a non-terminating expression in either an uncontrolled or controlled way, before starting evaluation of another expression.)
Instead, seq a b is intended to force evaluation of a to weak head normal form before returning the result of b to prevent accumulation of thunks. The idea is, if you have an expression b which builds a thunk that could potentially accumulate on top of another unevaluated thunk represented by a, you can prevent that accumulation by using seq a b. The "guarantee" is a weak one: GHC guarantees that it understands you don't want a to remain an unevaluated thunk when seq a b's value is demanded. Technically, it doesn't guarantee that a will be "evaluated before" b, whatever that means, but you don't need that guarantee. When you worry that, without this guarantee, GHC might evaluate the recursive call first and still accumulate thunks, this is as ridiculous as worrying that pseq a b might evaluate its first argument, then wait 15 minutes (just to make absolutely sure the first argument has been evaluated!), before evaluating its second.
This is a situation where you should trust GHC to do the right thing. It may seem to you that the only way to realize the performance benefit of seq a b is for a to be evaluated to WHNF before evaluation of b starts, but it is conceivable that there are optimizations in this or other situations that technically start evaluating b (or even fully evaluate b to WHNF) while leaving a unevaluated for a short time to improve performance while still preserving the semantics of seq a b. By using pseq instead, you may prevent GHC from making such optimizations. (In your sum program situation, there undoubtedly is no such optimization, but in a more complex use of seq, there might be.)
Third, it's important to understand what pseq is actually for. It was first described in Marlow 2009 in the context of concurrent programming. Suppose we want to parallelize two expensive computations foo and bar and then combine (say, add) their results:
foo `par` (bar `seq` foo+bar) -- parens redundant but included for clarity
The intention here is that -- when this expression's value is demanded -- it creates a spark to compute foo in parallel and then, via the seq expression, starts evaluating bar to WHNF (i.e., it's numeric value, say) before finally evaluating foo+bar which will wait on the spark for foo before adding and returning the results.
Here, it's conceivable that GHC will recognize that for a specific numeric type, (1) foo+bar automatically fails to terminate if bar does, satisfying the formal semantic guarantee of seq; and (2) evaluating foo+bar to WHNF will automatically force evaluation of bar to WHNF preventing any thunk accumulation and so satisfying the informal implementation guarantee of seq. In this situation, GHC may feel free to optimize the seq away to yield:
foo `par` foo+bar
particularly if it feels that it would be more performant to start evaluation of foo+bar before finishing evaluating bar to WHNF.
What GHC isn't smart enough to realize is that -- if evaluation of foo in foo+bar starts before the foo spark is scheduled, the spark will fizzle, and no parallel execution will occur.
It's really only in this case, where you need to explicitly delay demanding the value of a sparked expression to allow an opportunity for it to be scheduled before the main thread "catches up" that you need the extra guarantee of pseq and are willing to have GHC forgo additional optimization opportunities permitted by the weaker guarantee of seq:
foo `par` (bar `pseq` foo+bar)
Here, pseq will prevent GHC from introducing any optimization that might allow foo+bar to start evaluating (potentially fizzling the foo spark) before bar is in WHNF (which, we hope, allows enough time for the spark to be scheduled).
The upshot is that, if you're using pseq for anything other than concurrent programming, you're using it wrong. (Well, maybe there are some weird situations, but...) If all you want to do is force strict evaluation and/or thunk evaluation to improve performance in non-concurrent code, using seq (or $! which is defined in terms of seq or Haskell strict data types which are defined in terms of $!) is the correct approach.
(Or, if #Kindaro's benchmarks are to be believed, maybe merciless benchmarking with specific compiler versions and flags is the correct approach.)

I only see such a difference with optimizations turned off.
With ghc -O both pseq and seq perform the same.
The relaxed semantics of seq allow transformations resulting in slower code indeed. I can't think of a situation where that actually happens. We just assume GHC does the right thing. Unfortunately, we don't have a way to express that behavior in terms of a high-level semantics for Haskell.
Why pseq is slower?
pseq x y = x `seq` lazy y
pseq is thus implemented using seq. The observed overhead is due to the extra indirection of calling pseq.
Even if these ultimately get optimized away, it may not necessarily be a good idea to use pseq instead of seq. While the stricter ordering semantics seem to imply the intended effect (that go does not accumulate a thunk), it may disable some further optimizations: perhaps evaluating x and evaluating y can be decomposed into low-level operations, some of which we wouldn't mind to cross the pseq boundary.
Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
This can throw either "a" or "b".
seq (error "a") (error "b")
I guess there is a rationale explained in the paper about exceptions in Haskell, A Semantics for imprecise exceptions.

Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.
Concerning the question "why pseq is slower", I have a theory.
Let us re-phrase acc' `seq` go acc' xs as strict (go (strict acc') xs).
Similarly, acc' `pseq` go acc' xs is re-phrased as lazy (go (strict acc') xs).
Now, let us re-phrase go acc (x:xs) = let ... in ... to go acc (x:xs) = strict (go (x + acc) xs) in the case of seq.
And to go acc (x:xs) = lazy (go (x + acc) xs) in the case of pseq.
Now, it is easy to see that, in the case of pseq, go gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum, go never appears to the left of pseq, and thus, during the run of sum, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go, so thunks accumulate.
This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go allocates linear memory in pseq case but not in the case of seq. You may see for yourself if you run the following shell commands:
for file in SumNaive.hs SumPseq.hs SumSeq.hs
do
stack ghc \
--library-profiling \
--package parallel \
-- \
$file \
-main-is ${file%.hs} \
-o ${file%.hs} \
-prof \
-fprof-auto
done
for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
time ./${file%.hs} +RTS -P
done
-- And compare the memory allocation of the go cost centre.
COST CENTRE ... ticks bytes
SumNaive.prof:sum.go ... 782 559999984
SumPseq.prof:sum.go ... 669 800000016
SumSeq.prof:sum.go ... 161 0
postscriptum
Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time measures, so that there is a common baseline.
SumNaive.hs
module SumNaive where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
main = print $ sum [1..10^7]
SumSeq.hs
module SumSeq where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
main = print $ sum [1..10^7]
SumPseq.hs
module SumPseq where
import Prelude hiding (sum)
import Control.Parallel (pseq)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `pseq` go acc' xs
main = print $ sum [1..10^7]
Time without optimizations:
./SumNaive +RTS -P 4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P 0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P 2.19s user 0.22s system 99% cpu 2.408 total
Time with -O:
./SumNaive +RTS -P 0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P 0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P 1.91s user 0.24s system 99% cpu 2.147 total
Time with -O2:
./SumNaive +RTS -P 0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P 0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P 1.92s user 0.22s system 99% cpu 2.137 total
It may be seen that:
Naive variant has poor performance without optimizations, but excellent performance with either -O or -O2 -- to the extent that it outperforms all others.
seq variant has a good performance that's very little improved by optimizations, so that with either -O or -O2 the Naive variant outperforms it.
pseq variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O or -O2. Optimization affects it about as little as the seq variant.

How does the <<loop>> error "work" in detail?

I'm working on this tool where the user can define-and-include in [config files | content text-files | etc] their own "templates" (like mustache etc) and these can reference others so they can induce a loop. Just when I was about to create a "max-loops" setting I realized with runghc the program after a while just quits with farewell message of just <<loop>>. That's actually good enough for me but raises a few ponderations:
how does GHC or the runtime actually detect it's stuck in a loop, and how would it distinguish between a wanted long-running operation and an accidental infinite loop? The halting problem is still a thing last I checked..
any (time or iteration) limits that can be custom-set to the compiler or the runtime?
is this runghc-only or does it exist in all final compile outputs?
will any -o (optimization) flags set much later when building releases disable this apparent built-in loop detection?
All stuff I can figure out the hard way of course, but who knows maybe someone already looked into this in more detail.. (hard to google/ddg for "haskell" "<<loop>>" because they strip the angle brackets and then show results for "how to loop in Haskell" etc..)

This is a simple "improvement" of the STG runtime which was implemented in GHC. I'll share what I have understood, but GHC experts can likely provide more useful and accurate information.
GHC compiles to an intermediate language called Core, after having done several optimizations. You can see it using ghc -ddump-simpl ...
Very roughly, in Core, an unevaluated binding (like let x = 1+y+x in f x) creates a thunk. Some memory is allocated somewhere to represent the closure, and x is made to point at it.
When (and if) x is forced by f, then the thunk is evaluated. Here's the improvement: before the evaluation starts, the thunk of x is overwritten with a special value called BLACKHOLE. After x is evaluated (to WHNF) then the black hole is again overwritten with the actual value (so we don't recompute it if e.g. f x = x+x).
If the black hole is ever forced, <<loop>> is triggered. This is actually an IO exception (those can be raised in pure code, too, so this is fine).
Examples:
let x = 1+x in 2*x -- <<loop>>
let g x = g (x+1) in g 0 -- diverges
let h x = h (10-x) in h 0 -- diverges, even if h 0 -> h 10 -> h 0 -> ...
let g0 = g10 ; g10 = g0 in g0 -- <<loop>>
Note that each call of h 0 is considered a distinct thunk, hence no black hole is forced there.
The tricky part is that it's not completely trivial to understand which thunks are actually created in Core since GHC can perform several optimizations before emitting Core. Hence, we should regard <<loop>> as a bonus, not as a given / hard guarantee by GHC. New optimizations in the future might replace some <<loop>>s with actual non-termination.
If you want to google something, "GHC, blackhole, STG" should be good keywords.

Why is there no implicit parallelism in Haskell?

Haskell is functional and pure, so basically it has all the properties needed for a compiler to be able to tackle implicit parallelism.
Consider this trivial example:
f = do
a <- Just 1
b <- Just $ Just 2
-- ^ The above line does not utilize an `a` variable, so it can be safely
-- executed in parallel with the preceding line
c <- b
-- ^ The above line references a `b` variable, so it can only be executed
-- sequentially after it
return (a, c)
-- On the exit from a monad scope we wait for all computations to finish and
-- gather the results
Schematically the execution plan can be described as:
do
|
+---------+---------+
| |
a <- Just 1 b <- Just $ Just 2
| |
| c <- b
| |
+---------+---------+
|
return (a, c)
Why is there no such functionality implemented in the compiler with a flag or a pragma yet? What are the practical reasons?

This is a long studied topic. While you can implicitly derive parallelism in Haskell code, the problem is that there is too much parallelism, at too fine a grain, for current hardware.
So you end up spending effort on book keeping, not running things faster.
Since we don't have infinite parallel hardware, it is all about picking the right granularity -- too
coarse and there will be idle processors, too ﬁne and the overheads
will be unacceptable.
What we have is more coarse grained parallelism (sparks) suitable for generating thousands or millions of parallel tasks (so not at the instruction level), which map down onto the mere handful of cores we typically have available today.
Note that for some subsets (e.g. array processing) there are fully automatic parallelization libraries with tight cost models.
For background on this see Feedback Directed Implicit Parallelism, where they introduce an automated approach to the insertion of par in arbitrary Haskell programs.

While your code block may not be the best example due to implicit data
dependence between the a and b, it is worth noting that these two
bindings commute in that
f = do
a <- Just 1
b <- Just $ Just 2
...
will give the same results
f = do
b <- Just $ Just 2
a <- Just 1
...
so this could still be parallelized in a speculative fashion. It is worth noting that
this does not need to have anything to do with monads. We could, for instance, evaluate
all independent expressions in a let-block in parallel or we could introduce a
version of let that would do so. The lparallel library for Common Lisp does this.
Now, I am by no means an expert on the subject, but this is my understanding
of the problem.
A major stumbling block is determining when it is advantageous to parallelize the
evaluation of multiple expressions. There is overhead associated with starting
the separate threads for evaluation, and, as your example shows, it may result
in wasted work. Some expressions may be too small to make parallel evaluation
worth the overhead. As I understand it, coming up will a fully accurate metric
of the cost of an expression would amount to solving the halting problem, so
you are relegated to using an heuristic approach to determining what to
evaluate in parallel.
Then it is not always faster to throw more cores at a problem. Even when
explicitly parallelizing a problem with the many Haskell libraries available,
you will often not see much speedup just by evaluating expressions in parallel
due to heavy memory allocation and usage and the strain this puts on the garbage
collector and CPU cache. You end up needing a nice compact memory layout and
to traverse your data intelligently. Having 16 threads traverse linked lists will
just bottleneck you at your memory bus and could actually make things slower.
At the very least, what expressions can be effectively parallelized is something that is
not obvious to many programmers (at least it isn't to this one), so getting a compiler to
do it effectively is non-trivial.

Short answer: Sometimes running stuff in parallel turns out to be slower, not faster. And figuring out when it is and when it isn't a good idea is an unsolved research problem.
However, you still can be "suddenly utilizing all those cores without ever bothering with threads, deadlocks and race conditions". It's not automatic; you just need to give the compiler some hints about where to do it! :-D

One of the reason is because Haskell is non-strict and it does not evaluate anything by default. In general the compiler does not know that computation of a and b terminates hence trying to compute it would be waste of resources:
x :: Maybe ([Int], [Int])
x = Just undefined
y :: Maybe ([Int], [Int])
y = Just (undefined, undefined)
z :: Maybe ([Int], [Int])
z = Just ([0], [1..])
a :: Maybe ([Int], [Int])
a = undefined
b :: Maybe ([Int], [Int])
b = Just ([0], map fib [0..])
where fib 0 = 1
fib 1 = 1
fib n = fib (n - 1) + fib (n - 2)
Consider it for the following functions
main1 x = case x of
Just _ -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
(a, b) part does not need to be evaluated. As soon as you get that x = Just _ you can proceed to branch - hence it will work for all values but a
main2 x = case x of
Just (_, _) -> putStrLn "Just"
Nothing -> putStrLn "Nothing"
This function enforces evaluation of tuple. Hence x will terminate with error while rest will work.
main3 x = case x of
Just (a, b) -> print a >> print b
Nothing -> putStrLn "Nothing"
This function will first print first list and then second. It will work for z (resulting in printing infinite stream of numbers but Haskell can deal with it). b will eventually run out of memory.
Now in general you don't know if computation terminates or not and how many resources it will consume. Infinite lists are perfectly fine in Haskell:
main = maybe (return ()) (print . take 5 . snd) b -- Prints first 5 Fibbonacci numbers
Hence spawning threads to evaluate expression in Haskell might try to evaluate something which is not meant to be evaluated fully - say list of all primes - yet programmers use as part of structure. The above examples are very simple and you may argue that compiler could notice them - however it is not possible in general due to Halting problem (you cannot write program which takes arbitrary program and its input and check if it terminates) - therefore it is not safe optimization.
In addition - which was mentioned by other answers - it is hard to predict whether the overhead of additional thread are worth engaging. Even though GHC doesn't spawn new threads for sparks using green threading (with fixed number of kernel threads - setting aside a few exceptions) you still need to move data from one core to another and synchronize between them which can be quite costly.
However Haskell do have guided parallelization without breaking the purity of language by par and similar functions.

Actually there was such attempt but not on common hardware due to the low available quantity of cores. The project is called Reduceron. It runs Haskell code with a high level of parallelism. In case it was ever released as a proper 2 GHz ASIC core, we'd have a serious breakthrough in Haskell execution speed.

Does Haskell optimizer utilize memoization for repeated function calls in a scope?

Consider this function:
f as = if length as > 100 then length as else 100
Since the function is pure it's obvious that the length will be the same in both calls. My question is does Haskell optimizer turn the code above into equivalent of the following?
f as =
let l = length as
in if l > 100 then l else 100
If it does, then which level setting enables it? If it doesn't, then why? In this scenario a memory waste can't be the reason as explained in this answer, because the introduced variable gets released as soon as the function execution is finished.
Please note that this is not a duplicate of this question because of the local scope, and thus it may get a radically different answer.

GHC now does some CSE by default, as the -fcse flag is on.
On by default.. Enables the common-sub-expression elimination
optimisation. Switching this off can be useful if you have some
unsafePerformIO expressions that you don't want commoned-up.
However, it is conservative, due to the problems with introducing sharing (and thus space leaks).
The CSE pass is getting a bit better though (and this).
Finally, note there is a plugin for full CSE.
http://hackage.haskell.org/package/cse-ghc-plugin
If you have code that could benefit from that.

Even in such a local setting, it is still the case that it is not obvious that the introduction of sharing is always an optimization. Consider this example definition
f = if length [1 .. 1000000] > 0 then head [1 .. 1000000] else 0
vs. this one
f = let xs = [1 .. 1000000] in if length xs > 0 then head xs else 0
and you'll find that in this case, the first behaves much better, as each of the computations performed on the list is cheap, whereas the second version will cause the list to be unfolded completely in memory by length, and it can only be discarded after head has been reduced.

The case you are describing has more to do with common subexpression elimination than memoization, however it seems that GHC currently doesn't do that either because unintended sharing might lead to space leaks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string