Would seq ever be used instead of pseq? - haskell

If pseq ensures order of evaluation and seq doesn't, why does seq exist? Is there any time that seq should be used over pseq?

It says on the documentation page,
[pseq] restricts the transformations that the compiler can do, and ensures that the user can retain control of the evaluation order
therefore, if all you need to do is ensure strictness so that you don't get an infinite stack, use seq. I don't know of any examples where being able to transform
a `seq` b
into
b `seq` a `seq` b
would help performance though, sorry.

Related

What does seq actually do in Haskell?

From Real World Haskell I read
It operates as follows: when a seq expression is evaluated, it forces its first argument to be evaluated, then returns its second argument. It doesn't actually do anything with the first argument: seq exists solely as a way to force that value to be evaluated.
where I've emphasised the then because to me it implies an order in which the two things happen.
From Hackage I read
The value of seq a b is bottom if a is bottom, and otherwise equal to b. In other words, it evaluates the first argument a to weak head normal form (WHNF). seq is usually introduced to improve performance by avoiding unneeded laziness.
A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. […]
Furthermore, if I click on the # Source link from there, the page doesn't exist, so I can't see the code of seq.
That seems in line with a comment under this answer:
[…] seq cannot be defined in normal Haskell
On the other hand (or on the same hand, really), another comment reads
The 'real' seq is defined in GHC.Prim as seq :: a -> b -> b; seq = let x = x in x. This is only a dummy definition. Basically seq is specially syntax handled particularly by the compiler.
Can anybody shed some light on this topic? Especially in the following respects.
What source is right?
Is seq's implementation really not writable in Haskell?
If so, what does it even mean? That it is a primitive? What does this tell me about what seq actually does?
In seq a b is a guaranteed to be evaluated before b at least in the case that b makes use of a, e.g. seq a (a + x)?
Others answers have already discussed the meaning of seq and its relationship to pseq. But there appears to be quite some confusion about what exactly the implications of seq’s caveats are.
It is true, technically speaking, that a `seq` b does not guarantee a will be evaluated before b. This may seem troubling: how could it possibly serve its purpose if that were the case? Let’s consider the example Jon gave in their answer:
foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f acc [] = acc
foldl' f acc (x : xs)
= acc' `seq` foldl' f acc' xs
where
acc' = f acc x
Surely, we care about acc' being evaluated before the recursive call here. If it is not, the whole purpose of foldl' is lost! So why not use pseq here? And is seq really all that useful?
Fortunately, the situation is not actually so dire. seq really is the right choice here. GHC would never actually choose to compile foldl' such that it evaluates the recursive call before evaluating acc', so the behavior we want is preserved. The difference between seq and pseq is rather what flexibility the optimizer has to make a different decision when it thinks it has particularly good reason to.
Understanding seq and pseq’s strictness
To understand what that means, we must learn to think a little like the GHC optimizer. In practice, the only concrete difference between seq and pseq is how they affect the strictness analyzer:
seq is considered strict in both of its arguments. That is, in a function definition like
f a b c = (a `seq` b) + c
f will be considered strict in all three of its arguments.
pseq is just like seq, but it’s only considered strict in its first argument, not its second one. That means in a function definition like
g a b c = (a `pseq` b) + c
g will be considered strict in a and c, but not b.
What does this mean? Well, let’s first define what it means for a function to “be strict in one of its arguments” in the first place. The idea is that if a function is strict in one of its arguments, then a call to that function is guaranteed to evaluate that argument. This has several implications:
Suppose we have a function foo :: Int -> Int that is strict in its argument, and suppose we have a call to foo that looks like this:
foo (x + y)
A naïve Haskell compiler would construct a thunk for the expression x + y and pass the resulting thunk to foo. But we know that evaluating foo will necessarily force that thunk, so we’re not gaining anything from this laziness. It would be better to evaluate x + y immediately, then pass the result to foo to save a needless thunk allocation.
Since we know there’s never any reason to pass a thunk to foo, we gain an opportunity to make additional optimizations. For example, the optimizer could choose to internally rewrite foo to take an unboxed Int# instead of an Int, avoiding not only a thunk construction for x + y but avoiding boxing the resulting value altogether. This allows the result of x + y to be passed directly, on the stack, rather than on the heap.
As you can see, strictness analysis is rather crucial to making an efficient Haskell compiler, as it allows the compiler to make far more intelligent decisions about how to compile function calls, among other things. For that reason, we generally want strictness analysis to find as many opportunities to eagerly evaluate things as possible, letting us save on useless heap allocations.
With this in mind, let’s return to our f and g examples above. Let’s think about what strictness we’d intuitively expect these functions to have:
Recall that the body of f is (a `seq` b) + c. Even if we ignore the special properties of seq altogether, we know that it eventually evaluates to its second argument. This means f ought to be at least as strict as if its body were just b + c (with a entirely unused).
We know that evaluating b + c must fundamentally evaluate both b and c, so f must, at the very least, be strict in both b and c. Whether it’s strict in a is the more interesting question. If seq were actually just flip const, it would not be, as a would not be used, but of course the whole point of seq is to introduce artificial strictness, so in fact f is also considered strict in a.
Happily, the strictness of f I mentioned above is entirely consistent with our intuition about what strictness it ought to have. f is strict in all of its arguments, precisely as we would expect.
Intuitively, all of the above reasoning for f should also apply to g. The only difference is the replacement of seq with pseq, and we know that pseq provides a stronger guarantee about evaluation order than seq does, so we’d expect g to be at least as strict as f… which is to say, also strict in all its arguments.
However, remarkably, this is not the strictness GHC infers for g. GHC considers g strict in a and c, but not b, even though by our definition of strictness above, g is rather obviously strict in b: b must be evaluated for g to produce a result! As we’ll see, it is precisely this discrepancy that makes pseq so deeply magical, and why it’s generally a bad idea.
The implications of strictness
We’ve now seen that seq leads to the strictness we’d expect while pseq does not, but it’s not immediately obvious what that implies. To illustrate, consider a possible call site where f is used:
f a (b + 1) c
We know that f is strict in all its arguments, so by the same reasoning we used above, GHC should evaluate b + 1 eagerly and pass its result to f, avoiding a thunk.
At first blush, this might seem all well and good, but wait: what if a is a thunk? Even though f is also strict in a, it’s just a bare variable—maybe it was passed in as an argument from somewhere else—and there’s no reason for GHC to eagerly force a here if f is going to force it itself. The only reason we force b + 1 is to spare a new thunk from being created, but we save nothing but forcing the already-created a at the call site. This means a might in fact be passed as an unevaluated thunk.
This is something of a problem, because in the body of f, we wrote a `seq` b, requesting a be evaluated before b. But by our reasoning above, GHC just went ahead and evaluated b first! If we really, really need to make sure b isn’t evaluated until after a is, this type of eager evaluation can’t be allowed.
Of course, this is precisely why pseq is considered lazy in its second argument, even though it actually is not. If we replace f with g, then GHC would obediently allocate a fresh thunk for b + 1 and pass it on the heap, ensuring it is not evaluated a moment too soon. This of course means more heap allocation, no unboxing, and (worst of all) no propagation of strictness information further up the call chain, creating potentially cascading pessimizations. But hey, that’s what we asked for: avoid evaluating b too early at all costs!
Hopefully, this illustrates why pseq is seductive, but ultimately counterproductive unless you really know what you’re doing. Sure, you guarantee the evaluation you’re looking for… but at what cost?
The takeaways
Hopefully the above explanation has made clear how both seq and pseq have advantages and disadvantages:
seq plays nice with the strictness analyzer, exposing many more potential optimizations, but those optimizations might disrupt the order of evaluation we expect.
pseq preserves the desired evaluation order at all costs, but it only does this by outright lying to the strictness analyzer so it’ll stay off its case, dramatically weakening its ability to help the optimizer do good things.
How do we know which tradeoffs to choose? While we may now understand why seq can sometimes fail to evaluate its first argument before its second, we don’t have any more reason to believe this is an okay thing to let happen.
To soothe your fears, let’s take a step back and think about what’s really happening here. Note that GHC never actually compiles the a `seq` b expression itself in such a way that a is failed to be evaluated before b. Given an expression like a `seq` (b + c), GHC won’t ever secretly stab you in the back and evaluate b + c before evaluating a. Rather, what it does is much more subtle: it might indirectly cause b and c to be individually evaluated before evaluating the overall b + c expression, since the strictness analyzer will note that the overall expression is still strict in both b and c.
How all this fits together is incredibly tricky, and it might make your head spin, so perhaps you don’t find the previous paragraph all that soothing after all. But to make the point more concrete, let’s return to the foldl' example at the beginning of this answer. Recall that it contains an expression like this:
acc' `seq` foldl' f acc' xs
In order to avoid the thunk blowup, we need acc' to be evaluated before the recursive call to foldl'. But given the above reasoning, it still always will be! The difference that seq makes here relative to pseq is, again, only relevant for strictness analysis: it allows GHC to infer that this expression is also strict in f and xs, not just acc', which in this situation doesn’t actually change much at all:
The overall foldl' function still isn’t considered strict in f, since in the first case of the function (the one where xs is []), f is unused, so for some call patterns, foldl' is lazy in f.
foldl' can be considered strict in xs, but that is totally uninteresting here, since xs is only a piece of one of foldl'’s arguments, and that strictness information doesn’t affect the strictness of foldl' at all.
So, if there is not actually any difference here, why not use pseq? Well, suppose foldl' is inlined some finite number of times at a call site, since maybe the shape of its second argument is partially known. The strictness information exposed by seq might then expose several additional optimizations at the call site, leading to a chain of advantageous optimizations. If pseq had been used, those optimizations would be obscured, and GHC would produce worse code.
The real takeaway here is therefore that even though seq might sometimes not evaluate its first argument before its second, this is only technically true, the way it happens is subtle, and it’s pretty unlikely to break your program. This should not be too surprising: seq is the tool the authors of GHC expect programmers to use in this situation, so it would be rather rude of them to make it do the wrong thing! seq is the idiomatic tool for this job, not pseq, so use seq.
When do you use pseq, then? Only when you really, really care about a very specific evaluation order, which usually only happens for one of two reasons: you are using par-based parallelism, or you’re using unsafePerformIO and care about the order of side effects. If you’re not doing either of these things, then don’t use pseq. If all you care about is use cases like foldl', where you just want to avoid needless thunk build-up, use seq. That’s what it’s for.
seq introduces an artificial data dependency between two thunks. Normally, a thunk is forced to evaluate only when pattern-matching demands it. If the thunk a contains the expression case b of { … }, then forcing a also forces b. So there is a dependency between the two: in order to determine the value of a, we must evaluate b.
seq specifies this relationship between any two arbitrary thunks. When seq c d is forced, c is forced in addition to d. Note that I don’t say before: according to the standard, an implementation is free to force c before d or d before c or even some mixture thereof. It’s only required that if c does not halt, then seq c d also doesn’t halt. If you want to guarantee evaluation order, you can use pseq.
The diagrams below illustrate the difference. A black arrowhead (▼) indicates a real data dependency, the kind that you could express using case; a white arrowhead (▽) indicates an artificial dependency.
Forcing seq a b must force both a and b.
│
┌─▼───────┐
│ seq a b │
└─┬─────┬─┘
│ │
┌─▽─┐ ┌─▼─┐
│ a │ │ b │
└───┘ └───┘
Forcing pseq a b must force b, which must first force a.
│
┌─▼────────┐
│ pseq a b │
└─┬────────┘
│
┌─▼─┐
│ b │
└─┬─┘
│
┌─▽─┐
│ a │
└───┘
As it stands, it must be implemented as an intrinsic because its type, forall a b. a -> b -> b, claims that it works for any types a and b, without any constraint. It used to belong to a typeclass, but this was removed and made into a primitive because the typeclass version was considered to have poor ergonomics: adding seq to try to fix a performance issue in a deeply nested chain of function calls would require adding a boilerplate Seq a constraint on every function in the chain. (I would prefer the explicitness, but it would be hard to change now.)
So seq, and syntactic sugar for it like strict fields in data types or BangPatterns in patterns, is about ensuring that something is evaluated by attaching it to the evaluation of something else that will be evaluated. The classic example is foldl'. Here, the seq ensures that when the recursive call is forced, the accumulator is also forced:
foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f acc [] = acc
foldl' f acc (x : xs)
= acc' `seq` foldl' f acc' xs
where
acc' = f acc x
That requests of the compiler that if f is strict, such as (+) on a strict data type like Int, then the accumulator is reduced to an Int at each step, rather than building a chain of thunks to be evaluated only at the end.
Real World Haskell is mistaken, and all the other things you quoted are correct. If you care deeply about the evaluation order, use pseq instead.

Automatically inserting laziness in Haskell

Haskell pattern matching is often head strict, for example,f (x:xs) = ...
requires input list to be evaluated to (thunk : thunk). But sometimes such evaluation is not needed and function can afford to be non-strict on some arguments, for example f (x:xs) = 3.
Ideally, in such situations we could avoid evaluating arguments to get the behaviour of const 3, which could be done with irrefutable pattern: f ~(x:xs) = 3. This gives us performance benefits and greater error tolerance.
My question is: Does GHC already implement such transformations via some kind of strictness analysis? Appreciate it if you could also point me to some readings on it.
As far as I know, GHC will never make something more lazy than specified by the programmer, even if it can prove that doesn't change the semantics of the term. I don't think there is any fundamental reason to avoid changing the laziness of a term when we can prove the semantics don't change; I suspect it's more of an empirical observation that we don't know of any situations where that would be a really great idea. (And if a transformation would change the semantics, I would consider it a bug for GHC to make that change.)
There is only one possible exception that comes to mind, the so-called "full laziness" transformation, described well on the wiki. In short, GHC will translate
\a b -> let c = {- something that doesn't mention b -} in d
to
\a -> let c = {- same thing as before -} in \b -> d
to avoid recomputing c each time the argument is applied to a new b. But it seems to me that this transformation is more about memoization than about laziness: the two terms above appear to me to have the same (denotational) semantics wrt laziness/strictness, and are only operationally different.

Always guaranteed evaluation order of `seq` (with strange behavior of `pseq` in addition)

The documentation of seq function says the following:
A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. If you need to guarantee a specific order of evaluation, you must use the function pseq from the "parallel" package.
So I have a lazy version of sum function with accumulator:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
Obviously, this is extremely slow on big lists. Now I'm rewriting this function using seq:
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
And I see huge performance increase! But I wonder how reliable it is? Did I get it by luck? Because GHC can evaluate recursive call first (according to documentation) and still accumulate thunks. It looks like I need to use pseq to ensure that acc' is always evaluated before recursive call. But with pseq I see performance decrease in compare to seq version. Numbers on my machine (for calculating sum [1 .. 10^7]:
naive: 2.6s
seq: 0.2s
pseq: 0.5s
I'm using GHC-8.2.2 and I compile with stack ghc -- File.hs command.
After I tried to compile with stack ghc -- -O File.hs command performance gap between seq and pseq is gone. They now both run in 0.2s.
So does my implementation exhibit the properties I want? Or does GHC has some implementation quirk? Why is pseq slower? Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
The answers so far have focused on the seq versus pseq performance issues, but I think you originally wanted to know which of the two you should use.
The short answer is: while both should generate nearly identically performing code in practice (at least when proper optimization flags are turned on), the primitive seq, and not pseq, is the correct choice for your situation. Using pseq is non-idiomatic, confusing, and potentially counterproductive from a performance standpoint, and your reason for using it is based on a flawed understand of what its order-of-evaluation guarantee means and what it implies with respect to performance. While there are no guarantees about performance across different sets of compiler flags (much less across other compilers), if you ever run into a situation where the seq version of the above code runs significantly slower than the pseq version using "production quality" optimization flags with the GHC compiler, you should consider it a GHC bug and file a bug report.
The long answer is, of course, longer...
First, let's be clear that seq and pseq are semantically identical in the sense that they both satisfy the equations:
seq _|_ b = _|_
seq a b = b -- if a is not _|_
pseq _|_ b = _|_
pseq a b = b -- if a is not _|_
This is really the only thing that either of them guarantees semantically, and since the definition of the Haskell language (as given in the Haskell report say) only makes -- at best -- semantic guarantees and does not deal with performance or implementation, there's no reason to choose between one or the other for reasons of guaranteed performance across different compilers or compiler flags.
Furthermore, in your particular seq-based version of the function sum, it's not too difficult to see that there is no situation in which seq is called with an undefined first argument but a defined second argument (assuming the use of a standard numeric type), so you aren't even using the semantic properties of seq. You could re-define seq as seq a b = b and have exactly the same semantics. Of course, you know this -- that's why your first version didn't use seq. Instead, you're using seq for an incidental performance side-effect, so we're out of the realm of semantic guarantees and back in the realm of specific GHC compiler implementation and performance characteristics (where there aren't really any guarantees to speak of).
Second, that brings us to the intended purpose of seq. It is rarely used for its semantic properties because those properties aren't very useful. Who would want a computation seq a b to return b except that it should fail to terminate if some unrelated expression a fails to terminate? (The exceptions -- no pun intended -- would be things like handling exceptions, where you might use seq or deepSeq which is based on seq to force evaluation of a non-terminating expression in either an uncontrolled or controlled way, before starting evaluation of another expression.)
Instead, seq a b is intended to force evaluation of a to weak head normal form before returning the result of b to prevent accumulation of thunks. The idea is, if you have an expression b which builds a thunk that could potentially accumulate on top of another unevaluated thunk represented by a, you can prevent that accumulation by using seq a b. The "guarantee" is a weak one: GHC guarantees that it understands you don't want a to remain an unevaluated thunk when seq a b's value is demanded. Technically, it doesn't guarantee that a will be "evaluated before" b, whatever that means, but you don't need that guarantee. When you worry that, without this guarantee, GHC might evaluate the recursive call first and still accumulate thunks, this is as ridiculous as worrying that pseq a b might evaluate its first argument, then wait 15 minutes (just to make absolutely sure the first argument has been evaluated!), before evaluating its second.
This is a situation where you should trust GHC to do the right thing. It may seem to you that the only way to realize the performance benefit of seq a b is for a to be evaluated to WHNF before evaluation of b starts, but it is conceivable that there are optimizations in this or other situations that technically start evaluating b (or even fully evaluate b to WHNF) while leaving a unevaluated for a short time to improve performance while still preserving the semantics of seq a b. By using pseq instead, you may prevent GHC from making such optimizations. (In your sum program situation, there undoubtedly is no such optimization, but in a more complex use of seq, there might be.)
Third, it's important to understand what pseq is actually for. It was first described in Marlow 2009 in the context of concurrent programming. Suppose we want to parallelize two expensive computations foo and bar and then combine (say, add) their results:
foo `par` (bar `seq` foo+bar) -- parens redundant but included for clarity
The intention here is that -- when this expression's value is demanded -- it creates a spark to compute foo in parallel and then, via the seq expression, starts evaluating bar to WHNF (i.e., it's numeric value, say) before finally evaluating foo+bar which will wait on the spark for foo before adding and returning the results.
Here, it's conceivable that GHC will recognize that for a specific numeric type, (1) foo+bar automatically fails to terminate if bar does, satisfying the formal semantic guarantee of seq; and (2) evaluating foo+bar to WHNF will automatically force evaluation of bar to WHNF preventing any thunk accumulation and so satisfying the informal implementation guarantee of seq. In this situation, GHC may feel free to optimize the seq away to yield:
foo `par` foo+bar
particularly if it feels that it would be more performant to start evaluation of foo+bar before finishing evaluating bar to WHNF.
What GHC isn't smart enough to realize is that -- if evaluation of foo in foo+bar starts before the foo spark is scheduled, the spark will fizzle, and no parallel execution will occur.
It's really only in this case, where you need to explicitly delay demanding the value of a sparked expression to allow an opportunity for it to be scheduled before the main thread "catches up" that you need the extra guarantee of pseq and are willing to have GHC forgo additional optimization opportunities permitted by the weaker guarantee of seq:
foo `par` (bar `pseq` foo+bar)
Here, pseq will prevent GHC from introducing any optimization that might allow foo+bar to start evaluating (potentially fizzling the foo spark) before bar is in WHNF (which, we hope, allows enough time for the spark to be scheduled).
The upshot is that, if you're using pseq for anything other than concurrent programming, you're using it wrong. (Well, maybe there are some weird situations, but...) If all you want to do is force strict evaluation and/or thunk evaluation to improve performance in non-concurrent code, using seq (or $! which is defined in terms of seq or Haskell strict data types which are defined in terms of $!) is the correct approach.
(Or, if #Kindaro's benchmarks are to be believed, maybe merciless benchmarking with specific compiler versions and flags is the correct approach.)
I only see such a difference with optimizations turned off.
With ghc -O both pseq and seq perform the same.
The relaxed semantics of seq allow transformations resulting in slower code indeed. I can't think of a situation where that actually happens. We just assume GHC does the right thing. Unfortunately, we don't have a way to express that behavior in terms of a high-level semantics for Haskell.
Why pseq is slower?
pseq x y = x `seq` lazy y
pseq is thus implemented using seq. The observed overhead is due to the extra indirection of calling pseq.
Even if these ultimately get optimized away, it may not necessarily be a good idea to use pseq instead of seq. While the stricter ordering semantics seem to imply the intended effect (that go does not accumulate a thunk), it may disable some further optimizations: perhaps evaluating x and evaluating y can be decomposed into low-level operations, some of which we wouldn't mind to cross the pseq boundary.
Does there exist some example where seq a b has different results depending on evaluation order (same code but different compiler flags/different compilers/etc.)?
This can throw either "a" or "b".
seq (error "a") (error "b")
I guess there is a rationale explained in the paper about exceptions in Haskell, A Semantics for imprecise exceptions.
Edit: My theory foiled as the timings I observed were in fact heavily skewed by the influence of profiling itself; with profiling off, the data goes against the theory. Moreover, the timings vary quite a bit between versions of GHC. I am collecting better observations even now, and I will further edit this answer as I arrive to a conclusive point.
Concerning the question "why pseq is slower", I have a theory.
Let us re-phrase acc' `seq` go acc' xs as strict (go (strict acc') xs).
Similarly, acc' `pseq` go acc' xs is re-phrased as lazy (go (strict acc') xs).
Now, let us re-phrase go acc (x:xs) = let ... in ... to go acc (x:xs) = strict (go (x + acc) xs) in the case of seq.
And to go acc (x:xs) = lazy (go (x + acc) xs) in the case of pseq.
Now, it is easy to see that, in the case of pseq, go gets assigned a lazy thunk that will be evaluated at some later point. In the definition of sum, go never appears to the left of pseq, and thus, during the run of sum, the evaulation will not at all be forced. Moreover, this happens for every recursive call of go, so thunks accumulate.
This is a theory built from thin air, but I do have a partial proof. Specifically, I did find out that go allocates linear memory in pseq case but not in the case of seq. You may see for yourself if you run the following shell commands:
for file in SumNaive.hs SumPseq.hs SumSeq.hs
do
stack ghc \
--library-profiling \
--package parallel \
-- \
$file \
-main-is ${file%.hs} \
-o ${file%.hs} \
-prof \
-fprof-auto
done
for file in SumNaive.hs SumSeq.hs SumPseq.hs
do
time ./${file%.hs} +RTS -P
done
-- And compare the memory allocation of the go cost centre.
COST CENTRE ... ticks bytes
SumNaive.prof:sum.go ... 782 559999984
SumPseq.prof:sum.go ... 669 800000016
SumSeq.prof:sum.go ... 161 0
postscriptum
Since there appears to be discord on the question of which optimizations actually play to what effect, I am putting my exact source code and time measures, so that there is a common baseline.
SumNaive.hs
module SumNaive where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = go (x + acc) xs
main = print $ sum [1..10^7]
SumSeq.hs
module SumSeq where
import Prelude hiding (sum)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `seq` go acc' xs
main = print $ sum [1..10^7]
SumPseq.hs
module SumPseq where
import Prelude hiding (sum)
import Control.Parallel (pseq)
sum :: Num a => [a] -> a
sum = go 0
where
go acc [] = acc
go acc (x:xs) = let acc' = x + acc
in acc' `pseq` go acc' xs
main = print $ sum [1..10^7]
Time without optimizations:
./SumNaive +RTS -P 4.72s user 0.53s system 99% cpu 5.254 total
./SumSeq +RTS -P 0.84s user 0.00s system 99% cpu 0.843 total
./SumPseq +RTS -P 2.19s user 0.22s system 99% cpu 2.408 total
Time with -O:
./SumNaive +RTS -P 0.58s user 0.00s system 99% cpu 0.584 total
./SumSeq +RTS -P 0.60s user 0.00s system 99% cpu 0.605 total
./SumPseq +RTS -P 1.91s user 0.24s system 99% cpu 2.147 total
Time with -O2:
./SumNaive +RTS -P 0.57s user 0.00s system 99% cpu 0.570 total
./SumSeq +RTS -P 0.61s user 0.01s system 99% cpu 0.621 total
./SumPseq +RTS -P 1.92s user 0.22s system 99% cpu 2.137 total
It may be seen that:
Naive variant has poor performance without optimizations, but excellent performance with either -O or -O2 -- to the extent that it outperforms all others.
seq variant has a good performance that's very little improved by optimizations, so that with either -O or -O2 the Naive variant outperforms it.
pseq variant has consistently poor performance, about twice better than Naive variant without optimization, and four times worse than others with either -O or -O2. Optimization affects it about as little as the seq variant.

Does Haskell optimizer utilize memoization for repeated function calls in a scope?

Consider this function:
f as = if length as > 100 then length as else 100
Since the function is pure it's obvious that the length will be the same in both calls. My question is does Haskell optimizer turn the code above into equivalent of the following?
f as =
let l = length as
in if l > 100 then l else 100
If it does, then which level setting enables it? If it doesn't, then why? In this scenario a memory waste can't be the reason as explained in this answer, because the introduced variable gets released as soon as the function execution is finished.
Please note that this is not a duplicate of this question because of the local scope, and thus it may get a radically different answer.
GHC now does some CSE by default, as the -fcse flag is on.
On by default.. Enables the common-sub-expression elimination
optimisation. Switching this off can be useful if you have some
unsafePerformIO expressions that you don't want commoned-up.
However, it is conservative, due to the problems with introducing sharing (and thus space leaks).
The CSE pass is getting a bit better though (and this).
Finally, note there is a plugin for full CSE.
http://hackage.haskell.org/package/cse-ghc-plugin
If you have code that could benefit from that.
Even in such a local setting, it is still the case that it is not obvious that the introduction of sharing is always an optimization. Consider this example definition
f = if length [1 .. 1000000] > 0 then head [1 .. 1000000] else 0
vs. this one
f = let xs = [1 .. 1000000] in if length xs > 0 then head xs else 0
and you'll find that in this case, the first behaves much better, as each of the computations performed on the list is cheap, whereas the second version will cause the list to be unfolded completely in memory by length, and it can only be discarded after head has been reduced.
The case you are describing has more to do with common subexpression elimination than memoization, however it seems that GHC currently doesn't do that either because unintended sharing might lead to space leaks.

Tail optimization guarantee - loop encoding in Haskell

So the short version of my question is, how are we supposed to encode loops in Haskell, in general? There is no tail optimization guarantee in Haskell, bang patterns aren't even a part of the standard (right?), and fold/unfold paradigm is not guaranteed to work in all situation. Here's case in point were only bang-patterns did the trick for me of making it run in constant space (not even using $! helped ... although the testing was done at Ideone.com which uses ghc-6.8.2).
It is basically about a nested loop, which in list-paradigm can be stated as
prod (sum,concat) . unzip $
[ (c, [r | t]) | k<-[0..kmax], j<-[0..jmax], let (c,r,t)=...]
prod (f,g) x = (f.fst $ x, g.snd $ x)
Or in pseudocode:
let list_store = [] in
for k from 0 to kmax
for j from 0 to jmax
if test(k,j)
list_store += [entry(k,j)]
count += local_count(k,j)
result = (count, list_store)
Until I added the bang-pattern to it, I got either a memory blow-out or even a stack overflow. But bang patterns are not part of the standard, right? So the question is, how is one to code the above, in standard Haskell, to run in constant space?
Here is the test code. The calculation is fake, but the problems are the same. EDIT: The foldr-formulated code is:
testR m n = foldr f (0,[])
[ (c, [(i,j) | (i+j) == d ])
| i<- [0..m], j<-[0..n],
let c = if (rem j 3) == 0 then 2 else 1 ]
where d = m + n - 3
f (!c1, []) (!c, h) = (c1+c,h)
f (!c1, (x:_)) (!c, h) = (c1+c,x:h)
Trying to run print $ testR 1000 1000 produces stack overflow. Changing to foldl only succeeds if using bang-patterns in f, but it builds the list in reversed order. I'd like to build it lazily, and in the right order. Can it be done with any kind of fold, for the idiomatic solution?
EDIT: to sum up the answer I got from #ehird: there's nothing to fear using bang pattern. Though not in standard Haskell itself it is easily encoded in it as f ... c ... = case (seq c False) of {True -> undefined; _ -> ...}. The lesson is, only pattern match forces a value, and seq does NOT force anything by itself, but rather arranges that when seq x y is forced - by a pattern match - x will be forced too, and y will be the answer. Contrary to what I could understand from the Online Report, $! does NOT force anything by itself, though it is called a "strict application operator".
And the point from #stephentetley - strictness is very important in controlling the space behaviour. So it is perfectly OK to encode loops in Haskell with proper usage of strictness annotations with bang patterns, where needed, to write any kind of special folding (i.e. structure-consuming) function that is needed - like I ended up doing in the first place - and rely on GHC to optimize the code.
Thank you very much to all for your help.
Bang patterns are simply sugar for seq — whenever you see let !x = y in z, that can be translated into let x = y in x `seq` z. seq is standard, so there's no issue with translating programs that use bang patterns into a portable form.
It is true that Haskell makes no guarantees about performance — the report does not even define an evaluation order (only that it must be non-strict), let alone the existence or behaviour of a runtime stack. However, while the report doesn't specify a specific method of implementation, you can certainly optimise for one.
For example, call-by-need (and thus sharing) is used by all Haskell implementations in practice, and is vital for optimising Haskell code for memory usage and speed. Indeed, the pure memoisation trick1 (as relies on sharing (without it, it'll just slow things down).
This basic structure lets us see, for example, that stack overflows are caused by building up too-large thunks. Since you haven't posted your entire code, I can't tell you how to rewrite it without bang patterns, but I suspect [ (c, [r | t]) | ... ] should become [ c `seq` r `seq` t `seq` (c, [r | t]) | ... ]. Of course, bang patterns are more convenient; that's why they're such a common extension! (On the other hand, you probably don't need to force all of those; knowing what to force is entirely dependent on the specific structure of the code, and wildly adding bang patterns to everything usually just slows things down.)
Indeed, "tail recursion" per se does not mean all that much in Haskell: if your accumulator parameters aren't strict, you'll overflow the stack when you later try to force them, and indeed, thanks to laziness, many non-tail-recursive programs don't overflow the stack; printing repeat 1 won't ever overflow the stack, even though the definition — repeat x = x : repeat x — clearly has recursion in a non-tail position. This is because (:) is lazy in its second argument; if you traverse the list, you'll have constant space usage, as the repeat x thunks are forced, and the previous cons cells are thrown away by the garbage collector.
On a more philosophical note, tail-recursive loops are generally considered suboptimal in Haskell. In general, rather than iteratively computing a result in steps, we prefer to generate a structure with all the step-equivalents at the leaves, and do a transformation (like a fold) on it to produce the final result. This is a much higher-level view of things, made efficient by laziness (the structure is built up and garbage-collected as it's processed, rather than all at once).2
This can take some getting used to at first, and it certainly doesn't work in all cases — extremely complicated loop structures might be a pain to translate efficiently3 — but directly translating tail-recursive loops into Haskell can be painful precisely because it isn't really all that idiomatic.
As far as the paste you linked to goes, id $! x doesn't work to force anything because it's the same as x `seq` id x, which is the same as x `seq` x, which is the same as x. Basically, whenever x `seq` y is forced, x is forced, and the result is y. You can't use seq to just force things at arbitrary points; you use it to cause the forcing of thunks to depend on other thunks.
In this case, the problem is that you're building up a large thunk in c, so you probably want to make auxk and auxj force it; a simple method would be to add a clause like auxj _ _ c _ | seq c False = undefined to the top of the definition. (The guard is always checked, forcing c to be evaluated, but always results in False, so the right-hand side is never evaluated.)
Personally, I would suggest keeping the bang pattern you have in the final version, as it's more readable, but f c _ | seq c False = undefined would work just as well too.
1 See Elegant memoization with functional memo tries and the data-memocombinators library.
2 Indeed, GHC can often even eliminate the intermediate structure entirely using fusion and deforestation, producing machine code similar to how the computation would be written in a low-level imperative language.
3 Although if you have such loops, it's quite possible that this style of programming will help you simplify them — laziness means that you can easily separate independent parts of a computation out into separate structures, then filter and combine them, without worrying that you'll be duplicating work by making intermediate computations that will later be thrown away.
OK let's work from the ground up here.
You have a list of entries
entries = [(k,j) | j <- [0..jmax], k <- [0..kmax]]
And based on those indexes, you have tests and counts
tests m n = map (\(k,j) -> j + k == m + n - 3) entries
counts = map (\(_,j) -> if (rem j 3) == 0 then 2 else 1) entries
Now you want to build up two things: a "total" count, and the list of entries that "pass" the test. The problem, of course, is that you want to generate the latter lazily, while the former (to avoid exploding the stack) should be evaluated strictly.
If you evaluate these two things separately, then you must either 1) prevent sharing entries (generate it twice, once for each calculation), or 2) keep the entire entries list in memory. If you evaluate them together, then you must either 1) evaluate strictly, or 2) have a lot of stack space for the huge thunk created for the count. Option #2 for both cases is rather bad. Your imperative solution deals with this problem simply by evaluating simultaneously and strictly. For a solution in Haskell, you could take Option #1 for either the separate or the simultaneous evaluation. Or you could show us your "real" code and maybe we could help you find a way to rearrange your data dependencies; it may turn out you don't need the total count, or something like that.

Resources