What does seq actually do in Haskell?

What does seq actually do in Haskell? - haskell

From Real World Haskell I read
It operates as follows: when a seq expression is evaluated, it forces its first argument to be evaluated, then returns its second argument. It doesn't actually do anything with the first argument: seq exists solely as a way to force that value to be evaluated.
where I've emphasised the then because to me it implies an order in which the two things happen.
From Hackage I read
The value of seq a b is bottom if a is bottom, and otherwise equal to b. In other words, it evaluates the first argument a to weak head normal form (WHNF). seq is usually introduced to improve performance by avoiding unneeded laziness.
A note on evaluation order: the expression seq a b does not guarantee that a will be evaluated before b. The only guarantee given by seq is that the both a and b will be evaluated before seq returns a value. In particular, this means that b may be evaluated before a. […]
Furthermore, if I click on the # Source link from there, the page doesn't exist, so I can't see the code of seq.
That seems in line with a comment under this answer:
[…] seq cannot be defined in normal Haskell
On the other hand (or on the same hand, really), another comment reads
The 'real' seq is defined in GHC.Prim as seq :: a -> b -> b; seq = let x = x in x. This is only a dummy definition. Basically seq is specially syntax handled particularly by the compiler.
Can anybody shed some light on this topic? Especially in the following respects.
What source is right?
Is seq's implementation really not writable in Haskell?
If so, what does it even mean? That it is a primitive? What does this tell me about what seq actually does?
In seq a b is a guaranteed to be evaluated before b at least in the case that b makes use of a, e.g. seq a (a + x)?

Others answers have already discussed the meaning of seq and its relationship to pseq. But there appears to be quite some confusion about what exactly the implications of seq’s caveats are.
It is true, technically speaking, that a `seq` b does not guarantee a will be evaluated before b. This may seem troubling: how could it possibly serve its purpose if that were the case? Let’s consider the example Jon gave in their answer:
foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f acc [] = acc
foldl' f acc (x : xs)
= acc' `seq` foldl' f acc' xs
where
acc' = f acc x
Surely, we care about acc' being evaluated before the recursive call here. If it is not, the whole purpose of foldl' is lost! So why not use pseq here? And is seq really all that useful?
Fortunately, the situation is not actually so dire. seq really is the right choice here. GHC would never actually choose to compile foldl' such that it evaluates the recursive call before evaluating acc', so the behavior we want is preserved. The difference between seq and pseq is rather what flexibility the optimizer has to make a different decision when it thinks it has particularly good reason to.
Understanding seq and pseq’s strictness
To understand what that means, we must learn to think a little like the GHC optimizer. In practice, the only concrete difference between seq and pseq is how they affect the strictness analyzer:
seq is considered strict in both of its arguments. That is, in a function definition like
f a b c = (a `seq` b) + c
f will be considered strict in all three of its arguments.
pseq is just like seq, but it’s only considered strict in its first argument, not its second one. That means in a function definition like
g a b c = (a `pseq` b) + c
g will be considered strict in a and c, but not b.
What does this mean? Well, let’s first define what it means for a function to “be strict in one of its arguments” in the first place. The idea is that if a function is strict in one of its arguments, then a call to that function is guaranteed to evaluate that argument. This has several implications:
Suppose we have a function foo :: Int -> Int that is strict in its argument, and suppose we have a call to foo that looks like this:
foo (x + y)
A naïve Haskell compiler would construct a thunk for the expression x + y and pass the resulting thunk to foo. But we know that evaluating foo will necessarily force that thunk, so we’re not gaining anything from this laziness. It would be better to evaluate x + y immediately, then pass the result to foo to save a needless thunk allocation.
Since we know there’s never any reason to pass a thunk to foo, we gain an opportunity to make additional optimizations. For example, the optimizer could choose to internally rewrite foo to take an unboxed Int# instead of an Int, avoiding not only a thunk construction for x + y but avoiding boxing the resulting value altogether. This allows the result of x + y to be passed directly, on the stack, rather than on the heap.
As you can see, strictness analysis is rather crucial to making an efficient Haskell compiler, as it allows the compiler to make far more intelligent decisions about how to compile function calls, among other things. For that reason, we generally want strictness analysis to find as many opportunities to eagerly evaluate things as possible, letting us save on useless heap allocations.
With this in mind, let’s return to our f and g examples above. Let’s think about what strictness we’d intuitively expect these functions to have:
Recall that the body of f is (a `seq` b) + c. Even if we ignore the special properties of seq altogether, we know that it eventually evaluates to its second argument. This means f ought to be at least as strict as if its body were just b + c (with a entirely unused).
We know that evaluating b + c must fundamentally evaluate both b and c, so f must, at the very least, be strict in both b and c. Whether it’s strict in a is the more interesting question. If seq were actually just flip const, it would not be, as a would not be used, but of course the whole point of seq is to introduce artificial strictness, so in fact f is also considered strict in a.
Happily, the strictness of f I mentioned above is entirely consistent with our intuition about what strictness it ought to have. f is strict in all of its arguments, precisely as we would expect.
Intuitively, all of the above reasoning for f should also apply to g. The only difference is the replacement of seq with pseq, and we know that pseq provides a stronger guarantee about evaluation order than seq does, so we’d expect g to be at least as strict as f… which is to say, also strict in all its arguments.
However, remarkably, this is not the strictness GHC infers for g. GHC considers g strict in a and c, but not b, even though by our definition of strictness above, g is rather obviously strict in b: b must be evaluated for g to produce a result! As we’ll see, it is precisely this discrepancy that makes pseq so deeply magical, and why it’s generally a bad idea.
The implications of strictness
We’ve now seen that seq leads to the strictness we’d expect while pseq does not, but it’s not immediately obvious what that implies. To illustrate, consider a possible call site where f is used:
f a (b + 1) c
We know that f is strict in all its arguments, so by the same reasoning we used above, GHC should evaluate b + 1 eagerly and pass its result to f, avoiding a thunk.
At first blush, this might seem all well and good, but wait: what if a is a thunk? Even though f is also strict in a, it’s just a bare variable—maybe it was passed in as an argument from somewhere else—and there’s no reason for GHC to eagerly force a here if f is going to force it itself. The only reason we force b + 1 is to spare a new thunk from being created, but we save nothing but forcing the already-created a at the call site. This means a might in fact be passed as an unevaluated thunk.
This is something of a problem, because in the body of f, we wrote a `seq` b, requesting a be evaluated before b. But by our reasoning above, GHC just went ahead and evaluated b first! If we really, really need to make sure b isn’t evaluated until after a is, this type of eager evaluation can’t be allowed.
Of course, this is precisely why pseq is considered lazy in its second argument, even though it actually is not. If we replace f with g, then GHC would obediently allocate a fresh thunk for b + 1 and pass it on the heap, ensuring it is not evaluated a moment too soon. This of course means more heap allocation, no unboxing, and (worst of all) no propagation of strictness information further up the call chain, creating potentially cascading pessimizations. But hey, that’s what we asked for: avoid evaluating b too early at all costs!
Hopefully, this illustrates why pseq is seductive, but ultimately counterproductive unless you really know what you’re doing. Sure, you guarantee the evaluation you’re looking for… but at what cost?
The takeaways
Hopefully the above explanation has made clear how both seq and pseq have advantages and disadvantages:
seq plays nice with the strictness analyzer, exposing many more potential optimizations, but those optimizations might disrupt the order of evaluation we expect.
pseq preserves the desired evaluation order at all costs, but it only does this by outright lying to the strictness analyzer so it’ll stay off its case, dramatically weakening its ability to help the optimizer do good things.
How do we know which tradeoffs to choose? While we may now understand why seq can sometimes fail to evaluate its first argument before its second, we don’t have any more reason to believe this is an okay thing to let happen.
To soothe your fears, let’s take a step back and think about what’s really happening here. Note that GHC never actually compiles the a `seq` b expression itself in such a way that a is failed to be evaluated before b. Given an expression like a `seq` (b + c), GHC won’t ever secretly stab you in the back and evaluate b + c before evaluating a. Rather, what it does is much more subtle: it might indirectly cause b and c to be individually evaluated before evaluating the overall b + c expression, since the strictness analyzer will note that the overall expression is still strict in both b and c.
How all this fits together is incredibly tricky, and it might make your head spin, so perhaps you don’t find the previous paragraph all that soothing after all. But to make the point more concrete, let’s return to the foldl' example at the beginning of this answer. Recall that it contains an expression like this:
acc' `seq` foldl' f acc' xs
In order to avoid the thunk blowup, we need acc' to be evaluated before the recursive call to foldl'. But given the above reasoning, it still always will be! The difference that seq makes here relative to pseq is, again, only relevant for strictness analysis: it allows GHC to infer that this expression is also strict in f and xs, not just acc', which in this situation doesn’t actually change much at all:
The overall foldl' function still isn’t considered strict in f, since in the first case of the function (the one where xs is []), f is unused, so for some call patterns, foldl' is lazy in f.
foldl' can be considered strict in xs, but that is totally uninteresting here, since xs is only a piece of one of foldl'’s arguments, and that strictness information doesn’t affect the strictness of foldl' at all.
So, if there is not actually any difference here, why not use pseq? Well, suppose foldl' is inlined some finite number of times at a call site, since maybe the shape of its second argument is partially known. The strictness information exposed by seq might then expose several additional optimizations at the call site, leading to a chain of advantageous optimizations. If pseq had been used, those optimizations would be obscured, and GHC would produce worse code.
The real takeaway here is therefore that even though seq might sometimes not evaluate its first argument before its second, this is only technically true, the way it happens is subtle, and it’s pretty unlikely to break your program. This should not be too surprising: seq is the tool the authors of GHC expect programmers to use in this situation, so it would be rather rude of them to make it do the wrong thing! seq is the idiomatic tool for this job, not pseq, so use seq.
When do you use pseq, then? Only when you really, really care about a very specific evaluation order, which usually only happens for one of two reasons: you are using par-based parallelism, or you’re using unsafePerformIO and care about the order of side effects. If you’re not doing either of these things, then don’t use pseq. If all you care about is use cases like foldl', where you just want to avoid needless thunk build-up, use seq. That’s what it’s for.

seq introduces an artificial data dependency between two thunks. Normally, a thunk is forced to evaluate only when pattern-matching demands it. If the thunk a contains the expression case b of { … }, then forcing a also forces b. So there is a dependency between the two: in order to determine the value of a, we must evaluate b.
seq specifies this relationship between any two arbitrary thunks. When seq c d is forced, c is forced in addition to d. Note that I don’t say before: according to the standard, an implementation is free to force c before d or d before c or even some mixture thereof. It’s only required that if c does not halt, then seq c d also doesn’t halt. If you want to guarantee evaluation order, you can use pseq.
The diagrams below illustrate the difference. A black arrowhead (▼) indicates a real data dependency, the kind that you could express using case; a white arrowhead (▽) indicates an artificial dependency.
Forcing seq a b must force both a and b.
│
┌─▼───────┐
│ seq a b │
└─┬─────┬─┘
│ │
┌─▽─┐ ┌─▼─┐
│ a │ │ b │
└───┘ └───┘
Forcing pseq a b must force b, which must first force a.
│
┌─▼────────┐
│ pseq a b │
└─┬────────┘
│
┌─▼─┐
│ b │
└─┬─┘
│
┌─▽─┐
│ a │
└───┘
As it stands, it must be implemented as an intrinsic because its type, forall a b. a -> b -> b, claims that it works for any types a and b, without any constraint. It used to belong to a typeclass, but this was removed and made into a primitive because the typeclass version was considered to have poor ergonomics: adding seq to try to fix a performance issue in a deeply nested chain of function calls would require adding a boilerplate Seq a constraint on every function in the chain. (I would prefer the explicitness, but it would be hard to change now.)
So seq, and syntactic sugar for it like strict fields in data types or BangPatterns in patterns, is about ensuring that something is evaluated by attaching it to the evaluation of something else that will be evaluated. The classic example is foldl'. Here, the seq ensures that when the recursive call is forced, the accumulator is also forced:
foldl' :: (a -> b -> a) -> a -> [b] -> a
foldl' f acc [] = acc
foldl' f acc (x : xs)
= acc' `seq` foldl' f acc' xs
where
acc' = f acc x
That requests of the compiler that if f is strict, such as (+) on a strict data type like Int, then the accumulator is reduced to an Int at each step, rather than building a chain of thunks to be evaluated only at the end.

Real World Haskell is mistaken, and all the other things you quoted are correct. If you care deeply about the evaluation order, use pseq instead.

Related

Clarification on optimizing Single-constructor datatypes in GHC

I was reading about how to optimize my Haskell code and came across a note about Single-constructor datatypes in GHC.
Excerpt:
GHC loves single-constructor datatypes, such as tuples. A single-constructor datatype can be unpacked when it is passed to a strict function. For example, given this function:
f (x,y) = ...
GHC's strictness analyser will detect that f is strict in its argument, and compile the function like this:
f z = case z of (x,y) -> f' x y
f' x y = ...
where f is called the wrapper, and f' is called the worker. The wrapper is inlined everywhere, so for example if you had a call to f like this:
... f (3,4) ...
this will end up being compiled to
... f' 3 4 ...
and the tuple has been completely optimised away.
Does this mean I should go through my program and wrap up all function arguments into one tuple? I don't really see how this is an optimization when the tuple gets unwrapped anyway.
Is this an alternative to the INLINE pragma? Should I use both? Only one? Is one better?

I don't really see how this is an optimization when the tuple gets unwrapped anyway.
That is the optimisation: that the tuple gets unwrapped. IOW, the program that eventually runs never contains a tuple at all, it just contains a two-argument function call.
One might also put this in more pessimistic terms: ab initio, tuples are inherently bad for performance. That's because a tuple argument requires three pointer indirections: a thunk for the entire tuple, a thunk for the fst element, and a thunk for the snd element. So in principle, for performance it's a very bad idea to wrap your data into tuples. (Better put it in data structs with strict fields.) That is, of course, unless you really do need laziness at all of these spots.
However, and that's what the quote is all about, in practice it's generally still fine to use tuples in GHC, because it can usually optimise the indirection away if it can prove that it isn't actually needed.

Automatically inserting laziness in Haskell

Haskell pattern matching is often head strict, for example,f (x:xs) = ...
requires input list to be evaluated to (thunk : thunk). But sometimes such evaluation is not needed and function can afford to be non-strict on some arguments, for example f (x:xs) = 3.
Ideally, in such situations we could avoid evaluating arguments to get the behaviour of const 3, which could be done with irrefutable pattern: f ~(x:xs) = 3. This gives us performance benefits and greater error tolerance.
My question is: Does GHC already implement such transformations via some kind of strictness analysis? Appreciate it if you could also point me to some readings on it.

As far as I know, GHC will never make something more lazy than specified by the programmer, even if it can prove that doesn't change the semantics of the term. I don't think there is any fundamental reason to avoid changing the laziness of a term when we can prove the semantics don't change; I suspect it's more of an empirical observation that we don't know of any situations where that would be a really great idea. (And if a transformation would change the semantics, I would consider it a bug for GHC to make that change.)
There is only one possible exception that comes to mind, the so-called "full laziness" transformation, described well on the wiki. In short, GHC will translate
\a b -> let c = {- something that doesn't mention b -} in d
to
\a -> let c = {- same thing as before -} in \b -> d
to avoid recomputing c each time the argument is applied to a new b. But it seems to me that this transformation is more about memoization than about laziness: the two terms above appear to me to have the same (denotational) semantics wrt laziness/strictness, and are only operationally different.

Can compilers deduce/prove mathematically?

I'm starting to learn functional programming language like Haskell, ML and most of the exercises will show off things like:
foldr (+) 0 [ 1 ..10]
which is equivalent to
sum = 0
for( i in [1..10] )
sum += i
So that leads me to think why can't compiler know that this is Arithmetic Progression and use O(1) formula to calculate?
Especially for pure FP languages without side effect?
The same applies for
sum reverse list == sum list
Given a + b = b + a
and definition of reverse, can compilers/languages prove it automatically?

Compilers generally don't try to prove this kind of thing automatically, because it's hard to implement.
As well as adding the logic to the compiler to transform one fragment of code into another, you have to be very careful that it only tries to do it when it's actually safe - i.e. there are often lots of "side conditions" to worry about. For example in your example above, someone might have written an instance of the type class Num (and hence the (+) operator) where the a + b is not b + a.
However, GHC does have rewrite rules which you can add to your own source code and could be used to cover some relatively simple cases like the ones you list above, particularly if you're not too bothered about the side conditions.
For example, and I haven't tested this, you might use the following rule for one of your examples above:
{-# RULES
"sum/reverse" forall list . sum (reverse list) = sum list
#-}
Note the parentheses around reverse list - what you've written in your question actually means (sum reverse) list and wouldn't typecheck.
EDIT:
As you're looking for official sources and pointers to research, I've listed a few.
Obviously it's hard to prove a negative but the fact that no-one has given an example of a general-purpose compiler that does this kind of thing routinely is probably quite strong evidence in itself.
As others have pointed out, even simple arithmetic optimisations are surprisingly dangerous, particularly on floating point numbers, and compilers generally have flags to turn them off - for example Visual C++, gcc. Even integer arithmetic isn't always clear-cut and people occasionally have big arguments about how to deal with things like overflow.
As Joachim noted, integer variables in loops are one place where slightly more sophisticated optimisations are applied because there are actually significant wins to be had. Muchnick's book is probably the best general source on the topic but it's not that cheap. The wikipedia page on strength reduction is probably as good an introduction as any to one of the standard optimisations of this kind, and has some references to the relevant literature.
FFTW is an example of a library that does all kinds of mathematical optimization internally. Some of its code is generated by a customised compiler the authors wrote specifically for the purpose. It's worthwhile because the authors have domain-specific knowledge of optimizations that in the specific context of the library are both worth the effort and safe
People sometimes use template metaprogramming to write "self-optimising libraries" that again might rely on arithmetic identities, see for example Blitz++. Todd Veldhuizen's PhD dissertation has a good overview.
If you descend into the realms of toy and academic compilers all sorts of things go. For example my own PhD dissertation is about writing inefficient functional programs along with little scripts that explain how to optimise them. Many of the examples (see Chapter 6) rely on applying arithmetic rules to justify the underlying optimisations.
Also, it's worth emphasising that the last few examples are of specialised optimisations being applied only to certain parts of the code (e.g. calls to specific libraries) where it is expected to be worthwhile. As other answers have pointed out, it's simply too expensive for a compiler to go searching for all possible places in an entire program where an optimisation might apply. The GHC rewrite rules that I mentioned above are a great example of a compiler exposing a generic mechanism for individual libraries to use in a way that's most appropriate for them.

The answer
No, compilers don’t do that kind of stuff.
One reason why
And for your examples, it would even be wrong: Since you did not give type annotations, the Haskell compiler will infer the most general type, which would be
foldr (+) 0 [ 1 ..10] :: Num a => a
and similar
(\list -> sum (reverse list)) :: Num a => [a] -> a
and the Num instance for the type that is being used might well not fulfil the mathematical laws required for the transformation you suggest. The compiler should, before everything else, avoid to change the meaning (i.e. the semantics) of your program.
More pragmatically: The cases where the compiler could detect such large-scale transformations rarely occur in practice, so it would not be worth it to implement them.
An exception
Note notable exceptions are linear transformations in loops. Most compilers will rewrite
for (int i = 0; i < n; i++) {
... 200 + 4 * i ...
}
to
for (int i = 0, j = 200; i < n; i++, j += 4) {
... j ...
}
or something similar, as that pattern does often occur in code working on array.

The optimizations you have in mind will probably not be done even in the presence of monomorphic types, because there are so many possibilities and so much knowledge required. For example, in this example:
sum list == sum (reverse list)
The compiler would need to know or take into account the following facts:
sum = foldl (+) 0
(+) is commutative
reverse list is a permutation of list
foldl x c l, where x is commutative and c is a constant, yields the same result for all permutations of l.
This all seems trivial. Sure, the compiler can most probably look up the definition of sumand inline it. It could be required that (+) be commutative, but remember that +is just another symbol without attached meaning to the compiler. The third point would require the compiler to prove some non trivial properties about reverse.
But the point is:
You don't want to perform the compiler to do those calculations with each and every expression. Remember, to make this really useful, you'd have to heap up a lot of knowledge about many, many standard functions and operators.
You still can't replace the expression above with True unless you can rule out the possibility that list or some list element is bottom. Usually, one cannot do this. You can't even do the following "trivial" optimization of f x == f x in all cases
f x `seq` True
For, consider
f x = (undefined :: Bool, x)
then
f x `seq` True ==> True
f x == f x ==> undefined
That being said, regarding your first example slightly modified for monomorphism:
f n = n * foldl (+) 0 [1..10] :: Int
it is imaginable to optimize the program by moving the expression out of its context and replace it with the name of a constant, like so:
const1 = foldl (+) 0 [1..10] :: Int
f n = n * const1
This is because the compiler can see that the expression must be constant.

What you're describing looks like super-compilation. In your case, if the expression had a monomorphic type like Int (as opposed to polymorphic Num a => a), the compiler could infer that the expression foldr (+) 0 [1 ..10] has no external dependencies, therefore it could be evaluated at compile time and replaced by 55. However, AFAIK no mainstream compiler currently does this kind of optimization.
(In functional programming "proving" is usually associated with something different. In languages with dependent types types are powerful enough to express complex proposition and then through the Curry-Howard correspondence programs become proofs of such propositions.)

As others have noted, it's unclear that your simplifications even hold in Haskell. For instance, I can define
newtype NInt = N Int
instance Num NInt where
N a + _ = N a
N b * _ = N b
... -- etc
and now sum . reverse :: Num [a] -> a does not equal sum :: Num [a] -> a since I can specialize each to [NInt] -> NInt where sum . reverse == sum clearly does not hold.
This is one general tension that exists around optimizing "complex" operations—you actually need quite a lot of information in order to successfully prove that it's okay to optimize something. This is why the syntax-level compiler optimization which do exist are usually monomorphic and related to the structure of programs---it's usually such a simplified domain that there's "no way" for the optimization to go wrong. Even that is often unsafe because the domain is never quite so simplified and well-known to the compiler.
As an example, a very popular "high-level" syntactic optimization is stream fusion. In this case the compiler is given enough information to know that stream fusion can occur and is basically safe, but even in this canonical example we have to skirt around notions of non-termination.
So what does it take to have \x -> sum [0..x] get replaced by \x -> x*(x + 1)/2? The compiler would need a theory of numbers and algebra built-in. This is not possible in Haskell or ML, but becomes possible in dependently typed languages like Coq, Agda, or Idris. There you could specify things like
revCommute :: (_+_ :: a -> a -> a)
-> Commutative _+_
-> foldr _+_ z (reverse as) == foldr _+_ z as
and then, theoretically, tell the compiler to rewrite according to revCommute. This would still be difficult and finicky, but at least we'd have enough information around. To be clear, I'm writing something very strange above, a dependent type. The type not only depends on the ability to introduce both a type and a name for the argument inline, but also the existence of the entire syntax of your language "at the type level".
There are a lot of differences between what I just wrote and what you'd do in Haskell, though. First, in order to form a basis where such promises can be taken seriously, we must throw away general recursion (and thus we already don't have to worry about questions of non-termination like stream-fusion does). We also must have enough structure around to create something like the promise Commutative _+_---this likely depends upon there being an entire theory of operators and mathematics built into the language's standard library else you would need to create that yourself. Finally, the richness of type system required to even express these kinds of theories adds a lot of complexity to the entire system and tosses out type inference as you know it today.
But, given all that structure, I'd never be able to create an obligation Commutative _+_ for the _+_ defined to work on NInts and so we could be certain that foldr (+) 0 . reverse == foldr (+) 0 actually does hold.
But now we'd need to tell the compiler how to actually perform that optimization. For stream-fusion, the compiler rules only kick in when we write something in exactly the right syntactic form to be "clearly" an optimization redex. The same kinds of restrictions would apply to our sum . reverse rule. In fact, already we're sunk because
foldr (+) 0 . reverse
foldr (+) 0 (reverse as)
don't match. They're "obviously" the same due to some rules we could prove about (.), but that means that now the compiler must invoke two built-in rules in order to perform our optimization.
At the end of the day, you need a very smart optimization search over the sets of known laws in order to achieve the kinds of automatic optimizations you're talking about.
So not only do we add a lot of complexity to the entire system, require a lot of base work to build-in some useful algebraic theories, and lose Turing completeness (which might not be the worst thing), we also only get a finicky promise that our rule would even fire unless we perform an exponentially painful search during compilation.
Blech.
The compromise that exists today tends to be that sometimes we have enough control over what's being written to be mostly certain that a certain obvious optimization can be performed. This is the regime of stream fusion and it requires a lot of hidden types, carefully written proofs, exploitations of parametricity, and hand-waving before it's something the community trusts enough to run on their code.
And it doesn't even always fire. For an example of battling that problem take a look at the source of Vector for all of the RULES pragmas that specify all of the common circumstances where Vector's stream-fusion optimizations should kick in.
All of this is not at all a critique of compiler optimizations or dependent type theories. Both are really incredible. Instead it's just an amplification of the tradeoffs involved in introducing such an optimization. It's not to be done lightly.

Fun fact: Given two arbitrary formulas, do they both give the same output for the same inputs? The answer to this trivial question is not computable! In other words, it is mathematically impossible to write a computer program that always gives the correct answer in finite time.
Given this fact, it's perhaps not surprising that nobody has a compiler that can magically transform every possible computation into its most efficient form.
Also, isn't this the programmer's job? If you want the sum of an arithmetic sequence commonly enough that it's a performance bottleneck, why not just write some more efficient code yourself? Similarly, if you really want Fibonacci numbers (why?), use the O(1) algorithm.

Is everything in Haskell stored in thunks, even simple values?

What do the thunks for the following value/expression/function look like in the Haskell heap?
val = 5 -- is `val` a pointer to a box containing 5?
add x y = x + y
result = add 2 val
main = print $ result
Would be nice to have a picture of how these are represented in Haskell, given its lazy evaluation mode.

Official answer
It's none of your business. Strictly implementation detail of your compiler.
Short answer
Yes.
Longer answer
To the Haskell program itself, the answer is always yes, but the compiler can and will do things differently if it finds out that it can get away with it, for performance reasons.
For example, for '''add x y = x + y''', a compiler might generate code that works with thunks for x and y and constructs a thunk as a result.
But consider the following:
foo :: Int -> Int -> Int
foo x y = x * x + y * y
Here, an optimizing compiler will generate code that first takes x and y out of their boxes, then does all the arithmetic, and then stores the result in a box.
Advanced answer
This paper describes how GHC switched from one way of implementing thunks to another that was actually both simpler and faster:
http://research.microsoft.com/en-us/um/people/simonpj/papers/eval-apply/

In general, even primitive values in Haskell (e.g. of type Int and Float) are represented by thunks. This is indeed required by the non-strict semantics; consider the following fragment:
bottom :: Int
bottom = div 1 0
This definition will generate a div-by-zero exception only if the value of bottom is inspected, but not if the value is never used.
Consider now the add function:
add :: Int -> Int -> Int
add x y = x+y
A naive implementation of add must force the thunk for x, force the thunk for y, add the values and create an (evaluated) thunk for the result. This is a huge overhead for arithmetic compared to strict functional languages (not to mention imperative ones).
However, an optimizing compiler such as GHC can mostly avoid this overhead; this is a simplified view of how GHC translates the add function:
add :: Int -> Int -> Int
add (I# x) (I# y) = case# (x +# y) of z -> I# z
Internally, basic types like Int is seen as datatype with a single constructor. The type Int# is the "raw" machine type for integers and +# is the primitive addition on raw types.
Operations on raw types are implemented directly on bit-patterns (e.g. registers) --- not thunks. Boxing and unboxing are then translated as constructor application and pattern matching.
The advantage of this approach (not visible in this simple example) is that the compiler is often capable of inlining such definitions and removing intermediate boxing/unboxing operations, leaving only the outermost ones.

It would be absolutely correct to wrap every value in a thunk. But since Haskell is non-strict, compilers can choose when to evaluate thunks/expressions. In particular, compilers can choose to evaluate an expression earlier than strictly necessary, if it results in better code.
Optimizing Haskell compilers (GHC) perform Strictness analysis to figure out, which values will always be computed.
In the beginning, the compiler has to assume, that none of a function's arguments are ever used. Then it goes over the body of the function and tries to find functions applications that 1) are known to be strict in (at least some of) their arguments and 2) always have to be evaluated to compute the function's result.
In your example, we have the function (+) that is strict in both it's arguments. Thus the compiler knows that both x and y are always required to be evaluated at this point.
Now it just so happens, that the expression x+y is always necessary to compute the function's result, therefore the compiler can store the information that the function add is strict in both x and y.
The generated code for add* will thus expect integer values as parameters and not thunks. The algorithm becomes much more complicated when recursion is involved (a fixed point problem), but the basic idea remains the same.
Another example:
mkList x y =
if x then y : []
else []
This function will take x in evaluated form (as a boolean) and y as a thunk. The expression x needs to be evaluated in every possible execution path through mkList, thus we can have the caller evaluate it. The expression y, on the other hand, is never used in any function application that is strict in it's arguments. The cons-function : never looks at y it just stores it in a list. Thus y needs to be passed as a thunk in order to satisfy the lazy Haskell semantics.
mkList False undefined -- absolutely legal
*: add is of course polymorphic and the exact type of x and y depends on the instantiation.

Short answer: Yes.
Long answer:
val = 5
This has to be stored in a thunk, because imagine if we wrote this anywhere in our code (like, in a library we imported or something):
val = undefined
If this has to be evaluated when our program starts, it would crash, right? If we actually use that value for something, that would be what we want, but if we don't use it, it shouldn't be able to influence our program so catastrophically.
For your second example, let me change it a little:
div x y = x / y
This value has to be stored in a thunk as well, because imagine some code like this:
average list =
if null list
then 0
else div (sum list) (length list)
If div was strict here, it would be evaluated even when the list is null (aka. empty), meaning that writing the function like this wouldn't work, because it wouldn't have a chance to return 0 when given the empty list, even though this is what we would want in this case.
Your final example is just a variation of example 1, and it has to be lazy for the same reasons.
All this being said, it is possible to force the compiler to make some values strict, but that goes beyond the scope of this question.

I think the others answered your question nicely, but just for completeness's sake let me add that GHC offers you the possibility of using unboxed values directly as well. This is what Haskell Wiki says about it:
When you are really desperate for speed, and you want to get right down to the “raw bits.” Please see GHC Primitives for some information about using unboxed types.
This should be a last resort, however, since unboxed types and primitives are non-portable. Fortunately, it is usually not necessary to resort to using explicit unboxed types and primitives, because GHC's optimiser can do the work for you by inlining operations it knows about, and unboxing strict function arguments. Strict and unpacked constructor fields can also help a lot. Sometimes GHC needs a little help to generate the right code, so you might have to look at the Core output to see whether your tweaks are actually having the desired effect.
One thing that can be said for using unboxed types and primitives is that you know you're writing efficient code, rather than relying on GHC's optimiser to do the right thing, and being at the mercy of changes in GHC's optimiser down the line. This may well be important to you, in which case go for it.
As mentioned, it's non-portable, so you need a GHC language extension. See here for their docs.

Would seq ever be used instead of pseq?

If pseq ensures order of evaluation and seq doesn't, why does seq exist? Is there any time that seq should be used over pseq?

It says on the documentation page,
[pseq] restricts the transformations that the compiler can do, and ensures that the user can retain control of the evaluation order
therefore, if all you need to do is ensure strictness so that you don't get an infinite stack, use seq. I don't know of any examples where being able to transform
a `seq` b
into
b `seq` a `seq` b
would help performance though, sorry.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string