MonadFix in strict language

MonadFix in strict language - haskell

I'm working on camlp4 extension for haskell-like do notation in Ocaml, and trying to figure out how GHC compiles recursive do-bindings (enabled with -XDoRec).
I wonder if it possible for monadic fixpoint combinator to exist in strict language (like Ocaml/F#/SML/...)?
If yes, how can it look like? Would it be very useful?

The F# computation expression syntax (related to Haskell do) supports recursion:
let rec ones = seq {
yield 1
yield! ones }
This is supported because the computation builder has to support Delay operation in addition to other monadic (or MonadPlus) operations. The code is translated to something like:
let rec ones =
seq.Combine
( seq.Yield(1),
seq.Delay(fun () -> seq.YieldFrom(ones)) )
The type of Delay is, in general, (unit -> M<'T>) -> M<'T> and the trick is that it wraps a computation with effects (or immediate recursive reference) into a delayed computation that is evaluated on demand.
If you want to learn more about how the mechanism works in F#, then the following two papers are relevant:
Syntax Matters: Writing abstract computations in F#
Initializing Mutually Referential Abstract Objects: The Value Recursion Challenge
The first one describes how the F# computation expression syntax is desugared (and how Delay is inserted - and in general, how F# combines delayed and eager computations with effects) and the second one describes how F# handles let rec declarations with values - like the ones value above.

Related

Is the monadic IO construct in Haskell just a convention?

Regarding Haskell's monadic IO construct:
Is it just a convention or is there is a implementation reason for it?
Could you not just FFI into libc.so instead to do your I/O, and skip the whole IO-monad thing?
Would it work anyway, or is the outcome nondeterministic because of:
(a) Haskell's lazy evaluation?
(b) another reason, like that the GHC is pattern-matching for the IO monad and then handling it in a special way (or something else)?
What is the real reason - in the end you end up using a side effect, so why not do it the simple way?

Yes, monadic I/O is a consequence of Haskell being lazy. Specifically, though, monadic I/O is a consequence of Haskell being pure, which is effectively necessary for a lazy language to be predictable.†
This is easy to illustrate with an example. Imagine for a moment that Haskell were not pure, but it was still lazy. Instead of putStrLn having the type String -> IO (), it would simply have the type String -> (), and it would print a string to stdout as a side-effect. The trouble with this is that this would only happen when putStrLn is actually called, and in a lazy language, functions are only called when their results are needed.
Here’s the trouble: putStrLn produces (). Looking at a value of type () is useless, because () means “boring”. That means that this program would do what you expect:
main :: ()
main =
case putStr "Hello, " of
() -> putStrLn " world!"
-- prints “Hello, world!\n”
But I think you can agree that programming style is pretty odd. The case ... of is necessary, however, because it forces the evaluation of the call to putStr by matching against (). If you tweak the program slightly:
main :: ()
main =
case putStr "Hello, " of
_ -> putStrLn " world!"
…now it only prints world!\n, and the first call isn’t evaluated at all.
This actually gets even worse, though, because it becomes even harder to predict as soon as you start trying to do any actual programming. Consider this program:
printAndAdd :: String -> Integer -> Integer -> Integer
printAndAdd msg x y = putStrLn msg `seq` (x + y)
main :: ()
main =
let x = printAndAdd "first" 1 2
y = printAndAdd "second" 3 4
in (y + x) `seq` ()
Does this program print out first\nsecond\n or second\nfirst\n? Without knowing the order in which (+) evaluates its arguments, we don’t know. And in Haskell, evaluation order isn’t even always well-defined, so it’s entirely possible that the order in which the two effects are executed is actually completely impossible to determine!
This problem doesn’t arise in strict languages with a well-defined evaluation order, but in a lazy language like Haskell, we need some additional structure to ensure side-effects are (a) actually evaluated and (b) executed in the correct order. Monads happen to be an interface that elegantly provide the necessary structure to enforce that order.
Why is that? And how is that even possible? Well, the monadic interface provides a notion of data dependency in the signature for >>=, which enforces a well-defined evaluation order. Haskell’s implementation of IO is “magic”, in the sense that it’s implemented in the runtime, but the choice of the monadic interface is far from arbitrary. It seems to be a fairly good way to encode the notion of sequential actions in a pure language, and it makes it possible for Haskell to be lazy and referentially transparent without sacrificing predictable sequencing of effects.
It’s worth noting that monads are not the only way to encode side-effects in a pure way—in fact, historically, they’re not even the only way Haskell handled side-effects. Don’t be misled into thinking that monads are only for I/O (they’re not), only useful in a lazy language (they’re plenty useful to maintain purity even in a strict language), only useful in a pure language (many things are useful monads that aren’t just for enforcing purity), or that you needs monads to do I/O (you don’t). They do seem to have worked out pretty well in Haskell for those purposes, though.
† Regarding this, Simon Peyton Jones once noted that “Laziness keeps you honest” with respect to purity.

Could you just FFI into libc.so instead to do IO and skip the IO Monad thing?
Taking from https://en.wikibooks.org/wiki/Haskell/FFI#Impure_C_Functions, if you declare an FFI function as pure (so, with no reference to IO), then
GHC sees no point in calculating twice the result of a pure function
which means the the result of the function call is effectively cached. For example, a program where a foreign impure pseudo-random number generator is declared to return a CUInt
{-# LANGUAGE ForeignFunctionInterface #-}
import Foreign
import Foreign.C.Types
foreign import ccall unsafe "stdlib.h rand"
c_rand :: CUInt
main = putStrLn (show c_rand) >> putStrLn (show c_rand)
returns the same thing every call, at least on my compiler/system:
16807
16807
If we change the declaration to return a IO CUInt
{-# LANGUAGE ForeignFunctionInterface #-}
import Foreign
import Foreign.C.Types
foreign import ccall unsafe "stdlib.h rand"
c_rand :: IO CUInt
main = c_rand >>= putStrLn . show >> c_rand >>= putStrLn . show
then this results in (probably) a different number returned each call, since the compiler knows it's impure:
16807
282475249
So you're back to having to use IO for the calls to the standard libraries anyway.

Let's say using FFI we defined a function
c_write :: String -> ()
which lies about its purity, in that whenever its result is forced it prints the string. So that we don't run into the caching problems in Michal's answer, we can define these functions to take an extra () argument.
c_write :: String -> () -> ()
c_rand :: () -> CUInt
On an implementation level this will work as long as CSE is not too aggressive (which it is not in GHC because that can lead to unexpected memory leaks, it turns out). Now that we have things defined this way, there are many awkward usage questions that Alexis points out—but we can solve them using a monad:
newtype IO a = IO { runIO :: () -> a }
instance Monad IO where
return = IO . const
m >>= f = IO $ \() -> let x = runIO m () in x `seq` f x
rand :: IO CUInt
rand = IO c_rand
Basically, we just stuff all of Alexis's awkward usage questions into a monad, and as long as we use the monadic interface, everything stays predictable. In this sense IO is just a convention—because we can implement it in Haskell there is nothing fundamental about it.
That's from the operational vantage point.
On the other hand, Haskell's semantics in the report are specified using denotational semantics alone. And, in my opinion, the fact that Haskell has a precise denotational semantics is one of the most beautiful and useful qualities of the language, allowing me a precise framework to think about abstractions and thus manage complexity with precision. And while the usual abstract IO monad has no accepted denotational semantics (to the lament of some of us), it is at least conceivable that we could create a denotational model for it, thus preserving some of the benefits of Haskell's denotational model. However, the form of I/O we have just given is completely incompatible with Haskell's denotational semantics.
Simply put, there are only supposed to be two distinguishable values (modulo fatal error messages) of type (): () and ⊥. If we treat FFI as the fundamentals of I/O and use the IO monad only "as a convention", then we effectively add a jillion values to every type—to continue having a denotational semantics, every value must be adjoined with the possibility of performing I/O prior to its evaluation, and with the extra complexity this introduces, we essentially lose all our ability to consider any two distinct programs equivalent except in the most trivial cases—that is, we lose our ability to refactor.
Of course, because of unsafePerformIO this is already technically the case, and advanced Haskell programmers do need to think about the operational semantics as well. But most of the time, including when working with I/O, we can forget about all that and refactor with confidence, precisely because we have learned that when we use unsafePerformIO, we must be very careful to ensure it plays nicely, that it still affords us as much denotational reasoning as possible. If a function has unsafePerformIO, I automatically give it 5 or 10 times more attention than regular functions, because I need to understand the valid patterns of use (usually the type signature tells me everything I need to know), I need to think about caching and race conditions, I need to think about how deep I need to force its results, etc. It's awful[1]. The same care would be necessary of FFI I/O.
In conclusion: yes it's a convention, but if you don't follow it then we can't have nice things.
[1] Well actually I think it's pretty fun, but it's surely not practical to think about all those complexities all the time.

That depends on what the meaning of "is" is—or at least what the meaning of "convention" is.
If a "convention" means "the way things are usually done" or "an agreement among parties covering a particular matter" then it is easy to give a boring answer: yes, the IO monad is a convention. It is the way the designers of the language agreed to handle IO operations and the way that users of the language usually perform IO operations.
If we are allowed to choose a more interesting definition of "convention" then we can get a more interesting answer. If a "convention" is a discipline imposed on a language by its users in order to achieve a particular goal without assistance from the language itself, then the answer is no: the IO monad is the opposite of a convention. It is a discipline enforced by the language that assists its users in constructing and reasoning about programs.
The purpose of the IO type is to create a clear distinction between the types of "pure" values and the types of values which require execution by the runtime system to generate a meaningful result. The Haskell type system enforces this strict separation, preventing a user from (say) creating a value of type Int which launches the proverbial missiles. This is not a convention in the second sense: its entire goal is to move the discipline required to perform side effects in a safe and consistent way from the user and onto the language and its compiler.
Could you just FFI into libc.so instead to do IO and skip the IO Monad thing?
It is, of course, possible to do IO without an IO monad: see almost every other extant programming language.
Would it work anyway or is the outcome undeterministic because of Haskell evaluating lazy or something else, like that the GHC is pattern matching for IO Monad and then handling it in a special way or something else.
There is no such thing as a free lunch. If Haskell allowed any value to require execution involving IO then it would have to lose other things that we value. The most important of these is probably referential transparency: if myInt could sometimes be 1 and sometimes be 5 depending on external factors then we would lose most of our ability to reason about our programs in a rigorous way (known as equational reasoning).
Laziness was mentioned in other answers, but the issue with laziness would specifically be that sharing would no longer be safe. If x in let x = someExpensiveComputationOf y in x * x was not referentially transparent, GHC would not be able to share the work and would have to compute it twice.
What is the real reason?
Without the strict separation of effectful values from non-effectful values provided by IO and enforced by the compiler, Haskell would effectively cease to be Haskell. There are plenty of languages that don't enforce this discipline. It would be nice to have at least one around that does.
In the end you end you endup in a sideeffect. So why not do it the simple way?
Yes, in the end your program is represented by a value called main with an IO type. But the question isn't where you end up, it's where you start: If you start by being able to differentiate between effectful and non-effectful values in a rigorous way then you gain a lot of advantages when constructing that program.

What is the real reason - in the end you end up using a side effect, so why not do it the simple way?
...you mean like Standard ML? Well, there's a price to pay - instead of being able to write:
any :: (a -> Bool) -> [a] -> Bool
any p = or . map p
you would have to type out this:
any :: (a -> Bool) -> [a] -> Bool
any p [] = False
any p (y:ys) = y || any p ys
Could you not just FFI into libc.so instead to do your I/O, and skip the whole IO-monad thing?
Let's rephrase the question:
Could you not just do I/O like Standard ML, and skip the whole IO-monad thing?
...because that's effectively what you would be trying to do. Why "trying"?
SML is strict, and relies on sytactic ordering to specify the order of evaluation everywhere;
Haskell is non-strict, and relies on data dependencies to specify the order of evaluation for certain expressions e.g. I/O actions.
So:
Would it work anyway, or is the outcome nondeterministic because of:
(a) Haskell's lazy evaluation?
(a) - the combination of non-strict semantics and visible effects is generally useless. For an amusing exhibition of just how useless this combination can be, watch this presentation by Erik Meiyer (the slides can be found here).

Is there a real-world applicability for the continuation monad outside of academic use?

(later visitors: two answers to this question both give excellent insight, if you are interested you should probably read them both, I could only except one as a limitation of SO)
From all discussions I find online on the continuation monad they either mention how it could be used with some trivial examples, or they explain that it is a fundamental building block, as in this article on the Mother of all monads is the continuation monad.
I wonder if there is applicability outside of this range. I mean, does it make sense to wrap a recursive function, or mutual recursion in a continuation monad? Does it aid in readability?
Here's an F# version of the continuation mode taken from this SO post:
type ContinuationMonad() =
member this.Bind (m, f) = fun c -> m (fun a -> f a c)
member this.Return x = fun k -> k x
let cont = ContinuationMonad()
Is it merely of academic interest, for instance to help understand monads, or computation builders? Or is there some real-world applicability, added type-safety, or does it circumvent typical programming problems that are hard to solve otherwise?
I.e., the continuation monad with call/cc from Ryan Riley shows that it is complex to handle exceptions, but it doesn't explain what problem it is trying to solve and the examples don't show why it needs specifically this monad. Admittedly, I just don't understand what it does, but it may be a treasure trove!
(Note: I am not interested in understanding how the continuation monad works, I think I have a fair grasp of it, I just don't see what programming problem it solves.)

The "mother of all monads" stuff is not purely academic. Dan Piponi references Andrzej Filinski's Representing Monads, a rather good paper. The upshot of it is if your language has delimited continuations (or can mimic them with call/cc and a single piece of mutable state) then you can transparently add any monadic effect to any code. In other words, if you have delimited continuations and no other side effects, you can implement (global) mutable state or exceptions or backtracking non-determinism or cooperative concurrency. You can do each of these just by defining a few simply functions. No global transformation or anything needed. Also, you only pay for the side-effects when you use them. It turns out the Schemers were completely right about call/cc being highly expressive.
If your language doesn't have delimited continuations, you can get them via the continuation monad (or better the double-barrelled continuation monad). Of course, if you're going to write in monadic-style anyway – which is a global transformation – why not just use the desired monad from the get-go? For Haskellers, this is typically what we do, however, there are still benefits from using the continuation monad in many cases (albeit hidden away). A good example is the Maybe/Option monad which is like having exceptions except there's only one type of exception. Basically, this monad captures the pattern of returning an "error code" and checking it after each function call. And that's exactly what the typical definition does, except by "function call" I meant every (monadic) step of the computation. Suffice to say, this is pretty inefficient, especially when the vast majority of the time there is no error. If you reflect Maybe into the continuation monad though, while you have to pay the cost of the CPSed code (which GHC Haskell handles suprisingly well), you only pay to check the "error code" in places where it matters, i.e. catch statements. In Haskell, the Codensity monad than danidiaz mentioned is a better choice because the last thing Haskellers want is to make it so that arbitrary effects can be transparently interleaved in their code.
As danidiaz also mentioned, many monads are more easily or more efficiently implemented using essentially a continuation monad or some variant. Backtracking search is one example. While not the newest thing on the backtracking, one of my favorite papers that used it was Typed Logical Variables in Haskell. The techniques used in it was also used in the Wired Hardware Description Language. Also from Koen Claesson is A Poor Man's Concurrency Monad. More modern uses of the ideas in this example include: the monad for deterministic parallelism in Haskell A Monad for Deterministic Parallelism and scalable I/O managers Combining Events And Threads For Scalable Network Services. I'm sure I can find similar techniques used in Scala. If it wasn't provided, you could use a continuation monad to implement asynchronous workflows in F#. In fact, Don Syme references exactly the same papers I just referenced. If you can serialize functions but don't have continuations, you can use a continuation monad to get them and do the serialized continuation type of web programming made popular by systems like Seaside. Even without serializable continuations, you can use the pattern (essentially the same as async) to at least avoid callbacks while storing the continuations locally and only sending a key.
Ultimately, relatively few people outside of Haskellers are using monads in any capacity, and as I alluded to earlier, Haskellers tend to want to use more contcrollable monads than the continuation monad, though they do use them internally quite a bit. Nevertheless, continuation monads or continuation monad like things, particularly for asynchronous programming, are becoming less uncommon. As C#, F#, Scala, Swift, and even Java start incorporating support monadic or at least monadic-style programming, these ideas will become more broadly used. If the Node developers were more conversant with this, maybe they would have realized you could have your cake and eat it too with regards to event-driven programming.

To provide a more direct F#-specific answer (even though Derek already covered that too), the continuation monad pretty much captures the core of how asynchronous workflows work.
A continuation monad is a function that, when given a continuation, eventually calls the continuation with the result (it may never call it or it may call it repeatedly too):
type Cont<'T> = ('T -> unit) -> unit
F# asynchronous computations are a bit more complex - they take continuation (in case of success), exception and cancellation continuations and also include the cancellation token. Using a slightly simplified definition, F# core library uses (see the full definition here):
type AsyncParams =
{ token : CancellationToken
econt : exn -> unit
ccont : exn -> unit }
type Async<'T> = ('T -> unit) * AsyncParams -> unit
As you can see, if you ignore AsyncParams, it is pretty much the continuation monad. In F#, I think the "classical" monads are more useful as an inspiration than as a direct implementation mechanism. Here, the continuation monad provides a useful model of how to handle certain kinds of computations - and with many additional async-specific aspects, the core idea can be used to implement asynchronous computations.
I think this is quite different to how monads are used in classic academic works or in Haskell, where they tend to be used "as is" and perhaps composed in various ways to construct more complex monads that capture more complex behaviour.
This may be just my personal opinion, but I'd say that the continuation monad is not practically useful in itself, but it is a basis for some very practical ideas. (Just like lambda calculus is not really practically useful in itself, but it can be seen as an inspiration for nice practical languages!)

I certainly find it easier to read a recursive function implemented using the continuation monad compared to one implemented using explicit recursion. For example, given this tree type:
type 'a Tree =
| Node of 'a * 'a Tree * 'a Tree
| Empty
here's one way to write a bottom-up fold over a tree:
let rec fold e f t = cont {
match t with
| Node(a,t1,t2) ->
let! r1 = fold e f t1
let! r2 = fold e f t2
return f a r1 r2
| Empty -> return e
}
This is clearly analogous to a naïve fold:
let rec fold e f t =
match t with
| Node(a,t1,t2) ->
let r1 = fold e f t1
let r2 = fold e f t2
f a r1 r2
| Empty -> return e
except that the naïve fold will blow the stack when called on a deep tree because it's not tail recursive, while the fold written using the continuation monad won't. You can of course write the same thing using explicit continuations, but to my eye the amount of clutter that they add distracts from the structure of the algorithm (and putting them in place is not completely fool-proof):
let rec fold e f t k =
match t with
| Node(a,t1,t2) ->
fold e f t1 (fun r1 ->
fold e f t2 (fun r2 ->
k (f r1 r2)))
| Empty -> k e
Note that in order for this to work, you'll need to modify your definition of ContinuationMonad to include
member this.Delay f v = f () v

Why is Haskell fully declarative?

I'm not very good in understanding the difference between the imperative and declarative programming paradigms. I read about Haskell being a declarative language. To an extend I would says yes but there is a something that bothers me in respect to the definition of imperative.
When I have a datastruct and use Haskell's functions to transform that I actually just told WHAT to transform. So I give the datastruct as argument to a function and am happy with the result.
But what if there is no function that actually satisfies my needs?
I would start writing an own function which expects the datastruct as an argument. After that I would start writing how the datastruct should be processed. Since I only can call native Haskell functions I'm still with the declarative paradigm right? But what when I start using an "if statement". Wouldn't that end the declarative nature since I'm about to tell the program HOW to do stuff from that point?

Perhaps this is a matter of perspective. The way I see it, there is nothing imperative about defining things in terms of other things because we can always replace something with its definition (as long as the definition is pure). That is, if we have f x = x + 1, then any place we see f z we can replace with z + 1. So pure functions shouldn't really be considered instructions; they should be considered definitions.
A lot of Haskell code is considered declarative for this reason. We simply define things as (pure) functions of other things.
It is possible to write imperative-style code in Haskell as well. Sometimes we really do want to say something like "do A, then do B, then do C". This adds a new dimension to the simple world of function application: we need a notion of 'happens before'. Haskell has embraced the Monad concept to describe computations that have an order of evaluation. This turns out to be very handy because it can encapsulate effects such as changing state.

There is no if statement in Haskell, only an if expression if a then b else c, which is really equivalent to ternary expressions like a ? b : c in C-like languages or b if a else c in Python. You cannot leave out the else, and it can be used anywhere an arbitrary expression can be used. It is further constrained that a must have type Boolean, and b and c must have the same type. It is syntactic sugar for
case a of
True -> b
False -> c
Don't get too hung up on the label "declarative"; it's just a style of programming that the language supports. Consider, for example, a typical declarative definition of the factorial function:
fact 0 = 1
fact n = n * fact (n - 1)
This is also just syntactic sugar for something you would see as more imperative:
fact n = case n of
0 -> 1
otherwise -> n * fact (n - 1)

Frankly, the term "declarative programming" is more of a marketing term than anything else. The informal definition that it means "specifying what to do, not how to do it" is vague and open to interpretation and definitely far from a black-and-white boundary. In practice, it seems to be applied to anything that broadly falls into the categories of functional programming, logic programming or the use of Domain-Specific Languages (DSLs).
Consequently (and I realise this probably doesn't qualify as an answer to your question, but still :)), I recommend you do not waste your time wondering whether something is still declarative or not. The terms imperative vs. functional vs. logic programming are already a bit more meaningful, so perhaps it's more useful to reflect on those.

Declarative languages are constructed from expressions whereas imperative languages are constructed from statements.
The usual explanation is what to do versus how to do it. You've already found the confusion in this. If you take it this way, declarative is using definitions (by name) and imperative is writing definitions. What is a declarative language then? One which only names definitions? If so then the only Haskell program you could write is precisely main!
There is a declarative way and an imperative way to have branching, and it follows directly from the definition. A declarative branch is an expression and an imperative branch is a statement. Haskell has case … of … (an expression) and C has if (…) {…} else {…} (a statement).
What is the difference between expressions and statements? Expressions have a value whereas statements have effects. What is the difference between values and effects?
For expressions, there is a function μ which maps any expression e to its value μ(e). This is also called the semantic, or meaning, and is ideally a well-defined mathematical object. This way of defining values is called denotational semantics. There are other methods.
For statements, there is a state P immediately before a statement S and a state Q immediately after. The effect of S is the delta from P to Q. This way of defining effects is called Hoare logic. There are other methods.

You can't use an "if statement" in Haskell, because there is none. You can use an "if-expression"
if c then a else b
but this is just syntactic sugar for something like
let f True a b = a; f False a b = b in f c a b
which is again fuly declarative.

Can every functional language be lazy?

In a functional language, functions are first class citizens and thus calling them is not the only thing I can do. I can also store them away.
Now when I have a language, which is strict by default, then I am still not forced to evaluate a function call. I have the option to store the function and its parameters e.g. in a tuple for later evaluation.
So instead of
x = f a b c
I do something like
x = (f,a,b,c)
And later, I can evaluate this thing with something like
eval (f,a,b,c) = f a b c
Well, there is probably more to it, because I want to evaluate each unevaluated function call only once, but it seems to me, that this can also be solved with a data structure which is a bit fancier than just a tuple.
The inverse also seems to be the case, because e.g. in Haskell, which is lazy be default I can enforce evaluation with seq or BangPatterns.
So is it correct to say, that every functional language has the potential of being lazy, but most of them are just not lazy by default and thus require additional programming effort to call a function lazily, whereas haskell is lazy by default and requires additional programming effort to call a function in a strict way?
Should that be the case, what is more difficult for the programmer: writing lazy function calls in a strict language or writing strict function calls in a lazy language?
As a side note: was Simon P. Jone serious when he said: "the next version of haskell will be strict". I first thought that this was a joke. But now I think strict by default isn't all that bad as long as you can be lazy if required.

Lazy evaluation, at the low level, is implemented by a concept called a thunk, which comprises two things:
A closure that computes the value of the deferred computation
A set-once mutable storage cell for memoizing the result.
The first part, the closure, can be modeled in an even simpler way than your tuple with the function and its arguments. You can just use a function that accepts unit or no arguments (depending on how your language works), and in its body you apply the function to the arguments. To compute the result, you just invoke the function.
Paul Johnson mentions Scheme, which is a perfect language to demonstrate this. As a Scheme macro (pseudocode, untested):
(define-syntax delay
(syntax-rules ()
((delay expr ...)
;; `(delay expr)` evaluates to a lambda that closes over
;; two variables—one to store the result, one to record
;; whether the thunk has been forced.
(let ((value #f)
(done? #f))
(lambda ()
(unless done?
(set! value (begin expr ...))
(set! done? #t))
value)))))
(define (force thunk)
;; Thunks are procedures, so to force them you just invoke them.
(thunk))
But to get this back to the title of the question: does this mean that every functional language can be lazy? The answer is no. Eager languages can implement thunking and use it to provide opt-in delayed evaluation at user-selected spots, but this isn't the same as having pervasive lazy evaluation like Haskell implementations provide.

The answer is a qualified yes. Your intuition that laziness can be implemented in a strict language where functions are first-class objects is correct. But going into the details reveals a number of subtleties.
Let's take a functional language (by which I mean a language where functions can be constructed and manipulated as first-class objects, like in the lambda calculus), where function application is strict (i.e. the function¹ and its argument(s) are fully evaluated before the function is applied). I'll use the syntax of Standard ML, since this is a popular and historically important strict functional language. A strict application F A (where F and A are two expressions) can be delayed by encoding it as
Thunk (F, A)
This object contains a function and an argument is called a thunk. We can define a type of thunks:
datatype ('a, 'b) thunk = Thunk of ('a -> 'b) * 'a;
and a function to evaluate a thunk:
fun evaluate (Thunk (f, x)) = f x;
Nice and easy so far. But we have not, in fact, implemented lazy evaluation! What we've implemented is normal-order evaluation, also known as call-by-name. The difference is that if the value of the thunk is used more than once, it is calculated every time. Lazy evaluation (also known as call-by-need) requires evaluating the expression at most once.
In a pure, strict language, lazy evaluation is in fact impossible to implement. The reason is that evaluating a thunk modifies the state of the system: it goes from unevaluated to evaluated. Implementing lazy evaluation requires a way to change the state of the system.
There's a subtlety here: if the semantics of the language is defined purely in terms of the termination status of expressions and the value of terminating expressions, then, in a pure language, call-by-need and call-by-name are indistinguishable. Call-by-value (i.e. strict evaluation) is distinguishable because fewer expressions terminate — call-by-need and call-by-name hide any non-termination that happens in a thunk that is never evaluated. The equivalence of call-by-need and call-by-name allows lazy evaluation to be considered as an optimization of normal-order evaluation (which has nice theoretical properties). But in many programs, using call-by-name instead of call-by-value would blow up the running time by computing the value of the same expressions over and over again.
In a language with mutable state, lazy evaluation can be expressed by storing the value into the thunk when it is calculated.
datatype ('a, 'b) lazy_state = Lazy of ('a -> 'b) * 'a | Value of 'a;
type ('a, 'b) lazy_state = ('a, 'b) lazy_state ref;
let lazy (f, x) = ref (Lazy (f, x));
fun force r =
case !r of Value y => y
| Lazy (f, x) => let val y = f x in r := Value y; y end;
This code is not very complicated, so even in ML dialects that provide lazy evaluation as a library feature (possibly with syntactic sugar), it isn't used all that often in practice — often, the point at which the value will be needed is a known location in the programs, and programmers just use a function and pass it its argument at that location.
While this is getting into subjective territory, I would say that it's much easier to implement lazy evaluation like this, than to implement strict evaluation in a language like Haskell. Forcing strict evaluation in Haskell is basically impossible (except by wrapping everything into a state monad and essentially writing ML code in Haskell syntax). Of course, strict evaluation doesn't change the values calculated by the program, but it can have a significant impact on performance (and, in particular, it is sometimes much appreciated because it makes performance a lot more predictable — predicting the performance of a Haskell program can be very hard).
This evaluate-and-store mechanism is effectively the core of what a Haskell compiler does under the hood. Haskell is pure², so you cannot implemente this in the language itself! However, it's sound for the compiler to do it under the hood, because this particular use of side effects does not break the purity of the language, so it does not invalidate any program transformation. The reason storing the value of a thunk is sound is that it turns call-by-name evaluation into call-by-need, and as we saw above, this neither changes the values of terminating expressions, nor changes which expressions terminate.
This approach can be somewhat problematic in a language that combines purely functional local evaluation with a multithreaded environment and message passing between threads. (This is notably the model of Erlang.) If one thread starts evaluating a thunk, and another thread needs its value just then, what is going to happen? If no precautions are taken, then both threads will calculate the value and store it. In a pure language, this is harmless in the sense that both threads will calculate the same value anyway³. However this can hurt performance. To ensure that a thunk is evaluated only once, the calculation of the value must be wrapped in a lock; this helps with long calculations that are performed many times but hurts short calculations that are performed only once, as taking and releasing a lock takes some time.
¹ The function, not the function body of course.
² Or rather, the fragment of Haskell that doesn't use a side effect monad is pure.
³ It is necessary for the transition between a delayed thunk and a computed value to be atomic — concurrent threads must be able to read a lazy value and get either a valid delayed thunk or a valid computed value, not some mixture of the two that isn't a valid object. At the processor level, the transition from delayed thunk to computed value is usually a pointer assignment, which on most architectures is atomic, fortunately.

What you propose will work. The details for doing this in Scheme can be found in SICP. One could envisage a strict version of Haskell in which there is a "lazy" function which does the opposite of what "seq" does in Haskell. However adding this to a strict Haskell-like language would require compiler magic because otherwise the thunk gets forced before being passed to "lazy".
However if your language has uncontrolled effects then this can get hairy, because an effect happens whenever its enclosing function gets evaluated, and figuring out when figuring out when that is going to happen in a lazy langauge is difficult. Thats why Haskell has the IO monad.

What are Haskell's strictness points?

We all know (or should know) that Haskell is lazy by default. Nothing is evaluated until it must be evaluated. So when must something be evaluated? There are points where Haskell must be strict. I call these "strictness points", although this particular term isn't as widespread as I had thought. According to me:
Reduction (or evaluation) in Haskell only occurs at strictness points.
So the question is: what, precisely, are Haskell's strictness points? My intuition says that main, seq / bang patterns, pattern matching, and any IO action performed via main are the primary strictness points, but I don't really know why I know that.
(Also, if they're not called "strictness points", what are they called?)
I imagine a good answer will include some discussion about WHNF and so on. I also imagine it might touch on lambda calculus.
Edit: additional thoughts about this question.
As I've reflected on this question, I think it would be clearer to add something to the definition of a strictness point. Strictness points can have varying contexts and varying depth (or strictness). Falling back to my definition that "reduction in Haskell only occurs at strictness points", let us add to that definition this clause: "a strictness point is only triggered when its surrounding context is evaluated or reduced."
So, let me try to get you started on the kind of answer I want. main is a strictness point. It is specially designated as the primary strictness point of its context: the program. When the program (main's context) is evaluated, the strictness point of main is activated. Main's depth is maximal: it must be fully evaluated. Main is usually composed of IO actions, which are also strictness points, whose context is main.
Now you try: discuss seq and pattern matching in these terms. Explain the nuances of function application: how is it strict? How is it not? What about deepseq? let and case statements? unsafePerformIO? Debug.Trace? Top-level definitions? Strict data types? Bang patterns? Etc. How many of these items can be described in terms of just seq or pattern matching?

A good place to start is by understanding this paper: A Natural Semantics for Lazy Evalution (Launchbury). That will tell you when expressions are evaluated for a small language similar to GHC's Core. Then the remaining question is how to map full Haskell to Core, and most of that translation is given by the Haskell report itself. In GHC we call this process "desugaring", because it removes syntactic sugar.
Well, that's not the whole story, because GHC includes a whole raft of optimisations between desugaring and code generation, and many of these transformations will rearrange the Core so that things get evaluated at different times (strictness analysis in particular will cause things to be evaluated earlier). So to really understand how your
program will be evaluated, you need to look at the Core produced by GHC.
Perhaps this answer seems a bit abstract to you (I didn't specifically mention bang patterns or seq), but you asked for something precise, and this is about the best we can do.

I would probably recast this question as, Under what circumstances will Haskell evaluate an expression? (Perhaps tack on a "to weak head normal form.")
To a first approximation, we can specify this as follows:
Executing IO actions will evaluate any expressions that they “need.” (So you need to know if the IO action is executed, e.g. it's name is main, or it is called from main AND you need to know what the action needs.)
An expression that is being evaluated (hey, that's a recursive definition!) will evaluate any expressions it needs.
From your intuitive list, main and IO actions fall into the first category, and seq and pattern matching fall into the second category. But I think that the first category is more in line with your idea of "strictness point", because that is in fact how we cause evaluation in Haskell to become observable effects for users.
Giving all of the details specifically is a large task, since Haskell is a large language. It's also quite subtle, because Concurrent Haskell may evaluate things speculatively, even though we end up not using the result in the end: this is a third breed of things that cause evaluation. The second category is quite well studied: you want to look at the strictness of the functions involved. The first category too can be thought to be a sort of "strictness", though this is a little dodgy because evaluate x and seq x $ return () are actually different things! You can treat it properly if you give some sort of semantics to the IO monad (explicitly passing a RealWorld# token works for simple cases), but I don't know if there's a name for this sort of stratified strictness analysis in general.

C has the concept of sequence points, which are guarantees for particular operations that one operand will be evaluated before the other. I think that's the closest existing concept, but the essentially equivalent term strictness point (or possibly force point) is more in line with Haskell thinking.
In practice Haskell is not a purely lazy language: for instance pattern matching is usually strict (So trying a pattern match forces evaluation to happen at least far enough to accept or reject the match.
…
Programmers can also use the seq primitive to force an expression to evaluate regardless of whether the result will ever be used.
$! is defined in terms of seq.
—Lazy vs. non-strict.
So your thinking about !/$! and seq is essentially right, but pattern matching is subject to subtler rules. You can always use ~ to force lazy pattern matching, of course. An interesting point from that same article:
The strictness analyzer also looks for cases where sub-expressions are always required by the outer expression, and converts those into eager evaluation. It can do this because the semantics (in terms of "bottom") don't change.
Let's continue down the rabbit hole and look at the docs for optimisations performed by GHC:
Strictness analysis is a process by which GHC attempts to determine, at compile-time, which data definitely will 'always be needed'. GHC can then build code to just calculate such data, rather than the normal (higher overhead) process for storing up the calculation and executing it later.
—GHC Optimisations: Strictness Analysis.
In other words, strict code may be generated anywhere as an optimisation, because creating thunks is unnecessarily expensive when the data will always be needed (and/or may only be used once).
…no more evaluation can be performed on the value; it is said to be in normal form. If we are at any of the intermediate steps so that we've performed at least some evaluation on a value, it is in weak head normal form (WHNF). (There is also a 'head normal form', but it's not used in Haskell.) Fully evaluating something in WHNF reduces it to something in normal form…
—Wikibooks Haskell: Laziness
(A term is in head normal form if there is no beta-redex in head position1. A redex is a head redex if it is preceded only by lambda abstractors of non-redexes 2.) So when you start to force a thunk, you're working in WHNF; when there are no more thunks left to force, you're in normal form. Another interesting point:
…if at some point we needed to, say, print z out to the user, we'd need to fully evaluate it…
Which naturally implies that, indeed, any IO action performed from main does force evaluation, which should be obvious considering that Haskell programs do, in fact, do things. Anything that needs to go through the sequence defined in main must be in normal form and is therefore subject to strict evaluation.
C. A. McCann got it right in the comments, though: the only thing that's special about main is that main is defined as special; pattern matching on the constructor is sufficient to ensure the sequence imposed by the IO monad. In that respect only seq and pattern-matching are fundamental.

Haskell is AFAIK not a pure lazy language, but rather a non-strict language. This means that it does not necessarily evaluate terms at the last possible moment.
A good source for haskell's model of "lazyness" can be found here: http://en.wikibooks.org/wiki/Haskell/Laziness
Basically, it is important to understand the difference between a thunk and the weak header normal form WHNF.
My understanding is that haskell pulls computations through backwards as compared to imperative languages. What this means is that in the absence of "seq" and bang patterns, it will ultimately be some kind of side effect that forces the evaluation of a thunk, which may cause prior evaluations in turn (true lazyness).
As this would lead to a horrible space leak, the compiler then figures out how and when to evaluate thunks ahead of time to save space. The programmer can then support this process by giving strictness annotations (en.wikibooks.org/wiki/Haskell/Strictness , www.haskell.org/haskellwiki/Performance/Strictness) to further reduce space usage in form of nested thunks.
I am not an expert in the operational semantics of haskell, so I will just leave the link as a resource.
Some more resources:
http://www.haskell.org/haskellwiki/Performance/Laziness
http://www.haskell.org/haskellwiki/Haskell/Lazy_Evaluation

Lazy doesn't mean do nothing. Whenever your program pattern matches a case expression, it evaluates something -- just enough anyway. Otherwise it can't figure out which RHS to use. Don't see any case expressions in your code? Don't worry, the compiler is translating your code to a stripped down form of Haskell where they are hard to avoid using.
For a beginner, a basic rule of thumb is let is lazy, case is less lazy.

This is not a full answer aiming for karma, but just a piece of the puzzle -- to the extent that this is about semantics, bear in mind that there are multiple evaluation strategies that provide the same semantics. One good example here -- and the project also speaks to how we typically think of Haskell semantics -- was the Eager Haskell project, which radically altered evaluation strategies while maintaining the same semantics: http://csg.csail.mit.edu/pubs/haskell.html

The Glasgow Haskell compiler translates your code into a Lambda-calculus-like language called core. In this language, something is going to be evaluated, whenever you pattern match it by a case-statement. Thus if a function is called, the outermost constructor and only it (if there are no forced fields) is going to be evaluated. Anything else is canned in a thunk. (Thunks are introduced by let bindings).
Of course this is not exactly what happens in the real language. The compiler convert Haskell into Core in a very sophisticated way, making as many things as possibly lazy and anything that is always needed lazy. Additionally, there are unboxed values and tuples that are always strict.
If you try to evaluate a function by hand, you can basically think:
Try to evaluate the outermost constructor of the return.
If anything else is needed to get the result (but only if it's really needed) is also going to be evaluated. The order doesn't matters.
In case of IO you have to evaluate the results of all statements from the first to the last in that. This is a bit more complicated, since the IO monad does some tricks to force evaluation in a specific order.

We all know (or should know) that Haskell is lazy by default. Nothing is evaluated until it must be evaluated.
No.
Haskell is not a lazy language
Haskell is a language in which evaluation order doesn't matter because there are no side effects.
It's not quite true that evaluation order doesn't matter, because the language allows for infinite loops. If you aren't careful, it's possible to get stuck in a cul-de-sac where you evaluate a subexpression forever when a different evaluation order would have led to termination in finite time. So it's more accurate to say:
Haskell implementations must evaluate the program in a way that terminates if there is any evaluation order that terminates. Only if every possible evaluation order fails to terminate can the implementation fail to terminate.
This still leaves implementations with a huge freedom in how they evaluate the program.
A Haskell program is a single expression, namely let {all top-level bindings} in Main.main. Evaluation can be understood as a sequence of reduction (small-)steps which change the expression (which represents the current state of the executing program).
You can divide reduction steps into two categories: those that are provably necessary (provably will be part of any terminating sequence of reductions), and those that aren't. You can divide the provably necessary reductions somewhat vaguely into two subcategories: those that are "obviously" necessary, and those that require some nontrivial analysis to prove them necessary.
Performing only obviously necessary reductions is what's called "lazy evaluation". I don't know whether a purely lazy evaluating implementation of Haskell has ever existed. Hugs may have been one. GHC definitely isn't.
GHC performs reduction steps at compile time that aren't provably necessary; for example, it will replace 1+2::Int with 3::Int even if it can't prove that the result will be used.
GHC may also perform not-provably-necessary reductions at run time in some circumstances. For example, when generating code to evaluate f (x+y), if x and y are of type Int and their values will be known at run time, but f can't be proven to use its argument, there is no reason not to compute x+y before calling f. It uses less heap space and less code space and is probably faster even if the argument isn't used. However, I don't know whether GHC actually takes these sorts of optimization opportunities.
GHC definitely performs evaluation steps at run time that are proven necessary only by fairly complex cross-module analyses. This is extremely common and may represent the bulk of the evaluation of realistic programs. Lazy evaluation is a last-resort fallback evaluation strategy; it isn't what happens as a rule.
There was an "optimistic evaluation" branch of GHC that did much more speculative evaluation at run time. It was abandoned because of its complexity and the ongoing maintenance burden, not because it didn't perform well. If Haskell was as popular as Python or C++ then I'm sure there would be implementations with much more sophisticated runtime evaluation strategies, maintained by deep-pocketed corporations. Non-lazy evaluation isn't a change to the language, it's just an engineering challenge.
Reduction is driven by top-level I/O, and nothing else
You can model interaction with the outside world by means of special side-effectful reduction rules like: "If the current program is of the form getChar >>= <expr>, then get a character from standard input and reduce the program to <expr> applied to the character you got."
The entire goal of the run time system is to evaluate the program until it has one of these side-effecting forms, then do the side effect, then repeat until the program has some form that implies termination, such as return ().
There are no other rules about what is reduced when. There are only rules about what can reduce to what.
For example, the only rules for if expressions are that if True then <expr1> else <expr2> can be reduced to <expr1>, if False then <expr1> else <expr2> can be reduced to <expr2>, and if <exc> then <expr1> else <expr2>, where <exc> is an exceptional value, can be reduced to an exceptional value.
If the expression representing the current state of your program is an if expression, you have no choice but to perform reductions on the condition until it's True or False or <exc>, because that's the only way you'll ever get rid of the if expression and have any hope of reaching a state that matches one of the I/O rules. But the language specification doesn't tell you to do that in so many words.
These sorts of implicit ordering constraints are the only way that evaluation can be "forced" to happen. This is a frequent source of confusion for beginners. For example, people sometimes try to make foldl more strict by writing foldl (\x y -> x `seq` x+y) instead of foldl (+). This doesn't work, and nothing like it can ever work, because no expression can make itself evaluate. The evaluation can only "come from above". seq is not special in any way in this regard.
Reduction happens everywhere
Reduction (or evaluation) in Haskell only occurs at strictness points. [...] My intuition says that main, seq / bang patterns, pattern matching, and any IO action performed via main are the primary strictness points [...].
I don't see how to make sense of that statement. Every part of the program has some meaning, and that meaning is defined by reduction rules, so reduction happens everywhere.
To reduce a function application <expr1> <expr2>, you have to evaluate <expr1> until it has a form like (\x -> <expr1'>) or (getChar >>=) or something else that matches a rule. But for some reason function application doesn't tend to show up on lists of expressions that allegedly "force evaluation", while case always does.
You can see this misconception in a quote from the Haskell wiki, found in another answer:
In practice Haskell is not a purely lazy language: for instance pattern matching is usually strict
I don't understand what could qualify as a "purely lazy language" to whoever wrote that, except, perhaps, a language in which every program hangs because the runtime never does anything. If pattern matching is a feature of your language then you've got to actually do it at some point. To do it, you have to evaluate the scrutinee enough to determine whether it matches the pattern. That's the laziest way to match a pattern that is possible in principle.
~-prefixed patterns are often called "lazy" by programmers, but the language spec calls them "irrefutable". Their defining property is that they always match. Because they always match, you don't have to evaluate the scrutinee to determine whether they match or not, so a lazy implementation won't. The difference between regular and irrefutable patterns is what expressions they match, not what evaluation strategy you're supposed to use. The spec says nothing about evaluation strategies.
main is a strictness point. It is specially designated as the primary strictness point of its context: the program. When the program (main's context) is evaluated, the strictness point of main is activated. [...] Main is usually composed of IO actions, which are also strictness points, whose context is main.
I'm not convinced that any of that has any meaning.
Main's depth is maximal: it must be fully evaluated.
No, main only has to be evaluated "shallowly", to make I/O actions appear at the top level. main is the entire program, and the program isn't completely evaluated on every run because not all code is relevant to every run (in general).
discuss seq and pattern matching in these terms.
I already talked about pattern matching. seq can be defined by rules that are similar to case and application: for example, (\x -> <expr1>) `seq` <expr2> reduces to <expr2>. This "forces evaluation" in the same way that case and application do. WHNF is just a name for what these expressions "force evaluation" to.
Explain the nuances of function application: how is it strict? How is it not?
It's strict in its left expression just like case is strict in its scrutinee. It's also strict in the function body after substitution just like case is strict in the RHS of the selected alternative after substitution.
What about deepseq?
It's just a library function, not a builtin.
Incidentally, deepseq is semantically weird. It should take only one argument. I think that whoever invented it just blindly copied seq with no understanding of why seq needs two arguments. I count deepseq's name and specification as evidence that a poor understanding of Haskell evaluation is common even among experienced Haskell programmers.
let and case statements?
I talked about case. let, after desugaring and type checking, is just a way of writing an arbitrary expression graph in tree form. Here's a paper about it.
unsafePerformIO?
To an extent it can be defined by reduction rules. For example, case unsafePerformIO <expr> of <alts> reduces to unsafePerformIO (<expr> >>= \x -> return (case x of <alts>)), and at the top level only, unsafePerformIO <expr> reduces to <expr>.
This doesn't do any memoization. You could try to simulate memoization by rewriting every unsafePerformIO expression to explicitly memoize itself, and creating the associated IORefs... somewhere. But you could never reproduce GHC's memoization behavior, because it depends on unpredictable details of the optimization process, and because it isn't even type safe (as shown by the infamous polymorphic IORef example in the GHC documentation).
Debug.Trace?
Debug.Trace.trace is just a simple wrapper around unsafePerformIO.
Top-level definitions?
Top-level variable bindings are the same as nested let bindings. data, class, import, and such are a whole different ball game.
Strict data types? Bang patterns?
Just sugar for seq.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string