What is the mathematical significance of "all (==1) [1,1..]" not terminating? - haskell

Intuitively, I would expect the "mathematical" answer to all (==1) [1,1..] to be True because all of the elements in a list that only contains 1s are equal to 1. However I understand that "computationally", the process of evaluating the infinite list in order to check that each element does in fact equal 1 will never terminate, therefore the expression instead "evaluates" to bottom or ⊥.
I find this result counter-intuitive and a little unnerving. I think the fact that the list is an infinite one confuses the issue both mathematically and computationally, and I would love to hear from anyone who has some insight and experience in this area
My question is, which is the most mathematically correct answer? ⊥ or True?
Some elaboration as to why one answer is more correct than the other would also be much appreciated.
edit: This might indirectly have something to do with the Curry-Howard isomorphism (Programs are proofs and types are theorems) and Gödel's incompleteness theorems. If I remember correctly, one of the incompleteness theorems can be (incredibly roughly) summarised as saying that "sufficiently powerful formal systems (like mathematics or a programming language) cannot prove all true statements that can be expressed in the system"

The value
all (==1) [1,1..]
is the least upper bound of the sequence
all (==1) (⊥)
all (==1) (1 : ⊥)
all (==1) (1 : 1 : ⊥)
...
and all terms of this sequence are ⊥, so the least upper bound is also ⊥. (All Haskell functions are continuous: preserve least upper bounds.)
This is using the denotational semantics for Haskell and doesn't depend (directly) on choosing any particular evaluation strategy.

In programming, we use not classical logic, but intuitionistic (constructive) logic. We can still interpret types as theorems, but we don’t care about the truth of these theorems; instead, we talk about whether they’re constructively provable. Even though all (== 1) [1, 1 ..] is true, we can’t prove it in Haskell, so we get ⊥ (here, an infinite loop).
In constructive logic, we don’t have the law of the excluded middle, nor double negation elimination as a result. The Haskell function type all (== 1) :: [Int] -> Bool doesn’t represent the theorem [Int] → Bool, which would be a total function; it represents the theorem [Int] → ¬¬Bool. If all can prove the theorem by producing a result, then that result will be of type Bool; otherwise, the result is bottom.

I don't know enough about computability to answer this properly, but I do draw comfort from the simplicity in the language design. In this case, I find it simple and elegant that all doesn't have to know anything about the input it's given. It's arguably easier for human to reason about the snippet that you have given.
Sure, now it would be good for the infinite list comprehension to somehow tell all that it's an infinite list of ones. But... that's what it says "by being of that value". Having a bit more generic metadata about the repeated sequence would be more successful for optimization purposes, but I think the simplicity would be reduced, and complexity introduced.

Related

how lazy is haskell when dealing with complicated program flow

I was wondering how smart/lazy Haskell is. Can I always be sure that Haskell will only do what is necessary to generate a certain output?
No.
Haskell specifies a denotational semantics for its core lambda-like calculus, and you can rely on that semantics. Additionally, there is a metatheory proof that a particular reduction order -- known colloquially as "lazy evaluation" -- realizes that semantics; and so many people use that as their mental model of how Haskell programs behave.
There are two broad categories of ways that a Haskell program may end up evaluating more than necessary:
The Haskell implementation may choose to evaluate more. GHC uses lazy evaluation in most places, but I believe it will use other evaluation orders for efficiency in some cases. You could also look at the Eager Haskell project, which is attempting to use another implementation strategy. In principle, an implementation would be within its rights to choose to speculatively fork some computations to another thread (and then throw away the results if they weren't needed). And so on and so forth -- there are many possible variations on this theme.
The denotational semantics specified may demand more evaluation than "necessary". For example, one that occasionally trips up beginners:
primes :: [Int]
primes = 2 : filter prime [3,5..]
prime :: Int -> Bool
prime x = and [x `mod` p /= 0 | p <- primes, p < x]
When checking whether 3 should be in the list primes, it is in fact not necessary to check any of the elements of primes past 2, because the sequence is strictly monotonically increasing. But Haskell is not (does not try to be) smart enough to notice that; it will go straight on trying to check the rest of the primes and end up in an infinite loop instead of giving the list of primes.
An even smaller example: you could think that x && False is always False, but x will typically be evaluated anyway, because the semantics says this should be an infinite loop if x is. (Contrast False && x, which typically does not result in evaluating x.)
That said, when you say "complex structure", one thing that comes to mind is: does Haskell do the laziness thing even with custom data types that I define? That is, do complex structures like hash maps and balanced trees and k-d trees and so forth get treated lazily? The answer there is yes, at least for GHC; there is nothing fundamentally special about any of the types in the Prelude except IO. Lists, booleans, Maybes, and so forth are lazy not because the compiler knows special things about them, but simply as a consequence of the lazy evaluation reduction rules specified by Haskell and their declarations as types.
Of course, there are ways to opt-out of laziness. Some of the structures you will see on Hackage do that in various ways; but don't worry, usually this will be declared in their documentation.

Where does the Haskell function `seq` get its name from?

I get that seq is used to improve performance by avoiding unneeded laziness. I just want to know where the name is derived from? Is it from "sequence" or "sequential"? And how does the name relate to strict evaluation?
It comes from sequence point. That's a well-known concept in C, and it's indeed quite similar to the seq operator in Haskell: every computation on the left should be done before any computation on the right.
Of course, Haskell seq is a bit less demanding than that: it merely requests that the thing on the left is evaluated to weak head normal form before the result on the right is evaluated. And it doesn't really guarantee any particular evaluation order at all†, only that if the expression on the left is ⊥ then the one on the right must not be evaluated.
See pseq or deepseq for stronger alternatives, which come closer to what C calls sequence points.
†Actually, C or C++ sequence points don't guarantee computation order either, only that any side effects are in the right order. But, in C side effects are ubiquitous so apart from low-level optimisations you can usually assume that sequence-point order will be upheld, whereas GHC will in fact quite often throw seqs away if only it knows that the expressions do not diverge.

Does Prolog use lazy evaluation like Haskell? [duplicate]

Because Prolog uses chronological backtracking(from the Prolog Wikipedia page) even after an answer is found(in this example where there can only be one solution), would this justify Prolog as using eager evaluation?
mother_child(trude, sally).
father_child(tom, sally).
father_child(tom, erica).
father_child(mike, tom).
sibling(X, Y) :- parent_child(Z, X), parent_child(Z, Y).
parent_child(X, Y) :- father_child(X, Y).
parent_child(X, Y) :- mother_child(X, Y).
With the following output:
?- sibling(sally, erica).
true ;
false.
To summarize the discussion with #WillNess below, yes, Prolog is strict. However, Prolog's execution model and semantics are substantially different from the languages that are usually labelled strict or non-strict. For more about this, see below.
I'm not sure the question really applies to Prolog, because it doesn't really have the kind of implicit evaluation ordering that other languages have. Where this really comes into play in a language like Haskell, you might have an expression like:
f (g x) (h y)
In a strict language like ML, there is a defined evaluation order: g x will be evaluated, then h y, and f (g x) (h y) last. In a language like Haskell, g x and h y will only be evaluated as required ("non-strict" is more accurate than "lazy"). But in Prolog,
f(g(X), h(Y))
does not have the same meaning, because it isn't using a function notation. The query would be broken down into three parts, g(X, A), h(Y, B), and f(A,B,C), and those constituents can be placed in any order. The evaluation strategy is strict in the sense that what comes earlier in a sequence will be evaluated before what comes next, but it is non-strict in the sense that there is no requirement that variables be instantiated to ground terms before evaluation can proceed. Unification is perfectly content to complete without having given you values for every variable. I am bringing this up because you have to break down a complex, nested expression in another language into several expressions in Prolog.
Backtracking has nothing to do with it, as far as I can tell. I don't think backtracking to the nearest choice point and resuming from there precludes a non-strict evaluation method, it just happens that Prolog's is strict.
That Prolog pauses after giving each of the several correct answers to a problem has nothing to do with laziness; it is a part of its user interaction protocol. Each answer is calculated eagerly.
Sometimes there will be only one answer but Prolog doesn't know that in advance, so it waits for us to press ; to continue search, in hopes of finding another solution. Sometimes it is able to deduce it in advance and will just stop right away, but only sometimes.
update:
Prolog does no evaluation on its own. All terms are unevaluated, as if "quoted" in Lisp.
Prolog will unfold your predicate definitions as written and is perfectly happy to keep your data structures full of unevaluated uninstantiated holes, if so entailed by your predicate definitions.
Haskell does not need any values, a user does, when requesting an output.
Similarly, Prolog produces solutions one-by-one, as per the user requests.
Prolog can even be seen to be lazier than Haskell where all arithmetic is strict, i.e. immediate, whereas in Prolog you have to explicitly request the arithmetic evaluation, with is/2.
So perhaps the question is ill-posed. Prolog's operations model is just too different. There are no "results" nor "functions", for one; but viewed from another angle, everything is a result, and predicates are "multi"-functions.
As it stands, the question is not correct in what it states. Chronological backtracking does not mean that Prolog will necessarily backtrack "in an example where there can be only one solution".
Consider this:
foo(a, 1).
foo(b, 2).
foo(c, 3).
?- foo(b, X).
X = 2.
?- foo(X, 2).
X = b.
So this is an example that does have only one solution and Prolog recognizes that, and does not attempt to backtrack. There are cases in which you can implement a solution to a problem in a way that Prolog will not recognize that there is only one logical solution, but this is due to the implementation and is not inherent to Prolog's execution model.
You should read up on Prolog's execution model. From the Wikipedia article which you seem to cite, "Operationally, Prolog's execution strategy can be thought of as a generalization of function calls in other languages, one difference being that multiple clause heads can match a given call. In that case, [emphasis mine] the system creates a choice-point, unifies the goal with the clause head of the first alternative, and continues with the goals of that first alternative." Read Sterling and Shapiro's "The Art of Prolog" for a far more complete discussion of the subject.
from Wikipedia I got
In eager evaluation, an expression is evaluated as soon as it is bound to a variable.
Then I think there are 2 levels - at user level (our predicates) Prolog is not eager.
But it is at 'system' level, because variables are implemented as efficiently as possible.
Indeed, attributed variables are implemented to be lazy, and are rather 'orthogonal' to 'logic' Prolog variables.

Can compilers deduce/prove mathematically?

I'm starting to learn functional programming language like Haskell, ML and most of the exercises will show off things like:
foldr (+) 0 [ 1 ..10]
which is equivalent to
sum = 0
for( i in [1..10] )
sum += i
So that leads me to think why can't compiler know that this is Arithmetic Progression and use O(1) formula to calculate?
Especially for pure FP languages without side effect?
The same applies for
sum reverse list == sum list
Given a + b = b + a
and definition of reverse, can compilers/languages prove it automatically?
Compilers generally don't try to prove this kind of thing automatically, because it's hard to implement.
As well as adding the logic to the compiler to transform one fragment of code into another, you have to be very careful that it only tries to do it when it's actually safe - i.e. there are often lots of "side conditions" to worry about. For example in your example above, someone might have written an instance of the type class Num (and hence the (+) operator) where the a + b is not b + a.
However, GHC does have rewrite rules which you can add to your own source code and could be used to cover some relatively simple cases like the ones you list above, particularly if you're not too bothered about the side conditions.
For example, and I haven't tested this, you might use the following rule for one of your examples above:
{-# RULES
"sum/reverse" forall list . sum (reverse list) = sum list
#-}
Note the parentheses around reverse list - what you've written in your question actually means (sum reverse) list and wouldn't typecheck.
EDIT:
As you're looking for official sources and pointers to research, I've listed a few.
Obviously it's hard to prove a negative but the fact that no-one has given an example of a general-purpose compiler that does this kind of thing routinely is probably quite strong evidence in itself.
As others have pointed out, even simple arithmetic optimisations are surprisingly dangerous, particularly on floating point numbers, and compilers generally have flags to turn them off - for example Visual C++, gcc. Even integer arithmetic isn't always clear-cut and people occasionally have big arguments about how to deal with things like overflow.
As Joachim noted, integer variables in loops are one place where slightly more sophisticated optimisations are applied because there are actually significant wins to be had. Muchnick's book is probably the best general source on the topic but it's not that cheap. The wikipedia page on strength reduction is probably as good an introduction as any to one of the standard optimisations of this kind, and has some references to the relevant literature.
FFTW is an example of a library that does all kinds of mathematical optimization internally. Some of its code is generated by a customised compiler the authors wrote specifically for the purpose. It's worthwhile because the authors have domain-specific knowledge of optimizations that in the specific context of the library are both worth the effort and safe
People sometimes use template metaprogramming to write "self-optimising libraries" that again might rely on arithmetic identities, see for example Blitz++. Todd Veldhuizen's PhD dissertation has a good overview.
If you descend into the realms of toy and academic compilers all sorts of things go. For example my own PhD dissertation is about writing inefficient functional programs along with little scripts that explain how to optimise them. Many of the examples (see Chapter 6) rely on applying arithmetic rules to justify the underlying optimisations.
Also, it's worth emphasising that the last few examples are of specialised optimisations being applied only to certain parts of the code (e.g. calls to specific libraries) where it is expected to be worthwhile. As other answers have pointed out, it's simply too expensive for a compiler to go searching for all possible places in an entire program where an optimisation might apply. The GHC rewrite rules that I mentioned above are a great example of a compiler exposing a generic mechanism for individual libraries to use in a way that's most appropriate for them.
The answer
No, compilers don’t do that kind of stuff.
One reason why
And for your examples, it would even be wrong: Since you did not give type annotations, the Haskell compiler will infer the most general type, which would be
foldr (+) 0 [ 1 ..10] :: Num a => a
and similar
(\list -> sum (reverse list)) :: Num a => [a] -> a
and the Num instance for the type that is being used might well not fulfil the mathematical laws required for the transformation you suggest. The compiler should, before everything else, avoid to change the meaning (i.e. the semantics) of your program.
More pragmatically: The cases where the compiler could detect such large-scale transformations rarely occur in practice, so it would not be worth it to implement them.
An exception
Note notable exceptions are linear transformations in loops. Most compilers will rewrite
for (int i = 0; i < n; i++) {
... 200 + 4 * i ...
}
to
for (int i = 0, j = 200; i < n; i++, j += 4) {
... j ...
}
or something similar, as that pattern does often occur in code working on array.
The optimizations you have in mind will probably not be done even in the presence of monomorphic types, because there are so many possibilities and so much knowledge required. For example, in this example:
sum list == sum (reverse list)
The compiler would need to know or take into account the following facts:
sum = foldl (+) 0
(+) is commutative
reverse list is a permutation of list
foldl x c l, where x is commutative and c is a constant, yields the same result for all permutations of l.
This all seems trivial. Sure, the compiler can most probably look up the definition of sumand inline it. It could be required that (+) be commutative, but remember that +is just another symbol without attached meaning to the compiler. The third point would require the compiler to prove some non trivial properties about reverse.
But the point is:
You don't want to perform the compiler to do those calculations with each and every expression. Remember, to make this really useful, you'd have to heap up a lot of knowledge about many, many standard functions and operators.
You still can't replace the expression above with True unless you can rule out the possibility that list or some list element is bottom. Usually, one cannot do this. You can't even do the following "trivial" optimization of f x == f x in all cases
f x `seq` True
For, consider
f x = (undefined :: Bool, x)
then
f x `seq` True ==> True
f x == f x ==> undefined
That being said, regarding your first example slightly modified for monomorphism:
f n = n * foldl (+) 0 [1..10] :: Int
it is imaginable to optimize the program by moving the expression out of its context and replace it with the name of a constant, like so:
const1 = foldl (+) 0 [1..10] :: Int
f n = n * const1
This is because the compiler can see that the expression must be constant.
What you're describing looks like super-compilation. In your case, if the expression had a monomorphic type like Int (as opposed to polymorphic Num a => a), the compiler could infer that the expression foldr (+) 0 [1 ..10] has no external dependencies, therefore it could be evaluated at compile time and replaced by 55. However, AFAIK no mainstream compiler currently does this kind of optimization.
(In functional programming "proving" is usually associated with something different. In languages with dependent types types are powerful enough to express complex proposition and then through the Curry-Howard correspondence programs become proofs of such propositions.)
As others have noted, it's unclear that your simplifications even hold in Haskell. For instance, I can define
newtype NInt = N Int
instance Num NInt where
N a + _ = N a
N b * _ = N b
... -- etc
and now sum . reverse :: Num [a] -> a does not equal sum :: Num [a] -> a since I can specialize each to [NInt] -> NInt where sum . reverse == sum clearly does not hold.
This is one general tension that exists around optimizing "complex" operations—you actually need quite a lot of information in order to successfully prove that it's okay to optimize something. This is why the syntax-level compiler optimization which do exist are usually monomorphic and related to the structure of programs---it's usually such a simplified domain that there's "no way" for the optimization to go wrong. Even that is often unsafe because the domain is never quite so simplified and well-known to the compiler.
As an example, a very popular "high-level" syntactic optimization is stream fusion. In this case the compiler is given enough information to know that stream fusion can occur and is basically safe, but even in this canonical example we have to skirt around notions of non-termination.
So what does it take to have \x -> sum [0..x] get replaced by \x -> x*(x + 1)/2? The compiler would need a theory of numbers and algebra built-in. This is not possible in Haskell or ML, but becomes possible in dependently typed languages like Coq, Agda, or Idris. There you could specify things like
revCommute :: (_+_ :: a -> a -> a)
-> Commutative _+_
-> foldr _+_ z (reverse as) == foldr _+_ z as
and then, theoretically, tell the compiler to rewrite according to revCommute. This would still be difficult and finicky, but at least we'd have enough information around. To be clear, I'm writing something very strange above, a dependent type. The type not only depends on the ability to introduce both a type and a name for the argument inline, but also the existence of the entire syntax of your language "at the type level".
There are a lot of differences between what I just wrote and what you'd do in Haskell, though. First, in order to form a basis where such promises can be taken seriously, we must throw away general recursion (and thus we already don't have to worry about questions of non-termination like stream-fusion does). We also must have enough structure around to create something like the promise Commutative _+_---this likely depends upon there being an entire theory of operators and mathematics built into the language's standard library else you would need to create that yourself. Finally, the richness of type system required to even express these kinds of theories adds a lot of complexity to the entire system and tosses out type inference as you know it today.
But, given all that structure, I'd never be able to create an obligation Commutative _+_ for the _+_ defined to work on NInts and so we could be certain that foldr (+) 0 . reverse == foldr (+) 0 actually does hold.
But now we'd need to tell the compiler how to actually perform that optimization. For stream-fusion, the compiler rules only kick in when we write something in exactly the right syntactic form to be "clearly" an optimization redex. The same kinds of restrictions would apply to our sum . reverse rule. In fact, already we're sunk because
foldr (+) 0 . reverse
foldr (+) 0 (reverse as)
don't match. They're "obviously" the same due to some rules we could prove about (.), but that means that now the compiler must invoke two built-in rules in order to perform our optimization.
At the end of the day, you need a very smart optimization search over the sets of known laws in order to achieve the kinds of automatic optimizations you're talking about.
So not only do we add a lot of complexity to the entire system, require a lot of base work to build-in some useful algebraic theories, and lose Turing completeness (which might not be the worst thing), we also only get a finicky promise that our rule would even fire unless we perform an exponentially painful search during compilation.
Blech.
The compromise that exists today tends to be that sometimes we have enough control over what's being written to be mostly certain that a certain obvious optimization can be performed. This is the regime of stream fusion and it requires a lot of hidden types, carefully written proofs, exploitations of parametricity, and hand-waving before it's something the community trusts enough to run on their code.
And it doesn't even always fire. For an example of battling that problem take a look at the source of Vector for all of the RULES pragmas that specify all of the common circumstances where Vector's stream-fusion optimizations should kick in.
All of this is not at all a critique of compiler optimizations or dependent type theories. Both are really incredible. Instead it's just an amplification of the tradeoffs involved in introducing such an optimization. It's not to be done lightly.
Fun fact: Given two arbitrary formulas, do they both give the same output for the same inputs? The answer to this trivial question is not computable! In other words, it is mathematically impossible to write a computer program that always gives the correct answer in finite time.
Given this fact, it's perhaps not surprising that nobody has a compiler that can magically transform every possible computation into its most efficient form.
Also, isn't this the programmer's job? If you want the sum of an arithmetic sequence commonly enough that it's a performance bottleneck, why not just write some more efficient code yourself? Similarly, if you really want Fibonacci numbers (why?), use the O(1) algorithm.

Haskell pattern match "diverge" and ⊥

I'm trying to understand the Haskell 2010 Report section 3.17.2 "Informal Semantics of Pattern Matching". Most of it, relating to a pattern match succeeding or failing seems straightforward, however I'm having difficulty understanding the case which is described as the pattern match "diverging".
I'm semi-persuaded it means that the match algorithm does not "converge" to an answer (hence the match function never returns). But if doesn't return, then, how can it return a value, as suggested by the parenthetical "i.e. return ⊥"? And what does it mean to "return ⊥" anyway? How one handle that outcome?
Item 5 has the particularly confusing (to me) point "If the value is ⊥, the match diverges". Is this just saying that a value of ⊥ produces a match result of ⊥? (Setting aside that I don't know what that outcome means!)
Any illumination, possibly with an example, would be appreciated!
Addendum after a couple of lengthy answers:
Thanks Tikhon and all for your efforts.
It seems my confusion comes from there being two different realms of explanation: The realm of Haskell features and behaviors, and the realm of mathematics/semantics, and in Haskell literature these two are intermingled in an attempt to explain the former in terms of the latter, without sufficient signposts (to me) as to which elements belong to which.
Evidently "bottom" ⊥ is in the semantics domain, and does not exist as a value within Haskell (ie: you can't type it in, you never get a result that prints out as " ⊥").
So, where the explanation says a function "returns ⊥", this refers to a function that does any of a number of inconvenient things, like not terminate, throw an exception, or return "undefined". Is that right?
Further, those who commented that ⊥ actually is a value that can be passed around, are really thinking of bindings to ordinary functions that haven't yet actually been called upon to evaluate ("unexploded bombs" so to speak) and might never, due to laziness, right?
The value is ⊥, usually pronounced "bottom". It is a value in the semantic sense--it is not a normal Haskell value per se. It represents computations that do not produce a normal Haskell value: exceptions and infinite loops, for example.
Semantics is about defining the "meaning" of a program. In Haskell, we usually talk about denotational semantics, where the value is a mathematical object of some sort. The most trivial example would be that the expression 10 (but also the expression 9 + 1) have denotations of the number 10 (rather than the Haskell value 10). We usually write that ⟦9 + 1⟧ = 10 meaning that the denotation of the Haskell expression 9 + 1 is the number 10.
However, what do we do with an expression like let x = x in x? There is no Haskell value for this expression. If you tried to evaluate it, it would simply never finish. Moreover, it is not obvious what mathematical object this corresponds to. However, in order to reason about programs, we need to give some denotation for it. So, essentially, we just make up a value for all these computations, and we call the value ⊥ (bottom).
So ⊥ is just a way to define what a computation that doesn't return "means".
We also define other computations like undefined and error "some message" as ⊥ because they also do not have obvious normal values. So throwing an exception corresponds to ⊥. This is exactly what happens with a failed pattern match.
The usual way of thinking about this is that every Haskell type is "lifted"--it contains ⊥. That is, Bool corresponds to {⊥, True, False} rather than just {True, False}. This represents the fact that Haskell programs are not guaranteed to terminate and can have exceptions. This is also true when you define your own type--the type contains every value you defined for it as well as ⊥.
Interestingly, since Haskell is non-strict, ⊥ can exist in normal code. So you could have a value like Just ⊥, and if you never evaluate it, everything will work fine. A good example of this is const: const 1 ⊥ evaluates to 1. This works for failed pattern matches as well:
const 1 (let Just x = Nothing in x) -- 1
You should read the section on denotational semantics in the Haskell WikiBook. It's a very approachable introduction to the subject, which I personally find very fascinating.
Denotational semantics
So, briefly denotational semantics, which is where ⊥ lives, is a mapping from Haskell values to some other space of values. You do this to give meaning to programs in a more formal manner than just talking about what programs should do—you say that they must respect their denotational semantics.
So for Haskell, you often think about how Haskell expressions denote mathematical values. You often see Strachey brackets ⟦·⟧ to denote the "semantic mapping" from Haskell to Math. Finally, we want our semantic brackets to be compatible with semantic operations. For instance
⟦x + y⟧ = ⟦x⟧ + ⟦y⟧
where on the left side + is the Haskell function (+) :: Num a => a -> a -> a and on the right side it's the binary operation in a commutative group. While is cool, because then we know that we can use the properties from the semantic map to know how our Haskell functions should work. To wit, let's write the commutative property "in Math"
⟦x⟧ + ⟦y⟧ == ⟦y⟧ + ⟦x⟧
= ⟦x + y⟧ == ⟦y + x⟧
= ⟦x + y == y + x⟧
where the third step also indicates that the Haskell (==) :: Eq a => a -> a -> a ought to have the properties of a mathematical equivalence relationship.
Well, except...
Anyway, that's all well and good until we remember that computers are finite things and Maths don't much care about that (unless you're using intuitionistic logic, and then you get Coq). So, we have to take note of places where our semantics don't follow Math quite right. Here are three examples
⟦undefined⟧ = ??
⟦error "undefined"⟧ = ??
⟦let x = x in x⟧ = ??
This is where ⊥ comes into play. We just assert that so far as the denotational semantics of Haskell are concerned each of those examples might as well mean (the newly introduced Mathematical/semantic concept of) ⊥. What are the Mathematical properties of ⊥? Well, this is where we start to really dive into what the semantic domain is and start talking about monotonicity of functions and CPOs and the like. Essentially, though, ⊥ is a mathematical object which plays roughly the same game as non-termination does. To the point of view of the semantic model, ⊥ is toxic and it infects expressions with its toxic-nondeterminism.
But it's not a Haskell-the-language concept, just a Semantic-domain-of-the-language-Haskell thing. In Haskell we have undefined, error and infinite looping. This is important.
Extra-semantic behavior (side note)
So the semantics of ⟦undefined⟧ = ⟦error "undefined"⟧ = ⟦let x = x in x⟧ = ⊥ are clear once we understand the mathematical meanings of ⊥, but it's also clear that those each have different effects "in reality". This is sort of like "undefined behavior" of C... it's behavior that's undefined so far as the semantic domain is concerned. You might call it semantically unobservable.
So how does pattern matching return ⊥?
So what does it mean "semantically" to return ⊥? Well, ⊥ is a perfectly valid semantic value which has the infection property which models non-determinism (or asynchronous error throwing). From the semantic point of view it's a perfectly valid value which can be returned as is.
From the implementation point of view, you have a number of choices, each of which map to the same semantic value. undefined isn't quite right, nor is entering an infinite loop, so if you're going to pick a semantically undefined behavior you might as well pick one that's useful and throw an error
*** Exception: <interactive>:2:5-14: Non-exhaustive patterns in function cheers

Resources