I am wondering exactly how list-comprehensions are evaluated in Haskell. After reading this Removing syntactic sugar: List comprehension in Haskell and this: Haskell Lazy Evaluation and Reuse I still don't really understand if
[<function> x|x <- <someList>, <somePredicate>]
is actually exactly equivalent (not just in outcome but in evaluation) to
filter (<somePredicate>) . map (<function>) $ <someList>
and if so, does this mean it can potentially reduce time complexity drastically to build up the desired list with only desired elements?
Also, how does this work in terms of infinite lists? To be specific: I assume something like:
[x|x <- [1..], x < 20]
will be evaluated in finite time, but how "obvious" does the fact that there are no more elements above some value which satisfy the predicate need to be, for the compiler to consider it? Would
[x|x <- [1..], (sum.map factorial $ digits x) == x]
work (see project Euler problem 34 https://projecteuler.net/problem=34). There is obviously an upper bound because from some x on x*9! < 10^n -1 always holds, but do I need to supply that bound or will the compiler find it?
There's nothing obvious about the fact that a particular infinite sequence has no more elements matching a predicate. When you pass a list to filter, it has no way of knowing any other properties of the elements than that an element can be passed to the predicate.
You can write your own version of Ord a => List a which can describe a sequence as ascending or not, and a version of filter that can use this information to stop looking at elements past a particular threshold. Unfortunately, list comprehensions won't use either of them.
Instead, I'd use a combination of takeWhile and a comprehension without a predicate / a plain map. Somewhere in the takeWhile arguments, you will supply the compiler the information about the expected upper bound; for a number of n decimal digits, it would be 10^n.
[<function> x|x <- <someList>, <somePredicate>]
should always evaluate to the same result as
filter (<somePredicate>) . map (<function>) $ <someList>
However, there is no guarantee that this is how the compiler will actually do it. The section on list comprehensions in the Haskell Report only mentions what list comprehensions should do, not how they should work. So each compiler is free to do as its developers find best. Therefore, you should not assume anything about the performance of list comprehensions or that the compiler will do any optimizations.
Related
I'm just starting out with Haskell, so I'm trying to wrap my head around the "Haskell way of thinking." Is there a reason to use pattern matching to solve Problem 1 here basically by unwrapping the whole list and calling the function recursively, instead of just retrieving the last element directly like myLast lst = lst !! ((length lst) - 1)? It seems almost brute-force, but I assume it's just my lack of familiarity here.
A few things I can think of:
(!!) and length are ultimately implemented using recursion over the structure of the list. That being so, it can be a worthwhile learning exercise to implement those basic functions using explicit recursion.
Keep in mind that, under the hood, the retrieval of the last element is not direct. Since we are dealing with linked lists, length has to go through all elements of the lists, and (!!) has to go through all elements up to the desired index. That being so, lst !! (length lst - 1) runs through the whole list twice, rather than once. (This is one of the reasons why, as a rule of thumb, length is better avoided unless you actually need to know the number of elements in and of itself, and not just as a proxy to something else.)
Pattern matching is a neat way of stating facts about the structure of data types. If, while consuming a list recursively, you match a [x] pattern (or, equivalently, x : [] -- an element consed to the empty list), you know that x is the last element. In a way, matching [x] involves one less level of indirection than accessing the list element at index length lst - 1, as it only deals with the structure of the list, without requiring an indexing scheme to be bolted on the top of it.
With all that said, there is something fundamentally right about your feeling that explicit recursion feels "almost brute-force". In time, you'll find out about folds, mapping functions, and other ways to capture and abstract common recursive patterns, making it possible to write in a more fluent manner.
From my understanding, lazy evaluation is the arguments are not evaluated before they are passed to a function, but only when their values are actually used.
But in a haskell tutorial, I see an example.
xs = [1,2,3,4,5,6,7,8]
doubleMe(doubleMe(doubleMe(xs)))
The author said an imperative language would probably pass through the list once and make a copy and then return it. Then it would pass through the list another two times and return the result.
But in a lazy language, it would first compute
doubleMe(doubleMe(doubleMe(1)))
This will give back a doubleMe(1), which is 2. Then 4, and finally 8.
So it only does one pass through the list and only when you really need it.
This makes me confused. Why don't lazy language take the list as a whole, but split it? I mean we can ignore what the list or the expression is before we use it. But we need to evaluate the whole thing when we use it, isn't it?
A list like [1,2,3,4,5,6,7,8] is just syntactic sugar for this: 1:2:3:4:5:6:7:8:[].
In this case, all the values in the list are numeric constants, but we could define another, smaller list like this:
1:1+1:[]
All Haskell lists are linked lists, which means that they have a head and a tail. In the above example, the head is 1, and the tail is 1+1:[].
If you only want the head of the list, there's no reason to evaluate the rest of the list:
(h:_) = 1:1+1:[]
Here, h refers to 1. There's no reason to evaluate the rest of the list (1+1:[]) if h is all you need.
That's how lists are lazily evaluated. 1+1 remains a thunk (an unevaluated expression) until the value is required.
This has been a question I've been wondering for a while. if statements are staples in most programming languages (at least then ones I've worked with), but in Haskell it seems like it is quite frowned upon. I understand that for complex situations, Haskell's pattern matching is much cleaner than a bunch of ifs, but is there any real difference?
For a simple example, take a homemade version of sum (yes, I know it could just be foldr (+) 0):
sum :: [Int] -> Int
-- separate all the cases out
sum [] = 0
sum (x:xs) = x + sum xs
-- guards
sum xs
| null xs = 0
| otherwise = (head xs) + sum (tail xs)
-- case
sum xs = case xs of
[] -> 0
_ -> (head xs) + sum (tail xs)
-- if statement
sum xs = if null xs then 0 else (head xs) + sum (tail xs)
As a second question, which one of these options is considered "best practice" and why? My professor way back when always used the first method whenever possible, and I'm wondering if that's just his personal preference or if there was something behind it.
The problem with your examples is not the if expressions, it's the use of partial functions like head and tail. If you try to call either of these with an empty list, it throws an exception.
> head []
*** Exception: Prelude.head: empty list
> tail []
*** Exception: Prelude.tail: empty list
If you make a mistake when writing code using these functions, the error will not be detected until run time. If you make a mistake with pattern matching, your program will not compile.
For example, let's say you accidentally switched the then and else parts of your function.
-- Compiles, throws error at run time.
sum xs = if null xs then (head xs) + sum (tail xs) else 0
-- Doesn't compile. Also stands out more visually.
sum [] = x + sum xs
sum (x:xs) = 0
Note that your example with guards has the same problem.
I think the Boolean Blindness article answers this question very well. The problem is that boolean values have lost all their semantic meaning as soon as you construct them. That makes them a great source for bugs and also makes the code more difficult to understand.
Your first version, the one preferred by your prof, has the following advantages compared to the others:
no mention of null
list components are named in the pattern, so no mention of head and tail.
I do think that this one is considered "best practice".
What's the big deal? Why would we want to avoid especially head and tail? Well, everybody knows that those functions are not total, so one automatically tries to make sure that all cases are covered. A pattern match on [] not only stands out more than null xs, a series of pattern matches can be checked by the compiler for completeness. Hence, the idiomatic version with complete pattern match is easier to grasp (for the trained Haskell reader) and to proof exhaustive by the compiler.
The second version is slightly better than the last one because one sees at once that all cases are handled. Still, in the general case the RHS of the second equation could be longer and there could be a where clauses with a couple of definitions, the last of them could be something like:
where
... many definitions here ...
head xs = ... alternative redefnition of head ...
To be absolutly sure to understand what the RHS does, one has to make sure common names have not been redefined.
The 3rd version is the worst one IMHO: a) The 2nd match fails to deconstruct the list and still uses head and tail. b) The case is slightly more verbose than the equivalent notation with 2 equations.
In many programming languages, if-statements are fundamental primitives, and things like switch-blocks are just syntax sugar to make deeply-nested if-statements easier to write.
Haskell does it the other way around. Pattern matching is the fundamental primitive, and an if-expression is literally just syntax sugar for pattern matching. Similarly, constructs like null and head are simply user-defined functions, which are all ultimately implemented using pattern matching. So pattern matching is the thing at the bottom of it all. (And therefore potentially more efficient than calling user-defined functions.)
In many cases - such as the ones you list above - it's simply a matter of style. The compiler can almost certainly optimise things to the point where all versions are roughly equal in performance. But generally [not always!] pattern matching makes it clearer exactly what you're trying to achieve.
(It's annoyingly easy to write an if-expression where you get the two alternatives the wrong way around. You'd think it would be a rare mistake, but it's surprisingly common. With a pattern match, there's little chance of making that specific mistake, although there's still plenty of other things to screw up.)
Each call to null, head and tail entails a pattern match. But the 1st version in your answer does just one pattern match, and reuses its results through named components of the pattern.
Just for that, it is better. But it is also more visually apparent, more readable.
Pattern matching is better than a string of if-then-else statements for (at least) the following reasons:
it is more declarative
it interacts well with sum-types
Pattern matching helps to reduce the amount of "accidental complexity" in your code - that is, code that is really more about implementation details rather than the essential logic of your program.
In most other languages when the compier/run-time sees a string of if-then-else statements it has no choice but to test the conditions in exactly the order the programmer specified them. But pattern matching encourages the programmer to focus more on describing what should happen versus how things should be performed. Due to purity and immutability of values in Haskell the compiler can consider the collection of patterns as a whole and decide the how best to implement them.
An analogy would be C's switch statement. If you dump the assembly code for various switch statements you will see that sometimes the compiler will generate a chain/tree of comparisons and in other cases it will generate a jump table. The programmer uses the same syntax in both cases - the compiler chooses the implementation based on what the comparison values are. If they form a contiguous block of values the jump table method is used, otherwise a comparison tree is used. And this separation of concerns allows the compiler to implement even more strategies in the future if other patterns among the comparison values are detected.
I'm trying to learn me a Haskell (for great good), and one of the many different things I'm doing is trying to tackle some Project Euler problems as I'm going along to test my mettle.
In doing some of the Fibonacci based problems, I stumbled on and started playing around with the recursive infinite list version of the Fibonacci sequence:
fibs = 1 : 2 : zipWith (+) fibs (tail fibs)
For one of the PE problems, I needed to extract the subsequence of even Fibonacci numbers less than 4,000,000. I decided to do this with a list comprehension, and in my playing around with the code I stumbled on to something that I don't quite understand; I'm assuming that it's my weak grasp on Haskell's lazy evaluation scheme that's complicating things.
The following comprehension works just fine:
[x | x <- takeWhile (<= 4000000) fibs, even x]
The next comprehension spins forever; so I went through and had the output returned to stdout and while it stops at the correct place, it just seems to continue evaluating the recursively defined list forever without finishing after reaching the capped value; indicative of the fact that the last item in the list is printed with a comma but no further list items or closing square bracket are present:
[x | x <- fibs, x <= 4000000, even x]
So what exactly is the secret sauce used by the various functions that do play well with infinite lists?
The function takeWhile keeps taking elements of the input list until it reaches the first element that doesn't satisfy the predicate, and then it stops. As long as there is at least one element that doesn't satisfy the predicate, takeWhile turns infinite lists into finite lists.
Your first expression says
Keep taking elements of this infinite list until you find one greater than 4,000,000 and then stop. Include each element in the output if it's even.
The second expression says
Keep taking elements of this infinite list. Include each element in the output if it's less than or equal 4,000,000 and it's even.
When you observe an output that hangs forever, the function is busily generating more fibonacci numbers and checking to see if they're less than or equal 4,000,000. None of them are, which is why nothing is printed to the screen, but the function has no way of knowing that it's not going to encounter a small number a bit further down the list, so it has to keep checking.
I'm playing around with Haskell at the moment and thus stumbled upon the list comprehension feature.
Naturally, I would have used a closure to do this kind of thing:
Prelude> [x|x<-[1..7],x>4] -- list comprehension
[5,6,7]
Prelude> filter (\x->x>4) [1..7] -- closure
[5,6,7]
I still don't feel this language, so which way would a Haskell programmer go?
What are the differences between these two solutions?
Idiomatic Haskell would be filter (> 4) [1..7]
Note that you are not capturing any of the lexical scope in your closure, and are instead making use of a sectioned operator. That is to say, you want a partial application of >, which operator sections give you immediately. List comprehensions are sometimes attractive, but the usual perception is that they do not scale as nicely as the usual suite of higher order functions ("scale" with respect to more complex compositions). That kind of stylistic decision is, of course, largely subjective, so YMMV.
List comprehensions come in handy if the elements are somewhat complex and one needs to filter them by pattern matching, or the mapping part feels too complex for a lambda abstraction, which should be short (or so I feel), or if one has to deal with nested lists. In the latter case, a list comprehension is often more readable than the alternatives (to me, anyway).
For example something like:
[ (f b, (g . fst) a) | (Just a, Right bs) <- somelist, a `notElem` bs, (_, b) <- bs ]
But for your example, the section (>4) is a really nice way to write (\a -> a > 4) and because you use it only for filtering, most people would prefer ANthonys solution.