Why does function concat use foldr? Why not foldl' - haskell

In most resources it is recommended to use foldl', but that cause of using foldr in concat instead of foldl'?

EDIT I talk about laziness and productivity in this answer, and in my excitement I forgot a very important point that jpmariner focuses on in their answer: left-associating (++) is quadratic time!
foldl' is appropriate when your accumulator is a strict type, like most small types such as Int, or even large spine-strict data structures like Data.Map. If the accumulator is strict, then the entire list must be consumed before any output can be given. foldl' uses tail recursion to avoid blowing up the stack in these cases, but foldr doesn't and will perform badly. On the other hand, foldl' must consume the entire list in this way.
foldl f z [] = z
foldl f z [1] = f z 1
foldl f z [1,2] = f (f z 1) 2
foldl f z [1,2,3] = f (f (f z 1) 2) 3
The final element of the list is required to evaluate the outermost application, so there is no way to partially consume the list. If we expand this with (++), we will see:
foldl (++) [] [[1,2],[3,4],[5,6]]
= (([] ++ [1,2]) ++ [3,4]) ++ [5,6]
^^
= ([1,2] ++ [3,4]) ++ [5,6]
= ((1 : [2]) ++ [3,4]) ++ [5,6]
^^
= (1 : ([2] ++ [3,4])) ++ [5,6]
^^
= 1 : (([2] ++ [3,4]) ++ [5,6])
(I admit this looks a little magical if you don't have a good feel for cons lists; it's worth getting dirty with the details though)
See how we have to evaluate every (++) (marked with ^^ when they are evaluated) on the way down before before the 1 bubbles out to the front? The 1 is "hiding" under function applications until then.
foldr, on the other hand, is good for non-strict accumulators like lists, because it allows the accumulator to yield information before the entire list is consumed, which can bring many classically linear-space algorithms down to constant space! This also means that if your list is infinite, foldr is your only choice, unless your goal is to heat your room using your CPU.
foldr f z [] = z
foldr f z [1] = f 1 z
foldr f z [1,2] = f 1 (f 2 z)
foldr f z [1,2,3] = f 1 (f 2 (f 3 z))
foldr f z [1..] = f 1 (f 2 (f 3 (f 4 (f 5 ...
We have no trouble expressing the outermost applications without having to see the entire list. Expanding foldr the same way we did foldl:
foldr (++) z [[1,2],[3,4],[5,6]]
= [1,2] ++ ([3,4] ++ ([5,6] ++ []))
= (1 : [2]) ++ (3,4] ++ ([5,6] ++ []))
^^
= 1 : ([2] ++ ([3,4] ++ ([5,6] ++ [])))
1 is yielded immediately without having to evaluate any of the (++)s but the first one. Because none of those (++)s are evaluated, and Haskell is lazy, they don't even have to be generated until more of the output list is consumed, meaning concat can run in constant space for a function like this
concat [ [1..n] | n <- [1..] ]
which in a strict language would require intermediate lists of arbitrary length.
If these reductions look a little too magical, and if you want to go deeper, I suggest examining the source of (++) and doing some simple manual reductions against its definition to get a feel for it. (Just remember [1,2,3,4] is notation for 1 : (2 : (3 : (4 : [])))).
In general, the following seems to be a strong rule of thumb for efficiency: use foldl' when your accumulator is a strict data structure, and foldr when it's not. And if you see a friend using regular foldl and don't stop them, what kind of friend are you?

cause of using foldr in concat instead of foldl' ?
What if the result gets fully evaluated ?
If you consider [1,2,3] ++ [6,7,8] within an imperative programming mindset, all you have to do is redirect the next pointer at node 3 towards node 6, assuming of course you may alter your left side operand.
This being Haskell, you may NOT alter your left side operand, unless the optimizer is able to prove that ++ is the sole user of its left side operand.
Short of such a proof, other Haskell expressions pointing to node 1 have every right to assume that node 1 is forever at the beginning of a list of length 3. In Haskell, the properties of a pure expression cannot be altered during its lifetime.
So, in the general case, operator ++ has to do its job by duplicating its left side operand, and the duplicate of node 3 may then be set to point to node 6. On the other hand, the right side operand can be taken as is.
So if you fold the concat expression starting from the right, each component of the concatenation must be duplicated exactly once. But if you fold the expression starting from the left, you are facing a lot of repetitive duplication work.
Let's try to check that quantitatively. To ensure that no optimizer will get in the way by proving anything, we'll just use the ghci interpreter. Its strong point is interactivity not optimization.
So let's introduce the various candidates to ghci, and switch statistics mode on:
$ ghci
λ>
λ> myConcat0 = L.foldr (++) []
λ> myConcat1 = L.foldl (++) []
λ> myConcat2 = L.foldl' (++) []
λ>
λ> :set +s
λ>
We'll force full evaluation by using lists of numbers and printing their sum.
First, let's get baseline performance by folding from the right:
λ>
λ> sum $ concat [ [x] | x <- [1..10000::Integer] ]
50005000
(0.01 secs, 3,513,104 bytes)
λ>
λ> sum $ myConcat0 [ [x] | x <- [1..10000::Integer] ]
50005000
(0.01 secs, 3,513,144 bytes)
λ>
Second, let's fold from the left, to see whether that improves matters or not.
λ>
λ> sum $ myConcat1 [ [x] | x <- [1..10000::Integer] ]
50005000
(1.26 secs, 4,296,646,240 bytes)
λ>
λ> sum $ myConcat2 [ [x] | x <- [1..10000::Integer] ]
50005000
(1.28 secs, 4,295,918,560 bytes)
λ>
So folding from the left allocates much more transient memory and takes much more time, probably because of this repetitive duplication work.
As a last check, let's double the problem size:
λ>
λ> sum $ myConcat2 [ [x] | x <- [1..20000::Integer] ]
200010000
(5.91 secs, 17,514,447,616 bytes)
λ>
We see that doubling the problem size causes the resource consumptions to get multiplied by about 4. Folding from the left has quadratic cost in the case of concat.
Looking at the excellent answer by luqui, we see that both concerns:
the need to be able to access the beginning of the result list lazily
the need to avoid quadratic cost for full evaluation
happen to vote both in the same way, that is in favor of folding from the right.
Hence the Haskell library concat function using foldr.
Addendum:
After running some tests using GHC v8.6.5 with -O3 option instead of ghci, it appears that my preconceived idea of the optimizer messing up with the measurements was erroneous.
Even with -O3, for a problem size of 20,000, the foldr-based concat function is about 500 times faster that the foldl'-based one.
So either the optimizer fails to prove that it is OK to alter/reuse the left operand, or it just does not try at all.

Related

Lazy Catalan Numbers in Haskell

How might I go about efficiently generating an infinite list of Catalan numbers? What I have now works reasonably quickly, but it seems to me that there should be a better way.
c 1 = 1
c n = sum (zipWith (*) xs (reverse xs)) : xs
where xs = c (n-1)
catalan = map (head . c) [1..]
I made an attempt at using fix instead, but the lambda isn't lazy enough for the computation to terminate:
catalan = fix (\xs -> xs ++ [zipWith (*) xs (reverse xs)])
I realize (++) isn't ideal
Does such a better way exist? Can that function be made sufficiently lazy? There's an explicit formula for the nth, I know, but I'd rather avoid it.
The Catalan numbers [wiki] can be defined inductively with:
C0 = 1 and Cn+1=(4n+2)×Cn/(n+2).
So we can implement this as:
catalan :: Integral i => [i]
catalan = xs
where xs = 1 : zipWith f [0..] xs
f n cn = div ((4*n+2) * cn) (n+2)
For example:
Prelude> take 10 catalan
[1,1,2,5,14,42,132,429,1430,4862]
I'm guessing you're looking for a lazy, infinite, self-referential list of all the Catalan numbers using one of the basic recurrence relations. That's a common thing to do with the Fibonacci numbers after all. But it would help to specify the recurrence relation you mean, if you want answers to your specific question. I'm guessing this is the one you mean:
cat :: Integer -> Integer
cat 1 = 1
cat n = sum [ cat i * cat (n - i) | i <- [1 .. n - 1] ]
If so, the conversion to a self-referential form looks like this:
import Data.List (inits)
cats :: [Integer]
cats = 1 : [ sum (zipWith (*) pre (reverse pre)) | pre <- tail (inits cats) ]
This is quite a lot more complex than the fibonacci examples, because the recurrence refers to all previous entries in the list, not just a fixed small number of the most recent. Using inits from Data.List is the easiest way to get the prefix at each position. I used tail there because its first result is the empty list, and that's not helpful here. The rest is a straight-forward rewrite of the recurrence relation that I don't have much to say about. Except...
It's going to perform pretty badly. I mean, it's better than the exponential recursive calls of my cat function, but there's a lot of list manipulation going on that's allocating and then throwing away a lot of memory cells. That recurrence relation is not a very good fit for the recursive structure of the list data type. You can explore a lot of ways to make it more efficient, but they'll all be pretty bad in the end. For this particular case, going to a closed-form solution is the way to go if you want performance.
Apparently, what you wanted is
> cats = 1 : unfoldr (\ fx -> let x = sum $ zipWith (*) fx cats in Just (x, x:fx)) [1]
> take 10 cats
[1,1,2,5,14,42,132,429,1430,4862]
This avoids the repeated reversing of the prefixes (as in the linked answer), by unfolding with the state being a reversed prefix while consing onto the state as well as producing the next element.
The non-reversed prefix we don't have to maintain, as zipping the reversed prefix with the catalans list itself takes care of the catalans prefix's length.
You did say you wanted to avoid the direct formula.
The Catalan numbers are best understood by their generating function, which satisfies the relation
f(t) = 1 + t f(t)^2
This can be expressed in Haskell as
f :: [Int]
f = 1 : convolve f f
for a suitable definition of convolve. It is helpful to factor out convolve, for many other counting problems take this form. For example, a generalized Catalan number enumerates ternary trees, and its generating function satisfies the relation
g(t) = 1 + t g(t)^3
which can be expressed in Haskell as
g :: [Int]
g = 1 : convolve g (convolve g g)
convolve can be written using Haskell primitives as
convolve :: [Int] -> [Int] -> [Int]
convolve xs = map (sum . zipWith (*) xs) . tail . scanl (flip (:)) []
For these two examples and many other special cases, there are formulas that are quicker to evaluate. convolve is however more general, and cognitively more efficient. In a typical scenario, one has understood a counting problem in terms of a polynomial relation on its generating function, and now wants to compute some numbers in order to look them up in The On-Line Encyclopedia of Integer Sequences. One wants to get in and out, indifferent to complexity. What language will be least fuss?
If one has seen the iconic Haskell definition for the Fibonacci numbers
fibs :: [Int]
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
then one imagines there must be a similar idiom for products of generating functions. That search is what brought me here.

How does GHC know how to cache one function but not the others?

I'm reading Learn You a Haskell (loving it so far) and it teaches how to implement elem in terms of foldl, using a lambda. The lambda solution seemed a bit ugly to me so I tried to think of alternative implementations (all using foldl):
import qualified Data.Set as Set
import qualified Data.List as List
-- LYAH implementation
elem1 :: (Eq a) => a -> [a] -> Bool
y `elem1` ys =
foldl (\acc x -> if x == y then True else acc) False ys
-- When I thought about stripping duplicates from a list
-- the first thing that came to my mind was the mathematical set
elem2 :: (Eq a) => a -> [a] -> Bool
y `elem2` ys =
head $ Set.toList $ Set.fromList $ filter (==True) $ map (==y) ys
-- Then I discovered `nub` which seems to be highly optimized:
elem3 :: (Eq a) => a -> [a] -> Bool
y `elem3` ys =
head $ List.nub $ filter (==True) $ map (==y) ys
I loaded these functions in GHCi and did :set +s and then evaluated a small benchmark:
3 `elem1` [1..1000000] -- => (0.24 secs, 160,075,192 bytes)
3 `elem2` [1..1000000] -- => (0.51 secs, 168,078,424 bytes)
3 `elem3` [1..1000000] -- => (0.01 secs, 77,272 bytes)
I then tried to do the same on a (much) bigger list:
3 `elem3` [1..10000000000000000000000000000000000000000000000000000000000000000000000000]
elem1 and elem2 took a very long time, while elem3 was instantaneous (almost identical to the first benchmark).
I think this is because GHC knows that 3 is a member of [1..1000000], and the big number I used in the second benchmark is bigger than 1000000, hence 3 is also a member of [1..bigNumber] and GHC doesn't have to compute the expression at all.
But how is it able to automatically cache (or memoize, a term that Land of Lisp taught me) elem3 but not the two other ones?
Short answer: this has nothing to do with caching, but the fact that you force Haskell in the first two implementations, to iterate over all elements.
No, this is because foldl works left to right, but it will thus keep iterating over the list until the list is exhausted.
Therefore you better use foldr. Here from the moment it finds a 3 it in the list, it will cut off the search.
This is because foldris defined as:
foldr f z [x1, x2, x3] = f x1 (f x2 (f x3 z))
whereas foldl is implemented as:
foldl f z [x1, x2, x3] = f (f (f (f z) x1) x2) x3
Note that the outer f thus binds with x3, so that means foldl first so if due to laziness you do not evaluate the first operand, you still need to iterate to the end of the list.
If we implement the foldl and foldr version, we get:
y `elem1l` ys = foldl (\acc x -> if x == y then True else acc) False ys
y `elem1r` ys = foldr (\x acc -> if x == y then True else acc) False ys
We then get:
Prelude> 3 `elem1l` [1..1000000]
True
(0.25 secs, 112,067,000 bytes)
Prelude> 3 `elem1r` [1..1000000]
True
(0.03 secs, 68,128 bytes)
Stripping the duplicates from the list will not imrpove the efficiency. What here improves the efficiency is that you use map. map works left-to-right. Note furthermore that nub works lazy, so nub is here a no op, since you are only interested in the head, so Haskell does not need to perform memberchecks on the already seen elements.
The performance is almost identical:
Prelude List> 3 `elem3` [1..1000000]
True
(0.03 secs, 68,296 bytes)
In case you work with a Set however, you do not perform uniqueness lazily: you first fetch all the elements into the list, so again, you will iterate over all the elements, and not cut of the search after the first hit.
Explanation
foldl goes to the innermost element of the list, applies the computation, and does so again recursively to the result and the next innermost value of the list, and so on.
foldl f z [x1, x2, ..., xn] == (...((z `f` x1) `f` x2) `f`...) `f` xn
So in order to produce the result, it has to traverse all the list.
Conversely, in your function elem3 as everything is lazy, nothing gets computed at all, until you call head.
But in order to compute that value, you just the first value of the (filtered) list, so you just need to go as far as 3 is encountered in your big list. which is very soon, so the list is not traversed. if you asked for the 1000000th element, eleme3 would probably perform as badly as the other ones.
Lazyness
Lazyness ensure that your language is always composable : breaking a function into subfunction does not changes what is done.
What you are seeing can lead to a space leak which is really about how control flow works in a lazy language. both in strict and in lazy, your code will decide what gets evaluated, but with a subtle difference :
In a strict language, the builder of the function will choose, as it forces evaluation of its arguments: whoever is called is in charge.
In a lazy language, the consumer of the function chooses. whoever called is in charge. It may choose to only evaluate the first element (by calling head), or every other element. All that provided its own caller choose to evaluate his own computation as well. there is a whole chain of command deciding what to do.
In that reading, your foldl based elem function uses that "inversion of control" in an essential way : elem gets asked to produce a value. foldl goes deep inside the list. if the first element if y then it return the trivial computation True. if not, it forwards the requests to the computation acc. In other words, what you read as values acc, x or even True, are really placeholders for computations, which you receive and yield back. And indeed, acc may be some unbelievably complex computation (or divergent one like undefined), as long as you transfer control to the computation True, your caller will never see the existence of acc.
foldr vs foldl vs foldl' VS speed
As suggested in another answer, foldr might best your intent on how to traverse the list, and will shield you away from space leaks (whereas foldl' will prevent space leaks as well if you really want to traverse the other way, which can lead to buildup of complex computations ... and can be very useful for circular computation for instance).
But the speed issue is really an algorithmic one. There might be better data structure for set membership if and only if you know beforehand that you have a certain pattern of usage.
For instance, it might be useful to pay some upfront cost to have a Set, then have fast membership queries, but that is only useful if you know that you will have such a pattern where you have a few sets and lots of queries to those sets. Other data structure are optimal for other patterns, and it's interesting to note that from a API/specification/interface point of view, they are usually the same to the consumer. That's a general phenomena in any languages, and why many people love abstract data types/modules in programming.
Using foldr and expecting to be faster really encodes the assumption that, given your static knowledge of your future access pattern, the values you are likely to test membership of will sit at the beginning. Using foldl would be fine if you expect your values to be at the end of it.
Note that using foldl, you might construct the entire list, you do not construct the values themselves, until you need it of course, for instance to test for equality, as long as you have not found the searched element.

Find the K'th element of a list using foldr and function application ($) explanation

I'm currently at 6th chapter of Learn you a Haskell... Just recently started working my way on 99 questions.
The 3rd problem is to find the K'th element of a list. I've implemented it using take and zip.
The problem I have is understanding the alternate solution offered:
elementAt''' xs n = head $ foldr ($) xs
$ replicate (n - 1) tail
I'm "almost there" but I don't quite get it. I know the definition of the $ but.. Can you please explain to me the order of the execution of the above code. Also, is this often used as a solution to various problems, is this idiomatic or just... acrobatic ?
If you expand the definition of foldr
foldr f z (x1:x2:x3:...:[]) = x1 `f` x2 `f` x3 `f`... `f` z
you see that elementAt''' becomes
elementAt''' xs n = head (tail $ tail $ ... $ tail $ xs)
(note: it should be replicate n tail instead of replicate (n-1) tail if indexing is 0-based).
So you apply tail to xs the appropriate number of times, which has the same result as drop (n-1) xs if xs is long enough, but raises an error if it's too short, and take the head of the resulting list (if xs is too short, that latter would also raise an error with drop (n-1)).
What it does is thus
discard the first element of the list
discard the first element of the resulting list (n-1 times altogether)
take the head of the resulting list
Also, is this often used as a solution to various problems, is this idiomatic or just... acrobatic
In this case, just acrobatic. The foldr has to expand the full application before it can work back to the front taking the tails, thus it's less efficient than the straightforward traversal.
Break it down into the two major steps. First, the function replicates tail
(n-1) times. So you end up with something like
elementAt''' xs n = head $ foldr ($) xs [tail, tail, tail, ..., tail]
Now, the definition of foldr on a list expands to something like this
foldr f x [y1, y2, y3, ..., yn] = (y1 `f` (y1 `f` (... (yn `f` x))) ...)
So, that fold will expand to (replace f with $ and all the ys with tail)
foldr ($) xs [tail, tail, tail, ..., tail]
= (tail $ (tail $ (tail $ ... (tail xs))) ... )
{- Since $ is right associative anyway -}
= tail $ tail $ tail $ tail $ ... $ tail xs
where there are (n-1) calls to tail composed together. After taking n-1
tails, it just extracts the first element of the remaining list and gives that
back.
Another way to write it that makes the composition more explicit (in my opinion) would
be like this
elementAt n = head . (foldr (.) id $ replicate (n-1) tail)

The performance of (++) with lazy evaluation

I have been wondering about this a lot, and I haven't found satisfying answers.
Why is (++) "expensive"? Under lazy evaluation, we won't evaluate an expression like
xs ++ ys
before necessary, and even then, we will only evaluate the part we need, when we need them.
Can someone explain what I'm missing?
If you access the whole resulting list, lazy evaluation won't save any computation. It will only delay it until you need each particular element, but at the end, you have to compute the same thing.
If you traverse the concatenated list xs ++ ys, accessing each element of the first part (xs) adds a little constant overhead, checking if xs was spent or not.
So, it makes a big difference if you associate ++ to the left or to the right.
If you associate n lists of length k to the left like (..(xs1 ++ xs2) ... ) ++ xsn then accessing each of the first k elements will take O(n) time, accessing each of the next k ones will take O(n-1) etc. So traversing the whole list will take O(k n^2). You can check that
sum $ foldl (++) [] (replicate 100000 [1])
takes really long.
If you associate n lists of length k to the right like xs1 ++ ( ..(xsn_1 ++ xsn) .. ) then you'll get only constant overhead for each element, so traversing the whole list will be only O(k n). You can check that
sum $ foldr (++) [] (replicate 100000 [1])
is quite reasonable.
Edit: This is just the magic hidden behind ShowS. If you convert each string xs to showString xs :: String -> String (showString is just an alias for (++)) and compose these functions, then no matter how you associate their composition, at the end they will be applied from right to left - just what we need to get the linear time complexity. (This is simply because (f . g) x is f (g x).)
You can check that both
length $ (foldl (.) id (replicate 1000000 (showString "x"))) ""
and
length $ (foldr (.) id (replicate 1000000 (showString "x"))) ""
run in a reasonable time (foldr is a bit faster because it has less overhead when composing functions from the right, but both are linear in the number of elements).
It's not too expensive on its own, the problem arises when you start combining a whole lot of ++ from left to right: such a chain is evaluated like
( ([1,2] ++ [3,4]) ++ [5,6] ) ++ [7,8]
≡ let a = ([1,2] ++ [3,4]) ++ [5,6]
≡ let b = [1,2] ++ [3,4]
≡ let c = [1,2]
in head c : tail c ++ [3,4]
≡ 1 : [2] ++ [3,4]
≡ 1 : 2 : [] ++ [3,4]
≡ 1 : 2 : [3,4]
≡ [1,2,3,4]
in head b : tail b ++ [5,6]
≡ 1 : [2,3,4] ++ [5,6]
≡ 1:2 : [3,4] ++ [5,6]
≡ 1:2:3 : [4] ++ [5,6]
≡ 1:2:3:4 : [] ++ [5,6]
≡ 1:2:3:4:[5,6]
≡ [1,2,3,4,5,6]
in head a : tail a ++ [7,8]
≡ 1 : [2,3,4,5,6] ++ [7,8]
≡ 1:2 : [3,4,5,6] ++ [7,8]
≡ 1:2:3 : [4,5,6] ++ [7,8]
≡ 1:2:3:4 : [5,6] ++ [7,8]
≡ 1:2:3:4:5 : [6] ++ [7,8]
≡ 1:2:3:4:5:6 : [] ++ [7,8]
≡ 1:2:3:4:5:6 : [7,8]
≡ [1,2,3,4,5,6,7,8]
where you clearly see the quadratic complexity. Even if you only want to evaluate up to the n-th element, you still have to dig your way through all those lets. That's why ++ is infixr, for [1,2] ++ ( [3,4] ++ ([5,6] ++ [7,8]) ) is actually much more efficient. But if you're not careful while designing, say, a simple serialiser, you may easily end up with a chain like the one above. This is the main reason why beginners are warned about ++.
That aside, Prelude.++ is slow compared to e.g. Bytestring operations for the simple reason that it works by traversing linked lists, which have always suboptimal cache usage etc., but that's not as problematic; this prevents you from achieving C-like performance but properly written programs using only plain lists and ++ can still easily rival similar programs written in e.g. Python.
I would like to add one thing or two to Petr's answer.
As he pointed out, repeatedly appending lists at the beginning is quite cheap, while appending to the bottom is not. This is true as long as you use haskell's lists.
However, there are certain circumstances in which you HAVE TO append to the end (e.g., you are building a string to be printed out). With regular lists you have to deal with the quadratic complexity mentioned by his answer, but there's a way better solution in these cases: difference lists (see also my question on the topic).
Long story short, by describing lists as compositions of functions instead of concatenation of shorter lists you are able to append lists or individual elements at the beginning or at the end of your difference list by composing functions, in constant time. Once you're done, you can extract a regular list in linear time (in the number of elements).

A way to measure performance

Given Exercise 14 from 99 Haskell Problems:
(*) Duplicate the elements of a list.
Eg.:
*Main> dupli''' [1..10]
[1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10]
I've implemented 4 solutions:
{-- my first attempt --}
dupli :: [a] -> [a]
dupli [] = []
dupli (x:xs) = replicate 2 x ++ dupli xs
{-- using concatMap and replicate --}
dupli' :: [a] -> [a]
dupli' xs = concatMap (replicate 2) xs
{-- usign foldl --}
dupli'' :: [a] -> [a]
dupli'' xs = foldl (\acc x -> acc ++ [x,x]) [] xs
{-- using foldl 2 --}
dupli''' :: [a] -> [a]
dupli''' xs = reverse $ foldl (\acc x -> x:x:acc) [] xs
Still, I don't know how to really measure performance .
So what's the recommended function (from the above list) in terms of performance .
Any suggestions ?
These all seem more complicated (and/or less efficient) than they need to be. Why not just this:
dupli [] = []
dupli (x:xs) = x:x:(dupli xs)
Your last example is close to a good fold-based implementation, but you should use foldr, which will obviate the need to reverse the result:
dupli = foldr (\x xs -> x:x:xs) []
As for measuring performance, the "empirical approach" is profiling. As Haskell programs grow in size, they can get fairly hard to reason about in terms of runtime and space complexity, and profiling is your best bet. Also, a crude but often effective empirical approach when gauging the relative complexity of two functions is to simply compare how long they each take on some sufficiently large input; e.g. time how long length $ dupli [1..1000000] takes and compare it to dupli'', etc.
But for a program this small it shouldn't be too hard to figure out the runtime complexity of the algorithm based on your knowledge of the data structure(s) in question--in this case, lists. Here's a tip: any time you use concatenation (x ++ y), the runtime complexity is O(length x). If concatenation is used inside of a recursive algorithm operating on all the elements of a list of size n, you will essentially have an O(n ^2) algorithm. Both examples I gave, and your last example, are O(n), because the only operation used inside the recursive definition is (:), which is O(1).
As recommended you can use the criterion package. A good description is http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/.
To summarize it here and adapt it to your question, here are the steps.
Install criterion with
cabal install criterion -fchart
And then add the following to your code
import Criterion.Main
l = [(1::Int)..1000]
main = defaultMain [ bench "1" $ nf dupli l
, bench "2" $ nf dupli' l
, bench "3" $ nf dupli'' l
, bench "4" $ nf dupli''' l
]
You need the nf in order to force the evaluation of the whole result list. Otherwise you'll get just the thunk for the computation.
After that compile and run
ghc -O --make dupli.hs
./dupli -t png -k png
and you get pretty graphs of the running times of the different functions.
It turns out that dupli''' is the fastest from your functions but the foldr version that pelotom listed beats everything.

Resources