What are the space complexities of inits and tails? - haskell

TL; DR
After reading the passage about persistence in Okasaki's Purely Functional Data Structures and going over his illustrative examples about singly linked lists (which is how Haskell's lists are implemented), I was left wondering about the space complexities of Data.List's inits and tails...
It seems to me that
the space complexity of tails is linear in the length of its argument, and
the space complexity of inits is quadratic in the length of its argument,
but a simple benchmark indicates otherwise.
Rationale
With tails, the original list can be shared. Computing tails xs simply consists in walking along list xs and creating a new pointer to each element of that list; no need to recreate part of xs in memory.
In contrast, because each element of inits xs "ends in a different way", there can be no such sharing, and all the possible prefixes of xs must be recreated from scratch in memory.
Benchmark
The simple benchmark below shows there isn't much of a difference in memory allocation between the two functions:
-- Main.hs
import Data.List (inits, tails)
main = do
let intRange = [1 .. 10 ^ 4] :: [Int]
print $ sum intRange
print $ fInits intRange
print $ fTails intRange
fInits :: [Int] -> Int
fInits = sum . map sum . inits
fTails :: [Int] -> Int
fTails = sum . map sum . tails
After compiling my Main.hs file with
ghc -prof -fprof-auto -O2 -rtsopts Main.hs
and running
./Main +RTS -p
the Main.prof file reports the following:
COST CENTRE MODULE %time %alloc
fInits Main 60.1 64.9
fTails Main 39.9 35.0
The memory allocated for fInits and that allocated for fTails have the same order of magnitude... Hum...
What is going on?
Are my conclusions about the space complexities of tails (linear) and inits (quadratic) correct?
If so, why does GHC allocate roughly as much memory for fInits and fTails? Does list fusion have something to do with this?
Or is my benchmark flawed?

The implementation of inits in the Haskell Report, which is identical to or nearly identical to implementations used up to base 4.7.0.1 (GHC 7.8.3) is horribly slow. In particular, the fmap applications stack up recursively, so forcing successive elements of the result gets slower and slower.
inits [1,2,3,4] = [] : fmap (1:) (inits [2,3,4])
= [] : fmap (1:) ([] : fmap (2:) (inits [3,4]))
= [] : [1] : fmap (1:) (fmap (2:) ([] : fmap (3:) (inits [4])))
....
The simplest asymptotically optimal implementation, explored by Bertram Felgenhauer, is based on applying take with successively larger arguments:
inits xs = [] : go (1 :: Int) xs where
go !l (_:ls) = take l xs : go (l+1) ls
go _ [] = []
Felgenhauer was able to eke some extra performance out of this using a private, non-fusing version of take, but it was still not as fast as it could be.
The following very simple implementation is significantly faster in most cases:
inits = map reverse . scanl (flip (:)) []
In some weird corner cases (like map head . inits), this simple implementation is asymptotically non-optimal. I therefore wrote a version using the same technique, but based on Chris Okasaki's Banker's queues, that is both asymptotically optimal and nearly as fast. Joachim Breitner optimized it further, primarily by using a strict scanl' rather than the usual scanl, and this implementation got into GHC 7.8.4. inits can now produce the spine of the result in O(n) time; forcing the entire result requires O(n^2) time because none of the conses can be shared among the different initial segments. If you want really absurdly fast inits and tails, your best bet is to use Data.Sequence; Louis Wasserman's implementation is magical. Another possibility would be to use Data.Vector—it presumably uses slicing for such things.

Related

Problems with enforcing strictness in haskell

If want to pretend that Haskell is strict and I have an algorithm in mind that does not exploit laziness (so for instance it does not use infinite lists), what problems can occur if I used only strict data types and annotated any function that I use, to be strict in its arguments? Will there be a performance penalty, if so how bad; can worse problems occur? I know it is dirty, pointless and ugly to mindlessly make every function and data type strict, and I do not intend to do so in practice but I only want to understand if by doing so, Haskell becomes strict by default?
Secondly, if I tone down the paranoia, and only make the data structures strict: will I have to worry about space leaks brought about by a lazy implementation only when I am using some form of accumulation? In other words, assume that the algorithm would not exhibit a space leak in a strict language. Also assume that I implemented it in Haskell using only strict data structures, but was careful to use seq to evaluate any variable that was being passed on in a recursion, or used functions which internally are careful to do that (like fold'), would I avoid any space leaks? Remember that I am assuming that in a strict language, the same algorithm does not lead to a space leak. So it is a question about the implementation difference between lazy and strict.
The reason I ask the second question is because apart from cases where one is trying to take advantage of laziness by using a lazy data structure, or a spine strict one, all the examples of space leaks that I have seen until now, only involve thunks developing in an accumulator because it was not the function that was recursively called did not evaluate the accumulator before applying itself on it. I am aware that if one wants to take advantage of laziness then one has to be extra careful, but that caution would be needed in a strict by default language too.
Thank you.
Laziness speeding things up
You could be worse off. The naive definition of ++ is:
xs ++ ys = case xs of (x:xs) -> x : (xs ++ ys)
[] -> ys
Laziness makes this O(1), though it may also add O(1) processing to extract the cons. Without laziness, the ++ needs to be evaluated immediately causing an O(n) operation. (If you've never seen the O(.) notation, it is something computer science has stolen from engineers: given a function f the set O( f(n) ) is the set of all algorithms which are eventually at-worst-proportional to f(n), where n is the number of bits of input fed to the function. [Formally, there exists a k and N such that for all n > N the algorithm takes time less than k * f(n).] So I'm saying that laziness makes the above operation O(1) or eventually constant-time, but adds a constant overhead to each extraction, whereas strictness makes the operation O(n) or eventually linear in the number of list elements, assuming that those elements have a fixed size.
There are some practical examples here but the O(1) added processing time can potentially also "stack up" into an O(n) dependency, so the most obvious examples are O(n2) both ways. Still there can be a difference in these examples. For example, one situation that doesn't work well is using a stack (last-in first-out, which is the style of Haskell lists) for a queue (first-in first-out).
So here's a quick library consisting of strict left-folds; I've used case statements so that each line can be pasted into GHCi (with a let):
data SL a = Nil | Cons a !(SL a) deriving (Ord, Eq, Show)
slfoldl' f acc xs = case xs of Nil -> acc; Cons x xs' -> let acc' = f acc x in acc' `seq` slfoldl' f acc' xs'
foldl' f acc xs = case xs of [] -> acc; x : xs' -> let acc' = f acc x in acc' `seq` foldl' f acc' xs'
slappend xs ys = case xs of Nil -> ys; Cons x xs' -> Cons x (slappend xs' ys)
sl_test n = foldr Cons Nil [1..n]
test n = [1..n]
sl_enqueue xs x = slappend xs (Cons x Nil)
sl_queue = slfoldl' sl_enqueue Nil
enqueue xs x = xs ++ [x]
queue = foldl' enqueue []
The trick here is that both queue and sl_queue follow the xs ++ [x] pattern to append an element to the end of the list, which takes a list and builds up an exact copy of that list. GHCi can then run some simple tests. First we make two items and force their thunks to prove that this operation itself is quite fast and not too prohibitively expensive in memory:
*Main> :set +s
*Main> let vec = test 10000; slvec = sl_test 10000
(0.02 secs, 0 bytes)
*Main> [foldl' (+) 0 vec, slfoldl' (+) 0 slvec]
[50005000,50005000]
(0.02 secs, 8604632 bytes)
Now we do the actual tests: summing the queue-versions:
*Main> slfoldl' (+) 0 $ sl_queue slvec
50005000
(22.67 secs, 13427484144 bytes)
*Main> foldl' (+) 0 $ queue vec
50005000
(1.90 secs, 4442813784 bytes)
Notice that both of these suck in terms of memory-performance (the list-append stuff is still secretly O(n2)) where they eventually occupy gigabytes of space, but the strict version nevertheless occupies three times the space and takes ten times the time.
Sometimes the data structures should be changed
If you really want a strict queue, there are a couple options. One is finger trees as in Data.Sequence -- the viewr way they do things is a little complicated but works to get the rightmost elements. However that is a bit heavy and one common solution is O(1) amortized: define the structure
data Queue x = Queue !(SL x) !(SL x)
where the SL terms are the strict stacks above. Define a strict reverse, let's call it slreverse, the obvious way, then consider:
enqueue :: Queue x -> x -> Queue x
enqueue (Queue xs ys) el = Queue xs (Cons el ys)
dequeue :: Queue x -> Maybe (x, Queue x)
dequeue (Queue Nil Nil) = Nothing
dequeue (Queue Nil (Cons x xs)) = Just (x, Queue (slreverse xs) Nil)
dequeue (Queue (Cons x xs ys)) = Just (x, Queue xs ys)
This is "amortized O(1)": each time that a dequeue reverses the list, costing O(k) steps for some k, we ensure that we are creating a structure which won't have to pay these costs for k more steps.
Laziness hides errors
Another interesting point comes from the data/codata distinction, where data are finite structures traversed by recursion on subunits (that is, every data expression halts) while codata are the rest of the structures -- strict lists vs. streams. It turns out that when you properly make this distinction, there is no formal difference between strict data and lazy data -- the only formal difference between strict and lazy is how they handle terms within themselves which loop infinitely: strict will explore the loop and hence will also loop infinitely, while lazy will simply hand the infinite-loop onwards without descending into it.
As such you will find that x = slhead (Cons x undefined) will fail where head (x : undefined) succeeds. So you may "uncover" hidden infinite loops or bugs when you do this.
Caution when making "everything strict"
Not everything necessarily becomes strict when you use strict data structures in your language: notice that I made a point above to define strict foldl, not foldl, for both lists and strict-lists. Common data structures in Haskell will be lazy -- lists, tuples, stuff in popular libraries -- and explicit calls to seq still help when building up a complicated expression.

Lazy Evaluation - Space Leak

Thinking Functionally with Haskell provides the following code for calculating the mean of a list of Float's.
mean :: [Float] -> Float
mean [] = 0
mean xs = sum xs / fromIntegral (length xs)
Prof. Richard Bird comments:
Now we are ready to see what is really wrong with mean: it has a space leak. Evaluating mean [1..1000] will cause the list to be expanded and retained in memory after summing because there is a second pointer to it, namely in the computation of its length.
If I understand this text correctly, he's saying that, if there was no pointer to xs in the length computation, then the xs memory could've been freed after calculating the sum?
My confusion is - if the xs is already in memory, isn't the length function simply going to use the same memory that's already being taken up?
I don't understand the space leak here.
The sum function does not need to keep the entire list in memory; it can look at an element at a time then forget it as it moves to the next element.
Because Haskell has lazy evaluation by default, if you have a function that creates a list, sum could consume it without the whole list ever being in memory (each time a new element is generated by the producing function, it would be consumed by sum then released).
The exact same thing happens with length.
On the other hand, the mean function feeds the list to both sum and length. So during the evaluation of sum, we need to keep the list in memory so it can be processed by length later.
[Update] to be clear, the list will be garbage collected eventually. The problem is that it stays longer than needed. In such a simple case it is not a problem, but in more complex functions that operate on infinite streams, this would most likely cause a memory leak.
Others have explained what the problem is. The cleanest solution is probably to use Gabriel Gonzalez's foldl package. Specifically, you'll want to use
import qualified Control.Foldl as L
import Control.Foldl (Fold)
import Control.Applicative
meanFold :: Fractional n => Fold n (Maybe n)
meanFold = f <$> L.sum <*> L.genericLength where
f _ 0 = Nothing
f s l = Just (s/l)
mean :: (Fractional n, Foldable f) => f n -> Maybe n
mean = L.fold meanFold
if there was no pointer to xs in the length computation, then the xs memory could've been freed after calculating the sum?
No, you're missing the important aspect of lazy evaluation here. You're right that length will use the same memory as was allocated during the sum call, the memory in which we had expanded the whole list.
But the point here is that allocating memory for the whole list shouldn't be necessary at all. If there was no length computation but only the sum, then memory could've been freed during calculating the sum. Notice that the list [1..1000] is lazily generated only when it is consumed, so in fact the mean [1..1000] should run in constant space.
You might write the function like the following, to get an idea of how to avoid such a space leak:
import Control.Arrow
mean [] = 0
mean xs = uncurry (/) $ foldr (\x -> (x+) *** (1+)) (0, 0) xs
-- or more verbosely
mean xs = let (sum, len) = foldr (\x (s, l) -> (x+s, 1+l)) (0, 0)
in sum / len
which should traverse xs only once. However, Haskell is damn lazy - and computes the first tuple components only when evaluating sum and the second ones only later for len. We need to use some more tricks to actually force the evaluation:
{-# LANGUAGE BangPatterns #-}
import Data.List
mean [] = 0
mean xs = uncurry (/) $ foldl' (\(!s, !l) x -> (x+s, 1+l)) (0,0) xs
which really runs in constant space, as you can confirm in ghci by using :set +s.
The space leak is that the entire evaluated xs is held in memory for the length function. This is wasteful, as we aren't going to be using the actual values of the list after evaluating sum, nor do we need them all in memory at the same time, but Haskell doesn't know that.
A way to remove the space leak would be to recalculate the list each time:
sum [1..1000] / fromIntegral (length [1..1000])
Now the application can immediately start discarding values from the first list as it is evaluating sum, since it is not referenced anywhere else in the expression.
The same applies for length. The thunks it generates can be marked for deletion immediately, since nothing else could possibly want it evaluated further.
EDIT:
Implementation of sum in Prelude:
sum l = sum' l 0
where
sum' [] a = a
sum' (x:xs) a = sum' xs (a+x)

Efficient version of 'inits'

That is, inits "abc" == ["","a","ab","abc"]
There is a standard version of inits in Data.List, but below I have written a version myself:
myInits = f id
where
f start (l:ls) = (start []):(f (start . (l:)) ls)
f start [] = [(start [])]
Whilst my version is quite a bit simpler than the standard version, I suspect it's not as good for efficiency reasons. I suspect that when myInits l is fully evaluated it takes O(n^2) space. Take for example, myTails, an implementation of tails:
myTails a#(_:ls) = a:(myTails ls)
myTails [] = [[]]
Which is almost exactly the same as the standard version and I suspect achieves O(n) space by reusing the tails of the lists.
Could someone explain:
Why my version of inits is bad.
Why another approach is better (either the standard one in Data.List or your own).
Your myInits uses a technique called a difference list to make functions that build lists in linear time. I believe, but haven't checked, that the running time for completely evaluating myInits is O(n^2) requiring O(n^2) space. Fully evaluating inits also requires O(n^2) running time and space. Any version of inits will require O(n^2) space; lists built with : and [] can only share their tails, and there are no tails in common among the results of inits. The version of inits in Data.List uses an amortized time O(1) queue, much like the simpler queue described in the second half of a related answer. The Snoc referenced in the source code in Data.List is word play on Cons (another name for :) backwards - being able to append an item to the end of the list.
Briefly experimenting with these functions suggests myInits performs satisfactorily when used sparsely on a large list. On my computer, in ghci, myInits [1..] !! 8000000 yields results in a few seconds. Unfortunately, I have the horrifyingly inefficient implementation that shipped with ghc 7.8.3, so I can't compare myInits to inits.
Strictness
There is one big difference between myInits and inits and between myTails and tails. They have different meanings when applied to undefined or _|_ (pronounced "bottom", another symbol we use for undefined).
inits has the strictness property inits (xs ++ _|_) = inits xs ++ _|_, which, when xs is the empty list [] says that inits will still yield at least one result when applied to undefined
inits (xs ++ _|_) = inits xs ++ _|_
inits ([] ++ _|_) = inits [] ++ _|_
inits _|_ = [[]] ++ _|_
inits _|_ = [] : _|_
We can see this experimentally
> head . inits $ undefined
[]
myInits does not have this property either for the empty list or for longer lists.
> head $ myInits undefined
*** Exception: Prelude.undefined
> take 3 $ myInits ([1,2] ++ undefined)
[[],[1]*** Exception: Prelude.undefined
We can fix this if we realize that f in myInits would yield start [] in either branch. Therefore, we can delay the pattern matching until it is needed to decide what to do next.
myInits' = f id
where
f start list = (start []):
case list of
(l:ls) -> f (start . (l:)) ls
[] -> []
This increase in laziness makes myInits' work just like inits.
> head $ myInits' undefined
[]
> take 3 $ myInits' ([1,2] ++ undefined)
[[],[1],[1,2]]
Similarly, the difference between your myTails and tails in Data.List is that tails yields the entire list as the first result before deciding if there will be a remainder of the list. The documentation says it obeys tails _|_ = _|_ : _|_, but it actually obeys a much stronger rule that's hard to describe easily.
The prefix-functions building can be separated from their reification as actual lists:
diffInits = map ($ []) . scanl (\a x -> a . (x:)) id
This is noticeably faster (tested inside GHCi), and is lazier than your version (see Cirdec's answer for the discussion):
diffInits _|_ == [] : _|_
diffInits (xs ++ _|_) == diffInits xs ++ _|_

How to avoid stack space overflows?

I've been a bit surprised by GHC throwing stack overflows if I'd need to get value of large list containing memory intensive elements.
I did expected GHC has TCO so I'll never meet such situations.
To most simplify the case look at the following straightforward implementations of functions returning Fibonacci numbers (taken from HaskellWiki). The goal is to display millionth number.
import Data.List
# elegant recursive definition
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
# a bit tricky using unfoldr from Data.List
fibs' = unfoldr (\(a,b) -> Just (a,(b,a+b))) (0,1)
# version using iterate
fibs'' = map fst $ iterate (\(a,b) -> (b,a+b)) (0,1)
# calculate number by definition
fib_at 0 = 0
fib_at 1 = 1
fib_at n = fib_at (n-1) + fib_at (n-2)
main = do
{-- All following expressions abort with
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
--}
print $ fibs !! (10^6)
print . last $ take (10^6) fibs
print $ fibs' !! (10^6)
print $ fibs'' !! (10^6)
-- following expression does not finish after several
-- minutes
print $ fib_at (10^6)
The source is compiled with ghc -O2.
What am I doing wrong ? I'd like to avoid recompiling with increased stack size or other specific compiler options.
These links here will give you a good introduction to your problem of too many thunks (space leaks).
If you know what to look out for (and have a decent model of lazy evaluation), then solving them is quite easy, for example:
{-# LANGUAGE BangPatterns #-}
import Data.List
fibs' = unfoldr (\(!a,!b) -> Just (a,(b,a+b))) (0,1)
main = do
print $ fibs' !! (10^6) -- no more stack overflow
All of the definitions (except the useless fib_at) will delay all the + operations, which means that when you have selected the millionth element it is a thunk with a million delayed additions. You should try something stricter.
As other have pointed out, Haskell being lazy you have to force evaluation of the thunks to avoid stack overflow.
It appears to me that this version of fibs' should work up to 10^6:
fibs' = unfoldr (\(a,b) -> Just (seq a (a, (b, a + b) ))) (0,1)
I recommend to study this wiki page on Folds and have a look at the seq function.

A way to measure performance

Given Exercise 14 from 99 Haskell Problems:
(*) Duplicate the elements of a list.
Eg.:
*Main> dupli''' [1..10]
[1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10]
I've implemented 4 solutions:
{-- my first attempt --}
dupli :: [a] -> [a]
dupli [] = []
dupli (x:xs) = replicate 2 x ++ dupli xs
{-- using concatMap and replicate --}
dupli' :: [a] -> [a]
dupli' xs = concatMap (replicate 2) xs
{-- usign foldl --}
dupli'' :: [a] -> [a]
dupli'' xs = foldl (\acc x -> acc ++ [x,x]) [] xs
{-- using foldl 2 --}
dupli''' :: [a] -> [a]
dupli''' xs = reverse $ foldl (\acc x -> x:x:acc) [] xs
Still, I don't know how to really measure performance .
So what's the recommended function (from the above list) in terms of performance .
Any suggestions ?
These all seem more complicated (and/or less efficient) than they need to be. Why not just this:
dupli [] = []
dupli (x:xs) = x:x:(dupli xs)
Your last example is close to a good fold-based implementation, but you should use foldr, which will obviate the need to reverse the result:
dupli = foldr (\x xs -> x:x:xs) []
As for measuring performance, the "empirical approach" is profiling. As Haskell programs grow in size, they can get fairly hard to reason about in terms of runtime and space complexity, and profiling is your best bet. Also, a crude but often effective empirical approach when gauging the relative complexity of two functions is to simply compare how long they each take on some sufficiently large input; e.g. time how long length $ dupli [1..1000000] takes and compare it to dupli'', etc.
But for a program this small it shouldn't be too hard to figure out the runtime complexity of the algorithm based on your knowledge of the data structure(s) in question--in this case, lists. Here's a tip: any time you use concatenation (x ++ y), the runtime complexity is O(length x). If concatenation is used inside of a recursive algorithm operating on all the elements of a list of size n, you will essentially have an O(n ^2) algorithm. Both examples I gave, and your last example, are O(n), because the only operation used inside the recursive definition is (:), which is O(1).
As recommended you can use the criterion package. A good description is http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/.
To summarize it here and adapt it to your question, here are the steps.
Install criterion with
cabal install criterion -fchart
And then add the following to your code
import Criterion.Main
l = [(1::Int)..1000]
main = defaultMain [ bench "1" $ nf dupli l
, bench "2" $ nf dupli' l
, bench "3" $ nf dupli'' l
, bench "4" $ nf dupli''' l
]
You need the nf in order to force the evaluation of the whole result list. Otherwise you'll get just the thunk for the computation.
After that compile and run
ghc -O --make dupli.hs
./dupli -t png -k png
and you get pretty graphs of the running times of the different functions.
It turns out that dupli''' is the fastest from your functions but the foldr version that pelotom listed beats everything.

Resources