Benefit of DiffList - haskell

Learn You a Haskell demonstrates the DiffList concept:
*Main Control.Monad.Writer> let f = \xs -> "dog" ++ ("meat" ++ xs)
*Main Control.Monad.Writer> f "foo"
"dogmeatfoo"
Is the primary benefit of the DiffList that the list gets constructed from left to right?

The DList package lists some of the asymptotics: https://hackage.haskell.org/package/dlist-0.5/docs/Data-DList.html
You'll note lots of things only take O(1), including cons, snoc, and append. However, note that inspecting the list needs to force lots of operations each time, so if you are doing more inspecting than construction, or interleaving the two, the DList approach won't necessarily be a win.

Related

Example of a data structure with lazy spine and strict leaves

One of the performance trick mentioned here is this:
As a safe default: lazy in the spine, strict in the leaves.
I'm having trouble imagining such a data structure.
If I take Lists as an example, and if I make it strict in the leaves then won't the spine will be automatically strict ?
Is there any data structure example where the spine is lazy and leaves are strict ?
"Lazy in the spine, strict in the leaves" is a property of the API, not (just) a property of the data structure. Here's an example of how it might look for lists:
module StrictList (StrictList, runStrictList, nil, cons, uncons, repeat) where
newtype StrictList a = StrictList { runStrictList :: [a] }
nil :: StrictList a
nil = StrictList []
cons :: a -> StrictList a -> StrictList a
cons x (StrictList xs) = x `seq` StrictList (x:xs)
uncons :: StrictList a -> Maybe (a, StrictList a)
uncons (StrictList []) = Nothing
uncons (StrictList (x:xs)) = Just (x, StrictList xs)
repeat :: a -> StrictList a
repeat x = x `seq` StrictList (let xs = x:xs in xs)
Note that compared to built-in lists, this API is a quite impoverished -- that's just to keep the illustration small, not for a fundamental reason. The key point here is that you can still support things like repeat, where the spine is necessarily lazy (it's infinite!) but all the leaves are evaluated before anything else happens. Many of the other list operations that can produce infinite lists can be adapted to leaf-strict versions (though not all, as you observe).
You should also notice that it is not necessarily possible to take a leaf-lazy, spine-lazy structure and turn it into a leaf-strict, spine-lazy one in a natural way; e.g. one could not write a generic fromList :: [a] -> StrictList a such that:
fromList (repeat x) = repeat x and
runStrictList (fromList xs) = xs for all finite-length xs.
(Forgive my punning, I'm a repeat offender).
This bit of advice mixes up two related, but distinct, ideas. Haskell programmers are often sloppy about the distinction, but it matters here.
Strict vs. non-strict
This is a semantic distinction. A function f is strict if f _|_ = _|_, and non-strict otherwise.
Eager (call by value) vs. lazy (call by need)
This is a matter of implementation, and can have major performance implications. Lazy evaluation is one way to implement non-strict semantics.
What the claim really means
It actually means that the data structure should be strict and lazy. The right amount of laziness in the spine of a data structure can be very helpful. Sometimes it gives asymptotic improvements in performance. It can also improve cache utilization and cut garbage collection costs. On the other hand, too much laziness (even in the spine, in some cases!) can lead to a harmful accumulation of deferred computations. From an API standpoint, it can be very helpful to ensure that insertion operations are eager (and therefore strict), so that you know that everything stored in the structure has been forced.

Functional Programming-Style Map Function that adds elements?

I know and love my filter, map and reduce, which happen to be part of more and more languages that are not really purely functional.
I found myself needing a similar function though: something like map, but instead of one to one it would be one to many.
I.e. one element of the original list might be mapped to multiple elements in the target list.
Is there already something like this out there or do I have to roll my own?
This is exactly what >>= specialized to lists does.
> [1..6] >>= \x -> take (x `mod` 3) [1..]
[1,1,2,1,1,2]
It's concatenating together the results of
> map (\x -> take (x `mod` 3) [1..]) [1..6]
[[1],[1,2],[],[1],[1,2],[]]
You do not have to roll your own. There are many relevant functions here, but I'll highlight three.
First of all, there is the concat function, which already comes in the Prelude (the standard library that's loaded by default). What this function does, when applied to a list of lists, is return the list that contains concatenated contents of the sublists.
EXERCISE: Write your own version of concat :: [[a]] -> [a].
So using concat together with map, you could write this function:
concatMap :: (a -> [b]) -> [a] -> [b]
concatMap f = concat . map f
...except that you don't actually need to write it, because it's such a common pattern that the Prelude already has it (at a more general type than what I show here—the library version takes any Foldable, not just lists).
Finally, there is also the Monad instance for list, which can be defined this way:
instance Monad [] where
return a = [a]
as >>= f = concatMap f as
So the >>= operator (the centerpiece of the Monad class), when working with lists, is exactly the same thing as concatMap.
EXERCISE: Skim through the documentation of the Data.List module. Figure out how to import the module into your code and play around with some of the functions.

Rewriting as a practical optimization technique in GHC: Is it really needed?

I was reading the paper authored by Simon Peyton Jones, et al. named “Playing by the Rules: Rewriting as a practical optimization technique in GHC”. In the second section, namely “The basic idea” they write:
Consider the familiar map function, that applies a function to each element of a list. Written in Haskell, map looks like this:
map f [] = []
map f (x:xs) = f x : map f xs
Now suppose that the compiler encounters the following call of map:
map f (map g xs)
We know that this expression is equivalent to
map (f . g) xs
(where “.” is function composition), and we know that the latter expression is more efficient than the former because there is no intermediate list. But the compiler has no such knowledge.
One possible rejoinder is that the compiler should be smarter --- but the programmer will always know things that the compiler cannot figure out. Another suggestion is this: allow the programmer to communicate such knowledge directly to the compiler. That is the direction we explore here.
My question is, why can't we make the compiler smarter? The authors say that “but the programmer will always know things that the compiler cannot figure out”. However, that's not a valid answer because the compiler can indeed figure out that map f (map g xs) is equivalent to map (f . g) xs, and here is how:
map f (map g xs)
map g xs unifies with map f [] = [].
Hence map g [] = [].
map f (map g []) = map f [].
map f [] unifies with map f [] = [].
Hence map f (map g []) = [].
map g xs unifies with map f (x:xs) = f x : map f xs.
Hence map g (x:xs) = g x : map g xs.
map f (map g (x:xs)) = map f (g x : map g xs).
map f (g x : map g xs) unifies with map f (x:xs) = f x : map f xs.
Hence map f (map g (x:xs)) = f (g x) : map f (map g xs).
Hence we now have the rules:
map f (map g []) = []
map f (map g (x:xs)) = f (g x) : map f (map g xs)
As you can see f (g x) is just (f . g) and map f (map g xs) is being called recursively. This is exactly the definition of map (f . g) xs. The algorithm for this automatic conversion seems to be pretty simple. So why not implement this instead of rewriting rules?
Aggressive inlining can derive many of the equalities that rewrite rules are short-hand for.
The differences is that inlining is "blind", so you don't know in advance if the result will be better or worse, or even if it will terminate.
Rewrite rules, however, can do completely non-obvious things, based on much higher level facts about the program. Think of rewrite rules as adding new axioms to the optimizer. By adding these you have a richer rule set to apply, making complicated optimizations easier to apply.
Stream fusion, for example, changes the data type representation. This cannot be expressed through inlining, as it involves a representation type change (we reframe the optimization problem in terms of the Stream ADT). Easy to state in rewrite rules, impossible with inlining alone.
Something in that direction was investigated in a Bachelor’s thesis of Johannes Bader, a student of mine: Finding Equations in Functional Programs (PDF file).
To some degree it is certainly possible, but
it is quite tricky. Finding such equations is in a sense as hard as finding proofs in a theorem proofer, and
it is not often very useful, because it tends to find equations that the programmer would rarely write directly.
It is however useful to clean up after other transformations such as inlining and various form of fusion.
This could be viewed as a balance between balancing expectations in the specific case, and balancing them in the general case. This balance can generate funny situations where you can know how to make something faster, but it is better for the language in general if you don't.
In the specific case of maps in the structure you give, the computer could find optimizations. However, what about related structures? What if the function isn't map? What if there's an additional layer of indirection, such as a function that returns map. In those cases, the compiler cannot optimize easily. This is the general case problem.
How if you do optimize the special case, one of two outcomes occurs
Nobody relies on it, because they aren't sure if it is there or not. In this case, articles like the one you quote get written
People do start relying on it, and now every developer is forced to remember "maps done in this configuration get automatically converted to the fast version for me, but if I do it in this configuration I don't.' This starts to manipulate the way people use the language, and can actually reduce readability!
Given the need for developers to think about such optimizations in the general case, we expect to see developers doing these optimizations in the simple case, decreasing the need to for the optimization in the first place!
Now, if it turns out that the particular case you are interested accounts for something massive like 2% of the world codebase in Haskell, there would be a much stronger argument for applying your special-case optimization.

haskell io stream memory

I try to parse large log files in haskell. I'm using System.IO.Streams but it seems to eat a lot of memory when I fold over the input. Here are two (ugly) examples:
First load 1M Int to memory in a list.
let l = foldl (\aux p -> p:aux) [] [1..1000000]
return (sum l)
Memory consumption is beautiful. Ints eat 3Mb and the list needs 6Mb:
see memory consumption of building list of 1M Int
Then try the same with Stream of ByteStrings. We need an ugly back and forth conversation but I don't think makes any difference
let s = Streams.fromList $ map (B.pack . show) [1..1000000]
l <- s >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
see memory consumption of building a list of Ints from a stream
Why does it need more memory? And it's even worse if I read it from a file. It needs 90Mb
result <- withFileAsInput file load
putStrLn $ "loaded " ++ show result
where load is = do
l <- Streams.lines is >>=
Streams.map bsToInt >>=
Streams.fold (\aux p -> p:aux) []
return (sum l)
My assumption is Streams.fold has some issues. Because the library's built in countInput method doesn't use it. Any idea?
EDIT
after investigation I reduced the question to this: why does this code needs an extra 50Mb?
do
let l = map (Builder.toLazyByteString . intDec ) [1..1000000]
let l2 = map (fst . fromJust . B.readInt) l
return (foldl' (\aux p -> p:aux) [] l2)
without the conversions it only needs 30Mb, with the conversions 90Mb.
In your first example, the foldl (\aux p -> p:aux) [] is redundant. It constructs a list with the same elements as the list it takes as an argument! Without the redundancy, the example is equivalent to sum [1..1000000] or foldl (+) 0 [1..1000000]. Also, it would be better to use the strict left fold foldl' to avoid the accumulation of reducible expressions on the heap. See Foldr Foldl Foldl' on the Haskell wiki.
In your last example, you are using System.IO.Streams.Combinators.fold for building a list of all the integers which are read from the file, and then try to sum the list like you did in your first example.
The problem is that, because of the sequencing of file read operations imposed by the IO monad, all the data in the file has been read before you start summing the list, and is lurking on the heap, possibly still untransformed from the original Strings and taking even more memory.
The solution is to perform the actual sum inside the fold as each new element arrives; that way you don't need to have the full list in memory at any time, only the current element (being able to do this while performing I/O is one of the aims of streaming libraries). And the fold provided by io-streams is strict, analogous to foldl'. So you don't accumulate reducible expressions on the heap, either.
Try something like System.IO.Streams.Combinators.fold (+) 0.
So the problem was the lazy creation of ByteStrings and not with the iterator.
See
Why creating and disposing temporal ByteStrings eats up my memory in Haskell?

How do you create a rewrite pass based on whether two expressions refers to the same bound name?

How do you find and rewrite expressions that refer to the same bound name? For example, in the expression
let xs = ...
in ...map f xs...map g xs...
both the expression map f xs and the expression map g xs refer to the same bound name, namely xs. Are there any standard compiler analyses that would let us identify this situation and rewrite the two map expressions to e.g.
let xs = ...
e = unzip (map (f *** g) xs)
in ...fst e...snd e...
I've been thinking about the problem in terms of a tree traversal. For example given the AST:
data Ast = Map (a -> b) -> Ast -> Ast
| Var String
| ...
we could try to write a tree traversal to detect this case, but that seems difficult since two Map nodes that refer to the same Var might appear at widely different places in the tree. This analysis seems easier to do if you inverted all the references in the AST, making it a graph, but I wanted to see if there are any alternatives to that approach.
I think what you are looking for is a set of program transformations usually referred to as Tupling, Fusion, and Supercompilation, which fall under the more general theory of Unfold/Fold transformation. You can achieve what you want as follows.
First perform speculative evaluations (Unfolding) by "driving" the definition of map over the arguments, which gives rise to two new pseudo programs, depending on whether xs is of the form y:ys or []. In pseudo code:
let y:ys = ...
in ...(f y):(map f ys)...(g y):(map g ys)...
let [] = ...
in ...[]...[]...
Then perform abstractions for shared structure (Tupling) and generalisations (Folding) with respect to the original program to stop otherwise perpetual unfolding:
let xs = ...
in ...(fst tuple)...(snd tuple)...
where tuple = generalisation xs
generalisation [] = ([],[])
generalisation (y:ys) = let tuple = generalisation ys
in ((f y):(fst tuple),(g y):(snd tuple))
I hope this gives you an idea, but program tranformation is a research field in its own right, and it is hard to explain well without drawing acyclic directed graphs.

Resources