Suppose I have data on a bunch of people, and I want to be able to look them up in different ways. Maybe there's some kind of data structure (like a binary tree) that facilitates lookup by name. And maybe there's another (like a list) that's by order of creation. And perhaps many more.
In many languages, you would have each person allocated exactly once on the heap. Each data structure would contain pointers to that memory. Thus, you're not allocating a new set of people every time you add a new way to look them up.
How about in Haskell? Is there any way to avoid memory duplication when different data structures need to index the same data?
I feel sure there's a deeper, more knowledgeable answer to this question, but for the time being...
Since in a pure functional programming language data is immutable, there's no need to do anything other than copy the pointer instead of copying its target.
As a quick and very dirty example, I fired up the ghci interpreter:
Prelude> let x = replicate 10000 'm' in all (==x) $ replicate 10000 x
True
(1.61 secs, 0 bytes)
I admit that these stats are unreliable, but what it's not doing is allocating memory for all 10000 copies of a list 10000 characters long.
Summary:
The way to avoid memory duplication is to
(a) use haskell
(b) avoid pointlessly reconstructing your data.
How can I pointlessly reconstruct my data?
A very simple and pointless example:
pointlessly_reconstruct_list :: [a] -> [a]
pointlessly_reconstruct_list [] = []
pointlessly_reconstruct_list (x:xs) = x:xs
This kind of thing causes a duplicate of the list structure.
Have you got any examples that are a little less pointless but still simple?
Interestingly, if you do xs ++ ys you essentially reconstruct xs in order to place ys at the end of it (replacing []), so the list structure of xs is nearly copied wholesale. However, there's no need to replicate the actual data, and there certainly only needs to be one copy of ys.
Related
The Applicative instance for Data.Sequence generally performs very well. Almost all the methods are incrementally asymptotically optimal in time and space. That is, given fully forced/realized inputs, it's possible to access any portion of the result in asymptotically optimal time and memory residency. There is one remaining exception: (<*). I only know two ways to implement it as yet:
The default implementation
xs <* ys = liftA2 const xs ys
This implementation takes O(|xs| * |ys|) time and space to fully realize the result, but only O(log(min(k, |xs|*|ys|-k))) to access just the kth element of the result.
A "monadic" implementation
xs <* ys = xs >>= replicate (length ys)
This takes only O(|xs| * log |ys|) time and space, but it's not incremental; accessing an arbitrary element of the result requires O(|xs| * log |ys|) time and space.
I have long believed that it should be possible to have our cake and eat it too, but I've never been able to juggle the pieces in my mind well enough to get there. To do so appears to require a combination of ideas (but not actual code) from the implementations of liftA2 and replicate. How can this be done?
Note: it surely won't be necessary to incorporate anything like the rigidify mechanism of liftA2. The replicate-like pieces should surely produce only the sorts of "rigid" structures we use rigidify to get from user-supplied trees.
Update (April 6, 2020)
Mission accomplished! I managed to find a way to do it. Unfortunately, it's a little too complicated for me to understand everything going on, and the code is ... rather opaque. I will upvote and accept a good explanation of what I've written, and will also happily accept suggestions for clarity improvements and comments on GitHub.
Update 2
Many thanks to Li-Yao Xia and Bertram Felgenhauer for helping to clean up and document my draft code. It's now considerably less difficult to understand, and will appear in the next version of containers. It would still be nice to get an answer to close out this question.
I know a few different a way to build a Map in Haskell :
build it from a list using fromList
build it from a sorted list using fromAscList
Use the fact that Map is a Monoid (or a Semigroup) and concat singletons.
I understand that the amortized complexity of #1 is O(n*log(n)) where as #2 is O(n).
I Guess #3 should be roughly equivalent to #1 and might be subject to fusion.
The amortized is important, because Haskell being lazy by default, even though the lookup from a Map is O(log(n)), it can be in practice interleaved with the construction of the Map itself which is O(n * log(n)), which can make in practice the lookup being O(n * log(n)) (especially if you are building the map each time you need it). This also might append if you use hardcoded Map
For example, am I right to think that lookup 'b' (fromList [('a', 1), ('b', 2)]) is actually equivalent to just to d lookup in the list without using a intermediate Map ?
So is there a difference between #1 and #3 , or sorting the list and the calling fromList ?
Update
Also, If I need a map to be only computed once, do I need to make sure GHC doesn't inline it, so it is shared between functions ?
Use case
I realized that the question might be a bit blur and in fact correspond to different use cased I encountered recently.
The first one correspond to a "static join". I have an app which manages items and each item code can be split in style and variation (For example 'T-Shirt-Red' => ('T-Shirt', 'Red'). The split is based on rules (and regexp) and is quite slow. To not recompute the rules all the times, this is done once and stored in a db table. I have a few pure function which need to be able to split an item code so I pass them a function Text -> (Text, Text). The function is actually a lookup partially applied on a Map. The code is similar to this
getSplitter :: Handler (Text -> (Text, Text))
getSplitter = do
sku'style'vars <- runDB $ rawSQL "SELECT sku, style, var FROM style_cache " [] -- load the split table
let sku'map = fromList [ (sku, (style,var))
| (sku, style, var) <- sku'style'vars
]
return $ flip lookup sku'map
This one can be easily speed up by sorting the item by sku and using fromDistinctAscList (which is actually faster than fromAscList). However I still have some issue about how to cache it between different request.
The second case is to do a manual join between two tables, I usually do something along
do
sku'infos <- selectList [] [] -- load item info
let skuInfo = fromList sku'infos
orderLines <- selectList [] [] -- load orders
-- find infos corresponding to order items
mapM (\o -> (o, lookup (orderSku o) skuInfo) orderLines
There again I can sort sku'infos in SQL and use fromDistinctAscList.
A third case is when fetching miscellaneous info related to an category item in different table.
For example I want to be able to compare the sales (sales table) and the purchase (purchases table) by category.
In pure SQL I would do something along
SELECT style, sum(sales.amount), sum(purchase.amount)
FROM style_cache
LEFT JOIN sales USING(sku)
LEFT JOIN purchases USING(sku)
GROUP by style
However, this is a simplified example and in practice, the aggregation is much more complicated and have to be done in Haskell as well as the join. To do so I'm loading each table separately (grouping what I can in sql) and return a Map Style SalesInfo, Map Style PurchaseInfo etc ... and merge them. The table are quite big and I realize I end up loading everything in memory whereas I could probably by "zipping" things manually be much more efficient but I'm not sure how.
I'm not sure I understood the entire motivation behind this question, but I'll make a few comments:
Map is spine-strict -- which means the tree structure of a Map and the keys themselves are forced (at least far enough to do all the requisite comparisons) on every Map operation. So I would expect Data.Map.lookup k (fromList xs) to take O(n*log(n)) comparisons (n the length of xs) whereas I would expect Prelude.lookup k xs to take O(n) comparisons (actually just equality checks, but usually that's pretty much the same complexity as a comparison).
If fromAscList . sort is reliably faster than fromList, this is a performance bug in Data.Map and the library should just be changed to define fromList = fromAscList . sort. I would be very surprised if this were the case. People have spent a fair bit of time optimizing containers, so I wouldn't expect to see any fruit hanging as low as that.
Yes, inlining breaks sharing.
Referencing #dfeuer's answer to this question: Least expensive way to construct cyclic list in Haskell, which says that using cyclic lists 'defeats' the garbage collector as it has to keep everything you've consumed from a cyclic list allocated till you drop the reference to any cons cells in the list.
Apparently in Haskell a cyclic list and an infinite list are two separate things. This blog (https://unspecified.wordpress.com/2010/03/30/a-doubly-linked-list-in-haskell/) says that if you implement cycle as follows:
cycle xs = xs ++ cycle xs
it is an infinite list, not a cyclic list. To make it cyclic you have to implement it like this (as is found in the Prelude source code):
cycle xs = xs' where xs' = xs ++ xs'
What exactly is the difference between these two implementations? And why is it that if you are holding onto one cons cell somewhere in a cyclic list, that the garbage collector has to keep everything before it allocated as well?
The difference is entirely in the memory representation. From the point of view of the semantics of the language, they're indistinguishable—you can't write a function that can tell them apart, so your two versions of cycle are considered two implementations of the same function (they're the exact same mapping of arguments to results). In fact, I don't know if the language definition guarantees that one of those is cyclical and the other infinite.
But anyway, let's bring out the ASCII art. Cyclical list:
+----+----+ +----+----+
| x0 | -----> ... --->| xn | |
+----+----+ +----+-|--+
^ |
| |
+--------------------------------+
Infinite list:
+----+----+
| x0 | -----> thunk that produces infinite list
+----+----+
The thing with the cyclical list is that from every cons cell in the list there is a path to all of the others and itself. This means that from the point of view of the garbage collector, if one of the cons cells is reachable, then all are. In the plain infinite list, on the other hand, there aren't any cycles, so from a given cons cell only its successors are reachable.
Note that the infinite list representation is more powerful than the cyclical one, because the cyclical representation only works with lists that repeat after some number of elements. For example, the list of all prime numbers can be represented as an infinite list, but not as a cyclical one.
Note also that this distinction can be generalized into two distinct ways of implementing the fix function:
fix, fix' :: (a -> a) -> a
fix f = let result = f result in result
fix' f = f (fix' f)
-- Circular version of cycle:
cycle xs = fix (xs++)
-- Infinite list version of cycle:
cycle' xs = fix' (xs++)
The GHC libraries go for my fix definition. The way GHC compiles code means that the thunk created for result is used both as the result and the argument of the application of f. I.e., the thunk, when forced, will call the object code for f with the thunk itself as its argument, and replace the thunk's contents with the result.
Cyclic lists and infinite lists are different operationally, but not semantically.
A cyclic list is literally a loop in memory - imagine a singly linked list with the pointers following a cycle - so takes up constant space. Because each cell in the list can be reached from any other cell, holding onto any one cell will cause the entire list to be held onto.
An infinite list will take up more and more space as you evaluate more of it. Earlier elements will be garbage collected if no longer needed, so programs that process it may still run in constant space, although the garbage collection overhead will be higher. If earlier elements in the list are needed, for example because you hold onto a reference to the head of the list, then the list will consume linear space as you evaluate it, and will eventually exhaust available memory.
The reason for this difference is that without optimisations, a typical Haskell implementation like GHC will allocate memory once for a value, like xs' in the second definition of cycle, but will repeatedly allocate memory for a function invocation, like cycle xs in the first definition.
In principle optimisations might turn one definition into the other, but because of the quite different performance characteristics, it's unlikely that this will happen in practice as compilers are generally quite conservative about making programs behave worse. In some cases the cyclic variant will be worse because of the garbage collection properties already mentioned.
cycle xs = xs ++ cycle xs -- 1
cycle xs = xs' where xs' = xs ++ xs' -- 2
What exactly is the difference between these two implementations?
Using GHC, the difference is that implementation #2 creates a self-referential value (xs'), while #1 merely creates a thunk which happens to be the same.
And why is it that if you are holding onto one cons cell somewhere in a cyclic list, that the garbage collector has to keep everything before it allocated as well?
This is again GHC specific. As Luis said, if you have a reference to one cons cell in a cyclic list, then you can reach the whole list just by going around the cycle. The garbage collector is conservative and won't collect anything that you can still reach.
Haskell is pure and where-refactoring is sound... only when you disregard memory usage (and a few other things like CPU usage and computation time). Haskell the language does not specify what the compiler should do to differentiate #1 and #2. GHC the implementation follows certain memory management patterns which are sensible, but not immediately obvious.
I've seen many examples in functional languages about processing a list and constructing a function to do something with its elements after receiving some additional value (usually not present at the time the function was generated), such as:
Calculating the difference between each element and the average
(the last 2 examples under "Lazy Evaluation")
Staging a list append in strict functional languages such as ML/OCaml, to avoid traversing the first list more than once
(the section titled "Staging")
Comparing a list to another with foldr (i.e. generating a function to compare another list to the first)
listEq a b = foldr comb null a b
where comb x frec [] = False
comb x frec (e:es) = x == e && frec es
cmp1To10 = listEq [1..10]
In all these examples, the authors generally remark the benefit of traversing the original list only once. But I can't keep myself from thinking "sure, instead of traversing a list of N elements, you are traversing a chain of N evaluations, so what?". I know there must be some benefit to it, could someone explain it please?
Edit: Thanks to both for the answers. Unfortunately, that's not what I wanted to know. I'll try to clarify my question, so it's not confused with the (more common) one about creating intermediate lists (which I already read about in various places). Also thanks for correcting my post formatting.
I'm interested in the cases where you construct a function to be applied to a list, where you don't yet have the necessary value to evaluate the result (be it a list or not). Then you can't avoid generating references to each list element (even if the list structure is not referenced anymore). And you have the same memory accesses as before, but you don't have to deconstruct the list (pattern matching).
For example, see the "staging" chapter in the mentioned ML book. I've tried it in ML and Racket, more specifically the staged version of "append" which traverses the first list and returns a function to insert the second list at the tail, without traversing the first list many times. Surprisingly for me, it was much faster even considering it still had to copy the list structure as the last pointer was different on each case.
The following is a variant of map which after applied to a list, it should be faster when changing the function. As Haskell is not strict, I would have to force the evaluation of listMap [1..100000] in cachedList (or maybe not, as after the first application it should still be in memory).
listMap = foldr comb (const [])
where comb x rest = \f -> f x : rest f
cachedList = listMap [1..100000]
doubles = cachedList (2*)
squares = cachedList (\x -> x*x)
-- print doubles and squares
-- ...
I know in Haskell it doesn't make a difference (please correct me if I'm wrong) using comb x rest f = ... vs comb x rest = \f -> ..., but I chose this version to emphasize the idea.
Update: after some simple tests, I couldn't find any difference in execution times in Haskell. The question then is only about strict languages such as Scheme (at least the Racket implementation, where I tested it) and ML.
Executing a few extra arithmetic instructions in your loop body is cheaper than executing a few extra memory fetches, basically.
Traversals mean doing lots of memory access, so the less you do, the better. Fusion of traversals reduces memory traffic, and increases the straight line compute load, so you get better performance.
Concretely, consider this program to compute some math on a list:
go :: [Int] -> [Int]
go = map (+2) . map (^3)
Clearly, we design it with two traversals of the list. Between the first and the second traversal, a result is stored in an intermediate data structure. However, it is a lazy structure, so only costs O(1) memory.
Now, the Haskell compiler immediately fuses the two loops into:
go = map ((+2) . (^3))
Why is that? After all, both are O(n) complexity, right?
The difference is in the constant factors.
Considering this abstraction: for each step of the first pipeline we do:
i <- read memory -- cost M
j = i ^ 3 -- cost A
write memory j -- cost M
k <- read memory -- cost M
l = k + 2 -- cost A
write memory l -- cost M
so we pay 4 memory accesses, and 2 arithmetic operations.
For the fused result we have:
i <- read memory -- cost M
j = (i ^ 3) + 2 -- cost 2A
write memory j -- cost M
where A and M are the constant factors for doing math on the ALU and memory access.
There are other constant factors as well (two loop branches) instead of one.
So unless memory access is free (it is not, by a long shot) then the second version is always faster.
Note that compilers that operate on immutable sequences can implement array fusion, the transformation that does this for you. GHC is such a compiler.
There is another very important reason. If you traverse a list only once, and you have no other reference to it, the GC can release the memory claimed by the list elements as you traverse them. Moreover, if the list is generated lazily, you always have only a constant memory consumption. For example
import Data.List
main = do
let xs = [1..10000000]
sum = foldl' (+) 0 xs
len = foldl' (\_ -> (+ 1)) 0 xs
print (sum / len)
computes sum, but needs to keep the reference to xs and the memory it occupies cannot be released, because it is needed to compute len later. (Or vice versa.) So the program consumes a considerable amount of memory, the larger xs the more memory it needs.
However, if we traverse the list only once, it is created lazily and the elements can be GC immediately, so no matter how big the list is, the program takes only O(1) memory.
{-# LANGUAGE BangPatterns #-}
import Data.List
main = do
let xs = [1..10000000]
(sum, len) = foldl' (\(!s,!l) x -> (s + x, l + 1)) (0, 0) xs
print (sum / len)
Sorry in advance for a chatty-style answer.
That's probably obvious, but if we're talking about the performance, you should always verify hypotheses by measuring.
A couple of years ago I was thinking about the operational semantics of GHC, the STG machine. And I asked myself the same question — surely the famous "one-traversal" algorithms are not that great? It only looks like one traversal on the surface, but under the hood you also have this chain-of-thunks structure which is usually quite similar to the original list.
I wrote a few versions (varying in strictness) of the famous RepMin problem — given a tree filled with numbers, generate the tree of the same shape, but replace every number with the minimum of all the numbers. If my memory is right (remember — always verify stuff yourself!), the naive two-traversal algorithm performed much faster than various clever one-traversal algorithms.
I also shared my observations with Simon Marlow (we were both at an FP summer school during that time), and he said that they use this approach in GHC. But not to improve performance, as you might have thought. Instead, he said, for a big AST (such as Haskell's one) writing down all the constructors takes much space (in terms of lines of code), and so they just reduce the amount of code by writing down just one (syntactic) traversal.
Personally I avoid this trick because if you make a mistake, you get a loop which is a very unpleasant thing to debug.
So the answer to your question is, partial compilation. Done ahead of time, it makes it so that there's no need to traverse the list to get to the individual elements - all the references are found in advance and stored inside the pre-compiled function.
As to your concern about the need for that function to be traversed too, it would be true in interpreted languages. But compilation eliminates this problem.
In the presence of laziness this coding trick may lead to the opposite results. Having full equations, e.g. Haskell GHC compiler is able to perform all kinds of optimizations, which essentially eliminate the lists completely and turn the code into an equivalent of loops. This happens when we compile the code with e.g. -O2 switch.
Writing out the partial equations may prevent this compiler optimization and force the actual creation of functions - with drastic slowdown of the resulting code. I tried your cachedList code and saw a 0.01s execution time turn into 0.20s (don't remember right now the exact test I did).
I'm working on a small concept project in Haskell which requires a circular buffer. I've managed to create a buffer using arrays which has O(1) rotation, but of course requires O(N) for insertion/deletion. I've found an implementation using lists which appears to take O(1) for insertion and deletion, but since it maintains a left and right list, crossing a certain border when rotating will take O(N) time. In an imperative language, I could implement a doubly linked circular buffer with O(1) insertion, deletion, and rotation. I'm thinking this isn't possible in a purely functional language like Haskell, but I'd love to know if I'm wrong.
If you can deal with amortized O(1) operations, you could probably use either Data.Sequence from the containers package, or Data.Dequeue from the dequeue package. The former uses finger trees, while the latter uses the "Banker's Dequeue" from Okasaki's Purely Functional Data Structures (a prior version online here).
The ST monad allows to describe and execute imperative algorithms in Haskell. You can use STRefs for the mutable pointers of your doubly linked list.
Self-contained algorithms described using ST are executed using runST. Different runST executions may not share ST data structures (STRef, STArray, ..).
If the algorithm is not "self contained" and the data structure is required to be maintained with IO operations performed in between its uses, you can use stToIO to access it in the IO monad.
Regarding whether this is purely functional or not - I guess it's not?
It sounds like you might need something a bit more complicated than this (since you mentioned doubly-linked lists), but maybe this will help. This function acts like map over a mutable cyclic list:
mapOnCycling f = concat . tail . iterate (map f)
Use like:
*Main> (+1) `mapOnCycling` [3,2,1]
[4,3,2,5,4,3,6,5,4,7,6,5,8,7,6,9,8,7,10,9...]
And here's one that acts like mapAccumL:
mapAccumLOnCycling f acc xs =
let (acc', xs') = mapAccumL f acc xs
in xs' ++ mapAccumLOnCycling f acc' xs'
Anyway, if you care to elaborate even more on what exactly your data structure needs to be able to "do" I would be really interested in hearing about it.
EDIT: as camccann mentioned, you can use Data.Sequence for this, which according to the docs should give you O1 time complexity (is there such a thing as O1 amortized time?) for viewing or adding elements both to the left and right sides of the sequence, as well as modifying the ends along the way. Whether this will have the performance you need, I'm not sure.
You can treat the "current location" as the left end of the Sequence. Here we shuttle back and forth along a sequence, producing an infinite list of values. Sorry if it doesn't compile, I don't have GHC at the moment:
shuttle (viewl-> a <: as) = a : shuttle $ rotate (a+1 <| as)
where rotate | even a = rotateForward
| otherwise = rotateBack
rotateBack (viewr-> as' :> a') = a' <| as'
rotateForward (viewl-> a' <: as') = as' |> a'