How to iterate over tree with memory limit in Haskell? - haskell

I know that there is a solution for iterating through the Tree using Zippers (see details here). Though it is not clear for me whether it is possible to apply memory constraints to this approach.
Context
I was given the following problem to solve in Haskell:
Design an iterator that will iterate through a binary tree in-order.
Assume the binary tree is stored on disk and can contain up to 10 levels, and therefore can contain up to (2^10 - 1) nodes, and we can store at most 100 nodes in memory at any given time.
The goal of this iterator is to load a small fraction of the binary tree from disk to memory each time it's incremented, so that we don't need to load the entire tree into
memory all at once.
I assumed that the memory part is not possible to represent in Haskell, but I was told that it is not true.
Question: what can be used in Haskell to achieve that memory behaviour? Any suggestions, approaches and directions are appreciated. This is just out of curiosity, I've already failed at solving this problem.

If the iterator loads part of the tree each time it is incremented then there are two options:
It exists in the IO monad and works just like in an imperative language.
It is exploiting laziness and interleaved IO. This is the approach taken by functions like readFile which give you the entire contents of a file as one lazy list. The actual file is read on-demand as your application traverses the list.
The latter option is the interesting one here.
The tricky part of lazy lists is retainers. Suppose your file contains a list of numbers. If you compute the sum like this
nums <- map read . lines <$> readFile "numbers.txt"
putStrLn $ "The total is " <> show (sum nums)
then the program will run in constant space. But if you want the average:
putStrLn $ "The average is " <> show (sum nums / fromIntegral (length nums))
then the program will load the entire file into memory. This is because it has to traverse the list twice, once to compute the sum and once to compute the length. It can only do this by holding the entire list.
(The solution is to compute the sum and length in parallel within one pass. But that's beside the point here).
The challenge for the tree problem you pose is to come up with an approach to iteration which avoids retaining the tree.
Lets assume that each node in the file contains offsets in the file for the left and right child nodes. We can write a function in the IO monad which seeks to an offset and reads the node there.
data MyNode = MyNode Int Int ..... -- Rest of data to be filled in.
readNodeData :: Handle -> Int -> IO MyNode
From there it would be simple to write a function which traverses the entire file to create a Tree MyNode. If you implement this using unsafeInterleaveIO then you can get a tree which is read lazily as you traverse it.
unsafeInterleaveIO is unsafe because you don't know when the IO will be done. You don't even know what order it will happen in, because it only happens when the value is forced during evaluation. In this way its like the "promise" structures you get in some other languages. In this particular case this isn't a problem because we can assume the file doesn't change during the evaluation.
Unfortunately this doesn't solve the problem because the entire tree will be held in memory by the time you finish. Your traversal has to retain the root, at least as long as its traversing the left side, and as long as it does so it will retain the whole of the rest of the tree.
The solution is to rewrite the IO part to return a list instead of a tree, something like this:
readNode :: Handle -> Int -> IO [MyNode]
readNode _ (-1) = return [] -- Null case for empty child.
readNode h pos = unsafeInterleaveIO $ do
n <- readNodeData h pos -- Needs to be defined elsewhere.
lefts <- readNode (leftChild n)
rights <- readNode (rightChild n)
return $ lefts ++ [n] ++ rights
This returns the entire tree as a lazy list. As you traverse the list the relevant nodes will be read on demand. As long as you don't retain the list (see above) your program will not need to hold anything more than the current node and its parents.

Related

Effective way to build a (Lazy) Map in Haskell

I know a few different a way to build a Map in Haskell :
build it from a list using fromList
build it from a sorted list using fromAscList
Use the fact that Map is a Monoid (or a Semigroup) and concat singletons.
I understand that the amortized complexity of #1 is O(n*log(n)) where as #2 is O(n).
I Guess #3 should be roughly equivalent to #1 and might be subject to fusion.
The amortized is important, because Haskell being lazy by default, even though the lookup from a Map is O(log(n)), it can be in practice interleaved with the construction of the Map itself which is O(n * log(n)), which can make in practice the lookup being O(n * log(n)) (especially if you are building the map each time you need it). This also might append if you use hardcoded Map
For example, am I right to think that lookup 'b' (fromList [('a', 1), ('b', 2)]) is actually equivalent to just to d lookup in the list without using a intermediate Map ?
So is there a difference between #1 and #3 , or sorting the list and the calling fromList ?
Update
Also, If I need a map to be only computed once, do I need to make sure GHC doesn't inline it, so it is shared between functions ?
Use case
I realized that the question might be a bit blur and in fact correspond to different use cased I encountered recently.
The first one correspond to a "static join". I have an app which manages items and each item code can be split in style and variation (For example 'T-Shirt-Red' => ('T-Shirt', 'Red'). The split is based on rules (and regexp) and is quite slow. To not recompute the rules all the times, this is done once and stored in a db table. I have a few pure function which need to be able to split an item code so I pass them a function Text -> (Text, Text). The function is actually a lookup partially applied on a Map. The code is similar to this
getSplitter :: Handler (Text -> (Text, Text))
getSplitter = do
sku'style'vars <- runDB $ rawSQL "SELECT sku, style, var FROM style_cache " [] -- load the split table
let sku'map = fromList [ (sku, (style,var))
| (sku, style, var) <- sku'style'vars
]
return $ flip lookup sku'map
This one can be easily speed up by sorting the item by sku and using fromDistinctAscList (which is actually faster than fromAscList). However I still have some issue about how to cache it between different request.
The second case is to do a manual join between two tables, I usually do something along
do
sku'infos <- selectList [] [] -- load item info
let skuInfo = fromList sku'infos
orderLines <- selectList [] [] -- load orders
-- find infos corresponding to order items
mapM (\o -> (o, lookup (orderSku o) skuInfo) orderLines
There again I can sort sku'infos in SQL and use fromDistinctAscList.
A third case is when fetching miscellaneous info related to an category item in different table.
For example I want to be able to compare the sales (sales table) and the purchase (purchases table) by category.
In pure SQL I would do something along
SELECT style, sum(sales.amount), sum(purchase.amount)
FROM style_cache
LEFT JOIN sales USING(sku)
LEFT JOIN purchases USING(sku)
GROUP by style
However, this is a simplified example and in practice, the aggregation is much more complicated and have to be done in Haskell as well as the join. To do so I'm loading each table separately (grouping what I can in sql) and return a Map Style SalesInfo, Map Style PurchaseInfo etc ... and merge them. The table are quite big and I realize I end up loading everything in memory whereas I could probably by "zipping" things manually be much more efficient but I'm not sure how.
I'm not sure I understood the entire motivation behind this question, but I'll make a few comments:
Map is spine-strict -- which means the tree structure of a Map and the keys themselves are forced (at least far enough to do all the requisite comparisons) on every Map operation. So I would expect Data.Map.lookup k (fromList xs) to take O(n*log(n)) comparisons (n the length of xs) whereas I would expect Prelude.lookup k xs to take O(n) comparisons (actually just equality checks, but usually that's pretty much the same complexity as a comparison).
If fromAscList . sort is reliably faster than fromList, this is a performance bug in Data.Map and the library should just be changed to define fromList = fromAscList . sort. I would be very surprised if this were the case. People have spent a fair bit of time optimizing containers, so I wouldn't expect to see any fruit hanging as low as that.
Yes, inlining breaks sharing.

Eq testing for large DAG structures in Haskell

I'm new to Haskell ( a couple of months ). I have a Haskell program that assembles a large expression DAG (not a tree, a DAG), potentially deep and with multiple merging paths (ie, the number of different paths from root to leaves is huge). I need a fast way to test these dags for equality.The default Eq derivation will just recurse, exploring the same nodes multiple times. Currently this causes my program to take 60 seconds for relatively small expressions, and not even finish for larger ones. The profiler indicates it is busy checking equality most of the time. I would like to implement a custom Eq that does not have this problem. I don't have a way to solve this problem that does not involve a lot of rewriting. So I want to hear your thoughts.
My first attempt was to 'instrument' tree nodes with a hash that I compute incrementally, using Data.Hashable.hash, as I build the tree. This approach gave me an easy way to test two things aren't equal without looking deep into the structure. But often in this DAG, because of the paths in the DAG merging, the structures are indeed equal. So the hashes are equal, and I revert to full blown equality testing.
If I had a way to do physical equality, then a lot of my problems here would go away: if they are physically equal, then that's it. Otherwise if the hash is different then that's it. Only go deeper if they are physically not the same, but their hash agrees.
I could also imitate git, and compute a SHA1 per node to decide if they are equal period (no need to recurse). I know for a fact that this would help, because If I let equality be decided fully in terms of hash equality, then the program runs in tens milliseconds for the largest expressions. This approach also has the nice advantage that if for some reason there are two equal dags are not physically equal but are content-equal, I would be able to detect it fast in that case as well. (With Ids, Id still have to do a traversal at that point). So I like the semantics more.
This approach, however involves a lot more work than just calling the Data.Hashable.hash function, because I have to derive it for every variant of the dag node type. And moreover, I have multiple dag representations, with slightly different node definitions, so I would need to basically do this hashing trick thing twice or more if I decide to add more representations.
What would you do?
Part of the problem here is that Haskell has no concept of object identity, so when you say you have a DAG where you refer to the same node twice, as far as Haskell is concerned its just two values in different places on a tree. This is fundamentally different from the OO concept where an object is indexed by its location in memory, so the distinction between "same object" and "different objects with equal fields" is meaningful.
To solve your problem you need to detect when you are visiting the same object that you saw earlier, and in order to do that you need to have a concept of "same object" that is independent of the value. There are two basic ways to attack this:
Store all your objects in a vector (i.e. an array), and use the vector index as an object identity. Replace values with indices throughout your data structure.
Give each object a unique "identity" field so you can tell if you have seen this one before when traversing the DAG.
The former is how the Data.Graph module in the containers package does it. One advantage is that, if you have a single mapping from DAG to vector, then DAG equality becomes just vector equality.
Any efficient way to test for equality will be intertwined with the way you build up the DAG values.
Here is an idea which keeps track of all nodes ever created in a Map.
As new nodes are added to the Map they are assigned a unique id.
Creating nodes now becomes monadic as you have thread this Map
(and the next available id) throughout your computation.
In this example the nodes are implemented as Rose trees, and the
order of the children is not significant - hence the call to
sort in deriving the key into the map.
import Control.Monad.State
import Data.List
import qualified Data.Map as M
data Node = Node { _eqIdent:: Int -- equality identifier
, _value :: String -- value associated with the node
, _children :: [Node] -- children
}
deriving (Show)
type BuildState = (Int, M.Map (String,[Int]) Node)
buildNode :: String -> [Node] -> State BuildState Node
buildNode value nodes = do
(nextid, nodeMap) <- get
let key = (value, sort (map _eqIdent nodes)) -- the identity of the node
case M.lookup key nodeMap of
Nothing -> do let n = Node nextid value nodes
nodeMap' = M.insert key n nodeMap
put (nextid+1, nodeMap')
return n
Just node -> return node
nodeEquality :: Node -> Node -> Bool
nodeEquality a b = _eqIdent a == _eqIdent b
One caveat -- this approach requires that you know all the children of a node when you build it.

Multiple lookup structures for same data: Memory duplication?

Suppose I have data on a bunch of people, and I want to be able to look them up in different ways. Maybe there's some kind of data structure (like a binary tree) that facilitates lookup by name. And maybe there's another (like a list) that's by order of creation. And perhaps many more.
In many languages, you would have each person allocated exactly once on the heap. Each data structure would contain pointers to that memory. Thus, you're not allocating a new set of people every time you add a new way to look them up.
How about in Haskell? Is there any way to avoid memory duplication when different data structures need to index the same data?
I feel sure there's a deeper, more knowledgeable answer to this question, but for the time being...
Since in a pure functional programming language data is immutable, there's no need to do anything other than copy the pointer instead of copying its target.
As a quick and very dirty example, I fired up the ghci interpreter:
Prelude> let x = replicate 10000 'm' in all (==x) $ replicate 10000 x
True
(1.61 secs, 0 bytes)
I admit that these stats are unreliable, but what it's not doing is allocating memory for all 10000 copies of a list 10000 characters long.
Summary:
The way to avoid memory duplication is to
(a) use haskell
(b) avoid pointlessly reconstructing your data.
How can I pointlessly reconstruct my data?
A very simple and pointless example:
pointlessly_reconstruct_list :: [a] -> [a]
pointlessly_reconstruct_list [] = []
pointlessly_reconstruct_list (x:xs) = x:xs
This kind of thing causes a duplicate of the list structure.
Have you got any examples that are a little less pointless but still simple?
Interestingly, if you do xs ++ ys you essentially reconstruct xs in order to place ys at the end of it (replacing []), so the list structure of xs is nearly copied wholesale. However, there's no need to replicate the actual data, and there certainly only needs to be one copy of ys.

Benefit of avoiding multiple list traversals

I've seen many examples in functional languages about processing a list and constructing a function to do something with its elements after receiving some additional value (usually not present at the time the function was generated), such as:
Calculating the difference between each element and the average
(the last 2 examples under "Lazy Evaluation")
Staging a list append in strict functional languages such as ML/OCaml, to avoid traversing the first list more than once
(the section titled "Staging")
Comparing a list to another with foldr (i.e. generating a function to compare another list to the first)
listEq a b = foldr comb null a b
where comb x frec [] = False
comb x frec (e:es) = x == e && frec es
cmp1To10 = listEq [1..10]
In all these examples, the authors generally remark the benefit of traversing the original list only once. But I can't keep myself from thinking "sure, instead of traversing a list of N elements, you are traversing a chain of N evaluations, so what?". I know there must be some benefit to it, could someone explain it please?
Edit: Thanks to both for the answers. Unfortunately, that's not what I wanted to know. I'll try to clarify my question, so it's not confused with the (more common) one about creating intermediate lists (which I already read about in various places). Also thanks for correcting my post formatting.
I'm interested in the cases where you construct a function to be applied to a list, where you don't yet have the necessary value to evaluate the result (be it a list or not). Then you can't avoid generating references to each list element (even if the list structure is not referenced anymore). And you have the same memory accesses as before, but you don't have to deconstruct the list (pattern matching).
For example, see the "staging" chapter in the mentioned ML book. I've tried it in ML and Racket, more specifically the staged version of "append" which traverses the first list and returns a function to insert the second list at the tail, without traversing the first list many times. Surprisingly for me, it was much faster even considering it still had to copy the list structure as the last pointer was different on each case.
The following is a variant of map which after applied to a list, it should be faster when changing the function. As Haskell is not strict, I would have to force the evaluation of listMap [1..100000] in cachedList (or maybe not, as after the first application it should still be in memory).
listMap = foldr comb (const [])
where comb x rest = \f -> f x : rest f
cachedList = listMap [1..100000]
doubles = cachedList (2*)
squares = cachedList (\x -> x*x)
-- print doubles and squares
-- ...
I know in Haskell it doesn't make a difference (please correct me if I'm wrong) using comb x rest f = ... vs comb x rest = \f -> ..., but I chose this version to emphasize the idea.
Update: after some simple tests, I couldn't find any difference in execution times in Haskell. The question then is only about strict languages such as Scheme (at least the Racket implementation, where I tested it) and ML.
Executing a few extra arithmetic instructions in your loop body is cheaper than executing a few extra memory fetches, basically.
Traversals mean doing lots of memory access, so the less you do, the better. Fusion of traversals reduces memory traffic, and increases the straight line compute load, so you get better performance.
Concretely, consider this program to compute some math on a list:
go :: [Int] -> [Int]
go = map (+2) . map (^3)
Clearly, we design it with two traversals of the list. Between the first and the second traversal, a result is stored in an intermediate data structure. However, it is a lazy structure, so only costs O(1) memory.
Now, the Haskell compiler immediately fuses the two loops into:
go = map ((+2) . (^3))
Why is that? After all, both are O(n) complexity, right?
The difference is in the constant factors.
Considering this abstraction: for each step of the first pipeline we do:
i <- read memory -- cost M
j = i ^ 3 -- cost A
write memory j -- cost M
k <- read memory -- cost M
l = k + 2 -- cost A
write memory l -- cost M
so we pay 4 memory accesses, and 2 arithmetic operations.
For the fused result we have:
i <- read memory -- cost M
j = (i ^ 3) + 2 -- cost 2A
write memory j -- cost M
where A and M are the constant factors for doing math on the ALU and memory access.
There are other constant factors as well (two loop branches) instead of one.
So unless memory access is free (it is not, by a long shot) then the second version is always faster.
Note that compilers that operate on immutable sequences can implement array fusion, the transformation that does this for you. GHC is such a compiler.
There is another very important reason. If you traverse a list only once, and you have no other reference to it, the GC can release the memory claimed by the list elements as you traverse them. Moreover, if the list is generated lazily, you always have only a constant memory consumption. For example
import Data.List
main = do
let xs = [1..10000000]
sum = foldl' (+) 0 xs
len = foldl' (\_ -> (+ 1)) 0 xs
print (sum / len)
computes sum, but needs to keep the reference to xs and the memory it occupies cannot be released, because it is needed to compute len later. (Or vice versa.) So the program consumes a considerable amount of memory, the larger xs the more memory it needs.
However, if we traverse the list only once, it is created lazily and the elements can be GC immediately, so no matter how big the list is, the program takes only O(1) memory.
{-# LANGUAGE BangPatterns #-}
import Data.List
main = do
let xs = [1..10000000]
(sum, len) = foldl' (\(!s,!l) x -> (s + x, l + 1)) (0, 0) xs
print (sum / len)
Sorry in advance for a chatty-style answer.
That's probably obvious, but if we're talking about the performance, you should always verify hypotheses by measuring.
A couple of years ago I was thinking about the operational semantics of GHC, the STG machine. And I asked myself the same question — surely the famous "one-traversal" algorithms are not that great? It only looks like one traversal on the surface, but under the hood you also have this chain-of-thunks structure which is usually quite similar to the original list.
I wrote a few versions (varying in strictness) of the famous RepMin problem — given a tree filled with numbers, generate the tree of the same shape, but replace every number with the minimum of all the numbers. If my memory is right (remember — always verify stuff yourself!), the naive two-traversal algorithm performed much faster than various clever one-traversal algorithms.
I also shared my observations with Simon Marlow (we were both at an FP summer school during that time), and he said that they use this approach in GHC. But not to improve performance, as you might have thought. Instead, he said, for a big AST (such as Haskell's one) writing down all the constructors takes much space (in terms of lines of code), and so they just reduce the amount of code by writing down just one (syntactic) traversal.
Personally I avoid this trick because if you make a mistake, you get a loop which is a very unpleasant thing to debug.
So the answer to your question is, partial compilation. Done ahead of time, it makes it so that there's no need to traverse the list to get to the individual elements - all the references are found in advance and stored inside the pre-compiled function.
As to your concern about the need for that function to be traversed too, it would be true in interpreted languages. But compilation eliminates this problem.
In the presence of laziness this coding trick may lead to the opposite results. Having full equations, e.g. Haskell GHC compiler is able to perform all kinds of optimizations, which essentially eliminate the lists completely and turn the code into an equivalent of loops. This happens when we compile the code with e.g. -O2 switch.
Writing out the partial equations may prevent this compiler optimization and force the actual creation of functions - with drastic slowdown of the resulting code. I tried your cachedList code and saw a 0.01s execution time turn into 0.20s (don't remember right now the exact test I did).

How to create data for Criterion benchmarks?

I am using criterion to benchmark my Haskell code. I'm doing some heavy computations for which I need random data. I've written my main benchmark file like this:
main :: IO ()
main = newStdGen >>= defaultMain . benchmarks
benchmarks :: RandomGen g => g -> [Benchmark]
benchmarks gen =
[
bgroup "Group"
[
bench "MyFun" $ nf benchFun (dataFun gen)
]
]
I keep benchmarks and data genrators for them in different modules:
benchFun :: ([Double], [Double]) -> [Double]
benchFun (ls, sig) = fun ls sig
dataFun :: RandomGen g => g -> ([Double], [Double])
dataFun gen = (take 5 $ randoms gen, take 1024 $ randoms gen)
This works, but I have two concerns. First, is the time needed to generate random data included in the benchmark? I found a question that touches on that subject but honestly speaking I'm unable to apply it to my code. To check whether this happens I wrote an alternative version of my data generator enclosed within IO monad. I placed benchmarks list with main, called the generator, extracted the result with <- and then passed it to the benchmarked function. I saw no difference in performance.
My second concern is related to generating random data. Right now the generator once created is not updated, which leads to generating the same data within a single run. This is not a major problem, but nevertheless it would be nice to make it properly. Is there a neat way to generate different random data within each data* function? "Neat" means "without making data functions acquiring StdGen within IO"?
EDIT: As noted in comment below I don't really care about data randomness. What is important to me is that the time needed to generate the data is not included in the benchmark.
This works, but I have two concerns. First, is the time needed to generate random data included in the benchmark?
Yes it would. All of the random generation should be happening lazily.
To check whether this happens I wrote an alternative version of my data generator enclosed within IO monad. I placed benchmarks list with main, called the generator, extracted the result with <- and then passed it to the benchmarked function. I saw no difference in performance.
This is expected (if I understand what you mean); the random values from randoms gen aren't going to be generated until they're needed (i.e. inside your benchmark loop).
Is there a neat way to generate different random data within each data* function? "Neat" means "without making data functions acquiring StdGen within IO"?
You need either to be in IO or create an StdGen with an integer seed you supply, with mkStdGen.
Re. your main question of how you should get the pRNG stuff out of your benchmarks, you should be able to evaluate the random input fully before your defaultMain (benchmarks g) stuff, with evaluate and force like:
import Control.DeepSeq(force)
import Control.Exception(evaluate)
myBench g = do randInputEvaled <- evaluate $ force $ dataFun g
defaultMain [
bench "MyFun" $ nf benchFun randInputEvaled
...
where force evaluates its argument to normal form, but this will still happen lazily. So to get it to be evaluated outside of bench we use evaluate to leverage monadic sequencing. You could also do things like call seq on the tail of each of the lists in your tuple, etc. if you wanted to avoid the imports.
That kind of thing should work fine, unless you need to hold a huge amount of test data in memory.
EDIT: this method is also a good idea if you want to get your data from IO, like reading from the disk, and don't want that mixed in to your benchmarks.
You could try reading the random data from a disk file instead. (In fact, if you're on some Unix-like OS, you could even use /dev/urandom.)
However, depending on how much data you need, the I/O time might dwarf the computation time. It depends how much random data you need.
(E.g., if your benchmark reads random numbers and calculates their sum, it's going to be I/O-limited. If your benchmark reads a random number and does some huge calculation based on just that one number, the I/O adds hardly any overhead at all.)

Resources