Data.Map: how do I tell if I "need value-strict maps"? - haskell

When choosing between Data.Map.Lazy and Data.Map.Strict, the docs tell us for the former:
API of this module is strict in the keys, but lazy in the values. If you need value-strict maps, use Data.Map.Strict instead.
and for the latter likewise:
API of this module is strict in both the keys and the values. If you need value-lazy maps, use Data.Map.Lazy instead.
How do more seasoned Haskellers than me tend to intuit this "need"? Use-case in point, in a run-and-done (ie. not daemon-like/long-running) command-line tool: readFileing a simple lines-based custom config file where many (not all) lines define key:value pairs to be collected into a Map. Once done, we rewrite many values in it depending on other values in it that were read later (thanks to immutability, in this process we create a new Map and discard the initial incarnation).
(Although in practice this file likely won't often or ever reach even a 1000 lines, let's just assume for the sake of learning that for some users it will before long.)
Any given run of the tool will perhaps lookup some 20-100% of the (rewritten on load, although with lazy-eval I'm never quite sure "when really") key:value pairs, anywhere between once and dozens of times.
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here? What happens "under the hood", in terms of mainstream computing if you will?
Fundamentally, such hash-maps are of course about "storing once, looking up many times" --- but then, what in computing isn't, "fundamentally". And furthermore the whole concept of lazy-eval's thunks seems to boil down to this very principle, so why not always stay value-lazy?

How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here?
Value lazy is the normal in Haskell. This means that not just values, but thunks (i.e. recipes of how to compute the value) are stored. For example, lets say you extract the value from a line like this:
tail (dropUntil (==':') line)
Then a value-strict map would actually extract the value upon insert, while a lazy one would happily just remember how to get it. This is then also what you would get on a lookup
Here are some pros and cons:
lazy values may need more memory, not only for the thunk itself, but also for the data that are referenced there (here line).
strict values may need more memory. In our case this could be so when the string gets interpreted to yield some memory hungry structure like lists, JSON or XML.
using lazy values may need less CPU if your code doesn't need every value.
too deep nesting of thunks may cause stack-overflows when the value is finally needed.
there is also a semantic difference: in lazy mode, you may get away when the code to extract the value would fail (like the above one that fails if there isnt a ':' on the line) if you just need to look whether the key is present. In strict mode, your program crashes upon insert.
As always, there are no fixed measures like: "If your evaluated value needs less than 20 bytes and takes less than 30µs to compute, use strict, else use lazy."
Normally, you just go with one and when you notice extreme runtimes/memory usage you try the other.

Here's a small experiment that shows a difference betwen Data.Map.Lazy and Data.Map.Strict. This code exhausts the heap:
import Data.Foldable
import qualified Data.Map.Lazy as M
main :: IO ()
main = print $ foldl' (\kv i -> M.adjust (+i) 'a' kv)
(M.fromList [('a',0)])
(cycle [0])
(Better to compile with a small maximum heap, like ghc Main.hs -with-rtsopts="-M20m".)
The foldl' keeps the map in WHNF as we iterate over the infinite list of zeros. However, thunks accumulate in the modified value until the heap is exhausted.
The same code with Data.Map.Strict simply loops forever. In the strict variant, the values are in WHNF whenever the map is in WHNF.

Related

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!
No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.
I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

Haskell list construction and memory usage

Suppose I have the following piece of code:
a = reverse b
doSomething a
Will memory for the list a be actually allocated, or will doSomething simply reuse the list b? If the memory is going to be allocated, is there a way to avoid it? Doubling memory usage just because I need a reversed list doesn't sound particularly nice.
In the worst case, both a and b will exist in memory in their entirety. Note that even then, the contents of the two lists will only exist once, shared between both lists, so we're only talking about the "spine" of the lists existing twice.
In the best case, depending on how b is defined and what doSomething does, the compiler might do some hoopy magic to turn the whole thing into a tight constant-space loop that generates the contents of the list as it processes them, possibly involving no memory allocation at all. Maybe.
But even in the very worst case, you're duplicating the spine of the lists. You'll never duplicate the actual elements in the list.
(Each cons node is, what, 3 pointers? I think...)

Understanding Haskell's `map` - Stack or Heap?

Given the following function:
f :: [String]
f = map integerToWord [1..999999999]
integerToWord :: Integer -> String
Let's ignore the implementation. Here's a sample output:
ghci> integerToWord 123999
"onehundredtwentythreethousandandninehundredninetynine"
When I execute f, do all results, i.e. f(0) through f(999999999) get stored on the stack or heap?
Note - I'm assuming that Haskell has a stack and heap.
After running this function for ~1 minute, I don't see the RAM increasing from its original usage.
To be precise - when you "just execute" f it's not evaluated unless you use its result somehow. And when you do - it's stored according to how it's required to fulfill the caller requirements.
As of this example - it's not stored anywhere: the function is applied to every number, the result is output to your terminal and is discarded. So at a given moment in time you only allocate enough memory to store the current value and the result (which is an approximation, but for the case it's precise enough).
References:
https://wiki.haskell.org/Non-strict_semantics
https://wiki.haskell.org/Lazy_vs._non-strict
First: To split hairs, the following answer applies to GHC. A different Haskell compiler could plausibly implement things differently.
There is indeed a heap and a stack. Almost everything goes on the heap, and hardly anything goes on the stack.
Consider, for example, the expression
let x = foo 17 in ...
Let's assume that the optimiser doesn't transform this into something completely different. The call to foo doesn't appear on the stack at all; instead, we create a note on the heap saying that we need to do foo 17 at some point, and x becomes a pointer to this note.
So, to answer your question: when you call f, a note that says "we need to execute map integerToWord [1..999999999] someday" gets stored on the heap, and you get a pointer to that. What happens next depends on what you do with that result.
If, for example, you try to print the entire thing, then yes, the result of every call to f ends up on the heap. At any given moment, only a single call to f is on the stack.
Alternatively, if you just try to access the 8th element of the result, then a bunch of "call f 5 someday" notes end up on the heap, plus the result of f 8, plus a note for the rest of the list.
Incidentally, there's a package out there ("vacuum"?) which lets you print out the actual object graphs for what you're executing. You might find it interesting.
GHC programs use a stack and a heap... but it doesn't work at all like the eager language stack machines you're familiar with. Somebody else is gonna have to explain this, because I can't.
The other challenge in answering your question is that GHC uses the following two techniques:
Lazy evaluation
List fusion
Lazy evaluation in Haskell means that (as the default rule) expressions are only evaluated when their value is demanded, and even then they may only be partially evaluated—only far enough as needed to resolve a pattern match that requires the value. So we can't say what your map example does without knowing what is demanding its value.
List fusion is a set of rewrite rules built into GHC, that recognize a number of situations where the output of a "good" list producer is only ever consumed as the input of a "good" list consumer. In these cases, Haskell can fuse the producer and the consumer into an object-code loop without ever allocating list cells.
In your case:
[1..999999999] is a good producer
map is both a good consumer and a good producer
But you seem to be using ghci, which doesn't do fusion. You need to compile your program with -O for fusion to happen.
You haven't told us what would be consuming the output of the map. If it's a good consumer it will fuse with the map.
But there's a good chance that GHC would eliminate most or all of the list cell allocations if you compiled (with -O) a program that just prints the result of that code. In that case, the list would not exist as a data structure in memory at all—the compiler would generate object code that does something roughly equivalent to this:
for (int i = 1; i <= 999999999; i++) {
print(integerToWord(i));
}

Why are bacon.js observables classified lazy sequences?

My understanding of lazy sequences is that they don't load data in memory until it's accessed by the program. So I can see how this would make sense if there was a large list of numbers, waiting to be consumed, but the sequence only pulled in the data from the producer when the iterator called the next method.
But observables append the item to themselves whenever the producer pushes it to them. So it's not like the sequence loads the data when consumer asks for it, it loads it whenever the producer sends it. So in what way are observables lazy?
There are 2 kinds of laziness in Bacon.js Observables:
Observables don't register to their underlying data source (for example, an AJAX fetch) until there is at least one Observer. This laziness practically gives you automatic resource management in the sense that connections to data sources are automatically opened and closed based on demand. Also if there are multiple Observers, only a single connection to the data source is used and the results are shared.
Observables don't evaluate the functions passed to map, combine etc until the value is actually used. So if you do expensive calculation in the function you give to map and only sample the stream once in a second, only those values will be actually evaluated.
What you don't get is backpressure management. So if you're data source, say observable a, produces infinite values sequentially, you can either
process them immediately using a.onValue
take 1000 first of them using a.take(1000).onValue
take until some condition using a.takeUntil((x) -> x > 1000)).onValue
But you cannot affect the rate that the source produces the values, because Bacon.js provides no way to tell the source that "I'm interested in more values later, I'll tell you when". I've been thinking about adding something like this, but it would complicate matters a lot.
As the bottom line, I'd say that Bacon.js is not your ideal library for working with infinite lists, for instance, in the Haskell style. And AFAIK, neither is any other Javascript FRP lib.
A lazy sequence is a sequence that evaluates its elements at the last time possible. What's useful about them is that you can pass the sequence around and even perform operations on it without evaluating its contents.
Since you don't evaluate the sequence, you can even create an infinite sequence; just make sure that you don't evaluate the whole of it. For example, the following Haskell program creates an infinite sequence of natural numbers, then lazily multiplies each element by 2, producing an infinite sequence of even numbers, then takes the first 5 elements, evaluating them (and only them):
take 5 (map (*2) [1..])
-- [2,4,6,8,10]
Basically, with lazy sequences and a set of functions that work over them, like map, you can write programs that create and process potentially infinite streams of data in a composable way.
Which, incidentally, is exactly what (functional) reactive programming is about. An observable can be seen as a potentially infinite stream of pairs (value, timestamp). An operation on an observable, such as map, simply creates another stream based on the original one. Attaching a consumer to an observable evaluates the elements of it that "already happened", while leaving the rest un-evaluated.
For example, Bacon's observable.map can be implemented as a function that lazily applies a function to the 'value' part of the stream (again in Haskell):
map f = Prelude.map (\(value, tstamp) -> (f value, tstamp))
while observable.delay adds delay to the 'timestamp' part:
delay dt = Prelude.map (\(value, tstamp) -> (value, tstamp + dt))
where Prelude.map is just a regular "map" function over lazy sequences.
Hopefully I haven't confused you any further!

Repa performance versus lists

In the Numeric Haskell Repa Tutorial Wiki, there is a passage that reads (for context):
10.1 Fusion, and why you need it
Repa depends critically on array fusion to achieve fast code. Fusion is a fancy name for the
combination of inlining and code transformations performed by GHC when
it compiles your program. The fusion process merges the array filling
loops defined in the Repa library, with the "worker" functions that
you write in your own module. If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists. On the other
hand, provided fusion works, the resulting code will run as fast as an
equivalent cleanly written C program. Making fusion work is not hard
once you understand what's going on.
The part that I don't understand is this:
"If the fusion process fails, then the
resulting program will be much slower than it needs to be, often 10x
slower an equivalent program using plain Haskell lists."
I understand why it would run slower if stream fusion fails, but why does it run that much slower than lists?
Thanks!
Typically, because lists are lazy, and Repa arrays are strict.
If you fail to fuse a lazy list traversal, e.g.
map f . map g
you pay O(1) cost per value for leaving the intermediate (lazy) cons cell there.
If you fail to fuse the same traversal over a strict sequence, you pay at least O(n) per value for allocating a strict intermediate array.
Also, since fusion mangles your code into an unrecognizable Stream data type, to improve analysis, you can be left with code that has just too many constructors and other overheads.
Edit: This is not correct--see Don Nelson's comment (and his answer--he knows a lot more about the library than I do).
Immutable arrays cannot share components; disregarding fusion, any modification to an immutable array must reallocate the entire array. By contrast, while list operations are non-destructive, they can share parts: f i (h:t) = i:t, for example, replaces the head of a list in constant time by creating a new list in which the first cell points to the second cell of the original list. Moreover, because lists can be built incrementally, such functions as generators that build a list by repeated calls to a function can still run in O(n) time, while the equivalent function on an immutable array without fusion would need to reallocate the array with every call to the function, taking O(n^2) time.

Resources