Understanding Haskell's `map` - Stack or Heap? - haskell

Given the following function:
f :: [String]
f = map integerToWord [1..999999999]
integerToWord :: Integer -> String
Let's ignore the implementation. Here's a sample output:
ghci> integerToWord 123999
"onehundredtwentythreethousandandninehundredninetynine"
When I execute f, do all results, i.e. f(0) through f(999999999) get stored on the stack or heap?
Note - I'm assuming that Haskell has a stack and heap.
After running this function for ~1 minute, I don't see the RAM increasing from its original usage.

To be precise - when you "just execute" f it's not evaluated unless you use its result somehow. And when you do - it's stored according to how it's required to fulfill the caller requirements.
As of this example - it's not stored anywhere: the function is applied to every number, the result is output to your terminal and is discarded. So at a given moment in time you only allocate enough memory to store the current value and the result (which is an approximation, but for the case it's precise enough).
References:
https://wiki.haskell.org/Non-strict_semantics
https://wiki.haskell.org/Lazy_vs._non-strict

First: To split hairs, the following answer applies to GHC. A different Haskell compiler could plausibly implement things differently.
There is indeed a heap and a stack. Almost everything goes on the heap, and hardly anything goes on the stack.
Consider, for example, the expression
let x = foo 17 in ...
Let's assume that the optimiser doesn't transform this into something completely different. The call to foo doesn't appear on the stack at all; instead, we create a note on the heap saying that we need to do foo 17 at some point, and x becomes a pointer to this note.
So, to answer your question: when you call f, a note that says "we need to execute map integerToWord [1..999999999] someday" gets stored on the heap, and you get a pointer to that. What happens next depends on what you do with that result.
If, for example, you try to print the entire thing, then yes, the result of every call to f ends up on the heap. At any given moment, only a single call to f is on the stack.
Alternatively, if you just try to access the 8th element of the result, then a bunch of "call f 5 someday" notes end up on the heap, plus the result of f 8, plus a note for the rest of the list.
Incidentally, there's a package out there ("vacuum"?) which lets you print out the actual object graphs for what you're executing. You might find it interesting.

GHC programs use a stack and a heap... but it doesn't work at all like the eager language stack machines you're familiar with. Somebody else is gonna have to explain this, because I can't.
The other challenge in answering your question is that GHC uses the following two techniques:
Lazy evaluation
List fusion
Lazy evaluation in Haskell means that (as the default rule) expressions are only evaluated when their value is demanded, and even then they may only be partially evaluated—only far enough as needed to resolve a pattern match that requires the value. So we can't say what your map example does without knowing what is demanding its value.
List fusion is a set of rewrite rules built into GHC, that recognize a number of situations where the output of a "good" list producer is only ever consumed as the input of a "good" list consumer. In these cases, Haskell can fuse the producer and the consumer into an object-code loop without ever allocating list cells.
In your case:
[1..999999999] is a good producer
map is both a good consumer and a good producer
But you seem to be using ghci, which doesn't do fusion. You need to compile your program with -O for fusion to happen.
You haven't told us what would be consuming the output of the map. If it's a good consumer it will fuse with the map.
But there's a good chance that GHC would eliminate most or all of the list cell allocations if you compiled (with -O) a program that just prints the result of that code. In that case, the list would not exist as a data structure in memory at all—the compiler would generate object code that does something roughly equivalent to this:
for (int i = 1; i <= 999999999; i++) {
print(integerToWord(i));
}

Related

Best (mutable) queue data structure available in Haskell

Dear(est) Stack Exchangers,
I am currently implementing some algorithms which require access to a data structure of a “queue” (FIFO). I am using the ST monad , and thus am looking for queue implementations which complement well with the ST monad’s "mutablity of memory”. At this point, I am just tempted to use newSTRef on a list (but again, accessing the last element is O(n) complexity, which I would want to avoid as much as I can). I also thought to use Data.Sequence, though I am not sure if it actually will be “mutable” if used inside an ST monad without newSTRef initialisation.
Can the kind members of Stack Exchange guide a beginner in Haskell as to what would be the best data structure (or module) in the aforementioned context?
Options include implementing a traditional ring buffer on top of STArray, or using a mutable singly-linked list built out of STRefs, as in:
type CellRef s a = STRef s (Cell s a)
data Cell s a = End | Cell a (CellRef s a)
data Q s a = Q { readHead, writeHead :: CellRef s a }
If you want the easy growth of Q but like the low pointer overhead of a ring buffer, you can get a middle ground by making each cell have an STArray that slowly fills up. When it's full, allocate a fresh cell; when reading from it empties it, advance to the next cell. You get the idea.
There is a standard implementation of a FIFO queue as two LIFO stacks, one containing items starting from the front of the queue (with the next item to be removed on top), and the other containing items starting from the back (with the most recently pushed item on top). When popping from the queue, if the front stack is empty, you replace it with the reversal of the back stack.
If both stacks are implemented as Haskell lists, then adding a value to the queue is O(1), and removing a value is amortized O(1) if the data structure is used in a single-threaded way. The constant factor isn't bad. You can put the whole data structure in a STRef (which guarantees single-threaded use). The implementation is just a few lines of code. You should definitely do this in preference to your O(n) single-list idea.
You can also use Data.Sequence. Like the two-stack queue, it is a pure functional data structure, i.e., operations on it return a new data structure and leave the old one unchanged. But, like the two-stack queue, you can make it mutable by simply writing the new data structure into the STRef that held the old one. The constant factor for Data.Sequence is probably a bit worse than the two-stack queue, but in exchange you get a larger set of efficient operations.
The mutable list in David Wagner's answer is likely to be less efficient because it requires two heap objects per item in the queue. You may be able to avoid that in GHC by writing
Cell a {-# UNPACK #-} !(CellRef s a)
in place of Cell a (CellRef s a). I'm not certain that that will work, though. If it does, this is likely to be somewhat faster than the other list-based approaches.

Haskell list construction and memory usage

Suppose I have the following piece of code:
a = reverse b
doSomething a
Will memory for the list a be actually allocated, or will doSomething simply reuse the list b? If the memory is going to be allocated, is there a way to avoid it? Doubling memory usage just because I need a reversed list doesn't sound particularly nice.
In the worst case, both a and b will exist in memory in their entirety. Note that even then, the contents of the two lists will only exist once, shared between both lists, so we're only talking about the "spine" of the lists existing twice.
In the best case, depending on how b is defined and what doSomething does, the compiler might do some hoopy magic to turn the whole thing into a tight constant-space loop that generates the contents of the list as it processes them, possibly involving no memory allocation at all. Maybe.
But even in the very worst case, you're duplicating the spine of the lists. You'll never duplicate the actual elements in the list.
(Each cons node is, what, 3 pointers? I think...)

Data.Map: how do I tell if I "need value-strict maps"?

When choosing between Data.Map.Lazy and Data.Map.Strict, the docs tell us for the former:
API of this module is strict in the keys, but lazy in the values. If you need value-strict maps, use Data.Map.Strict instead.
and for the latter likewise:
API of this module is strict in both the keys and the values. If you need value-lazy maps, use Data.Map.Lazy instead.
How do more seasoned Haskellers than me tend to intuit this "need"? Use-case in point, in a run-and-done (ie. not daemon-like/long-running) command-line tool: readFileing a simple lines-based custom config file where many (not all) lines define key:value pairs to be collected into a Map. Once done, we rewrite many values in it depending on other values in it that were read later (thanks to immutability, in this process we create a new Map and discard the initial incarnation).
(Although in practice this file likely won't often or ever reach even a 1000 lines, let's just assume for the sake of learning that for some users it will before long.)
Any given run of the tool will perhaps lookup some 20-100% of the (rewritten on load, although with lazy-eval I'm never quite sure "when really") key:value pairs, anywhere between once and dozens of times.
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here? What happens "under the hood", in terms of mainstream computing if you will?
Fundamentally, such hash-maps are of course about "storing once, looking up many times" --- but then, what in computing isn't, "fundamentally". And furthermore the whole concept of lazy-eval's thunks seems to boil down to this very principle, so why not always stay value-lazy?
How do I reason about the differences between "value-strict" and "value-lazy" Data.Maps here?
Value lazy is the normal in Haskell. This means that not just values, but thunks (i.e. recipes of how to compute the value) are stored. For example, lets say you extract the value from a line like this:
tail (dropUntil (==':') line)
Then a value-strict map would actually extract the value upon insert, while a lazy one would happily just remember how to get it. This is then also what you would get on a lookup
Here are some pros and cons:
lazy values may need more memory, not only for the thunk itself, but also for the data that are referenced there (here line).
strict values may need more memory. In our case this could be so when the string gets interpreted to yield some memory hungry structure like lists, JSON or XML.
using lazy values may need less CPU if your code doesn't need every value.
too deep nesting of thunks may cause stack-overflows when the value is finally needed.
there is also a semantic difference: in lazy mode, you may get away when the code to extract the value would fail (like the above one that fails if there isnt a ':' on the line) if you just need to look whether the key is present. In strict mode, your program crashes upon insert.
As always, there are no fixed measures like: "If your evaluated value needs less than 20 bytes and takes less than 30µs to compute, use strict, else use lazy."
Normally, you just go with one and when you notice extreme runtimes/memory usage you try the other.
Here's a small experiment that shows a difference betwen Data.Map.Lazy and Data.Map.Strict. This code exhausts the heap:
import Data.Foldable
import qualified Data.Map.Lazy as M
main :: IO ()
main = print $ foldl' (\kv i -> M.adjust (+i) 'a' kv)
(M.fromList [('a',0)])
(cycle [0])
(Better to compile with a small maximum heap, like ghc Main.hs -with-rtsopts="-M20m".)
The foldl' keeps the map in WHNF as we iterate over the infinite list of zeros. However, thunks accumulate in the modified value until the heap is exhausted.
The same code with Data.Map.Strict simply loops forever. In the strict variant, the values are in WHNF whenever the map is in WHNF.

Call by need: When is it used in Haskell?

http://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_need says:
"Call-by-need is a memoized version of call-by-name where, if the function argument is evaluated, that value is stored for subsequent uses. [...] Haskell is the most well-known language that uses call-by-need evaluation."
However, the value of a computation is not always stored for faster access (for example consider a recursive definition of fibonacci numbers). I asked someone on #haskell and the answer was that this memoization is done automatically "only in one instance, e.g. if you have `let foo = bar baz', foo will be evaluated once".
My questions is: What does instance exactly mean, are there other cases than let in which memoization is done automatically?
Describing this behavior as "memoization" is misleading. "Call by need" just means that a given input to a function will be evaluated somewhere between 0 and 1 times, never more than once. (It could be partially evaluated as well, which means the function only needed part of that input.) In contrast, "call by name" is simply expression substitution, which means if you give the expression 2 + 3 as an input to a function, it may be evaluated multiple times if the input is used more than once. Both call by need and call by name are non-strict: if the input is not used, then it is never evaluated. Most programming languages are strict, and use a "call by value" approach, which means that all inputs are evaluated before you begin evaluating the function, whether or not the inputs are used. This all has nothing to do with let expressions.
Haskell does not perform any automatic memoization. Let expressions are not an example of memoization. However, most compilers will evaluate let bindings in a call-by-need-esque fashion. If you model a let expression as a function, then the "call by need" mentality does apply:
let foo = expression one in expression two that uses foo
==>
(\foo -> expression two that uses foo) (expression one)
This doesn't correctly model recursive bindings, but you get the idea.
The haskell language definition does not define when, or how often, code is invoked. Infinite loops are defined in terms of 'the bottom' (written ⊥), which is a value (which exists within all types) that represents an error condition. The compiler is free to make its own decisions regarding when and how often to evaluate things as long as the program (and presence/absence of error conditions, including infinite loops!) behaves according to spec.
That said, the usual way of doing this is that most expressions generate 'thunks' - basically a pointer to some code and some context data. The first time you attempt to examine the result of the expression (ie, pattern match it), the thunk is 'forced'; the pointed-to code is executed, and the thunk overwritten with real data. This in turn can recursively evaluate other thunks.
Of course, doing this all the time is slow, so the compiler usually tries to analyze when you'd end up forcing a thunk right away anyway (ie, when something is 'strict' on the value in question), and if it finds this, it'll skip the whole thunk thing and just call the code right away. If it can't prove this, it can still make this optimization as long as it makes sure that executing the thunk right away can't crash or cause an infinite loop (or it handles these conditions somehow).
If you don't want to have to get very technical about this, the essential point is that when you have an expression like some_expensive_computation of all these arguments, you can do whatever you want with it; store it in a data structure, create a list of 53 copies of it, pass it to 6 other functions, etc, and then even return it to your caller for the caller to do whatever it wants with it.
What Haskell will (mostly) do is evaluate it at most once; if it the program ever needs to know something about what that expression returned in order to make a decision, then it will be evaluated (at least enough to know which way the decision should go). That evaluation will affect all the other references to the same expression, even if they are now scattered around in data structures and other not-yet-evaluated expressions all throughout your program.

Keeping State in a Purely Functional Language

I am trying to figure out how to do the following, assume that your are working on a controller for a DC motor you want to keep it spinning at a certain speed set by the user,
(def set-point (ref {:sp 90}))
(while true
(let [curr (read-speed)]
(controller #set-point curr)))
Now that set-point can change any time via a web a application, I can't think of a way to do this without using ref, so my question is how functional languages deal with this sort of thing? (even though the example is in clojure I am interested in the general idea.)
This will not answer your question but I want to show how these things are done in Clojure. It might help someone reading this later so they don't think they have to read up on monads, reactive programming or other "complicated" subjects to use Clojure.
Clojure is not a purely functional language and in this case it might be a good idea to leave the pure functions aside for a moment and model the inherent state of the system with identities.
In Clojure, you would probably use one of the reference types. There are several to choose from and knowing which one to use might be difficult. The good news is they all support the unified update model so changing the reference type later should be pretty straight forward.
I've chosen an atom but depending on your requirements it might be more appropriate to use a ref or an agent.
The motor is an identity in your program. It is a "label" for some thing that has different values at different times and these values are related to each other (i.e., the speed of the motor). I have put a :validator on the atom to ensure that the speed never drops below zero.
(def motor (atom {:speed 0} :validator (comp not neg? :speed)))
(defn add-speed [n]
(swap! motor update-in [:speed] + n))
(defn set-speed [n]
(swap! motor update-in [:speed] (constantly n)))
> (add-speed 10)
> (add-speed -8)
> (add-speed -4) ;; This will not change the state of motor
;; since the speed would drop below zero and
;; the validator does not allow that!
> (:speed #motor)
2
> (set-speed 12)
> (:speed #motor)
12
If you want to change the semantics of the motor identity you have at least two other reference types to choose from.
If you want to change the speed of the motor asynchronously you would use an agent. Then you need to change swap! with send. This would be useful if, for example, the clients adjusting the motor speed are different from the clients using the motor speed, so that it's fine for the speed to be changed "eventually".
Another option is to use a ref which would be appropriate if the motor need to coordinate with other identities in your system. If you choose this reference type you change swap! with alter. In addition, all state changes are run in a transaction with dosync to ensure that all identities in the transaction are updated atomically.
Monads are not needed to model identities and state in Clojure!
For this answer, I'm going to interpret "a purely functional language" as meaning "an ML-style language that excludes side effects" which I will interpret in turn as meaning "Haskell" which I'll interpret as meaning "GHC". None of these are strictly true, but given that you're contrasting this with a Lisp derivative and that GHC is rather prominent, I'm guessing this will still get at the heart of your question.
As always, the answer in Haskell is a bit of sleight-of-hand where access to mutable data (or anything with side effects) is structured in such a way that the type system guarantees that it will "look" pure from the inside, while producing a final program that has side effects where expected. The usual business with monads is a large part of this, but the details don't really matter and mostly distract from the issue. In practice, it just means you have to be explicit about where side effects can occur and in what order, and you're not allowed to "cheat".
Mutability primitives are generally provided by the language runtime, and accessed through functions that produce values in some monad also provided by the runtime (often IO, sometimes more specialized ones). First, let's take a look at the Clojure example you provided: it uses ref, which is described in the documentation here:
While Vars ensure safe use of mutable storage locations via thread isolation, transactional references (Refs) ensure safe shared use of mutable storage locations via a software transactional memory (STM) system. Refs are bound to a single storage location for their lifetime, and only allow mutation of that location to occur within a transaction.
Amusingly, that whole paragraph translates pretty directly to GHC Haskell. I'm guessing that "Vars" are equivalent to Haskell's MVar, while "Refs" are almost certainly equivalent to TVar as found in the stm package.
So to translate the example to Haskell, we'll need a function that creates the TVar:
setPoint :: STM (TVar Int)
setPoint = newTVar 90
...and we can use it in code like this:
updateLoop :: IO ()
updateLoop = do tvSetPoint <- atomically setPoint
sequence_ . repeat $ update tvSetPoint
where update tv = do curSpeed <- readSpeed
curSet <- atomically $ readTVar tv
controller curSet curSpeed
In actual use my code would be far more terse than that, but I've left things more verbose here in hopes of being less cryptic.
I suppose one could object that this code isn't pure and is using mutable state, but... so what? At some point a program is going to run and we'd like it to do input and output. The important thing is that we retain all the benefits of code being pure, even when using it to write code with mutable state. For instance, I've implemented an infinite loop of side effects using the repeat function; but repeat is still pure and behaves reliably and nothing I can do with it will change that.
A technique to tackle problems that apparently scream for mutability (like GUI or web applications) in a functional way is Functional Reactive Programming.
The pattern you need for this is called Monads. If you really want to get into functional programming you should try to understand what monads are used for and what they can do. As a starting point I would suggest this link.
As a short informal explanation for monads:
Monads can be seen as data + context that is passed around in your program. This is the "space suit" often used in explanations. You pass data and context around together and insert any operation into this Monad. There is usually no way to get the data back once it is inserted into the context, you just can go the other way round inserting operations, so that they handle data combined with context. This way it almost seems as if you get the data out, but if you look closely you never do.
Depending on your application the context can be almost anything. A datastructure that combines multiple entities, exceptions, optionals, or the real world (i/o-monads). In the paper linked above the context will be execution states of an algorithm, so this is quite similar to the things you have in mind.
In Erlang you could use a process to hold the value. Something like this:
holdVar(SomeVar) ->
receive %% wait for message
{From, get} -> %% if you receive a get
From ! {value, SomeVar}, %% respond with SomeVar
holdVar(SomeVar); %% recursively call holdVar
%% to start listening again
{From, {set, SomeNewVar}} -> %% if you receive a set
From ! {ok}, %% respond with ok
holdVar(SomeNewVar); %% recursively call holdVar with
%% the SomeNewVar that you received
%% in the message
end.

Resources