Best (mutable) queue data structure available in Haskell - haskell

Dear(est) Stack Exchangers,
I am currently implementing some algorithms which require access to a data structure of a “queue” (FIFO). I am using the ST monad , and thus am looking for queue implementations which complement well with the ST monad’s "mutablity of memory”. At this point, I am just tempted to use newSTRef on a list (but again, accessing the last element is O(n) complexity, which I would want to avoid as much as I can). I also thought to use Data.Sequence, though I am not sure if it actually will be “mutable” if used inside an ST monad without newSTRef initialisation.
Can the kind members of Stack Exchange guide a beginner in Haskell as to what would be the best data structure (or module) in the aforementioned context?

Options include implementing a traditional ring buffer on top of STArray, or using a mutable singly-linked list built out of STRefs, as in:
type CellRef s a = STRef s (Cell s a)
data Cell s a = End | Cell a (CellRef s a)
data Q s a = Q { readHead, writeHead :: CellRef s a }
If you want the easy growth of Q but like the low pointer overhead of a ring buffer, you can get a middle ground by making each cell have an STArray that slowly fills up. When it's full, allocate a fresh cell; when reading from it empties it, advance to the next cell. You get the idea.

There is a standard implementation of a FIFO queue as two LIFO stacks, one containing items starting from the front of the queue (with the next item to be removed on top), and the other containing items starting from the back (with the most recently pushed item on top). When popping from the queue, if the front stack is empty, you replace it with the reversal of the back stack.
If both stacks are implemented as Haskell lists, then adding a value to the queue is O(1), and removing a value is amortized O(1) if the data structure is used in a single-threaded way. The constant factor isn't bad. You can put the whole data structure in a STRef (which guarantees single-threaded use). The implementation is just a few lines of code. You should definitely do this in preference to your O(n) single-list idea.
You can also use Data.Sequence. Like the two-stack queue, it is a pure functional data structure, i.e., operations on it return a new data structure and leave the old one unchanged. But, like the two-stack queue, you can make it mutable by simply writing the new data structure into the STRef that held the old one. The constant factor for Data.Sequence is probably a bit worse than the two-stack queue, but in exchange you get a larger set of efficient operations.
The mutable list in David Wagner's answer is likely to be less efficient because it requires two heap objects per item in the queue. You may be able to avoid that in GHC by writing
Cell a {-# UNPACK #-} !(CellRef s a)
in place of Cell a (CellRef s a). I'm not certain that that will work, though. If it does, this is likely to be somewhat faster than the other list-based approaches.

Related

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!
No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.
I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

Understanding Haskell's `map` - Stack or Heap?

Given the following function:
f :: [String]
f = map integerToWord [1..999999999]
integerToWord :: Integer -> String
Let's ignore the implementation. Here's a sample output:
ghci> integerToWord 123999
"onehundredtwentythreethousandandninehundredninetynine"
When I execute f, do all results, i.e. f(0) through f(999999999) get stored on the stack or heap?
Note - I'm assuming that Haskell has a stack and heap.
After running this function for ~1 minute, I don't see the RAM increasing from its original usage.
To be precise - when you "just execute" f it's not evaluated unless you use its result somehow. And when you do - it's stored according to how it's required to fulfill the caller requirements.
As of this example - it's not stored anywhere: the function is applied to every number, the result is output to your terminal and is discarded. So at a given moment in time you only allocate enough memory to store the current value and the result (which is an approximation, but for the case it's precise enough).
References:
https://wiki.haskell.org/Non-strict_semantics
https://wiki.haskell.org/Lazy_vs._non-strict
First: To split hairs, the following answer applies to GHC. A different Haskell compiler could plausibly implement things differently.
There is indeed a heap and a stack. Almost everything goes on the heap, and hardly anything goes on the stack.
Consider, for example, the expression
let x = foo 17 in ...
Let's assume that the optimiser doesn't transform this into something completely different. The call to foo doesn't appear on the stack at all; instead, we create a note on the heap saying that we need to do foo 17 at some point, and x becomes a pointer to this note.
So, to answer your question: when you call f, a note that says "we need to execute map integerToWord [1..999999999] someday" gets stored on the heap, and you get a pointer to that. What happens next depends on what you do with that result.
If, for example, you try to print the entire thing, then yes, the result of every call to f ends up on the heap. At any given moment, only a single call to f is on the stack.
Alternatively, if you just try to access the 8th element of the result, then a bunch of "call f 5 someday" notes end up on the heap, plus the result of f 8, plus a note for the rest of the list.
Incidentally, there's a package out there ("vacuum"?) which lets you print out the actual object graphs for what you're executing. You might find it interesting.
GHC programs use a stack and a heap... but it doesn't work at all like the eager language stack machines you're familiar with. Somebody else is gonna have to explain this, because I can't.
The other challenge in answering your question is that GHC uses the following two techniques:
Lazy evaluation
List fusion
Lazy evaluation in Haskell means that (as the default rule) expressions are only evaluated when their value is demanded, and even then they may only be partially evaluated—only far enough as needed to resolve a pattern match that requires the value. So we can't say what your map example does without knowing what is demanding its value.
List fusion is a set of rewrite rules built into GHC, that recognize a number of situations where the output of a "good" list producer is only ever consumed as the input of a "good" list consumer. In these cases, Haskell can fuse the producer and the consumer into an object-code loop without ever allocating list cells.
In your case:
[1..999999999] is a good producer
map is both a good consumer and a good producer
But you seem to be using ghci, which doesn't do fusion. You need to compile your program with -O for fusion to happen.
You haven't told us what would be consuming the output of the map. If it's a good consumer it will fuse with the map.
But there's a good chance that GHC would eliminate most or all of the list cell allocations if you compiled (with -O) a program that just prints the result of that code. In that case, the list would not exist as a data structure in memory at all—the compiler would generate object code that does something roughly equivalent to this:
for (int i = 1; i <= 999999999; i++) {
print(integerToWord(i));
}

Any way to manually indicate element of a MutableArray# safe to GC?

In my application I'm working with MutableArrays (via the primitive package) shared across threads. I know when individual elements are no longer used and I'd like some way (unsafeMarkGarbage or something) to indicate to the runtime that they can be collected. At least I'd like to experiment with that if such a function or equivalent technique exists.
EDIT, to add a bit more detail: I've got a conceptual "infinite tape" implemented as a linked list of short MutableArray segments, something like:
data Seg a = Seg (MutableArray a) (IORef (Maybe (Seg a)))
I access the tape using a concurrent counter and always know when an element of the tape will no longer be accessed. In certain cases when a thread is descheduled it's possible that entire array segments (both the array and its elements) which could have been GC'd will stick around as their references will persist.
An ideal solution would avoid an additional write (maybe that's silly), avoid another layer of indirection in the array, and allow entire MutableArrays to be collected when all their elements expire.
Weak references do seem to be the most promising sort of mechanism I've seen, but I can't yet see how they can help me here.
I would suggest you store undefined in the positions that you would like to garbage collect.

Is the whole Map copied when a new binding is inserted?

I would like to better understand the interns of e.g. Data.Map. When I insert a new binding in a Map, then, because of immutability of data I get back a new data structure that is identical with the old data structure plus the new binding.
I would like to understand how this is achieved. Does the compiler eventually implement this by copying the whole data structure with e.g. millions of bindings? Can it generally be said that mutable data structures/arrays (e.g. Data.Judy) or imperative programming languages perform better in such cases? Does immutable data have any advantage when it comes to dictionaries/key-value stores?
Map is built on a tree data structure. Basically, a new Map value is constructed, but it'll be filled almost entirely with pointers to the old structure. Since values never change in Haskell, this is a safe, and very important optimisation, known as sharing.
This means that you can have many similar versions of the same data structure hanging around, but only the branches of the tree that differ will be stored anew; the rest will simply be pointers to the original copy of the branch. And, of course, if you throw away the old Map, the branches you did change will be reclaimed by the garbage collector.
Sharing is key to the performance of immutable data structures. You might find this Wikipedia article helpful; it has some enlightening graphs showing how modified data gets represented with sharing.
No. The documentation for Data.Map.insert states that insertion takes O(log n) time. It would be impossible to satisfy that bound if it had to copy the entire structure.
Data.Map doesn't copy the old map; it (lazily) allocates O(log N) new nodes, which point to (and thereby share) most of the old map.
Because "updating" the map doesn't disrupt old versions, this kind of datastructure gives you greater freedom in building concurrent algorithms.

Is it possible to make fast big circular buffer arrays for stream recording in Haskell?

I'm considering converting a C# app to Haskell as my first "real" Haskell project. However I want to make sure it's a project that makes sense. The app collects data packets from ~15 serial streams that come at around 1 kHz, loads those values into the corresponding circular buffers on my "context" object, each with ~25000 elements, and then at 60 Hz sends those arrays out to OpenGL for waveform display. (Thus it has to be stored as an array, or at least converted to an array every 16 ms). There are also about 70 fields on my context object that I only maintain the current (latest) value, not the stream waveform.
There are several aspects of this project that map well to Haskell, but the thing I worry about is the performance. If for each new datapoint in any of the streams, I'm having to clone the entire context object with 70 fields and 15 25000-element arrays, obviously there's going to be performance issues.
Would I get around this by putting everything in the IO-monad? But then that seems to somewhat defeat the purpose of using Haskell, right? Also all my code in C# is event-driven; is there an idiom for that in Haskell? It seems like adding a listener creates a "side effect" and I'm not sure how exactly that would be done.
Look at this link, under the section "The ST monad":
http://book.realworldhaskell.org/read/advanced-library-design-building-a-bloom-filter.html
Back in the section called “Modifying array elements”, we mentioned
that modifying an immutable array is prohibitively expensive, as it
requires copying the entire array. Using a UArray does not change
this, so what can we do to reduce the cost to bearable levels?
In an imperative language, we would simply modify the elements of the
array in place; this will be our approach in Haskell, too.
Haskell provides a special monad, named ST, which lets us work
safely with mutable state. Compared to the State monad, it has some
powerful added capabilities.
We can thaw an immutable array to give a mutable array; modify the
mutable array in place; and freeze a new immutable array when we are
done.
...
The IO monad also provides these capabilities. The major difference between the two is that the ST monad is intentionally designed so that we can escape from it back into pure Haskell code.
So should be possible to modify in-place, and it won't defeat the purpose of using Haskell after all.
Yes, you would probably want to use the IO monad for mutable data. I don't believe the ST monad is a good fit for this problem space because the data updates are interleaved with actual IO actions (reading input streams). As you would need to perform the IO within ST by using unsafeIOToST, I find it preferable to just use IO directly. The other approach with ST is to continually thaw and freeze an array; this is messy because you need to guarantee that old copies of the data are never used.
Although evidence shows that a pure solution (in the form of Data.Sequence.Seq) is often faster than using mutable data, given your requirement that data be pushed out to OpenGL, you'll possible get better performance from working with the array directly. I would use the functions from Data.Vector.Storable.Mutable (from the vector package), as then you have access to the ForeignPtr for export.
You can look at arrows (Yampa) for one very common approach to event-driven code. Another area is Functional Reactivity (FRP). There are starting to be some reasonably mature libraries in this domain, such as Netwire or reactive-banana. I don't know if they'd provide adequate performance for your requirements though; I've mostly used them for gui-type programming.

Resources