I'm learning Haskell and read a couple of articles regarding performance differences of Haskell lists and (insert your language)'s arrays.
Being a learner I obviously just use lists without even thinking about performance difference.
I recently started investigating and found numerous data structure libraries available in Haskell.
Can someone please explain the difference between Lists, Arrays, Vectors, Sequences without going very deep in computer science theory of data structures?
Also, are there some common patterns where you would use one data structure instead of another?
Are there any other forms of data structures that I am missing and might be useful?
Lists Rock
By far the most friendly data structure for sequential data in Haskell is the List
data [a] = a:[a] | []
Lists give you ϴ(1) cons and pattern matching. The standard library, and for that matter the prelude, is full of useful list functions that should litter your code (foldr,map,filter). Lists are persistant , aka purely functional, which is very nice. Haskell lists aren't really "lists" because they are coinductive (other languages call these streams) so things like
ones :: [Integer]
ones = 1:ones
twos = map (+1) ones
tenTwos = take 10 twos
work wonderfully. Infinite data structures rock.
Lists in Haskell provide an interface much like iterators in imperative languages (because of laziness). So, it makes sense that they are widely used.
On the other hand
The first problem with lists is that to index into them (!!) takes ϴ(k) time, which is annoying. Also, appends can be slow ++, but Haskell's lazy evaluation model means that these can be treated as fully amortized, if they happen at all.
The second problem with lists is that they have poor data locality. Real processors incur high constants when objects in memory are not laid out next to each other. So, in C++ std::vector has faster "snoc" (putting objects at the end) than any pure linked list data structure I know of, although this is not a persistant data structure so less friendly than Haskell's lists.
The third problem with lists is that they have poor space efficiency. Bunches of extra pointers push up your storage (by a constant factor).
Sequences Are Functional
Data.Sequence is internally based on finger trees (I know, you don't want to know this) which means that they have some nice properties
Purely functional. Data.Sequence is a fully persistant data structure.
Darn fast access to the beginning and end of the tree. ϴ(1) (amortized) to get the first or last element, or to append trees. At the thing lists are fastest at, Data.Sequence is at most a constant slower.
ϴ(log n) access to the middle of the sequence. This includes inserting values to make new sequences
High quality API
On the other hand, Data.Sequence doesn't do much for the data locality problem, and only works for finite collections (it is less lazy than lists)
Arrays are not for the faint of heart
Arrays are one of the most important data structures in CS, but they dont fit very well with the lazy pure functional world. Arrays provide ϴ(1) access to the middle of the collection and exceptionally good data locality/constant factors. But, since they dont fit very well into Haskell, they are a pain to use. There are actually a multitude of different array types in the current standard library. These include fully persistant arrays, mutable arrays for the IO monad, mutable arrays for the ST monad, and un-boxed versions of the above. For more check out the haskell wiki
Vector is a "better" Array
The Data.Vector package provides all of the array goodness, in a higher level and cleaner API. Unless you really know what you are doing, you should use these if you need array like performance. Of-course, some caveats still apply--mutable array like data structures just dont play nice in pure lazy languages. Still, sometimes you want that O(1) performance, and Data.Vector gives it to you in a useable package.
You have other options
If you just want lists with the ability to efficiently insert at the end, you can use a difference list. The best example of lists screwing up performance tends to come from [Char] which the prelude has aliased as String. Char lists are convient, but tend to run on the order of 20 times slower than C strings, so feel free to use Data.Text or the very fast Data.ByteString. I'm sure there are other sequence oriented libraries I'm not thinking of right now.
Conclusion
90+% of the time I need a sequential collection in Haskell lists are the right data structure. Lists are like iterators, functions that consume lists can easily be used with any of these other data structures using the toList functions they come with. In a better world the prelude would be fully parametric as to what container type it uses, but currently [] litters the standard library. So, using lists (almost) every where is definitely okay.
You can get fully parametric versions of most of the list functions (and are noble to use them)
Prelude.map ---> Prelude.fmap (works for every Functor)
Prelude.foldr/foldl/etc ---> Data.Foldable.foldr/foldl/etc
Prelude.sequence ---> Data.Traversable.sequence
etc
In fact, Data.Traversable defines an API that is more or less universal across any thing "list like".
Still, although you can be good and write only fully parametric code, most of us are not and use list all over the place. If you are learning, I strongly suggest you do too.
EDIT: Based on comments I realize I never explained when to use Data.Vector vs Data.Sequence. Arrays and Vectors provide extremely fast indexing and slicing operations, but are fundamentally transient (imperative) data structures. Pure functional data structures like Data.Sequence and [] let efficiently produce new values from old values as if you had modified the old values.
newList oldList = 7 : drop 5 oldList
doesn't modify old list, and it doesn't have to copy it. So even if oldList is incredibly long, this "modification" will be very fast. Similarly
newSequence newValue oldSequence = Sequence.update 3000 newValue oldSequence
will produce a new sequence with a newValue for in the place of its 3000 element. Again, it doesn't destroy the old sequence, it just creates a new one. But, it does this very efficiently, taking O(log(min(k,k-n)) where n is the length of the sequence, and k is the index you modify.
You cant easily do this with Vectors and Arrays. They can be modified but that is real imperative modification, and so cant be done in regular Haskell code. That means operations in the Vector package that make modifications like snoc and cons have to copy the entire vector so take O(n) time. The only exception to this is that you can use the mutable version (Vector.Mutable) inside the ST monad (or IO) and do all your modifications just like you would in an imperative language. When you are done, you "freeze" your vector to turn in into the immutable structure you want to use with pure code.
My feeling is that you should default to using Data.Sequence if a list is not appropriate. Use Data.Vector only if your usage pattern doesn't involve making many modifications, or if you need extremely high performance within the ST/IO monads.
If all this talk of the ST monad is leaving you confused: all the more reason to stick to pure fast and beautiful Data.Sequence.
Related
so far I've only found vector and sequences, but neither of those could replace an element of a list in O(1). Such data structure would of course violate the immutable character of Haskells structures, but maybe there still exist some dirty implementation?
Every feedback is welcom.
As you suggest yourself – I'm also pretty sure a safe, purely-functional update in O(1) is not possible. What is possible is in O(log n) with a tree-like implementation; for instance, instead of [a] you could use Data.Map.Map Int a with a contiguous region of indices. Also, it is possible to do a batch update of k ≤ n elements in a list or vector, in only O(n) instead of the O(k·n) it would take to manually insert them one-by-one. Check out //.
If none of that is fast enough for you, then yes, you will need to go into the dark realm of mutability. Fortunately, Haskell offers a good safety armour and flashlight for such journeys: the ST monad. The way it works is, you wrap the entire region where you need to do mutable updates in runST. Inside that region, you use MVectors, which support O(1) mutable element updates, much like you could in an imperative language. But thanks to a type-system trick, runST ensures that all these side-effects stay confined to within the local scope.
I'm an intermediate Haskell programmer with tons of experience in strict FP and non-FP languages. Most of my Haskell code analyzes moderately large datasets (10^6..10^9 things), so laziness is always lurking. I have a reasonably good understanding of thunks, WHNF, pattern matching, and sharing, and I've been able to fix leaks with bang patterns and seq, but this profile-and-pray approach feels sordid and wrong.
I want to know how experienced Haskell programmers approach laziness at design time. I'm not asking about easy items like Data.ByteString.Lazy or foldl'; rather, I want to know how you think about the lower-level lazy machinery that causes runtime memory problems and tricky debugging.
How do you think about thunks, pattern matching, and sharing during design time?
What design patterns and idioms do you use to avoid leaks?
How did you learn these patterns and idioms, and do you have some good refs?
How do you avoid premature optimization of non-leaking non-problems?
(Amended 2014-05-15 for time budgeting):
Do you budget substantial project time for finding and fixing memory problems?
Or, do your design skills typically circumvent memory problems, and you get the expected memory consumption very early in the development cycle?
I think most of the trouble with "strictness leaks" happens because people don't have a good conceptual model. Haskellers without a good conceptual model tend to have and propagate the superstition that stricter is better. Perhaps this intuition comes from their results from toying with small examples & tight loops. But it is incorrect. It's just as important to be lazy at the right times as to be strict at the right times.
There are two camps of data types, usually referred to as "data" and "codata". It is essential to respect the patterns of each one.
Operations which produce "data" (Int, ByteString, ...) must be forced close to where they occur. If I add a number to an accumulator, I am careful to make sure that it will be forced before I add another one. A good understanding of laziness is very important here, especially its conditional nature (i.e. strictness propositions don't take the form "X gets evaluated" but rather "when Y is evaluated, so is X").
Operations which produce and consume "codata" (lists most of the time, trees, most other recursive types) must do so incrementally. Usually codata -> codata transformation should produce some information for each bit of information they consume (modulo skipping like filter). Another important piece for codata is that you use it linearly whenever possible -- i.e. use the tail of a list exactly once; use each branch of a tree exactly once. This ensures that the GC can collect pieces as they are consumed.
Things take a special amount of care when you have codata that contains data. E.g. iterate (+1) 0 !! 1000 will end up producing a size-1000 thunk before evaluating it. You need to think about conditional strictness again -- the way to prevent this case is to ensure that when a cons of the list is consumed, the addition of its element occurs. iterate violates this, so we need a better version.
iterate' :: (a -> a) -> a -> [a]
iterate' f x = x : (x `seq` iterate' f (f x))
As you start composing things, of course it gets harder to tell when bad cases happen. In general it is hard to make efficient data structures / functions that work equally well on data and codata, and it's important to keep in mind which is which (even in a polymorphic setting where it's not guaranteed, you should have one in mind and try to respect it).
Sharing is tricky, and I think I approach it mostly on a case-by-case basis. Because it's tricky, I try to keep it localized, choosing not to expose large data structures to module users in general. This can usually be done by exposing combinators for generating the thing in question, and then producing and consuming it all in one go (the codensity transformation on monads is an example of this).
My design goal is to get every function to be respectful of the data / codata patterns of my types. I can usually hit it (though sometimes it requires some heavy thought -- it has become natural over the years), and I seldom have leak problems when I do. But I don't claim that it's easy -- it requires experience with the canonical libraries and patterns of the language. These decisions are not made in isolation, and everything has to be right at once for it to work well. One poorly tuned instrument can ruin the whole concert (which is why "optimization by random perturbation" almost never works for these kinds of issues).
Apfelmus's Space Invariants article is helpful for developing your space/thunk intuition further. Also see Edward Kmett's comment below.
The fact that Haskell's default String implementation is not efficient both in terms of speed and memory is well known. As far as I know the [] lists in general are implemented in Haskell as singly-linked lists and for most small/simple data types (e.g. Int) it doesn't seem like a very good idea, but for String it seems like total overkill. Some of the opinions on this matter include:
Real World Haskell
On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude.
Efficient String Implementation in Haskell
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (...). Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
I know that Haskell has ByteStrings (and Arrays) in several nice flavors and that they can do the job nicely, but I would expect the default implementation to be the most efficient one.
TL;DR: Why is Haskell's default String implementation a singly-linked list even though it is terribly inefficient and rarely used for real world applications (except for the really simple ones)? Are there historical reasons? Is it easier to implement?
Why is Haskell's default String implementation a singly-linked list
Because singly-linked lists support:
induction via pattern matching
have useful properties, such as Monad, Functor
are properly parametrically polymorphic
are naturally lazy
and so String as [Char] (unicode points) means a string type that fits the language goals (as of 1990), and essentially come "for free" with the list library.
In summary, historically the language designers were interested more in well-designed core data types, than the modern problems of text processing, so we have an elegant, easy to understand, easy to teach String type, that isn't quite a unicode text chunk, and isn't a dense, packed, strict data type.
Efficiency is only one axis to measure an abstraction on. While lists are pretty inefficient for text-y operations, they are darn convenient in that there's a lot of list operations implemented polymorphically that have useful interpretations when specialized to [Char], so you get a lot of reuse both in the library implementation and in the user's brain.
It's not clear that, were the language being designed today from scratch with our current level of experience, the same decision would be made; however, it's not always possible to make decisions perfectly before experience is available.
At this point, it's probably historical: the optimizations that have made things like ByteString so efficient are recent, whereas [Char] predates them all by many years.
I am looking for a data structure in Haskell that supports both fast indexing and fast append. This is for a memoization problem which arises from recursion.
From the way vectors work in c++ (which are mutable, but that shouldn't matter in this case) it seems immutable vectors with both (amortized) O(1) append and O(1) indexing should be possible (ok, it's not, see comments to this question). Is this poossible in Haskell or should I go with Data.Sequence which has (AFAICT anyway) O(1) append and O(log(min(i,n-i))) indexing?
On a related note, as a Haskell newbie I find myself longing for a practical, concise guide to Haskell data structures. Ideally this would give a fairly comprehensive overview over the most practical data structures along with performance characteristics and pointers to Haskell libraries where they are implemented. It seems that there is a lot of information out there, but I have found it to be a little scattered. Am I asking too much?
For simple memoization problems, you typically want to build the table once and then not modify it later. In that case, you can avoid having to worry about appending, by instead thinking of the construction of the memoization table as one operation.
One method is to take advantage of lazy evaluation and refer to the table while we're constructing it.
import Data.Array
fibs = listArray (0, n-1) $ 0 : 1 : [fibs!(i-1) + fibs!(i-2) | i <- [2..n-1]]
where n = 100
This method is especially useful when the dependencies between the elements of the table makes it difficult to come up with a simple order of evaluating them ahead of time. However, it requires using boxed arrays or vectors, which may make this approach unsuitable for large tables due to the extra overhead.
For unboxed vectors, you have operations like constructN which lets you build a table in a pure way while using mutation underneath to make it efficient. It does this by giving the function you pass an immutable view of the prefix of the vector constructed so far, which you can then use to compute the next element.
import Data.Vector.Unboxed as V
fibs = constructN 100 f
where f xs | i < 2 = i
| otherwise = xs!(i-1) + xs!(i-2)
where i = V.length xs
If memory serves, C++ vectors are implemented as an array with bounds and size information. When an insertion would increase the bounds beyond the size, the size is doubled. This is amortized O(1) time insertion (not O(1) as you claim), and can be emulated just fine in Haskell using the Array type, perhaps with suitable IO or ST prepended.
Take a look at this to make a more informed choice of what you should use.
But the simple thing is, if you want the equivalent of a C++ vector, use Data.Vector.
I am writing program that does alot of table lookups. As such, I was perusing the Haskell documentation when I stumbled upon Data.Map (of course), but also Data.HashMap and Data.Hashtable. I am no expert on hashing algorithms and after inspecting the packages they all seem really similar. As such I was wondering:
1: what are the major differences, if any?
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
1: What are the major differences, if any?
Data.Map.Map is a balanced binary tree internally, so its time complexity for lookups is O(log n). I believe it's a "persistent" data structure, meaning it's implemented such that mutative operations yield a new copy with only the relevant parts of the structure updated.
Data.HashMap.Map is a Data.IntMap.IntMap internally, which in turn is implemented as Patricia tree; its time complexity for lookups is O(min(n, W)) where W is the number of bits in an integer. It is also "persistent.". New versions (>= 0.2) use hash array mapped tries. According to the documentation: "Many operations have a average-case complexity of O(log n). The implementation uses a large base (i.e. 16) so in practice these operations are constant time."
Data.HashTable.HashTable is an actual hash table, with time complexity O(1) for lookups. However, it is a mutable data structure -- operations are done in-place -- so you're stuck in the IO monad if you want to use it.
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
The best answer I can give you, unfortunately, is "it depends." If you take the asymptotic complexities literally, you get O(log 4000) = about 12 for Data.Map, O(min(4000, 64)) = 64 for Data.HashMap and O(1) = 1 for Data.HashTable. But it doesn't really work that way... You have to try them in the context of your code.
The obvious difference between Data.Map and Data.HashMap is that the former needs keys in Ord, the latter Hashable keys. Most of the common keys are both, so that's not a deciding criterion. I have no experience whatsoever with Data.HashTable, so I can't comment on that.
The APIs of Data.HashMap and Data.Map are very similar, but Data.Map exports more functions, some, like alter are absent in Data.HashMap, others are provided in strict and non-strict variants, while Data.HashMap (I assume you meant the hashmap from unordered-containers) provides lazy and strict APIs in separate modules. If you are using only the common part of the API, switching is really painless.
Concerning performance, Data.HashMap of unordered-containers has pretty fast lookup, last I measured, it was clearly faster than Data.IntMap or Data.Map, that holds in particular for the (not yet released) HAMT branch of unordered-containers. I think for inserts, it was more or less on par with Data.IntMap and somewhat faster than Data.Map, but I'm a bit fuzzy on that.
Both are sufficiently performant for most tasks, for those tasks where they aren't, you'll probably need a tailor-made solution anyway. Considering that you ask specifically about lookups, I would give Data.HashMap the edge.
Data.HashTable's documentation now says "use the hashtables package". There's a nice blog post explaining why hashtables is a good package here. It uses the ST monad.