Data.Text vs. Rope - haskell

I've been looking into ropes as an alternative to Data.Text, and I like what I see so much that I am now forced to ask this question.... Is there any case where Data.Text would be the better choice?
Here are the points that lead me to this (correct me if I am wrong on any of these)-
A single leaf node rope internally is (almost) the same thing as a Data.Text object. The overhead in a single node rope vs. text is minor, just a single bit flag to differentiate a branch or leaf. If you really want Data.Text, just use an unsplit rope.
Complexity is universally equal or better in ropes- insert/delete (log(N) vs N), get by index (log(N)/N depending on depth of tree vs N).
I've read that the success of ropes proved to be a mixed bag in c, because performance was harmed by thread safety code. However these concerns shouldn't matter in immutable Haskell. In fact, it would seem to me that because of this, Haskell and ropes are ideal for each other.
Again, like in my previous similar questions, I am more interested in the abstract qualities of the structures, not the current situation (library usage, how hardened the code is, etc). If you rewrote the Haskell libraries tomorrow, would you substitute Data.Rope for Data.Text?

The "finger tree of packed arrays" seems like a good representation choice, although I would worry about constant overhead. Some effort with aggressive stream fussion and some other optimizations for short strings might fix this, but Data.Rope lacks these features. Right now Data.Rope is not really a Data.Text replacmenet.
It mainly implements strings of bytes not strings of charachters--it replaces byteString not text. Unicode support is important.
Despite Edward's amazing-ness, Rope is not nearly as mature. It hasn't had any contributions in a year, and is rarely used.
Data.Text has a huge engineering effort behind it, is highly optimized, and is well known and well documented

A long time ago when I tried to use Rope, the author told me it wasn't really usable yet, it was just an experiment. One problem with Hackage is the difficulty of learning which package/versions are really production-ready.
Is the Rope as Unicode-compatible as Data.Text ?

Related

Is avoiding partial functions any easier in Haskell than other languages?

We're urged to avoid partial functions with seemingly more emphasis in Haskell than other languages.
Is this because partial functions are a more frequent risk in Haskell than other languages (c.f. this question), or is it that avoiding them in other languages is impractical to the point of little consideration?
Is this because partial functions are a more frequent risk in Haskell than other languages (c.f. this question), or is it that avoiding them in other languages is impractical to the point of little consideration?
Certainly the latter. The most commonly used languages all have some notion of the null value as an inhabitant of every type, the practical effect being that every value is akin to haskell's Maybe a.
You can argue that in haskell we have the same issue: bottoms can hide anywhere, e.g.
uhoh :: String
uhoh = error "oops"
But this isn't really the case. In haskell all bottom are morally equivalent and we can reason about code as if they didn't exist. If we could catch exceptions in pure code, this would no longer be the case. Here's an interesting discussion.
And just a subjective addendum, I think intermediate haskell developers tend to be aware of whether a function is partial, and to complain loudly when they are surprised to find they were wrong. At the same time a fair portion of the Prelude contains partial functions, such as tail and / and these haven't changed in spite of much attention and many alternative preludes, which I think is evidence that the language and standard lib probably struck a pretty decent balance.
EDIT I agree that Alexey Romanov's answer is an important part of the picture as well.
One reason why partial functions are significantly worse in Haskell compared to other languages is the lack of stack traces by default. When you call e.g. head on an empty list, you only get Prelude.head: empty list. Good luck figuring out which call of head is the problem or where the empty list came from! Of course, it may not even be in your code, but in some library you are using.
To get a stack trace, you need to either compile with profiling enabled or to make it available explicitly: see https://hackage.haskell.org/package/base-4.9.1.0/docs/GHC-Stack.html and https://wiki.haskell.org/Debugging. And both of these options appeared in relatively recent GHC versions (and work on improving them is ongoing).

How do experienced Haskell developers approach laziness at *design* time?

I'm an intermediate Haskell programmer with tons of experience in strict FP and non-FP languages. Most of my Haskell code analyzes moderately large datasets (10^6..10^9 things), so laziness is always lurking. I have a reasonably good understanding of thunks, WHNF, pattern matching, and sharing, and I've been able to fix leaks with bang patterns and seq, but this profile-and-pray approach feels sordid and wrong.
I want to know how experienced Haskell programmers approach laziness at design time. I'm not asking about easy items like Data.ByteString.Lazy or foldl'; rather, I want to know how you think about the lower-level lazy machinery that causes runtime memory problems and tricky debugging.
How do you think about thunks, pattern matching, and sharing during design time?
What design patterns and idioms do you use to avoid leaks?
How did you learn these patterns and idioms, and do you have some good refs?
How do you avoid premature optimization of non-leaking non-problems?
(Amended 2014-05-15 for time budgeting):
Do you budget substantial project time for finding and fixing memory problems?
Or, do your design skills typically circumvent memory problems, and you get the expected memory consumption very early in the development cycle?
I think most of the trouble with "strictness leaks" happens because people don't have a good conceptual model. Haskellers without a good conceptual model tend to have and propagate the superstition that stricter is better. Perhaps this intuition comes from their results from toying with small examples & tight loops. But it is incorrect. It's just as important to be lazy at the right times as to be strict at the right times.
There are two camps of data types, usually referred to as "data" and "codata". It is essential to respect the patterns of each one.
Operations which produce "data" (Int, ByteString, ...) must be forced close to where they occur. If I add a number to an accumulator, I am careful to make sure that it will be forced before I add another one. A good understanding of laziness is very important here, especially its conditional nature (i.e. strictness propositions don't take the form "X gets evaluated" but rather "when Y is evaluated, so is X").
Operations which produce and consume "codata" (lists most of the time, trees, most other recursive types) must do so incrementally. Usually codata -> codata transformation should produce some information for each bit of information they consume (modulo skipping like filter). Another important piece for codata is that you use it linearly whenever possible -- i.e. use the tail of a list exactly once; use each branch of a tree exactly once. This ensures that the GC can collect pieces as they are consumed.
Things take a special amount of care when you have codata that contains data. E.g. iterate (+1) 0 !! 1000 will end up producing a size-1000 thunk before evaluating it. You need to think about conditional strictness again -- the way to prevent this case is to ensure that when a cons of the list is consumed, the addition of its element occurs. iterate violates this, so we need a better version.
iterate' :: (a -> a) -> a -> [a]
iterate' f x = x : (x `seq` iterate' f (f x))
As you start composing things, of course it gets harder to tell when bad cases happen. In general it is hard to make efficient data structures / functions that work equally well on data and codata, and it's important to keep in mind which is which (even in a polymorphic setting where it's not guaranteed, you should have one in mind and try to respect it).
Sharing is tricky, and I think I approach it mostly on a case-by-case basis. Because it's tricky, I try to keep it localized, choosing not to expose large data structures to module users in general. This can usually be done by exposing combinators for generating the thing in question, and then producing and consuming it all in one go (the codensity transformation on monads is an example of this).
My design goal is to get every function to be respectful of the data / codata patterns of my types. I can usually hit it (though sometimes it requires some heavy thought -- it has become natural over the years), and I seldom have leak problems when I do. But I don't claim that it's easy -- it requires experience with the canonical libraries and patterns of the language. These decisions are not made in isolation, and everything has to be right at once for it to work well. One poorly tuned instrument can ruin the whole concert (which is why "optimization by random perturbation" almost never works for these kinds of issues).
Apfelmus's Space Invariants article is helpful for developing your space/thunk intuition further. Also see Edward Kmett's comment below.

Why is Haskell's default string implementation a linked list of chars?

The fact that Haskell's default String implementation is not efficient both in terms of speed and memory is well known. As far as I know the [] lists in general are implemented in Haskell as singly-linked lists and for most small/simple data types (e.g. Int) it doesn't seem like a very good idea, but for String it seems like total overkill. Some of the opinions on this matter include:
Real World Haskell
On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude.
Efficient String Implementation in Haskell
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (...). Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
I know that Haskell has ByteStrings (and Arrays) in several nice flavors and that they can do the job nicely, but I would expect the default implementation to be the most efficient one.
TL;DR: Why is Haskell's default String implementation a singly-linked list even though it is terribly inefficient and rarely used for real world applications (except for the really simple ones)? Are there historical reasons? Is it easier to implement?
Why is Haskell's default String implementation a singly-linked list
Because singly-linked lists support:
induction via pattern matching
have useful properties, such as Monad, Functor
are properly parametrically polymorphic
are naturally lazy
and so String as [Char] (unicode points) means a string type that fits the language goals (as of 1990), and essentially come "for free" with the list library.
In summary, historically the language designers were interested more in well-designed core data types, than the modern problems of text processing, so we have an elegant, easy to understand, easy to teach String type, that isn't quite a unicode text chunk, and isn't a dense, packed, strict data type.
Efficiency is only one axis to measure an abstraction on. While lists are pretty inefficient for text-y operations, they are darn convenient in that there's a lot of list operations implemented polymorphically that have useful interpretations when specialized to [Char], so you get a lot of reuse both in the library implementation and in the user's brain.
It's not clear that, were the language being designed today from scratch with our current level of experience, the same decision would be made; however, it's not always possible to make decisions perfectly before experience is available.
At this point, it's probably historical: the optimizations that have made things like ByteString so efficient are recent, whereas [Char] predates them all by many years.

Performant Haskell hashed structure.

I am writing program that does alot of table lookups. As such, I was perusing the Haskell documentation when I stumbled upon Data.Map (of course), but also Data.HashMap and Data.Hashtable. I am no expert on hashing algorithms and after inspecting the packages they all seem really similar. As such I was wondering:
1: what are the major differences, if any?
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
1: What are the major differences, if any?
Data.Map.Map is a balanced binary tree internally, so its time complexity for lookups is O(log n). I believe it's a "persistent" data structure, meaning it's implemented such that mutative operations yield a new copy with only the relevant parts of the structure updated.
Data.HashMap.Map is a Data.IntMap.IntMap internally, which in turn is implemented as Patricia tree; its time complexity for lookups is O(min(n, W)) where W is the number of bits in an integer. It is also "persistent.". New versions (>= 0.2) use hash array mapped tries. According to the documentation: "Many operations have a average-case complexity of O(log n). The implementation uses a large base (i.e. 16) so in practice these operations are constant time."
Data.HashTable.HashTable is an actual hash table, with time complexity O(1) for lookups. However, it is a mutable data structure -- operations are done in-place -- so you're stuck in the IO monad if you want to use it.
2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?
The best answer I can give you, unfortunately, is "it depends." If you take the asymptotic complexities literally, you get O(log 4000) = about 12 for Data.Map, O(min(4000, 64)) = 64 for Data.HashMap and O(1) = 1 for Data.HashTable. But it doesn't really work that way... You have to try them in the context of your code.
The obvious difference between Data.Map and Data.HashMap is that the former needs keys in Ord, the latter Hashable keys. Most of the common keys are both, so that's not a deciding criterion. I have no experience whatsoever with Data.HashTable, so I can't comment on that.
The APIs of Data.HashMap and Data.Map are very similar, but Data.Map exports more functions, some, like alter are absent in Data.HashMap, others are provided in strict and non-strict variants, while Data.HashMap (I assume you meant the hashmap from unordered-containers) provides lazy and strict APIs in separate modules. If you are using only the common part of the API, switching is really painless.
Concerning performance, Data.HashMap of unordered-containers has pretty fast lookup, last I measured, it was clearly faster than Data.IntMap or Data.Map, that holds in particular for the (not yet released) HAMT branch of unordered-containers. I think for inserts, it was more or less on par with Data.IntMap and somewhat faster than Data.Map, but I'm a bit fuzzy on that.
Both are sufficiently performant for most tasks, for those tasks where they aren't, you'll probably need a tailor-made solution anyway. Considering that you ask specifically about lookups, I would give Data.HashMap the edge.
Data.HashTable's documentation now says "use the hashtables package". There's a nice blog post explaining why hashtables is a good package here. It uses the ST monad.

What makes Iteratees worth the complexity?

First, I understand the how of iteratees, well enough that I could probably write a simplistic and buggy implementation without referring back to any existing ones.
What I'd really like to know is why people seem to find them so fascinating, or under what circumstances their benefits justify their complexity. Comparing them to lazy I/O there is a very clear benefit, but that seems an awful lot like a straw man to me. I never felt comfortable about lazy I/O in the first place, and I avoid it except for the occasional hGetContents or readFile, mostly in very simple programs.
In real-world scenarios I generally use traditional I/O interfaces with control abstractions appropriate to the task. In that context I just don't see the benefit of iteratees, or to what task they are an appropriate control abstraction. Most of the time they seem more like unnecessary complexity or even a counterproductive inversion of control.
I've read a fair number of articles about them and sources that make use of them, but have not yet found a compelling example that actually made me think anything along the lines of "oh, yea, I'd have used them there too." Maybe I just haven't read the right ones. Or perhaps there is a yet-to-be-devised interface, simpler than any I've yet seen, that would make them feel less like a Swiss Army Chainsaw.
Am I just suffering from not-invented-here syndrome or is my unease well-founded? Or is it perhaps something else entirely?
As to why people find them so fascinating, I think because they're such a simple idea. The recent discussion on Haskell-cafe about a denotational semantics for iteratees devolved into a consensus that they're so simple they're barely worth describing. The phrase "little more than a glorified left-fold with a pause button" sticks out to me from that thread. People who like Haskell tend to be fond of simple, elegant structures, so the iteratee idea is likely very appealing.
For me, the chief benefits of iteratees are
Composability. Not only can iteratees be composed, but enumerators can too. This is very powerful.
Safe resource usage. Resources (memory and handles mostly) cannot escape their local scope. Compare to strict I/O, where it's easier to create space leaks by not cleaning up.
Efficient. Iteratees can be highly efficient; competitive with or better than both lazy I/O and strict I/O.
I have found that iteratees provide the greatest benefits when working with single logical data that comes from multiple sources. This is when the composability is most helpful, and resource management with strict I/O most annoying (e.g. nested allocas or brackets).
For an example, in a work-in-progress audio editor, a single logical chunk of sound data is a set of offsets into multiple audio files. I can process that single chunk of sound by doing something like this (from memory, but I think this is right):
enumSound :: MonadIO m => Sound -> Enumerator s m a
enumSound snd = foldr (>=>) enumEof . map enumFile $ sndFiles snd
This seems clear, concise, and elegant to me, much more so than the equivalent strict I/O. Iteratees are also powerful enough to incorporate any processing I want to do, including writing output, so I find this very nice. If I used lazy I/O I could get something as elegant, but the extra care to make sure resources are consumed and GC'd would outweigh the advantages IMO.
I also like that you need to explicitly retain data in iteratees, which avoids the notorious mean xs = sum xs / length xs space leak.
Of course, I don't use iteratees for everything. As an alternative I really like the with* idiom, but when you have multiple resources that need to be nested that gets complex very quickly.
Essentially, it's about doing IO in a functional style, correctly and efficiently. That's all, really.
Correct and efficient are easy enough using quasi-imperative style with strict IO. Functional style is easy with lazy IO, but it's technically cheating (using unsafeInterleaveIO under the hood) and can have issues with resource management and efficiency.
In very, very general terms, a lot of pure functional code follows a pattern of taking some data, recursively expanding it into smaller pieces, transforming the pieces in some fashion, then recombining it into a final result. The structure may be implicit (in the call graph of the program) or an explicit data structure being traversed.
But this falls apart when IO is involved. Say your initial data is a file handle, the "recursively expand" step is reading a line from it, and you can't read the entire file into memory at once. This forces the entire read-transform-recombine process to be done for each line before reading the next one, so instead of the clean "unfold, map, fold" structure they get mashed together into explicitly recursive monadic functions using strict IO.
Iteratees provide an alternative structure to solve the same problem. The "transform and recombine" steps are extracted and, instead of being functions, are changed into a data structure representing the current state of the computation. The "recursively expand" step is given the responsibility of obtaining the data and feeding it to an (otherwise passive) iteratee.
What benefits does this offer? Among other things:
Because an iteratee is a passive object that performs single steps of a computation, they can be easily composed in different ways--for instance, interleaving two iteratees instead of running them sequentially.
The interface between iteratees and enumerators is pure, just a stream of values being processed, so a pure function can be freely spliced in between them.
Data sources and computations are oblivious to each other's internal workings, decoupling input and resource management from processing and output.
The end result is that a program can have a high-level structure much closer to what a pure functional version would look like, with many of the same benefits to compositionality, while simultaneously having efficiency comparable to the more imperative, strict IO version.
As for being "worth the complexity"? Well, that's the thing--they're really not that complex, just a bit new and unfamiliar. The idea's been floating around for only, what, a couple years? Give it some time for things to shake out as people use iteratee-based IO in larger projects (e.g., with things like Snap), and for more examples/tutorials to appear. It's likely that, in hindsight, the current implementations will seem very rough around the edges.
Somewhat related: You may want to read this discussion about functional-style IO. Iteratees aren't mentioned all that much, but the central issue is very similar. In particular this solution, which is both very elegant and goes even further than iteratees in abstracting incremental IO.
under what circumstances their benefits justify their complexity
Every language has strict (classical) IO, where all resources are managed by the user. Haskell also provides ubiquitous lazy IO, where all resource management is delegated to the system.
However, that can create problems, as the scope of resources is dependent on runtime demand properties.
Iteratees strike a third way:
High level abstractions, like lazy IO.
Explicit, lexical scoping of resources, like strict IO.
It is justified when you have complex IO processing tasks, but very tight bounds on resource use. An example is a web server.
Indeed, Snap is built around iteratee IO on top of epoll.

Resources