"List" with efficient insert/delete

"List" with efficient insert/delete - haskell

I am looking for a "List" with log(N) insert/delete at index i. I put the word "List" in quotes, because I don't mean it to be an actual Haskell List, but just any ordered container with a bunch of objects (in fact, it probably needs some sort of tree internally). I am surprised that I haven't found a great solution yet....
This is the best solution I have so far- Use Data.Sequence with these two functions.
doInsert position value seq = before >< ( value <| after)
where
(before, after) = splitAt position seq
doDelete position seq = before >< (drop 1 after)
where
(before, after) = splitAt position seq
While these are technically all log(N) functions, it feels like I am doing a bunch of extra stuff for each insert/delete.... In other words, this scales like K*log(N) for a K that is larger than it should be. Plus, since I have to define this stuff myself, I feel like I am using Sequence for something it wasn't designed for.
Is there a better way?
This is a related question (although it only deals with Sequences, I would happily use anything else): Why doesn't Data.Sequence have `insert' or `insertBy', and how do I efficiently implement them?
And yes, this question is related to this other one I posted a few days ago: Massive number of XML edits

There are structures such as catenable seqs, catenable queues, etc. that can give you O(1) join. However, all such structures I know of get away with this by giving you O(i) split. If you want split and join both to be as optimal as possible, I think a fingertree (aka Sequence) is your best bet. The entire point of the way Sequence is structured is to ensure that splitting, joining, etc. have an asymptotically not terrible amount of juggling to do when splitting and joining, among other things.
If you could get away with an IntMap which started sparse and got denser, and was occasionally "rebuilt" when things got too dense, then that might give you better performance -- but that really depends on your patterns of use.

If you work carefully with predefined Haskell Lists, there should be no problem with them.
(For example when concatenating two lists).
If you'd like to find an implementation for lists, with efficient insertion and deletion, an AVLTree or any kind or balanced binary tree would work.
For example store in an AVLTree a tuple (Int, a) where Int is the index of the list and a the elem.
Therefore, on average complexity, the operations would be logarithmic for insertion and deletion.
I hope this answers your question.

Related

Is there a way to "preserve" results in Haskell?

I'm new to Haskell and understand that it is (basically) a pure functional language, which has the advantage that results to functions will not change across multiple evaluations. Given this, I'm puzzled by why I can't easily mark a function in such a way that its remembers the results of its first evaluation, and does not have to be evaluated again each time its value is required.
In Mathematica, for example, there is a simple idiom for accomplishing this:
f[x_]:=f[x]= ...
but in Haskell, the closest things I've found is something like
f' = (map f [0 ..] !!)
where f 0 = ...
f n = f' ...
which in addition to being far less clear (and apparently limited to Int arguments?) does not (seem to) preserve results within an interactive session.
Admittedly (and clearly), I don't understand exactly what's going on here; but naively, it seems like Haskel should have some way, at the function definition level, of
taking advantage of the fact that its functions are functions and skipping re-computation of their results once they have been computed, and
indicating a desire to do this at the function definition level with a simple and clean idiom.
Is there a way to accomplish this in Haskell that I'm missing? I understand (sort of) that Haskell can't store the evaluations as "state", but why can't it simply (in effect) redefine evaluated functions to be their computed value?
This grows out of this question, in which lack of this feature results in terrible performance.

Use a suitable library, such as MemoTrie.
import Data.MemoTrie
f' = memo f
where f 0 = ...
f n = f' ...
That's hardly less nice than the Mathematica version, is it?
Regarding
“why can't it simply (in effect) redefine evaluated functions to be their computed value?”
Well, it's not so easy in general. These values have to be stored somewhere. Even for an Int-valued function, you can't just allocate an array with all possible values – it wouldn't fit in memory. The list solution only works because Haskell is lazy and therefore allows infinite lists, but that's not particularly satisfying since lookup is O(n).
For other types it's simply hopeless – you'd need to somehow diagonalise an over-countably infinite domain.
You need some cleverer organisation. I don't know how Mathematica does this, but it probably uses a lot of “proprietary magic”. I wouldn't be so sure that it does really work the way you'd like, for any inputs.
Haskell fortunately has type classes, and these allow you to express exactly what a type needs in order to be quickly memoisable. HasTrie is such a class.

Haskell: `Map (a,b) c` versus `Map a (Map b c)`?

Thinking of maps as representations of finite functions, a map of two or more variables can be given either in curried or uncurried form; that is, the types Map (a,b) c and Map a (Map b c) are isomorphic, or something close to it.
What practical considerations are there — efficiency, etc — for choosing between the two representations?

The Ord instance of tuples uses lexicographic order, so Map (a, b) c is going to sort by a first anyway, so the overall order will be the same. Regarding practical considerations:
Because Data.Map is a binary search tree splitting at a key is comparable to a lookup, so getting a submap for a given a in the uncurried form won't be significantly more expensive than in the curried form.
The curried form may produce a less balanced tree overall, for the obvious reason of having multiple trees instead of just one.
The curried form will have a bit of extra overhead to store the nested maps.
The nested maps of the curried form representing "partial applications" can be shared if some a values produce the same result.
Similarly, "partial application" of the curried form gives you the existing inner map, while the uncurried form must construct a new map.
So the uncurried form is clearly better in general, but the curried form may be better if you expect to do "partial application" often and would benefit from sharing of Map b c values.
Note that some care will be necessary to ensure you actually benefit from that potential sharing; you'll need to explicitly define any shared inner maps and reuse the single value when constructing the full map.
Edit: Tikhon Jelvis points out in the comments that the memory overhead of the tuple constructors--which I did not think to account for--is not at all negligible. There is certainly some overhead to the curried form, but that overhead is proportional to how many distinct a values there are. The tuple constructor overhead in the uncurried form, on the other hand, is proportional to the total number of keys.
So if, on average, for any given value of a there are three or more distinct keys using it you'll probably save memory using the curried version. The concerns about unbalanced trees still apply, of course. The more I think about it, the more I suspect the curried form is unequivocally better except perhaps if your keys are very sparse and unevenly distributed.
Note that because arity of definitions does matter to GHC, the same care is required when defining functions if you want subexpressions to be shared; this is one reason you sometimes see functions defined in a style like this:
foo x = go
where z = expensiveComputation x
go y = doStuff y z

Tuples are lazy in both elements, so the tuple version introduces a little extra laziness. Whether this is good or bad strongly depends on your usage. (In particular, comparisons may force the tuple elements, but only if there are lots of duplicate a values.)
Beyond that, I think it's going to depend on how many duplicates you have. If a is almost always different whenever b is, you're going to have a lot of small trees, so the tuple version might be better. On the other hand, if the opposite is true, the non-tuple version may save you a little time (not constantly recomparing a once you've found the appropriate subtree and you're looking for b).
I'm reminded of tries, and how they store common prefixes once. The non-tuple version seems to be a bit like that. A trie can be more efficient than a BST if there's lots of common prefixes, and less efficient if there aren't.
But the bottom line: benchmark it!! ;-)

Apart from the efficiency aspects, there's also a pragmatic side to this question: what do you want to do with this structure?
Do you, for instance, want to be able to store an empty map for a given value of type a? If so, then the uncurried version might be more practical!
Here's a simple example: let's say we want to store String-valued properties of persons - say the value of some fields on that person's stackoverflow profile page.
type Person = String
type Property = String
uncurriedMap :: Map Person (Map Property String)
uncurriedMap = fromList [
("yatima2975", fromList [("location","Utrecht"),("age","37")]),
("PLL", fromList []) ]
curriedMap :: Map (Person,Property) String
curriedMap = fromList [
(("yatima2975","location"), "Utrecht"),
(("yatima2975","age"), "37") ]
With the curried version, there is no nice way to record the fact that user "PLL" is known to the system, but hasn't filled in any information. A person/property pair ("PLL",undefined) is going to cause runtime crashes, since Map is strict in the keys.
You could change the type of curriedMap to Map (Person,Property) (Maybe String) and store Nothings in there, and that might very well be the best solution in this case; but where there's a unknown/varying number of properties (e.g. depending on the kind of Person) that will also run into difficulties.
So, I guess it also depends on whether you need a query function like this:
data QueryResult = PersonUnknown | PropertyUnknownForPerson | Value String
query :: Person -> Property -> Map (Person, Property) String -> QueryResult
This is hard to write (if not impossible) in the curried version, but easy in the uncurried version.

Haskell: Datastruture with O(1) append and O(1) indexing?

I am looking for a data structure in Haskell that supports both fast indexing and fast append. This is for a memoization problem which arises from recursion.
From the way vectors work in c++ (which are mutable, but that shouldn't matter in this case) it seems immutable vectors with both (amortized) O(1) append and O(1) indexing should be possible (ok, it's not, see comments to this question). Is this poossible in Haskell or should I go with Data.Sequence which has (AFAICT anyway) O(1) append and O(log(min(i,n-i))) indexing?
On a related note, as a Haskell newbie I find myself longing for a practical, concise guide to Haskell data structures. Ideally this would give a fairly comprehensive overview over the most practical data structures along with performance characteristics and pointers to Haskell libraries where they are implemented. It seems that there is a lot of information out there, but I have found it to be a little scattered. Am I asking too much?

For simple memoization problems, you typically want to build the table once and then not modify it later. In that case, you can avoid having to worry about appending, by instead thinking of the construction of the memoization table as one operation.
One method is to take advantage of lazy evaluation and refer to the table while we're constructing it.
import Data.Array
fibs = listArray (0, n-1) $ 0 : 1 : [fibs!(i-1) + fibs!(i-2) | i <- [2..n-1]]
where n = 100
This method is especially useful when the dependencies between the elements of the table makes it difficult to come up with a simple order of evaluating them ahead of time. However, it requires using boxed arrays or vectors, which may make this approach unsuitable for large tables due to the extra overhead.
For unboxed vectors, you have operations like constructN which lets you build a table in a pure way while using mutation underneath to make it efficient. It does this by giving the function you pass an immutable view of the prefix of the vector constructed so far, which you can then use to compute the next element.
import Data.Vector.Unboxed as V
fibs = constructN 100 f
where f xs | i < 2 = i
| otherwise = xs!(i-1) + xs!(i-2)
where i = V.length xs

If memory serves, C++ vectors are implemented as an array with bounds and size information. When an insertion would increase the bounds beyond the size, the size is doubled. This is amortized O(1) time insertion (not O(1) as you claim), and can be emulated just fine in Haskell using the Array type, perhaps with suitable IO or ST prepended.

Take a look at this to make a more informed choice of what you should use.
But the simple thing is, if you want the equivalent of a C++ vector, use Data.Vector.

Haskell: Lists, Arrays, Vectors, Sequences

I'm learning Haskell and read a couple of articles regarding performance differences of Haskell lists and (insert your language)'s arrays.
Being a learner I obviously just use lists without even thinking about performance difference.
I recently started investigating and found numerous data structure libraries available in Haskell.
Can someone please explain the difference between Lists, Arrays, Vectors, Sequences without going very deep in computer science theory of data structures?
Also, are there some common patterns where you would use one data structure instead of another?
Are there any other forms of data structures that I am missing and might be useful?

Lists Rock
By far the most friendly data structure for sequential data in Haskell is the List
data [a] = a:[a] | []
Lists give you ϴ(1) cons and pattern matching. The standard library, and for that matter the prelude, is full of useful list functions that should litter your code (foldr,map,filter). Lists are persistant , aka purely functional, which is very nice. Haskell lists aren't really "lists" because they are coinductive (other languages call these streams) so things like
ones :: [Integer]
ones = 1:ones
twos = map (+1) ones
tenTwos = take 10 twos
work wonderfully. Infinite data structures rock.
Lists in Haskell provide an interface much like iterators in imperative languages (because of laziness). So, it makes sense that they are widely used.
On the other hand
The first problem with lists is that to index into them (!!) takes ϴ(k) time, which is annoying. Also, appends can be slow ++, but Haskell's lazy evaluation model means that these can be treated as fully amortized, if they happen at all.
The second problem with lists is that they have poor data locality. Real processors incur high constants when objects in memory are not laid out next to each other. So, in C++ std::vector has faster "snoc" (putting objects at the end) than any pure linked list data structure I know of, although this is not a persistant data structure so less friendly than Haskell's lists.
The third problem with lists is that they have poor space efficiency. Bunches of extra pointers push up your storage (by a constant factor).
Sequences Are Functional
Data.Sequence is internally based on finger trees (I know, you don't want to know this) which means that they have some nice properties
Purely functional. Data.Sequence is a fully persistant data structure.
Darn fast access to the beginning and end of the tree. ϴ(1) (amortized) to get the first or last element, or to append trees. At the thing lists are fastest at, Data.Sequence is at most a constant slower.
ϴ(log n) access to the middle of the sequence. This includes inserting values to make new sequences
High quality API
On the other hand, Data.Sequence doesn't do much for the data locality problem, and only works for finite collections (it is less lazy than lists)
Arrays are not for the faint of heart
Arrays are one of the most important data structures in CS, but they dont fit very well with the lazy pure functional world. Arrays provide ϴ(1) access to the middle of the collection and exceptionally good data locality/constant factors. But, since they dont fit very well into Haskell, they are a pain to use. There are actually a multitude of different array types in the current standard library. These include fully persistant arrays, mutable arrays for the IO monad, mutable arrays for the ST monad, and un-boxed versions of the above. For more check out the haskell wiki
Vector is a "better" Array
The Data.Vector package provides all of the array goodness, in a higher level and cleaner API. Unless you really know what you are doing, you should use these if you need array like performance. Of-course, some caveats still apply--mutable array like data structures just dont play nice in pure lazy languages. Still, sometimes you want that O(1) performance, and Data.Vector gives it to you in a useable package.
You have other options
If you just want lists with the ability to efficiently insert at the end, you can use a difference list. The best example of lists screwing up performance tends to come from [Char] which the prelude has aliased as String. Char lists are convient, but tend to run on the order of 20 times slower than C strings, so feel free to use Data.Text or the very fast Data.ByteString. I'm sure there are other sequence oriented libraries I'm not thinking of right now.
Conclusion
90+% of the time I need a sequential collection in Haskell lists are the right data structure. Lists are like iterators, functions that consume lists can easily be used with any of these other data structures using the toList functions they come with. In a better world the prelude would be fully parametric as to what container type it uses, but currently [] litters the standard library. So, using lists (almost) every where is definitely okay.
You can get fully parametric versions of most of the list functions (and are noble to use them)
Prelude.map ---> Prelude.fmap (works for every Functor)
Prelude.foldr/foldl/etc ---> Data.Foldable.foldr/foldl/etc
Prelude.sequence ---> Data.Traversable.sequence
etc
In fact, Data.Traversable defines an API that is more or less universal across any thing "list like".
Still, although you can be good and write only fully parametric code, most of us are not and use list all over the place. If you are learning, I strongly suggest you do too.
EDIT: Based on comments I realize I never explained when to use Data.Vector vs Data.Sequence. Arrays and Vectors provide extremely fast indexing and slicing operations, but are fundamentally transient (imperative) data structures. Pure functional data structures like Data.Sequence and [] let efficiently produce new values from old values as if you had modified the old values.
newList oldList = 7 : drop 5 oldList
doesn't modify old list, and it doesn't have to copy it. So even if oldList is incredibly long, this "modification" will be very fast. Similarly
newSequence newValue oldSequence = Sequence.update 3000 newValue oldSequence
will produce a new sequence with a newValue for in the place of its 3000 element. Again, it doesn't destroy the old sequence, it just creates a new one. But, it does this very efficiently, taking O(log(min(k,k-n)) where n is the length of the sequence, and k is the index you modify.
You cant easily do this with Vectors and Arrays. They can be modified but that is real imperative modification, and so cant be done in regular Haskell code. That means operations in the Vector package that make modifications like snoc and cons have to copy the entire vector so take O(n) time. The only exception to this is that you can use the mutable version (Vector.Mutable) inside the ST monad (or IO) and do all your modifications just like you would in an imperative language. When you are done, you "freeze" your vector to turn in into the immutable structure you want to use with pure code.
My feeling is that you should default to using Data.Sequence if a list is not appropriate. Use Data.Vector only if your usage pattern doesn't involve making many modifications, or if you need extremely high performance within the ST/IO monads.
If all this talk of the ST monad is leaving you confused: all the more reason to stick to pure fast and beautiful Data.Sequence.

Non-Trivial Lazy Evaluation

I'm currently digesting the nice presentation Why learn Haskell? by Keegan McAllister. There he uses the snippet
minimum = head . sort
as an illustration of Haskell's lazy evaluation by stating that minimum has time-complexity O(n) in Haskell. However, I think the example is kind of academic in nature. I'm therefore asking for a more practical example where it's not trivially apparent that most of the intermediate calculations are thrown away.

Have you ever written an AI? Isn't it annoying that you have to thread pruning information (e.g. maximum depth, the minimum cost of an adjacent branch, or other such information) through the tree traversal function? This means you have to write a new tree traversal every time you want to improve your AI. That's dumb. With lazy evaluation, this is no longer a problem: write your tree traversal function once, to produce a huge (maybe even infinite!) game tree, and let your consumer decide how much of it to consume.
Writing a GUI that shows lots of information? Want it to run fast anyway? In other languages, you might have to write code that renders only the visible scenes. In Haskell, you can write code that renders the whole scene, and then later choose which pixels to observe. Similarly, rendering a complicated scene? Why not compute an infinite sequence of scenes at various detail levels, and pick the most appropriate one as the program runs?
You write an expensive function, and decide to memoize it for speed. In other languages, this requires building a data structure that tracks which inputs for the function you know the answer to, and updating the structure as you see new inputs. Remember to make it thread safe -- if we really need speed, we need parallelism, too! In Haskell, you build an infinite data structure, with an entry for each possible input, and evaluate the parts of the data structure that correspond to the inputs you care about. Thread safety comes for free with purity.
Here's one that's perhaps a bit more prosaic than the previous ones. Have you ever found a time when && and || weren't the only things you wanted to be short-circuiting? I sure have! For example, I love the <|> function for combining Maybe values: it takes the first one of its arguments that actually has a value. So Just 3 <|> Nothing = Just 3; Nothing <|> Just 7 = Just 7; and Nothing <|> Nothing = Nothing. Moreover, it's short-circuiting: if it turns out that its first argument is a Just, it won't bother doing the computation required to figure out what its second argument is.
And <|> isn't built in to the language; it's tacked on by a library. That is: laziness allows you to write brand new short-circuiting forms. (Indeed, in Haskell, even the short-circuiting behavior of (&&) and (||) aren't built-in compiler magic: they arise naturally from the semantics of the language plus their definitions in the standard libraries.)
In general, the common theme here is that you can separate the production of values from the determination of which values are interesting to look at. This makes things more composable, because the choice of what is interesting to look at need not be known by the producer.

Here's a well-known example I posted to another thread yesterday. Hamming numbers are numbers that don't have any prime factors larger than 5. I.e. they have the form 2^i*3^j*5^k. The first 20 of them are:
[1,2,3,4,5,6,8,9,10,12,15,16,18,20,24,25,27,30,32,36]
The 500000th one is:
1962938367679548095642112423564462631020433036610484123229980468750
The program that printed the 500000th one (after a brief moment of computation) is:
merge xxs#(x:xs) yys#(y:ys) =
case (x`compare`y) of
LT -> x:merge xs yys
EQ -> x:merge xs ys
GT -> y:merge xxs ys
hamming = 1 : m 2 `merge` m 3 `merge` m 5
where
m k = map (k *) hamming
main = print (hamming !! 499999)
Computing that number with reasonable speed in a non-lazy language takes quite a bit more code and head-scratching. There are a lot of examples here

Consider generating and consuming the first n elements of an infinite sequence. Without lazy evaluation, the naive encoding would run forever in the generation step, and never consume anything. With lazy evaluation, only as many elements are generated as the code tries to consume.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string