How is insert O(log(n)) in Data.Set? - haskell

When looking through the docs of Data.Set, I saw that insertion of an element into the tree is mentioned to be O(log(n)). However, I would intuitively expect it to be O(n*log(n)) (or maybe O(n)?), as referential transparency requires creating a full copy of the previous tree in O(n).
I understand that for example (:) can be made O(1) instead of O(n), as here the full list doesn't have to be copied; the new list can be optimized by the compiler to be the first element plus a pointer to the old list (note that this is a compiler - not a language level - optimization). However, inserting a value into a Data.Set involves rebalancing that looks quite complex to me, to the point where I doubt that there's something similar to the list optimization. I tried reading the paper that is referenced by the Set docs, but couldn't answer my question with it.
So: how can inserting an element into a binary tree be O(log(n)) in a (purely) functional language?

There is no need to make a full copy of a Set in order to insert an element into it. Internally, element are stored in a tree, which means that you only need to create new nodes along the path of the insertion. Untouched nodes can be shared between the pre-insertion and post-insertion version of the Set. And as Deitrich Epp pointed out, in a balanced tree O(log(n)) is the length of the path of the insertion. (Sorry for omitting that important fact.)
Say your Tree type looks like this:
data Tree a = Node a (Tree a) (Tree a)
| Leaf
... and say you have a Tree that looks like this
let t = Node 10 tl (Node 15 Leaf tr')
... where tl and tr' are some named subtrees. Now say you want to insert 12 into this tree. Well, that's going to look something like this:
let t' = Node 10 tl (Node 15 (Node 12 Leaf Leaf) tr')
The subtrees tl and tr' are shared between t and t', and you only had to construct 3 new Nodes to do it, even though the size of t could be much larger than 3.
EDIT: Rebalancing
With respect to rebalancing, think about it like this, and note that I claim no rigor here. Say you have an empty tree. Already balanced! Now say you insert an element. Already balanced! Now say you insert another element. Well, there's an odd number so you can't do much there.
Here's the tricky part. Say you insert another element. This could go two ways: left or right; balanced or unbalanced. In the case that it's unbalanced, you can clearly perform a rotation of the tree to balance it. In the case that it's balanced, already balanced!
What's important to note here is that you're constantly rebalancing. It's not like you have a mess of a tree, decided to insert an element, but before you do that, you rebalance, and then leave a mess after you've completed the insertion.
Now say you keep inserting elements. The tree's gonna get unbalanced, but not by much. And when that does happen, first off you're correcting that immediately, and secondly, the correction occurs along the path of the insertion, which is O(log(n)) in a balanced tree. The rotations in the paper you linked to are touching at most three nodes in the tree to perform a rotation. so you're doing O(3 * log(n)) work when rebalancing. That's still O(log(n)).

To add extra emphasis to what dave4420 said in a comment, there are no compiler optimizations involved in making (:) run in constant time. You could implement your own list data type, and run it in a simple non-optimizing Haskell interpreter, and it would still be O(1).
A list is defined to be an initial element plus a list (or it's empty in the base case). Here's a definition that's equivalent to native lists:
data List a = Nil | Cons a (List a)
So if you've got an element and a list, and you want to build a new list out of them with Cons, that's just creating a new data structure directly from the arguments the constructor requires. There is no more need to even examine the tail list (let alone copy it), than there is to examine or copy the string when you do something like Person "Fred".
You are simply mistaken when you claim that this is a compiler optimization and not a language level one. This behaviour follows directly from the language level definition of the list data type.
Similarly, for a tree defined to be an item plus two trees (or an empty tree), when you insert an item into a non-empty tree it must either go in the left or right subtree. You'll need to construct a new version of that tree containing the element, which means you'll need to construct a new parent node containing the new subtree. But the other subtree doesn't need to be traversed at all; it can be put in the new parent tree as is. In a balanced tree, that's a full half of the tree that can be shared.
Applying this reasoning recursively should show you that there's actually no copying of data elements necessary at all; there's just the new parent nodes needed on the path down to the inserted element's final position. Each new node stores 3 things: an item (shared directly with the item reference in the original tree), an unchanged subtree (shared directly with the original tree), and a newly created subtree (which shares almost all of its structure with the original tree). There will be O(log(n)) of those in a balanced tree.

Related

Are sequences faster than vectors for searching in haskell?

I am kind of new using data structures in haskell besides of Lists. My goal is to chose one container among Data.Vector, Data.Sequence, Data.List, etc ... My problem is the following:
I have to create a sequence (mathematically speaking). The sequence starts at 0. In each iteration two new elements are generated but only one should be appended based in whether the first element is already in the sequence. So in each iteration there is a call to elem function (see the pseudo-code below).
appendNewItem :: [Integer] -> [Integer]
appendNewItem acc = let firstElem = someFunc
secondElem = someOtherFunc
newElem = if firstElem `elem` acc
then secondElem
else firstElem
in acc `append` newElem
sequenceUptoN :: Int -> [Integer]
sequenceUptoN n = (iterate appendNewItem [0]) !! n
Where append and iterate functions vary depending on which colection you use (I am using lists in the type signature for simplicity).
The question is: Which data structure should I use?. Is Data.Sequence faster for this task because of the Finger Tree inner structure?
Thanks a lot!!
No, sequences are not faster for searching. A Vector is just a flat chunk of memory, which gives generally the best lookup performance. If you want to optimise searching, use Data.Vector.Unboxed. (The normal, “boxed” variant is also pretty good, but it actually contains only references to the elements in the flat memory-chunk, so it's not quite as fast for lookups.)
However, because of the flat memory layout, Vectors are not good for (pure-functional) appending: basically, whenever you add a new element, the whole array must be copied so as to not invalidate the old one (which somebody else might still be using). If you need to append, Seq is a pretty good choice, although it's not as fast as destructive appending: for maximum peformance, you'll want to pre-allocate an uninitialized Data.Vector.Unboxed.Mutable.MVector of the required size, populate it using the ST monad, and freeze the result. But this is much more fiddly than purely-functional alternatives, so unless you need to squeeze out every bit of performance, Data.Sequence is the way to go. If you only want to append, but not look up elements, then a plain old list in reverse order would also do the trick.
I suggest using Data.Sequence in conjunction with Data.Set. The Sequence to hold the sequence of values and the Set to track the collection.
Sequence, List, and Vector are all structures for working with values where the position in the structure has primary importance when it comes to indexing. In lists we can manipulate elements at the front efficiently, in sequences we can manipulate elements based on the log of the distance the closest end, and in vectors we can access any element in constant time. Vectors however, are not that useful if the length keeps changing, so that rules out their use here.
However, you also need to lookup a certain value within the list, which these structures don't help with. You have to search the whole of a list/sequence/vector to be certain that a new value isn't present. Data.Map and Data.Set are two of the structures for which you define an index value based on Ord, and let you lookup/insert in log(n). So, at the cost of memory usage you can lookup the presence of firstElem in your Set in log(n) time and then add newElem to the end of the sequence in constant time. Just make sure to keep these two structures in synch when adding or taking new elements.

How to generate stable id for AST nodes in functional programming?

I want to substitute a specific AST node into another, and this substituted node is specified by interactive user input.
In non-functional programming, you can use mutable data structure, and each AST node have a object reference, so when I need to reference to a specific node, I can use this reference.
But in functional programming, use IORef is not recommended, so I need to generate id for each AST node, and I want this id to be stable, which means:
when a node is not changed, the generated id will also not change.
when a child node is changed, it's parent's id will not change.
And, to make it clear that it is an id instead of a hash value:
for two different sub nodes, which are compared equal, but corresponding to different part of an expression, they should have different id.
So, what should I do to approach this?
Perhaps you could use the path from the root to the node as the id of that node. For example, for the datatype
data AST = Lit Int
| Add AST AST
| Neg AST
You could have something like
data ASTPathPiece = AddGoLeft
| AddGoRight
| NegGoDown
type ASTPath = [ASTPathPiece]
This satisfies conditions 2 and 3 but, alas, it doesn't satisfy 1 in general. The index into a list will change if you insert a node in a previous position, for example.
If you are rendering the AST into another format, perhaps you could add hidden attributes in the result nodes that identified which ASTPathPiece led to them. Traversing the result nodes upwards to the root would let you reconstruct the ASTPath.

Quadtree object movement

So I need some help brainstorming, from a theoretical standpoint. Right now I have some code that just draws some objects. The objects lie in the leaves of a quadtree. Now as the objects move I want to keep them placed in the correct leaf of the quadtree.
Right now I am just reconstructing the quadtree on the objects after I change their position. I was trying to figure out a way to correct the tree without rebuilding it completely. All I can think of is having a bunch of pointers to adjacent leaf nodes.
Does anyone have an idea of how to figure out the node into which an object moves without just having a ton of pointers everywhere or a link to articles on this? All I could find was different ways to build the quadtree, nothing about updating it.
If I understand your question. You want some way of mapping between spatial coordinates and leaves on the quadtree.
Here's one possible solution I've been looking at:
For simplicity, let's do the 1D case first. And lets assume we have 32 gridpoints in x. Every grid point then corresponds to some leaf on a quadtree of depth five. (depth 0 = the whole grid, depth 1 = 2 points, depth 2 = 4 points... depth 5 = 32 points).
Each leaf could be represented by the branch indices leading to the leaf. At each level there are two branches we can label A and B. So, a particular leaf might be labeled BBAAB, which would mean, go down the B branch, then the B branch, then the A branch, then the B branch and then the B branch.
So, how do you map e.g. BBABB to an x grid point between 0..31? Just convert it to binary, so that BBABB->11011 = 27. Thus, the mapping from gridpoint to leaf-node is simply a matter of translating the letters A and B into 0s and 1s and then interpreting the result as a binary number.
For the 2D case, it's only slightly more complicated. Now we have four branches from each node, so we can label each branch path using a four-letter alphabet, e.g. starting from the root and taking the 3rd branch and then the fourth branch and then the first branch and then the second branch and then the second branch again we would generate the string CDABB.
Now to convert the string (e.g. 'CDABB') into a pair of gridvalues (x,y).
Let's assume A is lower-left, B is lower right, C is upper left and D is upper right. Then, symbolically, we could write, A.x=0, A.y=0 / B.x=1, B.y=0 / C.x=0, C.y=1 / D.x=1, D.y=1.
Taking the example CDABB, we first look at its x values (CDABB).x = (01011), which gives us the x grid point. And similarly for y.
Finally, if you want to find out e.g. the node immediately to the right of CDABB, then simply convert it to a pair of binary numbers in x and y, add +1 to the x value and convert the new pair of binary numbers back into a string.
I'm sure this has all been discovered, but I haven't yet found this information on the web.
If you have the spatial data necessary to insert an element into the quad-tree in the first place (ex: its point or rectangle), then you have the same data needed to remove it.
An easy way is before you move an element, remove it from the quad-tree using the same data you used to originally insert it, then move it, then re-insert.
Removal from the quad-tree can first remove the element from the leaf node(s), then if the leaf nodes become empty, remove them from their parents. If the parents become empty, remove them from their parents, and so forth.
This simple method is efficient enough for a complex world of objects moving every frame as long as you implement the quad-tree efficiently (ex: use a free list for the nodes). There shouldn't have to be a heap allocation on a per-node basis to insert it, nor a heap deallocation involved in removing every single node. Most node allocations/deallocations should be a simple constant-time operation just involving, say, the manipulation of a couple of integers or pointers.
You can also make this a little more complex if you like. You can start off storing the previous position of an object and then move it. If the new position occupies nodes other than the previous position, then remove the object from the nodes it no longer occupies and insert it to the new ones. Otherwise just keep it in the same node(s).
Update
I usually try to avoid linking my previous answers, but in this case I ended up doing a pretty comprehensive write up on the topic which would be hard to replicate anywhere else. Here it is: https://stackoverflow.com/a/48330314/4842163

How is a map of STRING (to integer or anything) stored internally? How are they ordered/balanced?

I know that the Map container in STL is internally a Red-Black Tree, which is a self-balancing tree.
In Map, the lowest element is at the top of the tree. So, for a map of integer to 'anything', the lowest integer will be at the top and so on. It always balances itself. That's why we get a log n complexity while searching for an integer and its associated value.
But in case of map of string to 'anything', how does it balances and orders itself, if it does? Which string would be at the top of the tree? Does it matches the ASCII values or something?
This might be lame, but I need to know this as I have to ensure that I am adhering to the complexity of log n in my code.
In Map, the lowest element is at the top of the tree.
No, the least element is at the left-most node, the one you reach by following the left child pointer until it's null. In the case of strings, the least element is the one that always compared "less than" all the other strings you've inserted into the map. The default comparison is lexicographic.
The balancing proceeds as usual, by a bunch of pointer swaps. The path from the root to any leaf is guaranteed to have length Θ(lg n), no matter what.
(This is assuming the original STL implementation of map, which is still common, though a conforming C++ library might use a different structure.)

How do I find a matching subtree?

I have a large binary tree, T. T "matches". Some number of subtrees of T will also match. In fact, the matching subtrees need not even be full subtrees: they can be truncated, too. By truncated subtree, I mean that nodes in the subtree may not contain children all the way down - some nodes that have children may have their children removed.
An example: see this link. The tree represented by poem1, stanza1, stanza2, line3 is an example of a truncated subtree.
Determining if a tree matches requires performing a calculation on that entire tree. It's not progressive.
How the heck do I find all matches?
http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem
sounds roughly like what you're trying to find (except that you're trying this on all subgraphs of an original graph as well, making it even harder). I don't really know how you are defining "matches" (equality, pattern, color coordinated, sticks with chemicals on the end that ignite when struck?), so it might be quite a different problem.

Resources