Remove root element from a heap tree - haskell

How do I remove the smallest element of an heap tree?
This element is at the root of the tree. If I remove that, I'm left with two independent subtrees.
data Heap a = Empty
| Node a (Heap a) (Heap a)
The type of the function is:
removeMin :: Heap a -> (a, Heap a)
It should return the tree and the minimum removed.
Should I make an auxiliary function to build a new tree, or is there a faster way to do this?

Your type, as written, raises some questions:
Q: What's the output from removeMin Empty?
A: You can't produce an a from nothing, so the result should be wrapped in Maybe.
Q: If I've put (+), (-) and (*) in a Heap (Int -> Int -> Int), which one should be returned by removeMin?
A: Not all data types have an ordering (notably, functions lack one), so it makes sense to require that the data type have an Ord instance.
So the updated type becomes:
removeMin :: Ord a => Heap a -> Maybe (a, Heap a)
Now consider it case by case:
Empty has no min element:
removeMin Empty = Nothing
If one branch is empty, the remaining heap is the other branch
removeMin (Node a Empty r) = Just (a, r)
removeMin (Node a l Empty) = Just (a, l)
Convince yourself that this works for Node a Empty Empty.
If neither branch is empty, then the new smallest min element must be the root of one of the branches.
The branches in the resulting Heap are just the branch of the larger element, and the branch of the smaller element, with its minimum removed.
Fortunately, we already have a helper to remove the minimum from a Heap!
removeMin (Node a l#(Node la _ _) r#(Node ra _ _)) = Just (a, Node mina maxN minN')
where (minN, maxN) = if la <= ra then (l,r) else (r,l)
Just (mina, minN') = removeMin minN
Now, while this produces a valid heap, it's not necessarily the best algorithm because it's not guaranteed to produce a balanced heap. A poorly balanced heap is no better than a linked list, giving you O(n) insertion and deletion times where a balanced heap can give you O(log n).

You should build an appropriate function to build new tree, but don't worry- it will not perform poorly. GHC can optimize such use cases and this operation could be just as fast as you want (including large, even infinite (recursive) data structures).
I understand you are able to create such auxiliary function by yourself? It is straightforward - anyway, in case of troubles I can write it later.

Think of it this way: After removing the top node, you're left with two heaps. So you need to implement (recursive) merging of two heaps, something like
merge :: (Ord a) => Heap a -> Heap a -> Heap a
You could also implement a monoid instance for Heap
instance (Ord a) => Monoid (Heap a) where
mempty = Empty
mappend = -- the merging function

Related

Can you implement Binary Search Tree in Haskell with O(log n) insertion?

If I understand correctly, modifying (insertion or deletion) a Binary Search Tree in Haskell requires copying the whole tree, so practically making it being O(n). Is there a way to implement it in O(log n) or maybe compiler would optimize O(n) insertion down to O(log n) "under the hood"?
If I understand correctly, modifying (insertion or deletion) a Binary Search Tree in Haskell requires copying the whole tree, so practically making it being O(n).
You do not need to copy the entire tree. Indeed, let us work with a simple unbalanced binary search tree, like:
data Tree a = Node (Tree a) a (Tree a) | Empty deriving (Eq, Show)
then we can insert a value with:
insertIn :: Ord a => a -> Tree a -> Tree a
insertIn x = go
where go Empty = Node Empty x Empty
go n#(Node l v r)
| x < v = Node (go l) v r
| x > v = Node l v (go r)
| otherwise = n
Here we reuse r in case we construct a Node (go l) v r, and we reuse l in case we construct a Node l v (go r). For each node we visit, we create a new node where one of the two subtrees is used in the new node. This means that the new tree will point to the same subtree objects as the original tree.
In this example, the amount of new nodes thus scales with O(d) with d the depth of the tree. If the tree is fairly balanced, than it will insert in O(log n).
Of course you can improve the algorithm and define an AVL tree or red-black tree by storing more information in the node regarding balancing, in that case you thus can guarantee O(log n) insertion time.
The fact that all data is immutable here helps to reuse parts of the tree: we know that l and r can not change, so the two trees will share a large amount of nodes and thus reduce the amount of memory necessary if you want to use both the original and the new tree.
If there is no reference to the old tree necessary, the garbage collector will eventually collect the "old" nodes that have been replaced by the new tree.

what is the most appropriate data structure supporting binary search to solve LIS?

I would like to solve the Longest increasing subsequence problem in Haskell, with the patience sorting algorithm.
I first did it with lists, and it worked in O(n^2) time.
Now I would like to do create an algorithm that solve it in O(n log n) time. To do it I need to get the "first fit" when I insert each value v, in other words finding the first pile whose last element is bigger than v in n log n time.
data OrderedStruct = undefined -- ???
-- I need a method to remove elements in (n log n) time
popFirstBigger :: Ord a -> a -> OrderedStruct a -> (Maybe a, OrderedStruct a)
popFirstBigger a t = undefined
-- and one to insert them in (n log n) or faster
insert :: Ord a -> a -> OrderedStruct a -> OrderedStruct a
insert a t = undefined
I could do it with a balanced binary search tree, but I would like to know if it exists a shortest way.
For example, any structure that I could use in a dichotomic search would be sufficient.
Does such a data structure exists in standard Haskell (Seq for example?)
Otherwise which data structure could I use?
Data.Set offers insert, delete, and lookupGT, all in O(log n) time.

Haskell vector C++ push_back analogue

I've discovered that Haskell Data.Vector.* miss C++ std::vector::push_back's functionality. There is grow/unsafeGrow, but they seem to have O(n) complexity.
Is there a way to grow vectors in O(1) amortized time for an element?
No there really is no such facility in Data.Vector. It isn't too difficult to implement this from scratch using MutableArray like Data.Vector.Mutable does (see my implementation below), but there are some significant drawbacks. In particular, all of its operations end up happening inside some state context usually ST or IO. This has the downsides that
Any code that manipulates such a data structure ends up having to be monadic
The compiler is much less likely to be able to optimize. For example, libraries like vector use something really clever called fusion to optimize away intermediate allocations. This sort of thing is not possible in a state context.
Parallelism is going to be a lot tougher: in ST I can't even have two threads and in IO I will have race conditions all over the place. The nasty bit here is that any sharing is going to have to happen in IO.
As if all this wasn't enough, garbage collection also performs better inside pure code.
What do I do then?
It isn't particularly often that you have a need for exactly this behaviour - usually you are better off using an immutable data structure (thereby avoiding all of the aforementioned problems) which does something similar. Just limiting ourselves to containers which comes with GHC, some alternatives include:
if you are almost always just using push_back, maybe you just want a stack (a plain old [a]).
if you anticipate doing more push_back than lookups, Data.Sequence gives you O(1) appending to either end and O(log n) lookup.
if you are interested in a lot of operations especially hashmap-like, Data.IntMap is pretty optimized. Even if the theoretical cost of those operations is O(log n), you will need a pretty big IntMap to start feeling those costs.
Making something like C++ vector
Of course, if one doesn't care about the restrictions mentioned initially, there is no reason not to have a C++ like vector. Just for fun, I went ahead and implemented this from scratch (needs packages data-default and primitive).
The reason this code is probably not already in some library is that it goes against much of the spirit of Haskell (I do this with the intent of conforming to a C++ style vector).
The only operation that actually makes a new vector is newVector - everything else "modifies" an existing vector. Since pushBack doesn't return a new GrowVector, it has to modify the existing one (including its length and/or capacity), so length and capacity have to be "pointers". In turn, that means that even getting the length is a monadic operation.
While this isn't unboxed, it would not be too difficult to replicate vectors data family approach - it is just tedious1.
With that said:
module GrowVector (
GrowVector, newEmpty, size, read, write, pushBack, popBack
) where
import Data.Primitive.Array
import Data.Primitive.MutVar
import Data.Default
import Control.Monad
import Control.Monad.Primitive (PrimState, PrimMonad)
import Prelude hiding (length, read)
data GrowVector s a = GrowVector
{ underlying :: MutVar s (MutableArray s a) -- ^ underlying array
, length :: MutVar s Int -- ^ perceived length of vector
, capacity :: MutVar s Int -- ^ actual capacity
}
type GrowVectorIO = GrowVector (PrimState IO)
-- | Make a new empty vector with the given capacity. O(n)
newEmpty :: (Default a, PrimMonad m) => Int -> m (GrowVector (PrimState m) a)
newEmpty cap = do
arr <- newArray cap def
GrowVector <$> newMutVar arr <*> newMutVar 0 <*> newMutVar cap
-- | Read an element in the vector (unchecked). O(1)
read :: PrimMonad m => GrowVector (PrimState m) a -> Int -> m a
g `read` i = do arr <- readMutVar (underlying g); arr `readArray` i
-- | Find the size of the vector. O(1)
size :: PrimMonad m => GrowVector (PrimState m) a -> m Int
size g = readMutVar (length g)
-- | Double the vector capacity. O(n)
resize :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> m ()
resize g = do
curCap <- readMutVar (capacity g) -- read current capacity
curArr <- readMutVar (underlying g) -- read current array
curLen <- readMutVar (length g) -- read current length
newArr <- newArray (2 * curCap) def -- allocate a new array twice as big
copyMutableArray newArr 1 curArr 1 curLen -- copy the old array over
underlying g `writeMutVar` newArr -- use the new array in the vector
capacity g `modifyMutVar'` (*2) -- update the capacity in the vector
-- | Write an element to the array (unchecked). O(1)
write :: PrimMonad m => GrowVector (PrimState m) a -> Int -> a -> m ()
write g i x = do arr <- readMutVar (underlying g); writeArray arr i x
-- | Pop an element of the vector, mutating it (unchecked). O(1)
popBack :: PrimMonad m => GrowVector (PrimState m) a -> m a
popBack g = do
s <- size g;
x <- g `read` (s - 1)
length g `modifyMutVar'` (+ negate 1)
pure x
-- | Push an element. (Amortized) O(1)
pushBack :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> a -> m ()
pushBack g x = do
s <- readMutVar (length g) -- read current size
c <- readMutVar (capacity g) -- read current capacity
when (s+1 == c) (resize g) -- if need be, resize
write g (s+1) x -- write to the back of the array
length g `modifyMutVar'` (+1) -- increase te length
Current semantics of grow
I think the github issue does a pretty good job of explaining the semantics:
I think the intended semantics are that it may do a realloc, but not guaranteed to, and all the current implementations do the simpler copying semantics because for on heap allocations the cost should be roughly the same.
Basically you should use grow when you want a new mutable vector of an increased size, starting with the elements of the old vector (and no longer care about the old vector). This is quite useful - for example one could implement GrowVector using MVector and grow.
1 the approach is that for every new type of unboxed vector you want to have, you make a data instance that "expands" your type into a fixed number of unboxed arrays (or other unboxed vectors). This is the point of data family - to allow different instantiations of a type to have totally different runtime representations, and to also be extensible (you can add your own data instance if you want).

Is there a way to avoid copying the whole search path of a binary tree on insert?

I've just started working my way through Okasaki's Purely Functional Data Structures, but have been doing things in Haskell rather than Standard ML. However, I've come across an early exercise (2.5) that's left me a bit stumped on how to do things in Haskell:
Inserting an existing element into a binary search tree copies the entire search path
even though the copied nodes are indistinguishable from the originals. Rewrite insert using exceptions to avoid this copying. Establish only one handler per insertion rather than one handler per iteration.
Now, my understanding is that ML, being an impure language, gets by with a conventional approach to exception handling not so different to, say, Java's, so you can accomplish it something like this:
type Tree = E | T of Tree * int * Tree
exception ElementPresent
fun insert (x, t) =
let fun go E = T (E, x, E)
fun go T(l, y, r) =
if x < y then T(go (l), x, r)
else if y < x then T(l, x, go (r))
else raise ElementPresent
in go t
end
handle ElementPresent => t
I don't have an ML implementation, so this may not be quite right in terms of the syntax.
My issue is that I have no idea how this can be done in Haskell, outside of doing everything in the IO monad, which seems like cheating and even if it's not cheating, would seriously limit the usefulness of a function which really doesn't do any mutation. I could use the Maybe monad:
data Tree a = Empty | Fork (Tree a) a (Tree a)
deriving (Show)
insert :: (Ord a) => a -> Tree a -> Tree a
insert x t = maybe t id (go t)
where go Empty = return (Fork Empty x Empty)
go (Fork l y r)
| x < y = do l' <- go l; return (Fork l' y r)
| x > y = do r' <- go r; return (Fork l y r')
| otherwise = Nothing
This means everything winds up wrapped in Just on the way back up when the element isn't found, which requires more heap allocation, and sort of defeats the purpose. Is this allocation just the price of purity?
EDIT to add: A lot of why I'm wondering about the suitability of the Maybe solution is that the optimization described only seems to save you all the constructor calls you would need in the case where the element already exists, which means heap allocations proportional to the length of the search path. The Maybe also avoids those constructor calls when the element already exists, but then you get a number of Just constructor calls equal to the length of the search path. I understand that a sufficiently smart compiler could elide all the Just allocations, but I don't know if, say, the current version of GHC is really that smart.
In terms of cost, the ML version is actually very similar to your Haskell version.
Every recursive call in the ML version results in a stack frame. The same is true in the
Haskell version. This is going to be proportional in size to the path that you traverse in
the tree. Also, both versions will of course allocate new nodes for the entire path if an insertion is actually performed.
In your Haskell version, every recursive call might also eventually result in the
allocation of a Just node. This will go on the minor heap, which is just a block of
memory with a bump pointer. For all practical purposes, GHC's minor heap is roughly equivalent in
cost to the stack. Since these are short-lived allocations, they won't normally end up
being moved to the major heap at all.
GHC generally cannot elide path copying in cases like that. However, there is a way to do it manually, without incurring any of the indirection/allocation costs of Maybe. Here it is:
{-# LANGUAGE MagicHash #-}
import GHC.Prim (reallyUnsafePtrEquality#)
data Tree a = Empty | Fork (Tree a) a (Tree a)
deriving (Show)
insert :: (Ord a) => a -> Tree a -> Tree a
insert x Empty = Fork Empty x Empty
insert x node#(Fork l y r)
| x < y = let l' = insert x l in
case reallyUnsafePtrEquality# l l' of
1# -> node
_ -> Fork l' y r
| x > y = let r' = insert x r in
case reallyUnsafePtrEquality# r r' of
1# -> node
_ -> Fork l y r'
| otherwise = node
The pointer equality function does exactly what's in the name. Here it is safe because even if the equality returns a false negative we only do a bit of extra copying, and nothing worse happens.
It's not the most idiomatic or prettiest Haskell, but the performance benefits can be significant. In fact, this trick is used very frequently in unordered-containers.
As fizruk indicates, the Maybe approach is not significantly different from what you'd get in Standard ML. Yes, the whole path is copied, but the new copy is discarded if it turns out not to be needed. The Just constructor itself may not even be allocated on the heap—it can't escape from insert, let alone the module, and you don't do anything weird with it, so the compiler is free to analyze it to death.
Edit
There are efficiency problems, now that I think of it. Your use of Maybe conceals the fact that you're actually making two passes—one down to find the insertion point and one up to build the tree. The solution to this is to drop Maybe Tree in favor of (Tree,Bool) and use strictness annotations, or to switch to continuation-passing style. Also, if you choose to stay with the three-way logic, you may want to use the three-way comparison function. Alternatively, you can go all the way to the bottom each time and check later if you hit a duplicate.
If you have a predicate that checks whether the key is already in the tree, you can look before you leap:
insert x t = if contains t x then t else insert' x t
This traverses the tree twice, of course. Whether that's as bad as it sounds should be determined empirically: it might just load the relevant part of the tree into the cache.

Tree Fold operation?

I am taking a class in Haskell, and we need to define the fold operation for a tree defined by:
data Tree a = Lf a | Br (Tree a) (Tree a)
I can not seem to find any information on the "tfold" operation or really what it supposed to do. Any help would be greatly appreciated.
I always think of folds as a way of systematically replacing constructors by other functions. So, for instance, if you have a do-it-yourself List type (defined as data List a = Nil | Cons a (List a)), the corresponding fold can be written as:
listfold nil cons Nil = nil
listfold nil cons (Cons a b) = cons a (listfold nil cons b)
or, maybe more concisely, as:
listfold nil cons = go where
go Nil = nil
go (Cons a b) = cons a (go b)
The type of listfold is b -> (a -> b -> b) -> List a -> b. That is to say, it takes two 'replacement constructors'; one telling how a Nil value should be transformed into a b, another replacement constructor for the Cons constructor, telling how the first value of the Cons constructor (of type a) should be combined with a value of type b (why b? because the fold has already been applied recursively!) to yield a new b, and finally a List a to apply the whole she-bang to - with a result of b.
In your case, the type of tfold should be (a -> b) -> (b -> b -> b) -> Tree a -> b by analogous reasoning; hopefully you'll be able to take it from there!
Imagine you define that a tree should be shown in the following manner,
<1 # <<2#3> # <4#5>>>
Folding such a tree means replacing each branch node with an actual supplied operation to be performed on the results of fold recursively performed on the data type's constituents (here, the node's two child nodes, which are themselves, each, a tree), for example with +, producing
(1 + ((2+3) + (4+5)))
So, for leaves you should just take the values inside them, and for branches, recursively apply the fold for each of the two child nodes, and combine the two results with the supplied function, the one with which the tree is folded. (edit:) When "taking" values from leaves, you could additionally transform them, applying a unary function. So in general, your folding will need two user-provided functions, one for leaves, Lf, and another one for combining the results of recursively folding the tree-like constituents (i.e. branches) of the branching nodes, Br.
Your tree data type could have been defined differently, e.g. with possibly empty leaves, and with internal nodes also carrying the values. Then you'd have to provide a default value to be used instead of the empty leaf nodes, and a three-way combination operation. Still you'd have the fold defined by two functions corresponding to the two cases of the data type definition.
Another distinction to realize here is, what you fold, and how you fold it. I.e. you could fold your tree in a linear fashion, (1+(2+(3+(4+5)))) == ((1+) . (2+) . (3+) . (4+) . (5+)) 0, or you could fold a linear list in a tree-like fashion, ((1+2)+((3+4)+5)) == (((1+2)+(3+4))+5). It is all about how you parenthesize the resulting "expression". Of course in the classic take on folding the expression's structure follows that of the data structure being folded; but variations do exist. Note also, that the combining operation might not be strict, and the "result" type it consumes/produces might express compound (lists and such), as well as atomic (numbers and such), values.
(update 2019-01-26) This re-parenthesization is possible if the combining operation is associative, like +: (a1+a2)+a3 == a1+(a2+a3). A data type together with such associative operation and a "zero" element (a+0 == 0+a == a) is known as "Monoid", and the notion of folding "into" a Monoid is captured by the Foldable type class.
A fold on a list is a reduction from a list into a single element. It takes a function and then applies that function to elements, two at a time, until it has only one element. For example:
Prelude> foldl1 (+) [3,5,6,7]
21
...is found by doing operations one-by-one:
3 + 5 == 8
8 + 6 == 14
14 + 7 == 21
A fold can be written
ourFold :: (a -> a -> a) -> [a] -> a
ourFold _ [a] = a -- pattern-match for a single-element list. Our work is done.
ourFold aFunction (x0:x1:xs) = ourFold aFunction ((aFunction x0 x1):xs)
A tree fold would do this, but move up or down the branches of the tree. To do this, it first need to pattern-match to see whether you're operating on a Leaf or a Branch.
treeFold _ (Lf a) = Lf a -- You can't do much to a one-leaf tree
treeFold f (Br a b) = -- ...
The rest is left up to you, since it's homework. If you're stuck, try first thinking of what the type should be.
A fold is an operation which "compacts" a data structure into a single value using an operation. There are variations depending if you have a start value and execution order (e.g. for lists you have foldl, foldr, foldl1 and foldr1), so the correct implementation depends on your assignment.
I guess your tfold should simply replace all leafs with its values, and all branches with applications of the given operation. Draw an example tree with some numbers, an "collapse" him given an operation like (+). After this, it should be easy to write a function doing the same.

Resources