Efficient heaps in purely functional languages - haskell

As an exercise in Haskell, I'm trying to implement heapsort. The heap is usually implemented as an array in imperative languages, but this would be hugely inefficient in purely functional languages. So I've looked at binary heaps, but everything I found so far describes them from an imperative viewpoint and the algorithms presented are hard to translate to a functional setting. How to efficiently implement a heap in a purely functional language such as Haskell?
Edit: By efficient I mean it should still be in O(n*log n), but it doesn't have to beat a C program. Also, I'd like to use purely functional programming. What else would be the point of doing it in Haskell?

There are a number of Haskell heap implementations in an appendix to Okasaki's Purely Functional Data Structures. (The source code can be downloaded at the link. The book itself is well worth reading.) None of them are binary heaps, per se, but the "leftist" heap is very similar. It has O(log n) insertion, removal, and merge operations. There are also more complicated data structures like skew heaps, binomial heaps, and splay heaps which have better performance.

Jon Fairbairn posted a functional heapsort to the Haskell Cafe mailing list back in 1997:
http://www.mail-archive.com/haskell#haskell.org/msg01788.html
I reproduce it below, reformatted to fit this space. I've also slightly simplified the code of merge_heap.
I'm surprised treefold isn't in the standard prelude since it's so useful. Translated from the version I wrote in Ponder in October 1992 -- Jon Fairbairn
module Treefold where
-- treefold (*) z [a,b,c,d,e,f] = (((a*b)*(c*d))*(e*f))
treefold f zero [] = zero
treefold f zero [x] = x
treefold f zero (a:b:l) = treefold f zero (f a b : pairfold l)
where
pairfold (x:y:rest) = f x y : pairfold rest
pairfold l = l -- here l will have fewer than 2 elements
module Heapsort where
import Treefold
data Heap a = Nil | Node a [Heap a]
heapify x = Node x []
heapsort :: Ord a => [a] -> [a]
heapsort = flatten_heap . merge_heaps . map heapify
where
merge_heaps :: Ord a => [Heap a] -> Heap a
merge_heaps = treefold merge_heap Nil
flatten_heap Nil = []
flatten_heap (Node x heaps) = x:flatten_heap (merge_heaps heaps)
merge_heap heap Nil = heap
merge_heap node_a#(Node a heaps_a) node_b#(Node b heaps_b)
| a < b = Node a (node_b: heaps_a)
| otherwise = Node b (node_a: heaps_b)

You could also use the ST monad, which allows you to write imperative code but expose a purely functional interface safely.

As an exercise in Haskell, I implemented an imperative heapsort with the ST Monad.
{-# LANGUAGE ScopedTypeVariables #-}
import Control.Monad (forM, forM_)
import Control.Monad.ST (ST, runST)
import Data.Array.MArray (newListArray, readArray, writeArray)
import Data.Array.ST (STArray)
import Data.STRef (newSTRef, readSTRef, writeSTRef)
heapSort :: forall a. Ord a => [a] -> [a]
heapSort list = runST $ do
let n = length list
heap <- newListArray (1, n) list :: ST s (STArray s Int a)
heapSizeRef <- newSTRef n
let
heapifyDown pos = do
val <- readArray heap pos
heapSize <- readSTRef heapSizeRef
let children = filter (<= heapSize) [pos*2, pos*2+1]
childrenVals <- forM children $ \i -> do
childVal <- readArray heap i
return (childVal, i)
let (minChildVal, minChildIdx) = minimum childrenVals
if null children || val < minChildVal
then return ()
else do
writeArray heap pos minChildVal
writeArray heap minChildIdx val
heapifyDown minChildIdx
lastParent = n `div` 2
forM_ [lastParent,lastParent-1..1] heapifyDown
forM [n,n-1..1] $ \i -> do
top <- readArray heap 1
val <- readArray heap i
writeArray heap 1 val
writeSTRef heapSizeRef (i-1)
heapifyDown 1
return top
btw I contest that if it's not purely functional then there is no point in doing so in Haskell. I think my toy implementation is much nicer than what one would achieve in C++ with templates, passing around stuff to the inner functions.

And here is a Fibonacci Heap in Haskell:
https://github.com/liuxinyu95/AlgoXY/blob/algoxy/datastruct/heap/other-heaps/src/FibonacciHeap.hs
Here are the pdf file for some other k-ary heaps based on Okasaki's work.
https://github.com/downloads/liuxinyu95/AlgoXY/kheap-en.pdf

Just like in efficient Quicksort algorithms written in Haskell, you need to use monads (state transformers) to do stuff in-place.

Arrays in Haskell aren't as hugely inefficient as you might think, but typical practice in Haskell would probably be to implement this using ordinary data types, like this:
data Heap a = Empty | Heap a (Heap a) (Heap a)
fromList :: Ord a => [a] -> Heap a
toSortedList :: Ord a => Heap a -> [a]
heapSort = toSortedList . fromList
If I were solving this problem, I might start by stuffing the list elements into an array, making it easier to index them for heap creation.
import Data.Array
fromList xs = heapify 0 where
size = length xs
elems = listArray (0, size - 1) xs :: Array Int a
heapify n = ...
If you're using a binary max heap, you might want to keep track of the size of the heap as you remove elements so you can find the bottom right element in O(log N) time. You could also take a look at other types of heaps that aren't typically implemented using arrays, like binomial heaps and fibonacci heaps.
A final note on array performance: in Haskell there's a tradeoff between using static arrays and using mutable arrays. With static arrays, you have to create new copies of the arrays when you change the elements. With mutable arrays, the garbage collector has a hard time keeping different generations of objects separated. Try implementing the heapsort using an STArray and see how you like it.

I tried to port standard binary heap into functional settings. There is an article with described idea: A Functional Approach to Standard Binary Heaps. All the source code listings in the article are in Scala. But it might be ported very easy into any other functional language.

Here is a page containing an ML version of HeapSort. It's quite detailed and should provide a good starting point.
http://flint.cs.yale.edu/cs428/coq/doc/Reference-Manual021.html

Related

Efficient way to do n-ary branch / tabulated functions?

I'm trying to get some basic information on the performance characteristics of branches in SBV.
Let's suppose I have an SInt16 and a very sparse lookup table Map Int16 a. I can implement the lookup with nested ite:
sCase :: (Mergeable a) => SInt16 -> a -> Map Int16 a -> a
sCase x def = go . toList
where
go [] = def
go ((k,v):kvs) = ite (x .== literal k) v (go kvs)
However, this means the generated tree will be very deep.
Does that matter?
If yes, is it better to instead generate a balanced tree of branches, effectively mirroring the Map's structure? Or is there some other scheme that would give even better performance?
If there are less than 256 entries in the map, would it change anything to "compress" it so that sCase works on an SInt8 and a Map Int8 a?
Is there some built-in SBV combinator for this use case that works better than iterated ite?
EDIT: It turns out that it matters a lot what my a is, so let me add some more detail to that. I am currently using sCase to branch in a stateful computation modeled as an RWS r w s a, with the following instances:
instance forall a. Mergeable a => Mergeable (Identity a) where
symbolicMerge force cond thn els = Identity $ symbolicMerge force cond (runIdentity thn) (runIdentity els)
instance (Mergeable s, Mergeable w, Mergeable a, forall a. Mergeable a => Mergeable (m a)) => Mergeable (RWST r w s m a) where
symbolicMerge force cond thn els = Lazy.RWST $
symbolicMerge force cond (runRWST thn) (runRWST els)
So stripping away all the newtypes, I'd like to branch into something of type r -> s -> (a, s, w) s.t. Mergeable s, Mergeable w and Mergeable a.
Symbolic look-ups are expensive
Symbolic array lookup will be expensive regardless of what data-structure you use. It boils down to the fact that there's no information available to the symbolic execution engine to cut-down on the state-space, so it ends up doing more or less what you coded yourself.
SMTLib Arrays
However, the best solution in these cases is to actually use SMT's support for arrays: http://smtlib.cs.uiowa.edu/theories-ArraysEx.shtml
SMTLib arrays are different than what you'd consider as an array in a regular programming language: It does not have bounds. In that sense, it's more of a map from inputs to outputs, spanning the entire domain. (i.e., they are equivalent to functions.) But SMT has custom theories to deal with arrays and thus they can handle problems involving arrays much more efficiently. (On the down-side, there's no notion of index-out-of-bounds or somehow controlling the range of elements you can access. You can code those up yourself on top of the abstraction though, leaving it up to you to decide how you want to handle such invalid accesses.)
If you are interested in learning more about how SMT solvers deal with arrays, the classic reference is: http://theory.stanford.edu/~arbrad/papers/arrays.pdf
Arrays in SBV
SBV supports arrays, through the SymArray class: https://hackage.haskell.org/package/sbv-8.7/docs/Data-SBV.html#t:SymArray
The SFunArray type actually does not use SMTLib arrays. This was designed to support solvers that didn't understand Arrays, such as ABC: https://hackage.haskell.org/package/sbv-8.7/docs/Data-SBV.html#t:SFunArray
The SArray type fully supports SMTLib arrays: https://hackage.haskell.org/package/sbv-8.7/docs/Data-SBV.html#t:SArray
There are some differences between these types, and the above links describe them. However, for most purposes, you can use them interchangeably.
Converting a Haskell map to an SBV array
Going back to your original question, I'd be tempted to use an SArray to model such a look up. I'd code it as:
{-# LANGUAGE ScopedTypeVariables #-}
import Data.SBV
import qualified Data.Map as M
import Data.Int
-- Fill an SBV array from a map
mapToSArray :: (SymArray array, SymVal a, SymVal b) => M.Map a (SBV b) -> array a b -> array a b
mapToSArray m a = foldl (\arr (k, v) -> writeArray arr (literal k) v) a (M.toList m)
And use it as:
g :: Symbolic SBool
g = do let def = 0
-- get a symbolic array, initialized with def
arr <- newArray "myArray" (Just def)
let m :: M.Map Int16 SInt16
m = M.fromList [(5, 2), (10, 5)]
-- Fill the array from the map
let arr' :: SArray Int16 Int16 = mapToSArray m arr
-- A simple problem:
idx1 <- free "idx1"
idx2 <- free "idx2"
pure $ 2 * readArray arr' idx1 + 1 .== readArray arr' idx2
When I run this, I get:
*Main> sat g
Satisfiable. Model:
idx1 = 5 :: Int16
idx2 = 10 :: Int16
You can run it as satWith z3{verbose=True} g to see the SMTLib output it generates, which avoids costly lookups by simply delegating those tasks to the backend solver.
Efficiency
The question of whether this will be "efficient" really depends on how many elements your map has that you're constructing the array from. The larger the number of elements and the trickier the constraints, the less efficient it will be. In particular, if you ever write to an index that is symbolic, I'd expect slow-downs in solving time. If they're all constants, it should be relatively performant. As is usual in symbolic programming, it's really hard to predict any performance without seeing the actual problem and experimenting with it.
Arrays in the query context
The function newArray works in the symbolic context. If you're in a query context, instead use freshArray: https://hackage.haskell.org/package/sbv-8.7/docs/Data-SBV-Control.html#v:freshArray

Haskell vector C++ push_back analogue

I've discovered that Haskell Data.Vector.* miss C++ std::vector::push_back's functionality. There is grow/unsafeGrow, but they seem to have O(n) complexity.
Is there a way to grow vectors in O(1) amortized time for an element?
No there really is no such facility in Data.Vector. It isn't too difficult to implement this from scratch using MutableArray like Data.Vector.Mutable does (see my implementation below), but there are some significant drawbacks. In particular, all of its operations end up happening inside some state context usually ST or IO. This has the downsides that
Any code that manipulates such a data structure ends up having to be monadic
The compiler is much less likely to be able to optimize. For example, libraries like vector use something really clever called fusion to optimize away intermediate allocations. This sort of thing is not possible in a state context.
Parallelism is going to be a lot tougher: in ST I can't even have two threads and in IO I will have race conditions all over the place. The nasty bit here is that any sharing is going to have to happen in IO.
As if all this wasn't enough, garbage collection also performs better inside pure code.
What do I do then?
It isn't particularly often that you have a need for exactly this behaviour - usually you are better off using an immutable data structure (thereby avoiding all of the aforementioned problems) which does something similar. Just limiting ourselves to containers which comes with GHC, some alternatives include:
if you are almost always just using push_back, maybe you just want a stack (a plain old [a]).
if you anticipate doing more push_back than lookups, Data.Sequence gives you O(1) appending to either end and O(log n) lookup.
if you are interested in a lot of operations especially hashmap-like, Data.IntMap is pretty optimized. Even if the theoretical cost of those operations is O(log n), you will need a pretty big IntMap to start feeling those costs.
Making something like C++ vector
Of course, if one doesn't care about the restrictions mentioned initially, there is no reason not to have a C++ like vector. Just for fun, I went ahead and implemented this from scratch (needs packages data-default and primitive).
The reason this code is probably not already in some library is that it goes against much of the spirit of Haskell (I do this with the intent of conforming to a C++ style vector).
The only operation that actually makes a new vector is newVector - everything else "modifies" an existing vector. Since pushBack doesn't return a new GrowVector, it has to modify the existing one (including its length and/or capacity), so length and capacity have to be "pointers". In turn, that means that even getting the length is a monadic operation.
While this isn't unboxed, it would not be too difficult to replicate vectors data family approach - it is just tedious1.
With that said:
module GrowVector (
GrowVector, newEmpty, size, read, write, pushBack, popBack
) where
import Data.Primitive.Array
import Data.Primitive.MutVar
import Data.Default
import Control.Monad
import Control.Monad.Primitive (PrimState, PrimMonad)
import Prelude hiding (length, read)
data GrowVector s a = GrowVector
{ underlying :: MutVar s (MutableArray s a) -- ^ underlying array
, length :: MutVar s Int -- ^ perceived length of vector
, capacity :: MutVar s Int -- ^ actual capacity
}
type GrowVectorIO = GrowVector (PrimState IO)
-- | Make a new empty vector with the given capacity. O(n)
newEmpty :: (Default a, PrimMonad m) => Int -> m (GrowVector (PrimState m) a)
newEmpty cap = do
arr <- newArray cap def
GrowVector <$> newMutVar arr <*> newMutVar 0 <*> newMutVar cap
-- | Read an element in the vector (unchecked). O(1)
read :: PrimMonad m => GrowVector (PrimState m) a -> Int -> m a
g `read` i = do arr <- readMutVar (underlying g); arr `readArray` i
-- | Find the size of the vector. O(1)
size :: PrimMonad m => GrowVector (PrimState m) a -> m Int
size g = readMutVar (length g)
-- | Double the vector capacity. O(n)
resize :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> m ()
resize g = do
curCap <- readMutVar (capacity g) -- read current capacity
curArr <- readMutVar (underlying g) -- read current array
curLen <- readMutVar (length g) -- read current length
newArr <- newArray (2 * curCap) def -- allocate a new array twice as big
copyMutableArray newArr 1 curArr 1 curLen -- copy the old array over
underlying g `writeMutVar` newArr -- use the new array in the vector
capacity g `modifyMutVar'` (*2) -- update the capacity in the vector
-- | Write an element to the array (unchecked). O(1)
write :: PrimMonad m => GrowVector (PrimState m) a -> Int -> a -> m ()
write g i x = do arr <- readMutVar (underlying g); writeArray arr i x
-- | Pop an element of the vector, mutating it (unchecked). O(1)
popBack :: PrimMonad m => GrowVector (PrimState m) a -> m a
popBack g = do
s <- size g;
x <- g `read` (s - 1)
length g `modifyMutVar'` (+ negate 1)
pure x
-- | Push an element. (Amortized) O(1)
pushBack :: (Default a, PrimMonad m) => GrowVector (PrimState m) a -> a -> m ()
pushBack g x = do
s <- readMutVar (length g) -- read current size
c <- readMutVar (capacity g) -- read current capacity
when (s+1 == c) (resize g) -- if need be, resize
write g (s+1) x -- write to the back of the array
length g `modifyMutVar'` (+1) -- increase te length
Current semantics of grow
I think the github issue does a pretty good job of explaining the semantics:
I think the intended semantics are that it may do a realloc, but not guaranteed to, and all the current implementations do the simpler copying semantics because for on heap allocations the cost should be roughly the same.
Basically you should use grow when you want a new mutable vector of an increased size, starting with the elements of the old vector (and no longer care about the old vector). This is quite useful - for example one could implement GrowVector using MVector and grow.
1 the approach is that for every new type of unboxed vector you want to have, you make a data instance that "expands" your type into a fixed number of unboxed arrays (or other unboxed vectors). This is the point of data family - to allow different instantiations of a type to have totally different runtime representations, and to also be extensible (you can add your own data instance if you want).

What is "a Haskell way" to transpose a graph?

Suppose I have a tree represented as a list of parents and I want to reverse the edges, obtaining a list of children for each node. For this tree - http://i.stack.imgur.com/uapqT.png - transformation would look like:
[0,0,0,1,1,2,5,4,4] -> [[2,1],[4,3],[5],[],[8,7],[6],[],[],[]]
But it's not limited to graph transposing, however. I have a few other problems that I would solve in imperative language in the following way: traverse some source data array and non-sequentially update a resulting array as I get to know something about it.
Essentially, my question is "what is Haskell's idiomatic way to solve things like this?". As I understand, I can do it in imperative way by means of mutable vectors, but isn't there some purely functional method? If not, how would I properly use mutables?
Finally, I need it to work fast, that is O(n) complexity, and non-standard packages are not an option for me.
It's worth to consider the pure functions in Data.Vector or Data.Array that internally use mutation, in order to be more efficient (the accum-s in both libraries, plus the unfolds and construct-s in vector).
The accum-s are great when we don't care about intermediate states of an array during construction. They're nicely applicable for transposing graphs, although we have to provide a range for the node keys:
{-# LANGUAGE TupleSections #-}
import qualified Data.Array as A
type Graph = [(Int, [Int])]
transpose :: (Int, Int) -> Graph -> Graph
transpose range g =
A.assocs $ A.accumArray (flip (:)) [] range (do {(i, ns) <- g; map (,i) ns})
Here we first unroll the graph into an adjacency list, but with swapped pairs of indices, and then accumulate them into an array. It's roughly as fast as a standard imperative loop over a mutable array, and it's more convenient than the ST monad.
Alternatively, we can just use IntMap, likely alongside the State monad, and just port our imperative algorithms as they are, and the performance will be satisfactory for most purposes.
Fortunately IntMap provides a lot of higher-order functions, so we're not (always) forced to program in an imperative style with it. There's an analogue for accum, for instance:
import qualified Data.IntMap.Strict as IM
transpose :: Graph -> Graph
transpose g =
IM.assocs $ IM.fromListWith (++) (do {(i, ns) <- g; (i,[]) : map (,[i]) ns})
A purely functional way would be to use a map to store the information, producing O(n log n) algorithm:
import qualified Data.IntMap as IM
import Data.Maybe (fromMaybe)
childrenMap :: [Int] -> IM.IntMap [Int]
childrenMap xs = foldr addChild IM.empty $ zip xs [0..]
where
addChild :: (Int, Int) -> IM.IntMap [Int] -> IM.IntMap [Int]
addChild (parent, child) = IM.alter (Just . (child :) . fromMaybe []) parent
You could also use an imperative solution and keep things pure using the ST monad, which is obviously O(n), but the imperative code somewhat obscures the main idea:
import Control.Monad (forM_)
import Data.Array
import Data.Array.MArray
import Data.Array.ST
childrenST :: [Int] -> [[Int]]
childrenST xs = elems $ runSTArray $ do
let l = length xs
arr <- newArray (0, l - 1) []
let add (parent, child) =
writeArray arr parent . (child :) =<< readArray arr parent
forM_ (zip xs [0..]) add
return arr
One drawback of this approach is that an index is out of bounds, it just fails.
Another is that you traverse the list twice. However, if you used arrays instead of lists everywhere, this wouldn't matter.

How can I get a strict accumArray?

I have a list of key-value pairs and I want to count how many times each key occurs and what values it occurs with, but when I try, I get a stack overflow. Here's a simplified version of the code I'm running:
import Array
add (n, vals) val = n `seq` vals `seq` (n+1,val:vals)
histo = accumArray add (0,[]) (0,9) [(0, n) | n <- [0..5000000]]
main = print histo
When I compile this with 'ghc -O' and run it, I get "Stack space overflow: current size 8388608 bytes."
I think I know what's going on: accumArray has the same properties as foldl, and so I need a strict version of accumArray. Unfortunately, the only one I've found is in Data.Array.Unboxed, which doesn't work for an array of lists.
The documentation says that when the accumulating function is strict, then accumArray should be too, but I can't get this to work, and the discussion here claims that the documentation is wrong (at least for GHC).
Is there a strict version of accumArray other than the one in Data.Array.Unboxed? Or is there a better way to do what I want?
Well, strict doesn't necessarily mean that no thunks are created, it just means that if an argument is bottom, the result is bottom too. But accumArray is not that strict, it just writes bottoms to the array if they occur. It can't really do anything else, since it must allow for non-strict functions that could produce defined values from intermediate bottoms. And the strictness analyser can't rewrite it so that the accumulation function is evaluated to WHNF on each write if it is strict, because that would change the semantics of the programme in a rather drastic way (an array containing some bottoms vs. bottom).
That said, I agree that there's an unfortunate lack of strict and eager functions in several areas.
For your problem, you can use a larger stack (+RTS -K128M didn't suffice here, but 256M did), or you can use
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import GHC.Arr
strictAccumArray :: Ix i => (e -> a -> e) -> e -> (i,i) -> [(i,a)] -> Array i e
strictAccumArray fun ini (l,u) ies = case iar of
Array _ _ m barr -> Array l u m barr
where
iar = runSTArray $ do
let n = safeRangeSize (l,u)
stuff = [(safeIndex (l,u) n i, e) | (i, e) <- ies]
arr <- newArray (0,n-1) ini
let go ((i,v):ivs) = do
old <- unsafeRead arr i
unsafeWrite arr i $! fun old v
go ivs
go [] = return arr
go stuff
With a strict write, the thunks are kept small, so there's no stack overflow. But beware, the lists take a lot of space, so if your list is too long, you may get a heap exhaustion.
Another option would be to use a Data.Map (or Data.IntMap, if the version of containers is 0.4.1.0 or later) instead of an array, since that comes with insertWith', which forces the result of the combining function on use. The code could for example be
import qualified Data.Map as M -- or Data.IntMap
import Data.List (foldl')
histo :: M.Map Int (Int,[Int]) -- M.IntMap (Int,[Int])
histo = foldl' upd M.empty [(0,n) | n <- [0 .. 15000000]]
where
upd mp (i,n) = M.insertWith' add i (1,[n]) mp
add (j,val:_) (k,vals) = k `seq` vals `seq` (k+j,val:vals)
add _ pr = pr -- to avoid non-exhaustive pattern warning
Disadvantages of using a Map are
the combining function must have type a -> a -> a, so it needs to be a bit more complicated in your case.
an update is O(log size) instead of O(1), so for large histograms, it will be considerably slower.
Maps and IntMaps have some book-keeping overhead, so that will use more space than an array. But if the list of updates is large compared to the number of indices, the difference will be negligible (the overhead is k words per key, independent of the size of the values) in this case, where the size of the values grows with each update.

Random-Pivot Quicksort in Haskell

Is it possible to implement a quicksort in Haskell (with RANDOM-PIVOT) that still has a simple Ord a => [a]->[a] signature?
I'm starting to understand Monads, and, for now, I'm kind of interpreting monads as somethink like a 'command pattern', which works great for IO.
So, I understand that a function that returns a random number should actually return a monadic value like IO, because, otherwise, it would break referential transparency. I also understand that there should be no way to 'extract' the random integer from the returned monadic value, because, otherwise, it would, again, break referential transparency.
But yet, I still think that it should be possible to implement a 'pure' [a]->[a] quicksort function, even if it uses random pivot, because, it IS referential transparent. From my point of view, the random pivot is just a implementation detail, and shouldn't change the function's signature
OBS: I'm not actually interested in the specific quicksort problem (so, I don't want to sound rude but I'm not looking for "use mergesort" or "random pivot doesn't increase performance in practice" kind of answers) I'm actually interested in how to implement a 'pure' function that uses 'impure' functions inside it, in cases like quicksort, where I can assure that the function actually is a pure one.
Quicksort is just a good example.
You are making a false assumption that picking the pivot point is just an implementation detail. Consider a partial ordering on a set. Like a quicksort on cards where
card a < card b if the face value is less but if you were to evaluate booleans:
4 spades < 4 hearts (false)
4 hearts < 4 spades (false)
4 hearts = 4 spades (false)
In that case the choice of pivots would determine the final ordering of the cards. In precisely the same way
for a function like
a = get random integer
b = a + 3
print b
is determined by a. If you are randomly choosing something then your computation is or could be non deterministic.
OK, check this out.
Select portions copied form the hashable package, and voodoo magic language pragmas
{-# LANGUAGE FlexibleInstances, UndecidableInstances, NoMonomorphismRestriction, OverlappingInstances #-}
import System.Random (mkStdGen, next, split)
import Data.List (foldl')
import Data.Bits (shiftL, xor)
class Hashable a where
hash :: a -> Int
instance (Integral a) => Hashable a where
hash = fromIntegral
instance Hashable Char where
hash = fromEnum
instance (Hashable a) => Hashable [a] where
hash = foldl' combine 0 . map hash
-- ask the authors of the hashable package about this if interested
combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2
OK, so now we can take a list of anything Hashable and turn it into an Int. I've provided Char and Integral a instances here, more and better instances are in the hashable packge, which also allows salting and stuff.
This is all just so we can make a number generator.
genFromHashable = mkStdGen . hash
So now the fun part. Let's write a function that takes a random number generator, a comparator function, and a list. Then we'll sort the list by consulting the generator to select a pivot, and the comparator to partition the list.
qSortByGen _ _ [] = []
qSortByGen g f xs = qSortByGen g'' f l ++ mid ++ qSortByGen g''' f r
where (l, mid, r) = partition (`f` pivot) xs
pivot = xs !! (pivotLoc `mod` length xs)
(pivotLoc, g') = next g
(g'', g''') = split g'
partition f = foldl' step ([],[],[])
where step (l,mid,r) x = case f x of
LT -> (x:l,mid,r)
EQ -> (l,x:mid,r)
GT -> (l,mid,x:r)
Library functions: next grabs an Int from the generator, and produces a new generator. split forks the generator into two distinct generators.
My functions: partition uses f :: a -> Ordering to partition the list into three lists. If you know folds, it should be quite clear. (Note that it does not preserve the initial ordering of the elements in the sublists; it reverses them. Using a foldr could remedy this were it an issue.) qSortByGen works just like I said before: consult the generator for the pivot, partition the list, fork the generator for use in the two recursive calls, recursively sort the left and right sides, and concatenate it all together.
Convenience functions are easy to compose from here
qSortBy f xs = qSortByGen (genFromHashable xs) f xs
qSort = qSortBy compare
Notice the final function's signature.
ghci> :t qSort
qSort :: (Ord a, Hashable a) => [a] -> [a]
The type inside the list must implement both Hashable and Ord. There's the "pure" function you were asking for, with one logical added requirement. The more general functions are less restrictive in their requirements.
ghci> :t qSortBy
qSortBy :: (Hashable a) => (a -> a -> Ordering) -> [a] -> [a]
ghci> :t qSortByGen
qSortByGen
:: (System.Random.RandomGen t) =>
t -> (a -> a -> Ordering) -> [a] -> [a]
Final notes
qSort will behave exactly the same way for all inputs. The "random" pivot selection is. in fact, deterministic. But it is obscured by hashing the list and then seeding a random number generator, making it "random" enough for me. ;)
qSort also only works for lists with length less than maxBound :: Int, which ghci tells me is 9,223,372,036,854,775,807. I thought there would be an issue with negative indexes, but in my ad-hoc testing I haven't run into it yet.
Or, you can just live with the IO monad for "truer" randomness.
qSortIO xs = do g <- getStdGen -- add getStdGen to your imports
return $ qSortByGen g compare xs
ghci> :t qSortIO
qSortIO :: (Ord a) => [a] -> IO [a]
ghci> qSortIO "Hello world"
" Hdellloorw"
ghci> qSort "Hello world"
" Hdellloorw"
In such cases, where you know that the function is referentially transparent, but you can't proof it to the compiler, you may use the function unsafePerformIO :: IO a -> a from the module Data.Unsafe.
For instance, you may use unsafePerformIO to get an initial random state and then do anything using just this state.
But please notice: Don't use it if it's not really needed. And even then, think twice about it. unsafePerformIO is somewhat the root of all evil, since it's consequences can be dramatical - anything is possible from coercing different types to crashing the RTS using this function.
Haskell provides the ST monad to perform non-referentially-transparent actions with a referentially transparent result.
Note that it doesn't enforce referential transparency; it just insures that potentially non-referentially-transparent temporary state can't leak out. Nothing can prevent you from returning manipulated pure input data that was rearranged in a non-reproducible way. Best is to implement the same thing in both ST and pure ways and use QuickCheck to compare them on random inputs.

Resources