How to create data for Criterion benchmarks? - haskell

I am using criterion to benchmark my Haskell code. I'm doing some heavy computations for which I need random data. I've written my main benchmark file like this:
main :: IO ()
main = newStdGen >>= defaultMain . benchmarks
benchmarks :: RandomGen g => g -> [Benchmark]
benchmarks gen =
[
bgroup "Group"
[
bench "MyFun" $ nf benchFun (dataFun gen)
]
]
I keep benchmarks and data genrators for them in different modules:
benchFun :: ([Double], [Double]) -> [Double]
benchFun (ls, sig) = fun ls sig
dataFun :: RandomGen g => g -> ([Double], [Double])
dataFun gen = (take 5 $ randoms gen, take 1024 $ randoms gen)
This works, but I have two concerns. First, is the time needed to generate random data included in the benchmark? I found a question that touches on that subject but honestly speaking I'm unable to apply it to my code. To check whether this happens I wrote an alternative version of my data generator enclosed within IO monad. I placed benchmarks list with main, called the generator, extracted the result with <- and then passed it to the benchmarked function. I saw no difference in performance.
My second concern is related to generating random data. Right now the generator once created is not updated, which leads to generating the same data within a single run. This is not a major problem, but nevertheless it would be nice to make it properly. Is there a neat way to generate different random data within each data* function? "Neat" means "without making data functions acquiring StdGen within IO"?
EDIT: As noted in comment below I don't really care about data randomness. What is important to me is that the time needed to generate the data is not included in the benchmark.

This works, but I have two concerns. First, is the time needed to generate random data included in the benchmark?
Yes it would. All of the random generation should be happening lazily.
To check whether this happens I wrote an alternative version of my data generator enclosed within IO monad. I placed benchmarks list with main, called the generator, extracted the result with <- and then passed it to the benchmarked function. I saw no difference in performance.
This is expected (if I understand what you mean); the random values from randoms gen aren't going to be generated until they're needed (i.e. inside your benchmark loop).
Is there a neat way to generate different random data within each data* function? "Neat" means "without making data functions acquiring StdGen within IO"?
You need either to be in IO or create an StdGen with an integer seed you supply, with mkStdGen.
Re. your main question of how you should get the pRNG stuff out of your benchmarks, you should be able to evaluate the random input fully before your defaultMain (benchmarks g) stuff, with evaluate and force like:
import Control.DeepSeq(force)
import Control.Exception(evaluate)
myBench g = do randInputEvaled <- evaluate $ force $ dataFun g
defaultMain [
bench "MyFun" $ nf benchFun randInputEvaled
...
where force evaluates its argument to normal form, but this will still happen lazily. So to get it to be evaluated outside of bench we use evaluate to leverage monadic sequencing. You could also do things like call seq on the tail of each of the lists in your tuple, etc. if you wanted to avoid the imports.
That kind of thing should work fine, unless you need to hold a huge amount of test data in memory.
EDIT: this method is also a good idea if you want to get your data from IO, like reading from the disk, and don't want that mixed in to your benchmarks.

You could try reading the random data from a disk file instead. (In fact, if you're on some Unix-like OS, you could even use /dev/urandom.)
However, depending on how much data you need, the I/O time might dwarf the computation time. It depends how much random data you need.
(E.g., if your benchmark reads random numbers and calculates their sum, it's going to be I/O-limited. If your benchmark reads a random number and does some huge calculation based on just that one number, the I/O adds hardly any overhead at all.)

Related

How to iterate over tree with memory limit in Haskell?

I know that there is a solution for iterating through the Tree using Zippers (see details here). Though it is not clear for me whether it is possible to apply memory constraints to this approach.
Context
I was given the following problem to solve in Haskell:
Design an iterator that will iterate through a binary tree in-order.
Assume the binary tree is stored on disk and can contain up to 10 levels, and therefore can contain up to (2^10 - 1) nodes, and we can store at most 100 nodes in memory at any given time.
The goal of this iterator is to load a small fraction of the binary tree from disk to memory each time it's incremented, so that we don't need to load the entire tree into
memory all at once.
I assumed that the memory part is not possible to represent in Haskell, but I was told that it is not true.
Question: what can be used in Haskell to achieve that memory behaviour? Any suggestions, approaches and directions are appreciated. This is just out of curiosity, I've already failed at solving this problem.
If the iterator loads part of the tree each time it is incremented then there are two options:
It exists in the IO monad and works just like in an imperative language.
It is exploiting laziness and interleaved IO. This is the approach taken by functions like readFile which give you the entire contents of a file as one lazy list. The actual file is read on-demand as your application traverses the list.
The latter option is the interesting one here.
The tricky part of lazy lists is retainers. Suppose your file contains a list of numbers. If you compute the sum like this
nums <- map read . lines <$> readFile "numbers.txt"
putStrLn $ "The total is " <> show (sum nums)
then the program will run in constant space. But if you want the average:
putStrLn $ "The average is " <> show (sum nums / fromIntegral (length nums))
then the program will load the entire file into memory. This is because it has to traverse the list twice, once to compute the sum and once to compute the length. It can only do this by holding the entire list.
(The solution is to compute the sum and length in parallel within one pass. But that's beside the point here).
The challenge for the tree problem you pose is to come up with an approach to iteration which avoids retaining the tree.
Lets assume that each node in the file contains offsets in the file for the left and right child nodes. We can write a function in the IO monad which seeks to an offset and reads the node there.
data MyNode = MyNode Int Int ..... -- Rest of data to be filled in.
readNodeData :: Handle -> Int -> IO MyNode
From there it would be simple to write a function which traverses the entire file to create a Tree MyNode. If you implement this using unsafeInterleaveIO then you can get a tree which is read lazily as you traverse it.
unsafeInterleaveIO is unsafe because you don't know when the IO will be done. You don't even know what order it will happen in, because it only happens when the value is forced during evaluation. In this way its like the "promise" structures you get in some other languages. In this particular case this isn't a problem because we can assume the file doesn't change during the evaluation.
Unfortunately this doesn't solve the problem because the entire tree will be held in memory by the time you finish. Your traversal has to retain the root, at least as long as its traversing the left side, and as long as it does so it will retain the whole of the rest of the tree.
The solution is to rewrite the IO part to return a list instead of a tree, something like this:
readNode :: Handle -> Int -> IO [MyNode]
readNode _ (-1) = return [] -- Null case for empty child.
readNode h pos = unsafeInterleaveIO $ do
n <- readNodeData h pos -- Needs to be defined elsewhere.
lefts <- readNode (leftChild n)
rights <- readNode (rightChild n)
return $ lefts ++ [n] ++ rights
This returns the entire tree as a lazy list. As you traverse the list the relevant nodes will be read on demand. As long as you don't retain the list (see above) your program will not need to hold anything more than the current node and its parents.

(Edited) How to get random number in Haskell without IO

I want to have a function that return different stdGen in each call without IO.
I've tried to use unsafePerformIO, as the following code.
import System.IO.Unsafe
import System.Random
myStdGen :: StdGen
myStdGen = unsafePerformIO getStdGen
But when I try to call myStdGen in ghci, I always get the same value. Have I abused unsafePerformIO? Or is there any other ways to reach my goal?
EDIT
Sorry, I think I should describe my question more precisely.
Actually, I'm implementing a variation of the treap data strutcure, which needs a special 'merge' operation. It relies on some randomness to guarentee amortized O(log n) expected time complexity.
I've tried to use a pair like (Tree, StdGen) to keep the random generator for each treap. When inserting a new data to the treap, I would use random to give random value to the new node, and then update my generator. But I've encountered a problem. I have a function called empty which will return an empty treap, and I used the function myStdGen above to get the random generator for this treap. However, if I have two empty treap, their StdGen would be the same. So after I inserted a data to both treap and when I want to merge them, their random value would be the same, too. Therefore, I lost the randomness which I relies on.
That's why I would like to have a somehow "global" random generator, which yields different StdGen for each call, so that each empty treap could have different StdGen.
Do I abused unsafePerformIO
Heck yes! The "distinguishing features of a pure function" are
No side-effects
Referentially transparent, i.e. each subsequent eval of the result must yield the same.
There is in fact a way to achieve your "goal", but the idea is just wrong.
myStdGen :: () -> StdGen
myStdGen () = unsafePerformIO getStdGen
Because this is a (useless) function call instead of a CAF, it'll evaluate the IO action at each call seperately.
Still, I think the compiler is pretty much free to optimise that away, so definitely don't rely on it.
EDIT upon trying I noticed that getStdGen itself always gives the same generator when used within a given process, so even if the above would use more reasonable types it would not work.
Note that correct use of pseudorandomness in algorithms etc. does not need IO everywhere – for instance you can manually seed your StdGen, but then properly propagate it with split etc.. A much nicer way to handle that is with a randomness monad. The program as a whole will then always yield the same result, but internally have all different random numbers as needed to work usefully.
Alternatively, you can obtain a generator from IO, but still write your algorithm in a pure random monad rather than IO.
There's another way to obtain "randomness" in a completely pure algorithm: require the input to be Hashable! Then, you can effectively use any function argument as a random seed. This is a bit of a strange solution, but might work for your treap application (though I reckon some people would not classify it as a treap anymore, but as a special kind of hashmap).
This is not a good use of unsafePerformIO.
The reason you see the same number repeatedly in GHCi is that GHCi itself does not know that the value is impure, and so remembers the value from the last time you called it. You can type IO commands into the top level of GHCi, so you would expect to see a different value if you just type getStdGen. However, this won't work either, due to an obscure part of the way GHCi works involving not reverting top-level expressions. You can turn this of with :set +r:
> :set +r
> getStdGen
2144783736 1
> getStdGen
1026741422 1
Note that your impure function pretending to be pure will still not work.
> myStdGen
480142475 1
> myStdGen
480142475 1
> myStdGen
480142475 1
You really do not want to go down this route. unsafePerformIO is not supposed to be used this way, and nor is it a good idea at all. There are ways to get what you wanted (like unsafePerformIO randomIO :: Int) but they will not lead you to good things. Instead you should be doing calculations based on random numbers inside a random monad, and running that in the IO monad.
Update
I see from your updatee why you wanted this in the first place.
There are many interesting thoughts on the problem of randomness within otherwise referentially transparent functions in this answer.
Despite the fact that some people advocate the use of unsafePerformIO in this case, it is still a bad idea for a number of reasons which are outlined in various parts of that page. In the end, if a function depends on a source of randomness it is best for that to be specified in it's type, and the easiest way to do that is put it in a random monad. This is still a pure function, just one that takes a generator when it is called. You can provide this generator by asking for a random one in the main IO routine.
A good example of how to use the random monad can be found here.
Yes, you have abused unsafePerformIO. There are very few valid reasons to use unsafePerformIO, such as when interfacing with a C library, and it's also used in the implementation of a handful of core libraries (I think ST being one of them). In short, don't use unsafePerformIO unless you're really really sure of what you're doing. It is not meant for generating random numbers.
Recall that functions in Haskell have to be pure, meaning that they only depend on their inputs. This means that you can have a pure function that generates a "random" number, but that number is dependent on the random generator you pass to it, you could do something like
myStdGen :: StdGen
myStdGen = mkStdGen 42
Then you could do
randomInt :: StdGen -> (Int, StdGen)
randomInt g = random
But then you must use the new StdGen returned from this function moving forward, or you will always get the same output from randomInt.
So you may be wondering, how do you cleanly generate random numbers without resorting to IO? The answer is the State monad. It looks similar to
newtype State s a = State { runState :: s -> (a, s) }
And its monad instance looks like
instance Monad (State s) where
return a = State $ \s -> (a, s)
(State m) >>= f = State $ \s -> let (a, newState) = m s
(State g) = f a
in g newState
It's a little confusing to look at the first time, but essentially all the state monad does is fancy function composition. See LYAH for a more detailed explanation. What's important to note here is that the type of s does not change between steps, just the a parameter can change.
You'll notice that s -> (a, s) looks a lot like our function StdGen -> (Int, StdGen), with s ~ StdGen and a ~ Int. That means that if we did
randomIntS :: State StdGen Int
randomIntS = State randomInt
Then we could do
twoRandInts :: State StdGen (Int, Int)
twoRandInts = do
a <- randomIntS
b <- randomIntS
return (a, b)
Then it can be run by supplying an initial state:
main = do
g <- getStdGen
print $ runState twoRandInts g
The StdGen still comes out of IO, but then all the logic itself occurs within the state monad purely.

getting and testing a random item in a list in Haskell

Lets say there is a list of all possible things
all3PStrategies :: [Strategy3P]
all3PStrategies = [strategyA, strategyB, strategyC, strategyD] //could be longer, maybe even infinite, but this is good enough for demonstrating
Now we have another function that takes an integer N and two strategies, and uses the first strategy for N times, and then uses the second strategy for N times and continues to repeat for as long as needed.
What happens if the N is 0, I want to return a random strategy, since it breaks the purpose of the function, but it must ultimatley apply a particular strategy.
rotatingStrategy [] [] _ = chooseRandom all3PStrategies
rotatingStrategy strategy3P1 strategy3P2 N =
| … // other code for what really happens
So I am trying to get a rondom strategy from the list. I Think this will do it:
chooseRandom :: [a] -> RVar a
But how do I test it using Haddock/doctest?
-- >>> chooseRandom all3PStrategies
-- // What goes here since I cant gurauntee what will be returned...?
I think random functions kind of goes against the Haskell idea of functional, but I also am likely mistaken. In imperative languages the random function uses various parameters (like Time in Java) to determine the random number, so can't I just plug in a/the particular parameters to ensure which random number I will get?
If you do this: chooseRandom :: [a] -> RVar a, then you won't be able to use IO. You need to be able to include the IO monad throughout the type declaration, including the test cases.
Said more plainly, as soon as you use the IO monad, all return types must include the type of the IO monad, which is not likely to be included in the list that you want returned, unless you edit the structure of the list to accommodate items that have the IO Type included.
There are several ways to implement chooseRandom. If you use a version that returns RVar Strategy3P, you will still need to sample the RVar using runRVar to get a Strategy3P that you can actually execute.
You can also solve the problem using the IO monad, which is really no different: instead of thinking of chooseRandom as a function that returns a probability distribution that we can sample as necessary, we can think of it as a function that returns a computation that we can evaluate as necessary. Depending on your perspective, this might make things more or less confusing, but at least it avoids the need to install the rvar package. One implementation of chooseRandom using IO is the pick function from this blog post:
import Random (randomRIO)
pick :: [a] -> IO a
pick xs = randomRIO (0, (length xs - 1)) >>= return . (xs !!)
This code is arguably buggy: it crashes at runtime when you give it the empty list. If you're worried about that, you can detect the error at compile time by wrapping the result in Maybe, but if you know that your strategy list will never be empty (for example, because it's hard-coded) then it's probably not worth bothering.
It probably follows that it's not worth testing either, but there are a number of solutions to the fundamental problem, which is how to test monadic functions. In other words, given a monadic value m a, how can we interrogate it in our testing framework (ideally by reusing functions that work on the raw value a)? This is a complex problem addressed in the QuickCheck library and associated research paper, Testing Monadic Code with QuickCheck).
However, it doesn't look like it would be easy to integrate QuickCheck with doctest, and the problem is really too simple to justify investing in a whole new testing framework! Given that you just need some quick-and-dirty testing code (that won't actually be part of your application), it's probably OK to use unsafePerformIO here, even though many Haskellers would consider it a code smell:
{-|
>>> let xs = ["cat", "dog", "fish"]
>>> elem (unsafePerformIO $ pick xs) xs
True
-}
pick :: [a] -> IO a
Just make sure you understand why using unsafePerformIO is "unsafe" (it's non-deterministic in general), and why it doesn't really matter for this case in particular (because failure of the standard RNG isn't really a big enough risk, for this application, to justify the extra work we'd require to capture it in the type system).

Non-Trivial Lazy Evaluation

I'm currently digesting the nice presentation Why learn Haskell? by Keegan McAllister. There he uses the snippet
minimum = head . sort
as an illustration of Haskell's lazy evaluation by stating that minimum has time-complexity O(n) in Haskell. However, I think the example is kind of academic in nature. I'm therefore asking for a more practical example where it's not trivially apparent that most of the intermediate calculations are thrown away.
Have you ever written an AI? Isn't it annoying that you have to thread pruning information (e.g. maximum depth, the minimum cost of an adjacent branch, or other such information) through the tree traversal function? This means you have to write a new tree traversal every time you want to improve your AI. That's dumb. With lazy evaluation, this is no longer a problem: write your tree traversal function once, to produce a huge (maybe even infinite!) game tree, and let your consumer decide how much of it to consume.
Writing a GUI that shows lots of information? Want it to run fast anyway? In other languages, you might have to write code that renders only the visible scenes. In Haskell, you can write code that renders the whole scene, and then later choose which pixels to observe. Similarly, rendering a complicated scene? Why not compute an infinite sequence of scenes at various detail levels, and pick the most appropriate one as the program runs?
You write an expensive function, and decide to memoize it for speed. In other languages, this requires building a data structure that tracks which inputs for the function you know the answer to, and updating the structure as you see new inputs. Remember to make it thread safe -- if we really need speed, we need parallelism, too! In Haskell, you build an infinite data structure, with an entry for each possible input, and evaluate the parts of the data structure that correspond to the inputs you care about. Thread safety comes for free with purity.
Here's one that's perhaps a bit more prosaic than the previous ones. Have you ever found a time when && and || weren't the only things you wanted to be short-circuiting? I sure have! For example, I love the <|> function for combining Maybe values: it takes the first one of its arguments that actually has a value. So Just 3 <|> Nothing = Just 3; Nothing <|> Just 7 = Just 7; and Nothing <|> Nothing = Nothing. Moreover, it's short-circuiting: if it turns out that its first argument is a Just, it won't bother doing the computation required to figure out what its second argument is.
And <|> isn't built in to the language; it's tacked on by a library. That is: laziness allows you to write brand new short-circuiting forms. (Indeed, in Haskell, even the short-circuiting behavior of (&&) and (||) aren't built-in compiler magic: they arise naturally from the semantics of the language plus their definitions in the standard libraries.)
In general, the common theme here is that you can separate the production of values from the determination of which values are interesting to look at. This makes things more composable, because the choice of what is interesting to look at need not be known by the producer.
Here's a well-known example I posted to another thread yesterday. Hamming numbers are numbers that don't have any prime factors larger than 5. I.e. they have the form 2^i*3^j*5^k. The first 20 of them are:
[1,2,3,4,5,6,8,9,10,12,15,16,18,20,24,25,27,30,32,36]
The 500000th one is:
1962938367679548095642112423564462631020433036610484123229980468750
The program that printed the 500000th one (after a brief moment of computation) is:
merge xxs#(x:xs) yys#(y:ys) =
case (x`compare`y) of
LT -> x:merge xs yys
EQ -> x:merge xs ys
GT -> y:merge xxs ys
hamming = 1 : m 2 `merge` m 3 `merge` m 5
where
m k = map (k *) hamming
main = print (hamming !! 499999)
Computing that number with reasonable speed in a non-lazy language takes quite a bit more code and head-scratching. There are a lot of examples here
Consider generating and consuming the first n elements of an infinite sequence. Without lazy evaluation, the naive encoding would run forever in the generation step, and never consume anything. With lazy evaluation, only as many elements are generated as the code tries to consume.

"Lazy IO" in Haskell?

I'm trying a little experiment in haskell, wondering if it is possible to exploit laziness to process IO. I'd like to write a function that takes a String (a list of Chars) and produces a string, lazily. I would like then to be abily to lazily feed it characters from IO, so each character would be processed as soon as it was available, and the output would be produced as the characters necessary became available. However, I'm not quite sure if/how I can produce a lazy list of characters from input inside the IO monad.
Regular String IO in Haskell is lazy. So your example should just work out of the box.
Here's an example, using the 'interact' function, which applies a function to a lazy stream of characters:
interact :: (String -> String) -> IO ()
Let's filter out the letter 'e' from the input stream, lazily (i.e. run in constant space):
main = interact $ filter (/= 'e')
You could also use getContents and putStr if you like. They're all lazy.
Running it to filter the letter 'e' from the dictionary:
$ ghc -O2 --make A.hs
$ ./A +RTS -s < /usr/share/dict/words
...
2 MB total memory in use (0 MB lost due to fragmentation)
...
so we see that it ran in a constant 2M footprint.
The simplest method of doing lazy IO involves functions such as interact, readFile, hGetContents, and such, as dons says; there's a more extended discussion of these in the book Real World Haskell that you might find useful. If memory serves me, all such functions are eventually implemented using the unsafeInterleaveIO that ephemient mentions, so you can also build your own functions that way if you want.
On the other hand, it might be wise to note that unsafeInterleaveIO is exactly what it says on the tin: unsafe IO. Using it--or functions based on it--breaks purity and referential transparency. This allows apparently pure functions (that is, that do not return an IO action) to effect the outside world when evaluated, produce different results from the same arguments, and all those other unpleasant things. In practice, most sensible ways of using unsafeInterleaveIO won't cause problems, and simple mistakes will usually result in obvious and easily diagnosed bugs, but you've lost some nice guarantees.
There are alternatives, of course; you can find assorted libraries on Hackage that provide restricted, safer lazy IO or conceptually different approaches. However, given that problems arise only rarely in practical use, I think most people are inclined to stick with the built-in, technically unsafe functions.
unsafeInterleaveIO :: IO a -> IO a
unsafeInterleaveIO allos IO computation to be deferred lazily. When passed a value of type IO a, the IO will only be performed when the value of a is demanded. This is used to implement lazy file reading, see System.IO.hGetContents.
For example, main = getContents >>= return . map Data.Char.toUpper >>= putStr is lazy; as you feed characters to stdin, you will get characters on stdout.
(This is the same as writing main = interact $ map Data.Char.toUpper, as in dons's answer.)

Resources