Haskell: How to benchmark a computation accurately with deepseq/force

Haskell: How to benchmark a computation accurately with deepseq/force - haskell

I have a web server written in Haskell that computes some data in multiple steps.
I want to accurately measure and display how long each action takes.
In the presence of laziness, what is a good way to do this?
Note that "benchmarking" is not quite the right terminology since I only only want to measure time in a production system and not sample many runs. I know that for that case I can use criterion.

You can use force from Control.DeepSeq to fully evaluate a data structure (and thus demand and measure its computation).
One problem is that forcing a large data structure takes some time itself!
This is because a deepseq (used by force) will walk down your algebraic data type tree, visiting every node (but not doing anything with it).
When you perform only a cheap operation to each node, such as map (*2) mylist, and try to measure how long it takes, this overhead can suddenly become significant, messing up your measurements.
import Control.DeepSeq
import Control.Exception (evaluate)
import Data.Time (diffUTCTime, getCurrentTime)
-- | Measures how long a computation takes, printing both the time and the
-- overhead of `force` to stdout. So it forces *twice*.
benchmarkForce :: NFData a => String -> IO a -> IO a
benchmarkForce msg action = do
before <- getCurrentTime
-- Force the first time to measure computation + forcing
result <- evaluate . force =<< action
after <- getCurrentTime
-- Force again to see how long forcing itself takes
_ <- evaluate . force $ result
afterAgain <- getCurrentTime
putStrLn $ msg ++ ": " ++ show (diffTimeMs before after) ++ " ms"
++ " (force time: " ++ show (diffTimeMs after afterAgain) ++ " ms)"
return result
where
-- Time difference `t2 - t1` in milliseconds
diffTimeMs t1 t2 = realToFrac (t2 `diffUTCTime` t1) * 1000.0 :: Double
The first evaluate . force run will make sure your action and its return value are evaluated entirely.
By doing a second force run over the result, we can measure how much overhead it added to the first traversal.
This of course comes at the expense of two traversals; being able to measure how much time a deepseq wastes requires you to waste that time twice.
Here is an example to measure some pure functions with that:
main :: IO ()
main = do
l <- benchmarkForce "create list" $
return [1..10000000 :: Integer]
_ <- benchmarkForce "double each list element" $
return $ map (*2) l
_ <- benchmarkForce "map id l" $
return $ map id l
return ()
(Of course it also works with functions in IO.)
The output:
create list: 1091.936 ms (force time: 71.33200000000001 ms)
double each list element: 1416.0569999999998 ms (force time: 96.808 ms)
map id l: 484.493 ms (force time: 67.232 ms)
As we can see, the force creates around 13% overhead in the map id l case.

Related

Designing basic Haskell bench-marking code

I would like to write some basic bench-marking code in Haskell, namely design a function:
benchmark :: M a -> String -> Int -> M ()
which given an action (a monadic value with respect to my own monad M), a function name for messaging purposes, and a number of iterations, executes the action as many times as iterations, and displays a message on stdout about the time spent executing the code. For the purpose of this question, I assume that the monad M is an instance of the class MonadIO. In particular I have some function:
liftIO :: IO a -> M a
Having spent many hours on 'Real World Haskell', I have been warned that "Non-strict evaluation can turn performance measurement and analysis into something of a minefield". However, I am not looking for a definitive answer, bur rather for suggestions and guidance on how best to approach this.
My naive approach is to separate the time measuring code:
benchmark action name iterations = do
start <- liftIO getPOSIXTime
runAction action iterations
end <- liftIO getPOSIXTime
let time = realToFrac $ (end - start) :: Double
liftIO $ printf "Benchmark: %s, %d iterations ran in %.3f seconds\n"
name iterations time
from the real crunch of the problem, namely that of running an action as a loop:
runAction :: Monad m => m a -> Int -> m ()
runAction action iterations
| iterations <= 0 = return ()
| otherwise = do
action
runAction action (iterations -1)
The issue I have with my solution is that it seems the code spends more time in 'boiler-plate' than running the action: if I attempt to benchmark the simplest possible action return () , I may get a time say of 1000 ms for 1 million iterations. I can see the specifics of the monad M play an important role (replacing M by IO would bring the time down to 250 ms instead of 1000 ms). A one million loop doing nothing may also take in the order of 250 ms in interpreted python or scheme (so the IO monad is not so bad), but it is a lot faster in C. Normally Haskell is mightily fast (not so far from C). I have tried a solution involving forM and [1..1000000] but without improvement.
Is there a way to approach this so the 'boiler-plate' code does not overwhelm the code being tested?
EDIT: The problem seemingly goes away when using the -O2 compiler optimization option as suggested by #luqui (i.e. a one million monadic loop doing nothing has very good performance relative to other languages, at least in my case). So this question is closed as far as I can see it.

You can try this way:
runAction :: MonadIO m => m a -> Int -> m NominalDiffTime
runAction action = fmap sum . flip replicateM action'
where
action' = do
start <- liftIO getPOSIXTime
action
end <- liftIO getPOSIXTime
let !delta = end - start
return delta
benchmark action name iterations = do
time <- realToFrac <$> runAction action iterations
liftIO $ printf "Benchmark: %s, %d iterations ran in %.3f seconds\n"
name iterations (time :: Double)
Or, you can just rewrite your runAction:
runAction = flip replicateM_
Or, just fix overhead:
time :: MonadIO m => m a -> m NominalDiffTime
time action = do
start <- liftIO getPOSIXTime
action
end <- liftIO getPOSIXTime
let !delta = end - start
return delta
benchmark action name iterations = do
!overhead <- time $ replicateM_ iterations (return ())
!delta <- time $ replicateM_ iterations action
let time = realToFrac . abs $ delta - overhead :: Double
liftIO $ printf "Benchmark: %s, %d iterations ran in %.3f seconds\n"
name iterations time

How to best "waste" a roughly specified time by only "burning CPU" with pure functional calculations?

I occasionally would like to delay specific parts of a pure algorithm while developing / testing, so I can monitor the evaluation simply by watching the lazy result build up piece by piece (which would generally be too fast to be useful in the final, un-delayed version). I then find myself inserting ugly stuff like sum [1..1000000] `seq` q, which kind of works (though often with the usual thunk-explosion problems, because I never think much about this), but is rather trial-and-error-like.
Is there a nicer, more controllable alternative that's still just as simple, when I want to do some quick testing in that way and can't be bothered to do proper profiling, criterion etc.?
I'd also like to avoid unsafePerformIO $ threadDelay, though I reckon this might actually be an appropriate use.

This looping solution avoids calling threadDelay, but still calls unsafePerformIO, so maybe we don't gain much:
import Data.AdditiveGroup
import Data.Thyme.Clock
import Data.Thyme.Clock.POSIX
import System.IO.Unsafe
pureWait :: NominalDiffTime -> ()
pureWait time = let tsList = map unsafePerformIO ( repeat getPOSIXTime ) in
case tsList of
(t:ts) -> loop t ts
where
loop t (t':ts') = if (t' ^-^ t) > time
then ()
else loop t ts'
main :: IO ()
main = do
putStrLn . show $ pureWait (fromSeconds 10)
UPDATE: Here's an altenative solution. First determine (using IO) how many iterations do you need to achieve a given delay, and then just use a pure looping function.
pureWait :: Integer -> Integer
pureWait i = foldl' (+) 0 $ genericTake i $ intersperse (negate 1) (repeat 1)
calibrate :: NominalDiffTime -> IO Integer
calibrate timeSpan = let iterations = iterate (*2) 2 in loop iterations
where
loop (i:is) = do
t1 <- getPOSIXTime
if pureWait i == 0
then do
t2 <- getPOSIXTime
if (t2 ^-^ t1) > timeSpan
then return i
else loop is
else error "should never happen"
main :: IO ()
main = do
requiredIterations <- calibrate (fromSeconds 10)
putStrLn $ "iterations required for delay: " ++ show requiredIterations
putStrLn . show $ pureWait requiredIterations

Parallel maximum computation

I have a set of problems that I would like to evaluate in parallel. These problems are expressed using a simple expression type very similar to this:
-- Expressions are either a constant value or two expressions
-- combined using a certain operation
data Expr
= Const NumType
| Binary BinOp Expr Expr
-- The possible operations
data BinOp = Add | Sub | Mul | Div
deriving (Eq)
These expressions are built on the fly and should evaluate to a certain result which may be valid or invalid. This is expressed as a monad to stop computation when encountering invalid results.
data Result a
= Val { val :: a }
| Exc { exc :: String }
instance Monad Result where
return = Val
(Exc e) >>= _ = (Exc e)
(Val v) >>= g = g v
To determine a value of each solved problem I have two relevant functions:
eval :: Expr -> Result NumType
score :: Expr -> NumType
And finally I have solve functions that will return a [Expr]. This leads to my main function currently looking like this:
main :: IO ()
main = do
strAvailableNumbers <- getLine
strTargetNumber <- getLine
let numbers = parseList strAvailableNumbers
target = parseTargetNumber strTargetNumber in
sequence $ map (print) $
solveHeuristic1 (Problem target numbers) [Add] [Sub] ++
solveHeuristic2 (Problem target numbers)
return ()
The basic idea is that I read a list of numbers and a target number from stdin and then print expressions on stdout.
But I have two problems that I would like to solve and I am not quite sure how related they are:
Those heuristics run entirely unaware of each other and therefore don't know whether the score of their solution is higher than any other. I would like to introduce some kind of state to the map function to only print the the new Expr if its score is higher then the Expr printed previously.
I would like to do these computations in parallel and attempted to do so by using (parMap rseq) instead of map, compiling with the -threaded option and running it using +RTS -N2. The result is a runtime increase from 5 seconds to 7 seconds. Not what I expected, altough time shows the CPU utilization is higher. I guess I am not correctly using parMap or do something wrong by using ++. So how would I run a list of independent functions, each returning a list of elements, in parallel?
Update: Created a gist with complete source code.

The problem here is that evaluating an IO action with seq does approximately nothing. So you're just running things sequentially with slightly more overhead.
You can refractor things to make them pure again
main :: IO ()
main = do
mapM_ (`seq` print "found it") -- make sure we're not
-- benchmarking printing stuff
. concat
. parMap rdeepseq (solve [1..10000000])
$ [42, 42]
return ()
And add instances of NFData to use rdeepseq which will fully evaluate things
instance NFData BinOp -- Binop is just an enum, WHNF = NF
instance NFData Expr where
rnf (Const a) = a `deepseq` ()
rnf (Binary b e1 e2) = b `deepseq` e1 `deepseq` e2 `deepseq` ()
And now if we run it we get... a stackoverflow. I bumped up the size sufficiently that we search in order to actually make it take long enough to be worth benchmarking and now fully loading both structures into memory will blow the stack. Bumping up the stack size to the point where we don't blow everything up leaves us running 40% faster (3 vs 5 seconds) using -N2 than without. Which I would consider the expected result. Visually when running this, I can see 2 cores briefly jump up to 100%.
Final compilation sequence
> ghc -O2 -threaded -rtsops bench.hs
> ./bench +RTS -K10000000 -N2

How do I make a conduit like takeWhile but only takes at maximum a certain number of bytes?

I'm trying to make a conduit that's sort of a cross between takeWhile and isolate. That is, it will consume from the input and yield to the output until either the predicate no longer holds or it has reached the byte limit. I know the type signature will be
isolateWhile :: (Monad m) => Int -> (Word8 -> Bool) -> Conduit ByteString m ByteString
As an example of its use:
{-# LANGUAGE OverloadedStrings #-}
import Data.Conduit
import qualified Data.Conduit.List as CL
import qualified Data.Conduit.Binary as CB
import Control.Monad.Trans.Class
charToWord = fromIntegral . fromEnum
example :: Int -> Char -> IO ()
example limit upTo = do
untaken <- CB.sourceLbs "Hello, world!" $= conduit $$ CB.sinkLbs
putStrLn $ "Left " ++ show untaken
where
conduit = do
taken <- toConsumer $ isolateWhile limit (/= charToWord upTo) =$ CB.sinkLbs
lift $ putStrLn $ "Took " ++ show taken
CL.map id -- pass the rest through untouched
I expect that
ghci> example 5 'l'
Took "He"
Left "llo, world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
However, the simplest possible definition of isolateWhile:
isolateWhile limit pred = CB.isolate limit =$= CB.takeWhile pred
yields
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ", world!"
In other words, isolate will eat up the entire Hello, leaving He to takeWhile and discarding the llo. This data loss is undesirable for my application. However, it is notable that the second case yields the expected result.
If I swap the operands of =$= like so:
isolateWhile limit pred = CB.takeWhile pred =$= CB.isolate limit
Then
ghci> example 5 'l'
Took "He"
Left ", world!"
ghci> example 5 'w'
Took "Hello"
Left ""
Now I've fixed the first test, but broken the second one! This time, takeWhile will take whatever it needs and isolate will take a subset of that; but whatever takeWhile uses that isolate doesn't will be discarded, and this is undesirable.
Lastly, I tried:
isolateWhile limit pred = do
untaken <- CB.isolate limit =$= (CB.takeWhile pred >> CL.consume)
mapM_ leftover $ reverse untaken
This actually works! Whatever isolate accepts and takeWhile doesn't is consumed by the CL.consume and placed back into the stream with leftover. Unfortunately, this seems like a horrible kludge, and undesirably (although not unusably so) it will buffer up to limit bytes in memory only to put it back with leftover. That seems like a waste.
The only solution I can think of is to write it in terms of the primitives await, yield and leftover as takeWhile and isolate are themselves written. While this would solve all the problems without wasting much, it seems like there must be a better way.
Am I missing something, or is there really no better way to write this?

There's a known limitation in the current version of conduit: fusion always discards downstream leftovers, which is exactly what you're running into here. There are some discussions right now about an architecture to resolve this, but for the moment, writing your function in terms of the primitives is likely your best option.

Are there any problems with this Haskell function for strictly timing a computation?

Recently I was trying to determine the time needed to calculate a waveform using the vector storage type.
I wanted to do so without requiring to print the length or something like that. Finally I came up with the following two definitions. It seems simple enough, and from what I can tell it prints a non-zero computation time as expected the first time I run the function, but I'm wondering if there are any laziness caveats here that I've missed.
import System.IO
import System.CPUTime
import qualified Data.Vector.Storable as V
timerIO f = do
start <- getCPUTime
x <- f
let !y = x
end <- getCPUTime
let diff = (fromIntegral (end - start)) / (10^12)
print $ "Computation time: " ++ show diff ++ " sec\n"
timer f = timerIO $ do return f
main :: IO ()
main = do
let sr = 1000.0
time = V.map (/ sr) $ V.enumFromN 0 120000 :: V.Vector Float
wave = V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float
timer wave
timer wave
prints,
Computation time: 0.16001 sec
Computation time: 0.0 sec
Are there any hidden bugs here? I'm really not sure that the let with strictness flag is really the best way to go here. Is there a more concise way to write this? Are there any standard functions that already do this that I should know about?
Edit: I should mention that I had read about criterion but in this case I was not looking for a robust way to calculate average timing for profiling-only purposes; rather I was looking for a simple / low-overhead way to integrate single timers into my program for tracing the timing of some computations during normal running of the application. Criterion is cool, but this was a slightly different use case.

If evaluating to weak head normal form is enough - for strict Vectors or UArrays it is -, then your timing code works well¹, however, instead of the bang pattern in the let-binding, you could put a bang on the monadic bind,
start <- getCPUTime
!x <- f
end <- getCPUTime
which to me looks nicer, or you could use Control.Exception.evaluate
start <- getCPUTime
evaluate f
end <- getCPUTime
which has the advantage of (supposed) portability, whereas bang patterns are a GHC extension. If WHNF is not enough, you would need to force full evaluation, for example using rnf or deepseq, like
start <- getCPUTime
!x <- rnf `fmap` f
end <- getCPUTime
However, repeatedly timing the same computation with that is hairy. If, as in your example, you give the thing a name, and call it
timer wave
timer wave
the compiler shares the computation, so it's only done once and all but the first timer calls return zero (or very close to zero) times. If you call it with code instead of a name,
timer (V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float)
timer (V.map (\x -> sin $ x * 2 * pi * 10) time :: V.Vector Float)
the compiler can still share the computation, if it does common subexpression elimination. And although GHC doesn't do much CSE, it does some and I'm rather confident it would spot and share this (when compiling with optimisations). To reliably make the compiler repeat the computations, you need to hide the fact that they are the same from it (or use some low-level internals), which is not easy to do without influencing the time needed for the computation.
¹ It works well if the computation takes a significant amount of time. If it takes only a short time, the jitter introduced by outside influences (CPU load, scheduling, ...) will make single timings far too unreliable. Then you should do multiple measurements, and for that, as has been mentioned elsewhere, the criterion library is an excellent way to relieve you of the burden of writing robust timing code.

Are you familiar with the deepseq package? It's used by the criterion package for pretty much the purpose you describe.
Speaking of which, you may want to consider whether criterion itself does what you need anyway.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: How to benchmark a computation accurately with deepseq/force - haskell

Related

Designing basic Haskell bench-marking code

How to best "waste" a roughly specified time by only "burning CPU" with pure functional calculations?

Parallel maximum computation

How do I make a conduit like takeWhile but only takes at maximum a certain number of bytes?

Are there any problems with this Haskell function for strictly timing a computation?

Categories

Resources