Adding a Progress Bar to Parallel Code (Haskell)

Adding a Progress Bar to Parallel Code (Haskell) - haskell

In Haskell, I have a list that can be evaluated in parallel. Each individual evaluation doesn't take that long, but there are many of them (1 million, for example). I'm using the following library. The plan is split the list into chunks and run them in parallel. I have something like the following that works:
import Control.Parallel.Strategies
import Control.DeepSeq
-- Imagine this being slightly more expensive
kindaExpensiveComputation :: Int -> [Int]
kindaExpensiveComputation n = replicate n 42
main :: IO ()
main = do
let n = 1000000
let args = replicate n 20
let chunkSize = n `div` 10
let result = force $ withStrategy (parListChunk chunkSize rseq) . map kindaExpensiveComputation $ args
-- do stuff with result here
-- end program
I would like to add a progress bar to this so I can keep track of how much of the list has been done. My instinct was to try something like the following:
import Control.Parallel.Strategies
import Control.DeepSeq
import System.ProgressBar
-- Imagine this being slightly more expensive
kindaExpensiveComputation :: ProgressBar s -> Int -> IO [Int]
kindaExpensiveComputation pb n = do
let res = replicate n 42
incProgress pb 1
return res
main :: IO ()
main = do
let n = 1000000
let args = replicate n 20
let chunkSize = n `div` 10
pb <- newProgressBar defStyle 10 (Progress 0 n ())
let result = force $ withStrategy (parListChunk chunkSize rseq) . map (kindaExpensiveComputation pb) $ args
-- do stuff with result here
-- end program
But force doesn't seem to be able to handle IO. I've tried a couple other things, but whatever I try evaluates the list of IO [Int] in parallel but not the actual contents of the IO. I see that the parallel library has some functions like withStrategyIO, although I'm not sure how to use it or if it's what I'm looking for.
I think my understanding of how Haskell evaluates expressions is causing my confusion, so any pointers on that would be helpful as well.

Unfortunately, I believe GHC currently does not expose any functionality for observing (in IO, obviously) the evaluation progress of parallel computations. You will need to use concurrency in the form of forkIO and friends (or a library that wraps them like the async package) instead.

Related

Haskell: How to use a HashMap in a main function

I beg for your help, speeding up the following program:
main = do
jobsToProcess <- fmap read getLine
forM_ [1..jobsToProcess] $ \_ -> do
[r, k] <- fmap (map read . words) getLine :: IO [Int]
putStrLn $ doSomeReallyLongWorkingJob r k
There could(!) be a lot of identical jobs to do, but it's not up to me modifying the inputs, so I tried to use Data.HashMap for backing up already processed jobs. I already optimized the algorithms in the doSomeReallyLongWorkingJob function, but now it seems, it's quite as fast as C.
But unfortunately it seems, I'm not able to implement a simple cache without producing a lot of errors. I need a simple cache of Type HashMap (Int, Int) Int, but everytime I have too much or too few brackets. And IF I manage to define the cache, I'm stuck in putting data into or retrieving data from the cache cause of lots of errors.
I already Googled for some hours but it seems I'm stuck. BTW: The result of the longrunner is an Int as well.

It's pretty simple to make a stateful action that caches operations. First some boilerplate:
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M
import Debug.Trace
I'll use Data.Map, but of course you can substitute in a hash map or any similar data structure without much trouble. My long-running computation will just add up its arguments. I'll use trace to show when this computation is executed; we'll hope not to see the output of the trace when we enter a duplicate input.
reallyLongRunningComputation :: [Int] -> Int
reallyLongRunningComputation args = traceShow args $ sum args
Now the caching operation will just look up whether we've seen a given input before. If we have, we'll return the precomputed answer; otherwise we'll compute the answer now and store it.
cache :: (MonadState (Map a b) m, Ord a) => (a -> b) -> a -> m b
cache f x = do
mCached <- gets (M.lookup x)
case mCached of
-- depending on your goals, you may wish to force `result` here
Nothing -> modify (M.insert x result) >> return result
Just cached -> return cached
where
result = f x
The main function now just consists of calling cache reallyLongRunningComputation on appropriate inputs.
main = do
iterations <- readLn
flip evalStateT M.empty . replicateM_ iterations
$ liftIO getLine
>>= liftIO . mapM readIO . words
>>= cache reallyLongRunningComputation
>>= liftIO . print
Let's try it in ghci!
> main
5
1 2 3
[1,2,3]
6
4 5
[4,5]
9
1 2
[1,2]
3
1 2
3
1 2 3
6
As you can see by the bracketed outputs, reallyLongRunningComputation was called the first time we entered 1 2 3 and the first time we entered 1 2, but not the second time we entered these inputs.

I hope i'm not too far off base, but first you need a way to carry around the past jobs with you. Easiest would be to use a foldM instead of a forM.
import Control.Monad
import Data.Maybe
main = do
jobsToProcess <- fmap read getLine
foldM doJobAcc acc0 [1..jobsToProcess]
where
acc0 = --initial value of some type of accumulator, i.e. hash map
doJobAcc acc _ = do
[r, k] <- fmap (map read . words) getLine :: IO [Int]
case getFromHash acc (r,k) of
Nothing -> do
i <- doSomeReallyLongWorkingJob r k
return $ insertNew acc (r,k) i
Just i -> do
return acc
Note, I don't actually use the interface for putting and getting the hash table key. It doesn't actually have to be a hash table, Data.Map from containers could work. Or even a list if its going to be a small one.
Another way to carry around the hash table would be to use a State transformer monad.

I am just adding this answer since I feel like the other answers are diverging a bit from the original question, namely using hashtable constructs in Main function (inside IO monad).
Here is a minimal hashtable example using hashtables module. To install the module with cabal, simply use
cabal install hashtables
In this example, we simply put some values in a hashtable and use lookup to print a value retrieved from the table.
import qualified Data.HashTable.IO as H
main :: IO ()
main = do
t <- H.new :: IO (H.CuckooHashTable Int String)
H.insert t 22 "Hello world"
H.insert t 5 "No problem"
msg <- H.lookup t 5
print msg
Notice that we need to use explicit type annotation to specify which implementation of the hashtable we wish to use.

Haskell random numbers

I am trying to generate a sample of random numbers in Haskell
import System.Random
getSample n = take n $ randoms g where
g = newStdGen
but it seems I am not quite using newStdGen the right way. What am I missing?

First off, you probably don't want to use newStdGen. The biggest problem is that you'll get a different seed every time you run your program, so no results will be reproducible. In my opinion, mkStdGen is a better choice as it requires you to give it a seed. This means you will get the same sequence of (pseudo)random numbers every time. If you want a different sequence, just change the seed.
The second problem with newStdGen is that since it's impure, you'll end up in the IO monad which can be a bit inconvenient.
sample :: Int -> IO [Int]
sample n = do
gen <- newStdGen
return $ take n $ randoms gen
You can use do-notation to 'extract' the values and then sum them:
main :: IO ()
main = do
xs <- sample 10
s = sum xs
print s
Or you could 'fmap' the function over the result (but notice that at some point you will probably need to extract the value):
main :: IO ()
main = do
s <- fmap sum $ sample 10
print s
The fmap function is a generalized version of map. Just like map applies a function to the values inside a list, fmap can apply a function to values inside IO.
Another problem with this sample function is that if we call it again, it starts with a fresh seed instead of continuing the previous (pseudo)random sequence. Again, this make reproducing results impossible. In order to fix this problem, we need to pass in the seed and return a new seed. Unfortunately, randoms does not return the next seed for us, so we'll have to write this from scratch using random.
sample :: Int -> StdGen -> ([Int],StdGen)
sample n seed1 = case n of
0 -> ([],seed1)
k -> let (rs,seed2) = sample (k-1) seed1
(r, seed3) = random seed2
in ((r:rs),seed3)
Our main function is now
main :: IO ()
main = do
let seed1 = mkStdGen 123456
(xs,seed2) = sample 10 seed1
s = sum xs
(ys,seed3) = sample 10 seed2
t = sum ys
print s
print t
I know this seems like an awful lot of work just to to use random numbers, but the advantages are worth it. We can generate all of our randomness with a single seed which guarantees that the results can be reproduced.
Of course, this being Haskell, we can take advantage of Monads to get rid of all the manual threading of the seed values. This is a slightly more advanced method, but well worth learning since monads are ubiquitous in Haskell code.
We need these imports:
import System.Random
import Control.Monad
import Control.Applicative
Then we'll create a newtype which represents the action of turning a seed into a value and the next seed.
newtype Rand a = Rand { runRand :: StdGen -> (a,StdGen) }
We need Functor and Applicative instances or GHC will complain, but we can avoid implementing them for this example.
instance Functor Rand
instance Applicative Rand
And now for the Monad instance. This is where the magic happens. The >>= function (called bind) is the one place where we specify how to thread the seed value through the computation.
instance Monad Rand where
return x = Rand ( \seed -> (x,seed) )
ra >>= f = Rand ( \s1 -> let (a,s2) = runRand ra s1
in runRand (f a) s2 )
newRand :: Rand Int
newRand = Rand ( \seed -> random seed )
Now our sample function is extremely simple! We can take advantage of replicateM from Control.Monad which repeats a given action and accumulates the results in a list. All that funny business with the seed values is taken care of behind the scenes
sample :: Int -> Rand [Int]
sample n = replicateM n newRand
main :: IO ()
main = do
let seed1 = mkStdGen 124567
(xs,seed2) = runRand (sample 10) seed1
s = sum xs
print s
We can even stay inside the Rand monad if we need to generate random values multiple times.
main :: IO ()
main = do
let seed1 = mkStdGen 124567
(xs,seed2) = flip runRand seed1 $ do
x <- newRand
bs <- sample 5
cs <- sample 10
return $ x : (bs ++ cs)
s = sum xs
print s
I hope this helps!

Synchronising parallel processes in Haskell

In Haskell, how can I run multiple (monad?) functions (such as print) in parallel and see their output in the order of their finish time? I want three processes each one ending up in a print function.
import Control.Parallel
main = a `par` b `pseq` (a,b)
where
a = print("ack",ack 3 10)
b = print("fac",fac 42)
If I don't use pseq, it will show the last one specified in par combination. I want to make sure all processes are finished before the program ends. I tried this but it does not show the output of a,b:
...
main = a `par` b `pseq` print("done.")
...
Note: my program ends with the following lines:
fac 0 = 1
fac n = n * fac (n-1)
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))

Don't use Control.Parallel for running IO actions concurrently.
Control.Concurrent.Async should do what you want – Control.Parallel is used for hinting which pure operations can be evaluated simultaneously, not for running multiple actions (monadic or otherwise) at the same time.

In the C family of languages, there's a difference between these 2 lines
a = foobar; // This just sets a to be a function pointer to foobar
b = foobar(); // This actually runs foobar and stores the result in `b`
Your code basically has the same problem as if you were writing C and forgot the () at the end of all your function calls. It assigns the function pointers a and b in parallel but doesn't actually call them.
If you are writing pure functions in Haskell, you can add parallelism using par, but it doesn't work for IO because IO a is like a function pointer. Your code "evaluates" a and b in parallel but the result of that evaluation is still waiting for you to actually execute it.
For code that lives in IO, you have to import Control.Concurrent and use the function forkIO to create the new thread. Then because the child threads all automatically die when the main thread finishes, you need some way of making the main thread wait (I'll use MVars in the example because they're the simplest reliable way to do it)
import Control.Concurrent
import Control.Concurrent.MVar
main = do
-- MVars have a type parameter because you can also use them to send data
-- between threads, but in this example I'm basically just using them as locks
await <- newEmptyMVar :: IO (MVar ())
bwait <- newEmptyMVar :: IO (MVar ())
forkIO $ print ("ack",ack 3 10) >> putMVar await ()
forkIO $ print ("fac",fac 42) >> putMVar bwait ()
takeMVar await
takeMVar bwait

How do I get lazy streaming into the foldl'?

How does one make their own streaming code? I was generating about 1,000,000,000 random pairs of war decks, and I wanted them to be lazy streamed into a foldl', but I got a space leak! Here is the relevant section of code:
main = do
games <- replicateM 1000000000 $ deal <$> sDeck --Would be a trillion, but Int only goes so high
let res = experiment Ace games --experiment is a foldl'
print res --res is tiny
When I run it with -O2, it first starts freezing up my computer, and then the program dies and the computer comes back to life (and Google Chrome then has the resources it needs to yell at me for using up all its resources.)
Note: I tried unsafeInterleaveIO, and it didn't work.
Full code is at: http://lpaste.net/109977

replicateM doesn't do lazy streaming. If you need to stream results from monadic actions, you should use a library such as conduit or pipes.
Your example code could be written to support streaming with conduits like this:
import Data.Conduit
import qualified Data.Conduit.Combinators as C
main = do
let games = C.replicateM 1000000 $ deal <$> sDeck
res <- games $$ C.foldl step Ace
-- where step is the function you want to fold with
print res
The Data.Conduit.Combinators module is from the conduit-combinators package.
As a quick-and-dirty solution you could implement a streaming version of replicateM using lazy IO.
import System.IO.Unsafe
lazyReplicateIO :: Integer -> IO a -> IO [a] --Using Integer so I can make a trillion copies
lazyReplicateIO 0 _ = return []
lazyReplicateIO n act = do
a <- act
rest <- unsafeInterleaveIO $ lazyReplicateIO (n-1) act
return $ a : rest
But I recommend using a proper streaming library.

The equivalent pipes solution is:
import Pipes
import qualified Pipes.Prelude as Pipes
-- Assuming the following types
action :: IO A
acc :: S
step :: S -> A -> S
done :: S -> B
main = do
b <- Pipes.fold step acc done (Pipes.replicateM 1000000 action)
print (b :: B)

How to best "waste" a roughly specified time by only "burning CPU" with pure functional calculations?

I occasionally would like to delay specific parts of a pure algorithm while developing / testing, so I can monitor the evaluation simply by watching the lazy result build up piece by piece (which would generally be too fast to be useful in the final, un-delayed version). I then find myself inserting ugly stuff like sum [1..1000000] `seq` q, which kind of works (though often with the usual thunk-explosion problems, because I never think much about this), but is rather trial-and-error-like.
Is there a nicer, more controllable alternative that's still just as simple, when I want to do some quick testing in that way and can't be bothered to do proper profiling, criterion etc.?
I'd also like to avoid unsafePerformIO $ threadDelay, though I reckon this might actually be an appropriate use.

This looping solution avoids calling threadDelay, but still calls unsafePerformIO, so maybe we don't gain much:
import Data.AdditiveGroup
import Data.Thyme.Clock
import Data.Thyme.Clock.POSIX
import System.IO.Unsafe
pureWait :: NominalDiffTime -> ()
pureWait time = let tsList = map unsafePerformIO ( repeat getPOSIXTime ) in
case tsList of
(t:ts) -> loop t ts
where
loop t (t':ts') = if (t' ^-^ t) > time
then ()
else loop t ts'
main :: IO ()
main = do
putStrLn . show $ pureWait (fromSeconds 10)
UPDATE: Here's an altenative solution. First determine (using IO) how many iterations do you need to achieve a given delay, and then just use a pure looping function.
pureWait :: Integer -> Integer
pureWait i = foldl' (+) 0 $ genericTake i $ intersperse (negate 1) (repeat 1)
calibrate :: NominalDiffTime -> IO Integer
calibrate timeSpan = let iterations = iterate (*2) 2 in loop iterations
where
loop (i:is) = do
t1 <- getPOSIXTime
if pureWait i == 0
then do
t2 <- getPOSIXTime
if (t2 ^-^ t1) > timeSpan
then return i
else loop is
else error "should never happen"
main :: IO ()
main = do
requiredIterations <- calibrate (fromSeconds 10)
putStrLn $ "iterations required for delay: " ++ show requiredIterations
putStrLn . show $ pureWait requiredIterations

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Adding a Progress Bar to Parallel Code (Haskell) - haskell

Unfortunately, I believe GHC currently does not expose any functionality for observing (in IO, obviously) the evaluation progress of parallel computations. You will need to use concurrency in the form of forkIO and friends (or a library that wraps them like the async package) instead.

Related

Haskell: How to use a HashMap in a main function

Haskell random numbers

Synchronising parallel processes in Haskell

How do I get lazy streaming into the foldl'?

How to best "waste" a roughly specified time by only "burning CPU" with pure functional calculations?

Categories

Resources