When more than one thread is waiting to write an MVar, they are executed in first-in first-out scheme. I want to execute thread as per shortest job scheduling.
I have tired to code this using MVar. Here job is to calculate a Fibonacci number and write a MVar. 1st thread calculates Fibonacci 30 and 2nd thread calculates Fibonacci 10. As time taken for calculating Fibonacci 10 is less than 30, thus 2nd thread should execute first. I a not getting the desired result from the following block of code.
How to implement shortest job first scheduling in Haskell (or may be using Haskell STM)?
Code
module Main
where
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
type MInt = MVar Int
updateMVar :: MInt -> Int -> IO ()
updateMVar n v = do x1 <- readMVar n
let y = nfib v
x2 <- readMVar n
if x1 == x2
then do t <- takeMVar n
putMVar n y
else return()
main :: IO ()
main = do
n <- newEmptyMVar
putMVar n 0
forkIO(updateMVar n 30)
t <- readMVar n
putStrLn("n is : " ++ (show t))
forkIO(updateMVar n 10)
t <- readMVar n
putStrLn("n is : " ++ (show t))
Output
n is : 832040
n is : 55
To implement scheduling you need to use MVars and threads together. Start with an empty MVar. Fork the jobs you wish to run in the background. The main thread can then block on each result in turn. The fastest will come first. Like so:
{-# LANGUAGE BangPatterns #-}
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
main :: IO ()
main = do
result <- newEmptyMVar
forkIO $ do
let !x = nfib 40
putMVar result x
forkIO $ do
let !x = nfib 30
putMVar result x
t <- takeMVar result
print $ "Fastest result was: " ++ show t
t <- takeMVar result
print $ "Slowest result was: " ++ show t
Note that it is important to use bang patterns to evaluate the fibonacci calls outside of the MVar (don't want to simply return an unevaluated thunk to the main thread).
Compile with the threaded runtime:
$ ghc -o A --make A.hs -threaded -fforce-recomp -rtsopts
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A.exe ...
And run on two cores:
$ ./A.exe +RTS -N2
"Fastest result was: 832040"
"Slowest result was: 102334155"
Productivity is pretty good as well (use +RTS -s to see runtime performance statistics).
Productivity 89.3% of total user, 178.1% of total elapsed
The first thread to finish will have its result printed first. The main thread will then block until the second thread is done.
The main thing is to take advantage of MVar empty/full semantics to block the main thread on each of the children threads.
Related
The program below works if run with runhaskell or if compiled but not with -O2. If compiled with -O2 it seems to hang.
I'm using GHC 7.10.2.
I've changed the min/max iterations to 10 and 20 respectively. It will
generate anywhere from 20 to 100 MB of output into the file test.out.
Run time is about 15 - 60 secs.
Program Explanation
Below is a multi-threaded program that has a pool of workers and a manager. The workers generate traces to be used in plotting a Buddhabrot, put it in a queue, and a manager periodically empties the queue and writes the data to disk. When a certain amount of data has been generated, the program stops.
But when the program runs the manager thread only does one check, and then it gets stuck (the worker threads are still running). However, if I remove the part where the manager thread writes to file, then everything seems to work. I just don't understand why...
import Control.Concurrent
import Control.Concurrent.Async
import Control.Concurrent.STM
import Control.Monad
( forever
, unless
)
import Control.Monad.Loops
import System.IO
import System.Random
import qualified Data.Binary as B
import qualified Data.ByteString.Lazy as BS
type Coord = (Double, Double)
type Trace = [Coord]
-- | Represents a rectangle in the complex plane, bounded by a lower left
-- coordinate and an upper right coordinate.
data Plane
= Plane { ll :: Coord, ur :: Coord }
deriving (Show)
-- | Adds two coordinates.
(+.) :: Coord -> Coord -> Coord
(r1, i1) +. (r2, i2) = (r1 + r2, i1 + i2)
-- | Multiplies two coordinates.
(*.) :: Coord -> Coord -> Coord
(r1, i1) *. (r2, i2) = (r1*r2 - i1*i2, r1*i2 + r2*i1)
-- | Computes the square of a coordinate.
square :: Coord -> Coord
square (r, i) = (r*r - i*i, 2*r*i)
-- | Distance from origin to a given coordinate.
distFromOrigin :: Coord -> Double
distFromOrigin (r, i) = r*r + i*i
-- | A structure for passing data to the worker threads.
data WorkerData
= WorkerData { wdMinIt :: Int
, wdMaxIt :: Int
, wdTraceQueue :: TQueue Trace
-- ^ A queue of traces to be written to disk.
}
-- | A structure for passing data to the manager thread.
data ManagerData
= ManagerData { mdOutHandle :: Handle
-- ^ Handle to the output file.
, mdNumTraces :: Integer
-- ^ Number of traces to gather.
, mdTraceQueue :: TQueue Trace
-- ^ A queue of traces to be written to disk.
}
-- | Encodes an entity to binary bytestring.
encode :: B.Binary a => a -> BS.ByteString
encode = B.encode
-- | Writes a lazy bytestring to file.
writeToFile :: Handle -> BS.ByteString -> IO ()
writeToFile = BS.hPut
mkManagerData :: TQueue Trace -> IO ManagerData
mkManagerData t_queue =
do let out_f = "test.out"
out_h <- openBinaryFile out_f WriteMode
let num_t = 1000
return $ ManagerData { mdOutHandle = out_h
, mdNumTraces = num_t
, mdTraceQueue = t_queue
}
mkWorkerData :: TQueue Trace -> IO WorkerData
mkWorkerData t_queue =
do let min_it = 10 -- 1000
max_it = 20 -- 10000
return $ WorkerData { wdMinIt = min_it
, wdMaxIt = max_it
, wdTraceQueue = t_queue
}
-- | The actions to be performed by the manager thread.
runManager :: ManagerData -> IO ()
runManager m_data =
do execute 0
return ()
where execute count =
do new_traces <- purgeTQueue $ mdTraceQueue m_data
let new_count = count + (toInteger $ length new_traces)
putStrLn $ "Found " ++ (show $ new_count) ++ " traces so far. "
if length new_traces > 0
then do putStrLn $ "Writing new traces to file..."
_ <- mapM (writeToFile (mdOutHandle m_data))
(map encode new_traces)
putStr "Done"
else return ()
putStrLn ""
unless (new_count >= mdNumTraces m_data) $
do threadDelay (1000 * 1000) -- Sleep 1s
execute new_count
-- | The actions to be performed by a worker thread.
runWorker :: WorkerData -> IO ()
runWorker w_data =
forever $
do c <- randomCoord
case computeTrace c (wdMinIt w_data) (wdMaxIt w_data) of
Just t -> atomically $ writeTQueue (wdTraceQueue w_data) t
Nothing -> return ()
-- | Reads all values from a given 'TQueue'. If any other thread reads from the
-- same 'TQueue' during the execution of this function, then this function may
-- deadlock.
purgeTQueue :: Show a => TQueue a -> IO [a]
purgeTQueue q =
whileJust (atomically $ tryReadTQueue q)
(return . id)
-- | Generates a random coordinate to trace.
randomCoord :: IO Coord
randomCoord =
do x <- randomRIO (-2.102613, 1.200613)
y <- randomRIO (-1.237710, 1.239710)
return (x, y)
-- | Computes a trace, using the classical Mandelbrot function, for a given
-- coordinate and minimum and maximum iteration count. If the length of the
-- trace is less than the minimum iteration count, or exceeds the maximum
-- iteration count, 'Nothing' is returned.
computeTrace
:: Coord
-> Int
-- ^ Minimum iteration count.
-> Int
-- ^ Maximum iteration count.
-> Maybe Trace
computeTrace c0 min_it max_it =
if isUsefulCoord c0
then let step c = square c +. c0
computeIt c it = if it < max_it
then computeIt (step c) (it + 1)
else it
computeTr [] = error "computeTr: empty list"
computeTr (c:cs) = if length cs < max_it
then computeTr (step c:(c:cs))
else (c:cs)
num_it = computeIt c0 0
in if num_it >= min_it && num_it <= max_it
then Just $ reverse $ computeTr [c0]
else Nothing
else Nothing
-- | Checks if a given coordinate is useful by checking if it belongs in the
-- cardioid or period-2 bulb of the Mandelbrot.
isUsefulCoord :: Coord -> Bool
isUsefulCoord (x, y) =
let t1 = x - 1/4
p = sqrt (t1*t1 + y*y)
is_in_cardioid = x < p - 2*p*p + 1/4
t2 = x + 1
is_in_bulb = t2*t2 + y*y < 1/16
in not is_in_cardioid && not is_in_bulb
main :: IO ()
main =
do t_queue <- newTQueueIO
m_data <- mkManagerData t_queue
w_data <- mkWorkerData t_queue
let num_workers = 1
workers <- mapM async (replicate num_workers (runWorker w_data))
runManager m_data
_ <- mapM cancel workers
_ <- mapM waitCatch workers
putStrLn "Tracing finished"
Why It Fails
After reviewing the answers below, I finally realized why it doesn't work as intended. The program does not hang, but the time it takes for the manager thread to encode a single trace is in the order of tens of seconds (and when encoded it consumes several megabytes)! This means that even if there are some tens of traces in the queue when exhausted -- on my machine the workers manage to produce about 250 traces before the queue is exhausted by the manger thread -- it will take forever before the next exhaust.
Hence it matters little what solution I choose unless the work of the manager thread is greatly reduced. For that, I will have to abandon my idea of dumping each individual trace to file and instead process it once generated.
The problem is two-fold:
(1) The manager thread doesn't process any
Traces until it has exhausted the queue.
(2) The worker thread can add elements to the queue very, very quickly.
This results in a race that the manager thread rarely wins. [ This also explains the observed behavior with -O2 - the optimization just made the worker thread faster. ]
Adding some debugging code shows that the worker can add
items to the queue in excess of 100K Traces per second.
Moreover, even though the manager is only interested in
writing out the first 1000 Traces, the worker doesn't
stop at this limit. So, under certain circumstances,
the manager is never able to exit this loop:
purgeTQueue q =
whileJust (atomically $ tryReadTQueue q)
(return . id)
The simplest way to fix the code is to have the
manager thread use readTQueue to read and process just one
item off the queue at a time. This will also block
the manager thread when the queue us empty obviating
the need to the manager thread to periodically sleep.
Changing purgeTQueue to:
purgeTQueue = do item <- atomically $ readTQueue (mdTraceQueue m_data)
return [item]
and removing the threadDelay from runManager fixes the problem.
Example code available in the Lib4.hs module at: https://github.com/erantapaa/mandel
I am pretty new to Haskell threads (and parallel programming in general) and I am not sure why my parallel version of an algorithm runs slower than the corresponding sequential version.
The algorithm tries to find all k-combinations without using recursion. For this, I am using this helper function, which given a number with k bits set, returns the next number with the same number of bits set:
import Data.Bits
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
It is now easy to obtain sequentially all k-combinations in the range [(2^k - 1), (2^(n-k)+...+ 2^(n-1)):
import qualified Data.Stream as ST
combs :: Int -> Int -> [Integer]
combs n k = ST.takeWhile (<= end) $ kBitNumbers start
where start = 2^k - 1
end = sum $ fmap (2^) [n-k..n-1]
kBitNumbers :: Integer -> ST.Stream Integer
kBitNumbers = ST.iterate nextKBitNumber
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
print $ length (combs n k)
My idea is that this should be easily parallelizable splitting this range into smaller parts. For example:
start :: Int -> Integer
start k = 2 ^ k - 1
end :: Int -> Int -> Integer
end n k = sum $ fmap (2 ^) [n-k..n-1]
splits :: Int -> Int -> Int -> [(Integer, Integer, Int)]
splits n k numSplits = fixedRanges ranges []
where s = start k
e = end n k
step = (e-s) `div` (min (e-s) (toInteger numSplits))
initSplits = [s,s+step..e]
ranges = zip initSplits (tail initSplits)
fixedRanges [] acc = acc
fixedRanges [x] acc = acc ++ [(fst x, e, k)]
fixedRanges (x:xs) acc = fixedRanges xs (acc ++ [(fst x, snd x, k)])
At this point, I would like to run each split in parallel, something like:
runSplit :: (Integer, Integer, Int) -> [Integer]
runSplit (start, end, k) = ST.takeWhile (<= end) $ kBitNumbers (fixStart start)
where fixStart s
| popCount s == k = s
| otherwise = fixStart $ s + 1
For pallalelization I am using the monad-par package:
import Control.Monad.Par
import System.Environment
import qualified Data.Set as S
main :: IO ()
main = do
params <- getArgs
let n = read $ params !! 0
k = read $ params !! 1
numTasks = read $ params !! 2
batches = runPar $ parMap runSplit (splits n k numTasks)
reducedNumbers = foldl S.union S.empty $ fmap S.fromList batches
print $ S.size reducedNumbers
The result is that the sequential version is way faster and it uses little memory, while the parallel version consumes a lot of memory and it is noticeable slower.
What might be the reasons causing this? Are threads a good approach for this problem? For example, every thread generates a (potentially large) list of integers and the main thread reduces the results; are threads expected to need much memory or are simply meant to produce simple results (i.e. only cpu-intensive computations)?
I compile my program with stack build --ghc-options -threaded --ghc-options -rtsopts --executable-profiling --library-profiling and run it with ./.stack-work/install/x86_64-osx/lts-6.1/7.10.3/bin/combinatorics 20 3 4 +RTS -pa -N4 -RTS for n=20, k=3 and numSplits=4. An example of the profiling report for the parallel version can be found here and for the sequential version here.
In your sequential version calling combs does not build up a list in memory since after length consumes an element it isn't needed anymore and is freed. Indeed, GHC may not even allocate storage for it.
For instance, this will take a while but won't consume a lot of memory:
main = print $ length [1..1000000000] -- 1 billion
In your parallel version you are generating sub-lists, concatenating them together, building Sets, etc. and therefore the results of each sub-task have to be kept in memory.
A fairer comparison would be to have each parallel task compute the length of the k-bit numbers in its assigned range, and then add up the results. That way the k-bit numbers found by each parallel task wouldn't have to be kept in memory and would operate more like the sequential version.
Update
Here is an example of how to use parMap. Note: under 7.10.2 I've had mixed success getting the parallelism to fire - sometimes it does and sometimes it doesn't. (Figured it out - I was using -RTS -N2 instead of +RTS -N2.)
{-
compile with: ghc -O2 -threaded -rtsopts foo.hs
compare:
time ./foo 26 +RTS -N1
time ./foo 26 +RTS -N2
-}
import Data.Bits
import Control.Parallel.Strategies
import System.Environment
nextKBitNumber :: Integer -> Integer
nextKBitNumber n
| n == 0 = 0
| otherwise = ripple .|. ones
where smallest = n .&. (-n)
ripple = n + smallest
newSmallest = ripple .&. (-ripple)
ones = (newSmallest `div` smallest) `shiftR` 1 - 1
combs :: Int -> Int -> [Integer]
combs n k = takeWhile (<= end) $ iterate nextKBitNumber start
where start = 2^k - 1
end = shift start (n-k)
main :: IO ()
main = do
( arg1 : _) <- getArgs
let n = read arg1
print $ parMap rseq (length . combs n) [1..n]
good approaches for this problem
What do you mean by this problem? If it's how to write, analyze and tune a parallel Haskell program, then this is required background reading:
Simon Marlow: Parallel and Concurrent Programming in Haskell
http://community.haskell.org/~simonmar/pcph/
in particular, Section 15 (Debugging, Tuning, ..)
Use threadscope! (a graphical viewer for thread profile information generated by the Glasgow Haskell compiler) https://hackage.haskell.org/package/threadscope
I am trying to compare the perf of two implementations of finding the Nth smallest number in a binary search tree. This is just a toy learning problem. My naïve attempt at measuring follows:
getNth :: Tree Int -> Int -> Either String Int
eval :: [Either b Int] -> Int
eval = (foldl (+) 0) . rights
main :: IO ()
main = do
let t = foldl treeInsertBalanced EmptyTree [1..1000000]
first = getNth t 100000
iterations = 100000000000
second = take iterations $ repeat $ getNth t 100
third = take iterations $ repeat $ getNth' t 100
print $ "dummy to cause eval: " ++ (show first)
print ""
time1 <- System.CPUTime.getCPUTime
print $ eval second
time2 <- System.CPUTime.getCPUTime
print $ eval third
time3 <- System.CPUTime.getCPUTime
let secondTime = time2-time1
thirdTime = time3-time2
timeDiff = secondTime - thirdTime
print $ "take version = " ++ (show secondTime)
print $ "opt version = " ++ (show thirdTime)
print $ "diff = " ++ (show timeDiff)
I am having a hard time figuring out how to include laziness where I want it, and prevent it where I do not.
I want the tree to be completely constructed before I begin measuring the functions that operate on it. This is why I try to force an evaluation of t by calling getNth on it and then printing it.
Is this doing what I hope it is doing.
Will t remain fully evaluated when I use it subsequently.
The difference in implementation between the two getNth functions is that the first uses the 'take' function on a simple depth first search of the tree. The second does a depth first search with an explicit early return. I want to know whether the simple 'take' implementation has to walk the whole tree or not. How can I determine that in a simpler way than measuring the performance of the two functions. I tried introducing an 'error' or an 'undefined' as a value in the tree, but of course neither was evaluated unless it was the Nth element in the tree. Is there another, simple way of determining whether the getNth function is truly laze or not?
(Available as a .lhs at http://pastebin.com/jpg0vSNd )
Some observations:
A good way to force evalution of a value is to use deepseq from Control.DeepSeq.
repeat does not re-evaluate it's argument.
GHC is pretty good at spotting expression which are the same, so sometimes you have to disguise your function calls with identical arguments to make GHC re-evalute the function call.
Here is an example of using deepseq:
import Control.DeepSeq (deepseq)
import Control.Monad
import Debug.Trace
import System.TimeIt
import System.Environment
theList = [1..8] ++ [undefined] ++ [10] :: [Int]
main1 = do
print $ length theList
print $ deepseq theList (length theList)
The first print statement emits 10. The second throws an exception because
the deepseq call tried to evaluate the undefined element.
To see that repeat does not re-evaluate it's argument, consider this example:
foo = repeat $ trace "(here)" 2
main2 = print $ take 3 foo
The result of running main2 is:
[(here)
2,2,2]
What is happening is that when the head of foo is called for repeat evaluate it's argument. This calls trace which prints (here) and returns 2. This value is saved by repeat when the rest of the list foo is needed.
Finally, here is a demonstration of how good GHC is at spotting function calls with identical arguemnts.
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
theN = 34 -- use 24 if running under ghci
compute1 = fib theN
compute2 k = fib theN
compute3 k = fib (k+theN-k)
fib theN is just a function call which takes a while to compute (about 0.6 secs)
loop1 n = forM_ [1..n] $ \_ -> print compute1
loop2 n = forM_ [1..n] $ \k -> print (compute2 k)
loop3 n = forM_ [1..n] $ \k -> print (compute3 k)
timeLoop loop = do timeIt $ loop 1
timeIt $ loop 2
timeIt $ loop 3
timeIt $ loop 10
main4 = timeLoop loop1
main5 = timeLoop loop2
main6 = timeLoop loop3
main = do (arg:_) <- getArgs
case arg of
"4" -> main4
"5" -> main5
"6" -> main6
The run times depend on whether or not you compile with -O2 or not.
Typical reuts are:
w/o -O2 with -O2
main4 1 secs 0.1 sec
main5 13 secs 0.1 sec
main6 13 secs 1.0 sec
Some conclusions:
A top-level expression like compute1 is memoized.
Adding an ignored parameter (e.g. compute2) will fool GHC into recomputing a function call if -O2 is not used.
With -O2 a trickier way of disguising the function call may be needed to get GHC to re-evaluate it in a loop.
In the following Haskell code, how to force main thread to wait till all its child threads finish.
I could not able to use forkFinally as given in the section "Terminating the Program" here in this link: (http://hackage.haskell.org/package/base-4.7.0.2/docs/Control-Concurrent.html).
I get desired result when using TMVar. But I want to do this with TVar.
Please help.
module Main
where
import Control.Monad
import Control.Concurrent
import Control.Concurrent.STM
type TInt = TVar Int
transTest :: TInt -> Int -> IO ()
transTest n t = do
atomically $ do
t1 <- readTVar n
doSomeJob t
t2 <- readTVar n
writeTVar n t
doSomeJob :: Int -> STM ()
doSomeJob t = do
x <- newTVar 0
let l = 10^6*t
forM_ [1..l] (\i -> do
writeTVar x i )
main :: IO ()
main = do
n <- newTVarIO 0
let v = 5
forkIO (transTest n v)
let v = 3
forkIO (transTest n v)
let v = 7
forkIO (transTest n v)
let v = 1
forkIO (transTest n v)
r <- atomically $ readTVar n
putStrLn("Last updated value = " ++ (show r))
What I did in the past was to create a little MVar for each forked thread and then use forkFinally to fork the threads such that at the very end, each thread would put a dummy value into the MVar (i.e. I used the MVar as a synchronisation primitive). I could then call takeMVar on those MVars to wait.
I wrapped it into a little helper function:
forkThread :: IO () -> IO (MVar ())
forkThread proc = do
handle <- newEmptyMVar
_ <- forkFinally proc (\_ -> putMVar handle ())
return handle
Using this, your code could be changed to something like
-- Fork four threads
threads <- forM [5, 3, 7, 1] (\v -> forkThread (transTest n v))
-- Wait for all of them
mapM_ takeMVar threads
However, that was before I read the (most excellent) book "Parallel and Concurrent Programming in Haskell" by Simon Marlow, which made me aware of the async package. The package provides an abstraction which not only takes care of all these things, so you can write just
-- Runs 'transTest n {5,3,7,1}' in parallel and waits for all threads
_ <- mapConcurrently (transTest n) [5, 3, 7, 1]
...it also takes care of things such as (asynchronous) exceptions.
To practice concurrent programming, I wrote the following (suboptimal) program, which repeatedly calculates the first prime bigger than whatever the user inputs:
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Monad (forever)
primeAtLeast n = -- Some pure code that looks up the first prime at least as big as n
outputPrimeAtLeast n = putStrLn $ show $ (n, primeAtLeast n)
main = do
chan <- newChan
worker <- forkIO $ forever $ readChan chan >>= outputPrimeAtLeast
forever $ (readLn :: (IO Int)) >>= (writeChan chan)
killThread worker
I want to have a worker thread in the background that does the actual calculation and outputs (n, primeAtLeast n) as soon as it's finished.
What it's doing now: As soon as I enter a number n, it immediately outputs (n,, returns the control to the main thread, calculates primeAtLeast n in the background and outputs the second half primeAtLeast n) as soon as it's finished.
So is putStrLn not atomic? Or where is the problem?
Try this:
outputPrimeAtLeast n = let p = primeAtLeast n in p `seq` putStrLn $ show (n, p)
The above forces the computation of the prime before the putStrLn is run.
Further, you may use print instead of putStrLn . show:
outputPrimeAtLeast n = let p = primeAtLeast n in p `seq` print (n, p)
Alternatively, you may use a putStrLn function which forces every single character before starting printing anything.
strictPutStrLn :: Show a => a -> IO ()
strictPutStrLn x = let str = show x in str `listSeq` putStrLn str
listSeq :: [a] -> b -> b
listSeq [] w = w
listSeq (x:xs) w = x `seq` listSeq xs w