Interaction of forkIO/killThread with forkProcess - linux

I've written the code below, and noticed that killThread blocks and the thread still continues. That only happens if I do it in the forkProcess, if I remove the forkProcess, everything works as expected.
Code
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Concurrent
import Control.Monad
import System.Posix.Process
{-# NOINLINE primes #-}
primes :: [Integer]
primes = 2:[x | x <- [3..], all (not . flip isDivisorOf x) (takeWhile (< truncate (sqrt $ fromInteger x :: Double)) primes)]
where x `isDivisorOf` y = y `rem` x == 0
evaluator :: Show a => [a] -> IO ()
evaluator xs = do
putStrLn "[Evaluator] Started evaluator."
forM_ xs $ \x -> putStrLn $ "[Evaluator] Got result: " ++ show x
putStrLn "[Evaluator] Evaluator exited."
test :: IO ThreadId
test = forkIO (evaluator $ filter ((== 13) . flip rem (79 * 5 * 7 * 3 * 3 * 2 * 3)) primes) -- Just some computation that doesn't finsish too fast
main :: IO ()
main = do
pid <- forkProcess $ do
a <- test
threadDelay $ 4000 * 1000
putStrLn "Canceling ..."
killThread a
putStrLn "Canceled"
void $ getProcessStatus True False pid
Output
$ ghc test.hs -O -fforce-recomp -threaded -eventlog -rtsopts # I also tried with -threaded
$ ./test +RTS -N2 # I also tried without -N
[Evaluator] Started evaluator.
[Evaluator] Got result: 13
[Evaluator] Got result: 149323
[Evaluator] Got result: 447943
[Evaluator] Got result: 597253
[Evaluator] Got result: 746563
[Evaluator] Got result: 1045183
Canceling ...
[Evaluator] Got result: 1194493
[Evaluator] Got result: 1642423
[Evaluator] Got result: 1791733
[Evaluator] Got result: 2090353
[Evaluator] Got result: 2687593
[Evaluator] Got result: 3135523
[Evaluator] Got result: 3284833
[Evaluator] Got result: 4777933
[Evaluator] Got result: 5375173
^C[Evaluator] Got result: 5524483
^C
This is not the usual problem that there is no memory allocation and thus GHC's thread scheduler doesn't run. I verified that by running the program with +RTS -sstderr, which shows that the garbage collector is running very often. I'm running this on linux 64bit.

This bug report notes that forkProcess masks asynchronous exceptions in the child process despite no indication of such in the documentation. The behavior should be fixed in 7.8.1 when it is released.
Of course, if asynchronous exceptions are masked, the throw inside the killThread will block indefinitely. If you simply delete the lines in main containing forkProcess and getProcessStatus, the program works as intended:
module Main where
import Control.Concurrent
import Control.Monad
import System.Posix.Process
{-# NOINLINE primes #-}
primes :: [Integer]
primes = 2:[ x | x <- [3..], all (not . flip isDivisorOf x) (takeWhile (< truncate (sqrt $ fromInteger x :: Double)) primes)]
where x `isDivisorOf` y = y `rem` x == 0
evaluator :: Show a => [a] -> IO ()
evaluator = mapM_ $ \x ->
putStrLn $ "[Evaluator] Got result: " ++ show x
test :: IO ThreadId
test = forkIO (evaluator $ filter ((== 13) . flip rem (79 * 5 * 7 * 3 * 3 * 2 * 3)) primes) -- Just some computation that doesn't finsish too fast
main :: IO ()
main = do
a <- test
threadDelay $ 4000 * 1000
putStrLn "Canceling ..."
killThread a
putStrLn "Canceled"
I build it with ghc --make -threaded async.hs and run with ./async +RTS -N4.
If for some reason you need a separate process, you will have to manually unmask asynchronous exceptions in the child process in GHC 7.6.3.

Related

Sorting in parallel performance

I tried to run some programs with multicore and kinda confused by the results.
By default sorting in program below takes 20 seconds, when I run it with +RTS -N2 it takes around 16 secs, but with +RTS -N4 it takes 21 second!
Why it is like that? And is there example of program that gets faster with each extra core? (had similar results with other programs in tutorials)
Here's example of program:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
qsort :: Ord a => [a] -> [a]
qsort (x:xs)
= let a = qsort $ filter (<=x) xs
b = qsort $ filter (>x) xs
in b `par` a ++ x:b
qsort [] = []
randomList :: Int -> [Int]
randomList n = take n $ tail (iterate lcg 1)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
let randints = randomList 5000000
t1 <- getPOSIXTime
print . sum $ qsort randints
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"
I can't duplicate your results. (Which is a good thing, since I think I was the one claiming to see a performance improvement with -N2 and -N4 with the code you posted.)
On Linux with GHC 8.8.3, and compiling to a standalone executable with -O2 -threaded, I get the following timings on a 4-core desktop:
$ stack ghc -- --version
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
The Glorious Glasgow Haskell Compilation System, version 8.8.3
$ stack ghc -- -O2 -threaded QuickSort3.hs
Stack has not been tested with GHC versions above 8.6, and using 8.8.3, this may fail
Stack has not been tested with Cabal versions above 2.4, but version 3.0.1.0 was found, this may fail
[1 of 1] Compiling Main ( QuickSort3.hs, QuickSort3.o )
Linking QuickSort3 ...
$ ./QuickSort3 +RTS -N1
10741167410134688
SORT TIME: 7.671760902s
$ ./QuickSort3 +RTS -N2
10741167410134688
SORT TIME: 5.700858877s
$ ./QuickSort3 +RTS -N3
10741167410134688
SORT TIME: 4.88330669s
$ ./QuickSort3 +RTS -N4
10741167410134688
SORT TIME: 4.93364958s
I get similar results with a 16-core Linux laptop and also similar results with a 4-core Windows virtual machine (also using GHC 8.8.3) running on that laptop.
I can think of a few possible explanations for your results.
First, I don't have a tremendously fast desktop machine, so your timings of 20secs seem suspicious. Is it possible you're doing something like:
$ stack runghc QuickSort3.hs +RTS -N4
If so, this passes the +RTS flags to stack, and then runs the Haskell program in single-threaded mode using the slow byte-code interpreter. In my tests, the sort then takes about 30secs no matter what -Nx flag value I pass.
Second, is it possible you're running this on a virtual machine with a limited number of cores (or an extremely old piece of two-core hardware)? As noted, I tried testing under a Windows virtual machine and got similar results to the Linux version with a 4-core virtual machine but quite erratic results with a 2-core virtual machine (e.g., 11.4, 13.0, and 51.3secs for -N1, -N2, and -N4 respectively, so worse performance for more cores in general, and off-the-charts bad performance for 4 cores).
You could try the following simple parallel sums benchmark, which might scale better:
import Data.List
import Control.Parallel
import Data.Time.Clock.POSIX
randomList :: Int -> Int -> [Int]
randomList seed n = take n $ tail (iterate lcg seed)
where lcg x = (a * x + c) `rem` m
a = 1664525
c = 1013904223
m = 2^32
main :: IO ()
main = do
t1 <- getPOSIXTime
let n = 50000000
a = sum $ randomList 1 n
b = sum $ randomList 2 n
c = sum $ randomList 3 n
d = sum $ randomList 4 n
e = sum $ randomList 5 n
f = sum $ randomList 6 n
g = sum $ randomList 7 n
h = sum $ randomList 8 n
print $ a `par` b `par` c `par` d `par` e `par` f `par` g `par` h `par` (a+b+c+d+e+f+g+h)
t2 <- getPOSIXTime
putStrLn $ "SORT TIME: " ++ show (t2 - t1) ++ "\n"

How do I recover lazy evaluation of a monadically constructed list, after switching from State to StateT?

With the following code:
(lazy_test.hs)
-- Testing lazy evaluation of monadically constructed lists, using State.
import Control.Monad.State
nMax = 5
foo :: Int -> State [Int] Bool
foo n = do
modify $ \st -> n : st
return (n `mod` 2 == 1)
main :: IO ()
main = do
let ress = for [0..nMax] $ \n -> runState (foo n) []
sts = map snd $ dropWhile (not . fst) ress
print $ head sts
for = flip map
I can set nMax to 5, or 50,000,000, and I get approximately the same run time:
nMax = 5:
$ stack ghc lazy_test.hs
[1 of 1] Compiling Main ( lazy_test.hs, lazy_test.o )
Linking lazy_test ...
$ time ./lazy_test
[1]
real 0m0.019s
user 0m0.002s
sys 0m0.006s
nMax = 50,000,000:
$ stack ghc lazy_test.hs
[1 of 1] Compiling Main ( lazy_test.hs, lazy_test.o )
Linking lazy_test ...
$ time ./lazy_test
[1]
real 0m0.020s
user 0m0.002s
sys 0m0.005s
which is as I expect, given my understanding of lazy evaluation mechanics.
However, if I switch from State to StateT:
(lazy_test2.hs)
-- Testing lazy evaluation of monadically constructed lists, using StateT.
import Control.Monad.State
nMax = 5
foo :: Int -> StateT [Int] IO Bool
foo n = do
modify $ \st -> n : st
return (n `mod` 2 == 1)
main :: IO ()
main = do
ress <- forM [0..nMax] $ \n -> runStateT (foo n) []
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
for = flip map
then I see an extreme difference between the respective run times:
nMax = 5:
$ stack ghc lazy_test2.hs
[1 of 1] Compiling Main ( lazy_test2.hs, lazy_test2.o )
Linking lazy_test2 ...
$ time ./lazy_test2
[1]
real 0m0.019s
user 0m0.002s
sys 0m0.004s
nMax = 50,000,000:
$ stack ghc lazy_test2.hs
[1 of 1] Compiling Main ( lazy_test2.hs, lazy_test2.o )
Linking lazy_test2 ...
$ time ./lazy_test2
[1]
real 0m29.758s
user 0m25.488s
sys 0m4.231s
And I'm assuming that's because I'm losing lazy evaluation of the monadically constructed list, when I switch to the StateT-based implementation.
Is that correct?
Can I recover lazy evaluation of a monadically constructed list, while keeping with the StateT-based implementation?
In your example, you're only running one foo action per runState, so your use of State and/or StateT is essentially irrelevant. You can replace the use of foo with the equivalent:
import Control.Monad
nMax = 50000000
main :: IO ()
main = do
ress <- forM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
and it behaves the same way.
The issue is the strictness of the IO monad. If you ran this computation in the Identity monad instead:
import Control.Monad
import Data.Functor.Identity
nMax = 50000000
main :: IO ()
main = do
let ress = runIdentity $ forM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
then it would run lazily.
If you want to run lazily in the IO monad, you need to do it explicitly with unsafeInterleaveIO, so the following would work:
import System.IO.Unsafe
import Control.Monad
nMax = 50000000
main :: IO ()
main = do
ress <- lazyForM [0..nMax] $ \n -> return (n `mod` 2 == 1, [n])
let sts = map snd $ dropWhile (not . fst) ress
print $ head sts
lazyForM :: [a] -> (a -> IO b) -> IO [b]
lazyForM (x:xs) f = do
y <- f x
ys <- unsafeInterleaveIO (lazyForM xs f)
return (y:ys)
lazyForM [] _ = return []
The other answer by K A Buhr explains why State vs StateT is not the pertinent factor (IO is), and also points out how your example is strangely constructed (in that the State(T) part isn't actually used as each number uses a new state []). But aside from those points, I'm not sure I would say "losing lazy evaluation of the monadically constructed list", because if we understand something like "lazy evaluation = evaluated only when needed", then foo does indeed need to run on every element on the input list in order to perform all the effects, so lazy evaluation is not being "lost". You are getting what you asked for. (It just so happens that foo doesn't perform any IO, and perhaps someone else can comment with if it's ever possible for a compiler/GHC to optimize it away on this basis, but you can easily see why GHC does the naive thing here.)
This is a common, well-known problem in Haskell. There are various libraries (best known of which are streaming, pipes, conduit) which solve the problem by giving you streams (basically lists) which are lazy in the effects too. If I recreate your example in a streaming style,
import Data.Function ((&))
import Control.Monad.State
import Streaming
import qualified Streaming.Prelude as S
foo :: Int -> StateT [Int] IO Bool
foo n =
(n `mod` 2 == 1) <$ modify (n:)
nMax :: Int
nMax = 5000000
main :: IO ()
main = do
mHead <- S.head_ $ S.each [0..nMax]
& S.mapM (flip runStateT [] . foo)
& S.dropWhile (not . fst)
print $ snd <$> mHead
then both versions run practically instantaneously. To make the difference more apparent, imagine that foo also called print "hi". Then the streaming version, being lazy in the effects, would print only twice, whereas your original versions would both print nMax times. As they're lazy in the effects, then the whole list doesn't need to be traversed in order to short-circuit and finish early.

Avoiding CAF in Haskell

To avoid CAF (resource sharing), I tried converting to function
with dummy argument, but no success (noCafB).
I've read How to make a CAF not a CAF in Haskell? so tried noCafC and noCafD. When compiled with
-O0, then functions with dummy argument did evaluated every time.
However, with -O2, it seems that GHC converts those functions to
CAF. Is this intended behaviour (GHC's optimization)?
module Main where
import Debug.Trace
cafA :: [Integer]
cafA = trace "hi" (map (+1) $ [1..])
noCafB :: a -> [Integer]
noCafB _ = trace "hi" (map (+1) $ [1..])
noCafC :: a -> [Integer]
noCafC _ = trace "hi" (map (+1) $ [1..])
{-# NOINLINE noCafC #-}
noCafD :: a -> [Integer]
noCafD _ = trace "hi" (map (+1) $ myEnumFrom 0 1)
{-# NOINLINE noCafD #-}
myEnumFrom :: a -> Integer -> [Integer]
myEnumFrom _ n = enumFrom n
{-# NOINLINE myEnumFrom #-}
main :: IO ()
main = do
putStrLn "cafA"
print $ (cafA !! 1 + cafA !! 2)
putStrLn "noCafB"
print $ (noCafB 0 !! 1 + noCafB 0 !! 2)
putStrLn "noCafC"
print $ (noCafC 0 !! 1 + noCafC 0 !! 2)
putStrLn "noCafD"
print $ (noCafD 0 !! 1 + noCafD 0 !! 2)
Result with -O2
$ stack ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ stack ghc -- -O2 cafTest.hs
[1 of 1] Compiling Main ( cafTest.hs, cafTest.o )
Linking cafTest ...
$ ./cafTest
cafA
hi
7
noCafB
7
noCafC
7
noCafD
hi
7
Result with -O0
$ stack ghc -- -O0 cafTest.hs
[1 of 1] Compiling Main ( cafTest.hs, cafTest.o )
Linking cafTest ...
$ ./cafTest
cafA
hi
7
noCafB
hi
hi
7
noCafC
hi
hi
7
noCafD
hi
hi
7
I've also tried without trace but results was same. under -O2, I found that the result of incInt function is shared by inspecting profiling output. Why this behaviour?
incIntOrg :: [Integer]
incInt = map (+1) [1..]
incInt :: a -> [Integer] -- results IS shared. should it be?
incInt _ = map (+1) $ myEnum 0 1
{-# NOINLINE incInt #-}
myEnum :: a -> Integer -> [Integer]
myEnum _ n = enumFrom n
{-# NOINLINE myEnum #-}
main :: IO ()
main = do
print (incInt 0 !! 9999999)
print (incInt 0 !! 9999999)
print (incInt 0 !! 9999999)
Any comments will be appreciated deeply. Thanks.

Haskell - Couldn't match type [] with IO

I am new at Haskell. Why am I getting the error message
(Couldn't match type '[]' with 'IO' — Haskell) in folowing code.
In main I only need time of algorithm running without the result.
Only want to measure algorithm time.
qsort1 :: Ord a => [a] -> [a]
qsort1 [] = []
qsort1 (p:xs) = qsort1 lesser ++ [p] ++ qsort1 greater
where
lesser = [ y | y <- xs, y < p ]
greater = [ y | y <- xs, y >= p ]
main = do
start <- getCurrentTime
qsort1 (take 1000000 $ randomRs (1, 100000) (mkStdGen 42))
end <- getCurrentTime
print (diffUTCTime end start)
Your main function isn't right. Unless qsort1 is an IO action you cannot perform it in an IO monad. Instead you can put it in the let binding:
main = do
start <- getCurrentTime
let x = qsort1 (take 1000000 $ randomRs ((1 :: Int), 100000) (mkStdGen 42))
end <- getCurrentTime
print (diffUTCTime end start)
Also note that I have explicitly given a type annotation for 1 to avoid some compile errors.
But that being said you cannot actually find the the total time taken to do the sorting because of lazy evaluation. x will never be computed because it's never used in the program. If you run main, it give you this output which is definetly wrong:
λ> main
0.000001s
Instead you can use this to calculate the computation:
main = do
start <- getCurrentTime
let x = qsort1 (take 1000000 $ randomRs ((1 :: Int), 100000) (mkStdGen 42))
print x
end <- getCurrentTime
print (diffUTCTime end start)
Instead of printing, you can also use the BangPatterns extension to force the computation of qsort1:
main = do
start <- getCurrentTime
let !x = qsort1 (take 1000000 $ randomRs ((1 :: Int), 100000) (mkStdGen 42))
end <- getCurrentTime
print (diffUTCTime end start)
BangPatterns will not lead to full evaluation as #kosmikus points out. Instead use a library like criterion which has been specially made for benchnmarking.
I used method below and it works fine:
main = do
let arr = take 1000000 $ randomRs ((1 :: Int), 10000000) (mkStdGen 59)
defaultMain [
bgroup "qs" [ bench "1" $ nf quickSort arr ]
]

Haskell MVar : How to execute shortest job first?

When more than one thread is waiting to write an MVar, they are executed in first-in first-out scheme. I want to execute thread as per shortest job scheduling.
I have tired to code this using MVar. Here job is to calculate a Fibonacci number and write a MVar. 1st thread calculates Fibonacci 30 and 2nd thread calculates Fibonacci 10. As time taken for calculating Fibonacci 10 is less than 30, thus 2nd thread should execute first. I a not getting the desired result from the following block of code.
How to implement shortest job first scheduling in Haskell (or may be using Haskell STM)?
Code
module Main
where
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
type MInt = MVar Int
updateMVar :: MInt -> Int -> IO ()
updateMVar n v = do x1 <- readMVar n
let y = nfib v
x2 <- readMVar n
if x1 == x2
then do t <- takeMVar n
putMVar n y
else return()
main :: IO ()
main = do
n <- newEmptyMVar
putMVar n 0
forkIO(updateMVar n 30)
t <- readMVar n
putStrLn("n is : " ++ (show t))
forkIO(updateMVar n 10)
t <- readMVar n
putStrLn("n is : " ++ (show t))
Output
n is : 832040
n is : 55
To implement scheduling you need to use MVars and threads together. Start with an empty MVar. Fork the jobs you wish to run in the background. The main thread can then block on each result in turn. The fastest will come first. Like so:
{-# LANGUAGE BangPatterns #-}
import Control.Parallel
import Control.Concurrent
import System.IO
nfib :: Int -> Int
nfib n | n <= 2 = 1
| otherwise = par n1 (pseq n2 (n1 + n2 ))
where n1 = nfib (n-1)
n2 = nfib (n-2)
main :: IO ()
main = do
result <- newEmptyMVar
forkIO $ do
let !x = nfib 40
putMVar result x
forkIO $ do
let !x = nfib 30
putMVar result x
t <- takeMVar result
print $ "Fastest result was: " ++ show t
t <- takeMVar result
print $ "Slowest result was: " ++ show t
Note that it is important to use bang patterns to evaluate the fibonacci calls outside of the MVar (don't want to simply return an unevaluated thunk to the main thread).
Compile with the threaded runtime:
$ ghc -o A --make A.hs -threaded -fforce-recomp -rtsopts
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A.exe ...
And run on two cores:
$ ./A.exe +RTS -N2
"Fastest result was: 832040"
"Slowest result was: 102334155"
Productivity is pretty good as well (use +RTS -s to see runtime performance statistics).
Productivity 89.3% of total user, 178.1% of total elapsed
The first thread to finish will have its result printed first. The main thread will then block until the second thread is done.
The main thing is to take advantage of MVar empty/full semantics to block the main thread on each of the children threads.

Resources