ThreadDelay Problem in Haskell (GHC) on Ubuntu - linux

I noticed odd behavior with the threadDelay function in GHC.Conc on some of my machines. The following program:
main = do print "start"
threadDelay (1000 * 1000)
print "done"
takes 1 second to run, as expected. On the other hand, this program:
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
main = do print "start"
loop 1000
print "done"
where loop :: Int -> IO ()
loop !n =
if n == 0
then return ()
else do threadDelay 1000
loop (n-1)
takes about 10 seconds to run on two of my machines, though on other machines it takes about 1 second, as expected. (I compiled both of the above programs with the '-threaded' flag.) Here is a screen shot from Threadscope showing that there is activity only once every 10 milliseconds:
On the other hand, here is a screenshot from ThreadScope from one of my machines on which the program takes 1 second total:
A similar C program:
#include <unistd.h>
#include <stdio.h>
int main() {
int i;
for (i=1; i < 1000; i++) {
printf("%i\n",i);
usleep(1000);
}
return 0;
}
does the right thing, i.e. running 'time ./a.out' gives output like:
1
2
...
999
real 0m1.080s
user 0m0.000s
sys 0m0.020s
Has anyone encountered this problem before, and if so, how can this be fixed? I am running ghc 7.2.1 for Linux(x86_64) on all of my machines and am running various versions of Ubuntu. It works badly on Ubuntu 10.04.2, but fine on 11.04.

threadDelay is not an accurate timer. It promises that your thread will sleep for at least as long as its argument says it should, but it doesn't promise anything more than that. If you want something to happen periodically, you will have to use something else. (I'm not sure what, but possibly Unix' realtime alarm signal would work for you.)

I suspect you forgot to compile with the '-threaded' option. (I did that once for 6.12.3, and consistently had 30 millisecond thread delays.)

As noted above, threadDelay only makes one guarantee, which is that you'll wait at least as long as you request. Haskell's runtime does not obtain special cooperation from the OS
Other than that, it's best effort from the OS.
It might be worth benchmarking your results for threadDelays. For example:
module Main where
import Control.Concurrent
import Data.Time
time op =
getCurrentTime >>= \ t0 ->
op >>
getCurrentTime >>= \ tf ->
return $! (diffUTCTime tf t0)
main :: IO ()
main =
let action tm = time (threadDelay tm) >>= putStrLn . show in
mapM action [2000,5000,10000,20000,30000,40000,50000] >>
return ()
On my windows box, this gives me:
0.0156098s
0.0156098s
0.0156098s
0.0312196s
0.0312196s
0.0468294s
0.0624392s
This suggests the combo of delay and getCurrentTime has a resolution of 15.6 milliseconds. When I loop 1000 times delay 1000, I end up waiting 15.6 seconds, so this is just the minimum wait for a thread.
On my Ubuntu box (11.04, with kernel 2.6.38-11), I get much greater precision (~100us).
It might be you can avoid the timing problem by keeping the program busier, so we don't context switch away. Either way, I would suggest you do not use threadDelay for timing, or at least check the time and perform any operations up to the given instant.
Your high-precision sleep via C might work for you, if you are willing to muck with FFI, but the cost is you'll need to use bound threads (at least for your timer).

Related

In Haskell, can syncing hardware clock while retrieving the time cause a serious problem?

Let’s say I have this:
main = do
void (forkIO (forever $ (threadDelay 10000) >> callCommand "hwclock --systohc"))
let loop = do
mytime <- getCurrentTime
if (mytime > (fromJust(parseISO8601 “2021-12-10T13:10:09+0000”))) then (print “hello”) else ((threadDelay 100) >> loop)
loop
Is there any serious issue that can happen here relating to the time potentially being synced at the same time as when I request the time with getCurrentTime?

Specify millisecond-speed of an infinite loop

My main function has one infinite loop and I'd like to execute each loop of it every 100 millisecond. I know it's done by some concurrent or parallel method, but I've never done such things before and have no idea even where to start from. How would you implement such function?
Assuming your loop body takes negligible time, just use threadDelay from Control.Concurrent:
import Control.Concurrent
main = forever $ do
mainBody
threadDelay (100*1000) -- value in microseconds
Update: To account for the time of your loop body, use this:
import Data.Time.Clock
import Control.Concurrent
import Control.Monad
mainBody :: IO ()
mainBody = putStrLn "hi"
main = forever $ do
start <- getCurrentTime
mainBody
end <- getCurrentTime
let diff = diffUTCTime end start
usecs = floor (toRational diff * 1000000) :: Int
delay = 100*1000 - usecs
if delay > 0
then threadDelay delay
else return ()
Haskell's threads are light-weight, so a quick solution would be to fork on each cycle. Thus you'll end up using the main thread as a manager of worker threads, which ensures that a worker gets spawned every 100 micros.
import Control.Concurrent
main =
forever $ do
forkIO $ loopCycle
threadDelay $ 100 * 10^3
In case you care about exceptions not getting lost and getting reraised in the main thread instead, I recommend taking a look at the "slave-thread" package. Actually, I'd recommend to use that package instead of forkIO and brothers by default, but then I'm the author so I might be subjective.
Also note that the above solution might cause an accumulation of worker threads in case the loopCycle will take longer than 100 micros to execute too often. To protect against such a scenario, you can implement a strategy in the manager thread, which will ensure that the number of active workers is limited. Following is how such a strategy could be implemented:
-- From the "SafeSemaphore" package
import qualified Control.Concurrent.SSem as Sem
main =
manager 12 (100 * 10^3) $ putStrLn "Implement me!"
manager :: Int -> Int -> IO () -> IO ()
manager limit delay worker =
do
sem <- Sem.new limit
forever $ do
forkIO $ Sem.withSem sem $ worker
threadDelay delay
You could use sleep to pause the loop at the end of every iteration for 100 milliseconds. https://www.haskell.org/hoogle/?q=sleep

Importing 'C' Delay function into Haskell using FFI

There is a function in the wiringPi 'C' library called delay with type
void delay(unsigned int howLong);
This function delays execution of code for howLong milliseconds. I wrote the binding code in haskell to be able to call this function. The haskell code is as follows,
foreign import ccall "wiringPi.h delay" c_delay :: CUInt -> IO ()
hdelay :: Int -> IO ()
hdelay howlong = c_delay (fromIntegral howlong)
After this, I wrote a simple haskell program to call this function. The simply haskell code is as follows..
--After importing relavent libraries I did
main = wiringPiSetup
>> delay 5000
But the delay does not happen or rather the executable generated by the ghc compiler exits right away.
Could someone tell me what could possibly go wrong here? A small nudge in the right direction would help.
Cheers and Regards.
Please ignore the part in block quote, and see update below - I am preserving the original non-solution because of comments associated with it.
You should mark the import as unsafe since you want the main
thread to block while the function is executing (see comment below by
#carl). By default, import is safe, not unsafe. So, changing
the function signature to this should make the main thread block:
foreign import ccall unsafe "wiring.h delay" c_delay :: CUInt -> IO ()
Also, if you plan to write multi-threaded code, GHC docs for multi-threaded FFI is >very useful. This also seems a good starter.
Update
The behavior seems to be due to signal interrupt handling (if I recall correctly, this was added in GHC 7.4+ to fix some bugs). More details here:
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Signals
Please note the comment on the above page: Signal handling differs between the threaded version of the runtime and the non-threaded version.
Approach 1 - Handle signal interrupt in FFI code:
A toy code is below which handles the interrupt in sleep. I tested it on Linux 2.6.18 with ghc 7.6.1.
C code:
/** ctest.c **/
#include <unistd.h>
#include <stdio.h>
#include <time.h>
unsigned delay(unsigned sec)
{
struct timespec req={0};
req.tv_sec = sec;
req.tv_nsec = 0;
while (nanosleep(&req, &req) == -1) {
printf("Got interrupt, continuing\n");
continue;
}
return 1;
}
Haskell code:
{-# LANGUAGE ForeignFunctionInterface #-}
-- Filename Test.hs
module Main (main) where
import Foreign.C.Types
foreign import ccall safe "delay" delay :: CUInt -> IO CUInt
main = do
putStrLn "Sleeping"
n <- delay 2000
putStrLn $ "Got return code from sleep: " ++ show n
Now, after compiling with ghc 7.6.1 (command: ghc Test.hs ctest.c), it waits until sleep finishes, and prints a message every time it gets an interrupt signal during sleep:
./Test
Sleeping
Got interrupt, continuing
Got interrupt, continuing
Got interrupt, continuing
Got interrupt, continuing
....
....
Got return code from sleep: 1
Approach 2 - Disable SIGVTALRM before calling FFI code, and re-enable:
I am not sure what the implications are for disabling SIGVTALRM. This is alternative approach which disables SIGVTALRM during FFI call, if you can't alter FFI code. So, FFI code is not interrupted during sleep (assuming it is SIGVTALRM that is causing the interrupt).
{-# LANGUAGE ForeignFunctionInterface #-}
-- Test.hs
module Main (main) where
import Foreign.C.Types
import System.Posix.Signals
foreign import ccall safe "delay" delay :: CUInt -> IO CUInt
main = do
putStrLn "Sleeping"
-- Block SIGVTALRM temporarily to avoid interrupts while sleeping
blockSignals $ addSignal sigVTALRM emptySignalSet
n <- delay 2
putStrLn $ "Got return code from sleep: " ++ show n
-- Unblock SIGVTALRM
unblockSignals $ addSignal sigVTALRM emptySignalSet
return ()

Can I speed up this Haskell algorithm?

I've got this haskell file, compiled with ghc -O2 (ghc 7.4.1), and takes 1.65 sec on my machine
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
The same algorithm in C, compiled with gcc -O2 (gcc 4.6.3), runs in 0.18 sec.
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
}
Update
I thought it might be the Data.Bits stuff going slow, but surprisingly if I remove the shifting and just do a straight mod, it actually runs slower at 5.6 seconds!?!
import Data.Bits
main = do
print $ length $ filter (\i -> (i `mod` 4) /= 0) [0..123456789]
whereas the equivalent C runs slightly faster at 0.16 sec:
#include <stdio.h>
void main() {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i % 4) != 0)
++count;
printf("count: %d\n", count);
}
The two pieces of code do very different things.
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `mod` 4)) /= 0) [0..123456789]
creates a list of 123456790 Integer (lazily), takes the remainder modulo 4 of each (involving first a check whether the Integer is small enough to wrap a raw machine integer, then after the division a sign-check, since mod returns non-negative results only - though in ghc-7.6.1, there is a primop for that, so it's not as much of a brake to use mod as it was before), shifts the Integer 1 left the appropriate number of bits, which involves a conversion to "big" Integers and a call to GMP, takes the bitwise and with i - yet another call to GMP - and checks whether the result is 0, which causes another call to GMP or a conversion to small integer, not sure what GHC does here. Then, if the result is nonzero, a new list cell is created where that Integer is put in, and consumed by length. That's a lot of work done, most of which unnecessarily complicated due to the defaulting of unspecified number types to Integer.
The C code
#include <stdio.h>
int main(void) {
int count = 0;
const int max = 123456789;
int i;
for (i = 0; i < max; ++i)
if ((i & (1 << i % 4)) != 0)
++count;
printf("count: %d\n", count);
return 0;
}
(I took the liberty of fixing the return type of main), does much much less. It takes an int, compares it to another, if smaller, takes the bitwise and of the first int with 3(1), shifts the int 1 to the left the appropriate number of bits, takes the bitwise and of that and the first int, and if nonzero increments another int, then increments the first. Those are all machine ops, working on raw machine types.
If we translate that code to Haskell,
module Main (main) where
import Data.Bits
maxNum :: Int
maxNum = 123456789
loop :: Int -> Int -> Int
loop acc i
| i < maxNum = loop (if i .&. (1 `shiftL` (i .&. 3)) /= 0 then acc + 1 else acc) (i+1)
| otherwise = acc
main :: IO ()
main = print $ loop 0 0
we get a much closer result:
C, gcc -O3:
count: 30864196
real 0m0.180s
user 0m0.178s
sys 0m0.001s
Haskell, ghc -O2:
30864196
real 0m0.247s
user 0m0.243s
sys 0m0.003s
Haskell, ghc -O2 -fllvm:
30864196
real 0m0.144s
user 0m0.140s
sys 0m0.003s
GHC's native code generator isn't a particularly good loop optimiser, so using the llvm backend makes a big difference here, but even the native code generator doesn't do too badly.
Okay, I have done the optimisation of replacing a modulus calculation with a power-of-two modulus with a bitwise and by hand, GHC's native code generator doesn't do that (yet), so with ```rem4`` instead of.&. 3`, the native code generator produces code that takes (here) 1.42 seconds to run, but the llvm backend does that optimisation, and produces the same code as with the hand-made optimisation.
Now, let us turn to gspr's question
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Well, the original code used Integers and lists, llvm doesn't know too well what to do with these, it can't transform that code into loops. The modified code uses Ints and the vector package rewrites the code to loops, so llvm does know how to optimise that well, and that shows.
(1) Assuming a normal binary computer. That optimisation is done by ordinary C compilers even without any optimisation flag, except on the very rare platforms where a div instruction is faster than a shift.
Few things beat a hand-written loop with a strict accumulator:
{-# LANGUAGE BangPatterns #-}
import Data.Bits
f :: Int -> Int
f n = g 0 0
where g !i !s | i <= n = g (i+1) (if i .&. (unsafeShiftL 1 (i `rem` 4)) /= 0 then s+1 else s)
| otherwise = s
main = print $ f 123456789
In addition to the tricks mentioned so far, this also replaces shift with unsafeShiftL, which doesn't check its argument.
Compiled with -O2 and -fllvm, this is about 13x faster than the original on my machine.
Note: Testing if bit i of x is set can be written more clearly as x `testBit` i. This produces the same assembly as the above.
Vector instead of list, fold instead of filter-and-length
Substituting the list for an unboxed vector and the filter-and-length for a fold (i.e. incrementing a counter) improves the time significantly for me. Here's what I used:
import qualified Data.Vector.Unboxed as UV
import Data.Bits
foo :: Int
foo = UV.foldl (\s i -> if i .&. (shift 1 (i `rem` 4)) /= 0 then s+1 else s) 0 (UV.enumFromN 0 123456789)
main = print foo
The original code (with two changes though: rem instead of mod as suggested in the comments, and adding an Int to the signature to avoid Integer) gave:
$ time ./orig
30864196
real 0m2.159s
user 0m2.144s
sys 0m0.008s
The modified code above gave:
$ time ./new
30864196
real 0m1.450s
user 0m1.440s
sys 0m0.004s
LLVM
While LLVM didn't have a massive effect on the original code, it really did on the modified (I'd love to learn why...).
Original (LLVM):
$ time ./orig-llvm
30864196
real 0m2.047s
user 0m2.036s
sys 0m0.008s
Modified (LLVM):
$ time ./new-llvm
30864196
real 0m0.233s
user 0m0.228s
sys 0m0.004s
For comparison, OP's original C code comes in at 0m0.152s user on my system.
This is all GHC 7.4.1, GCC 4.6.3, and vector 0.9.1. LLVM is either 2.9 or 3.0; I have both but can't seem to figure out which one GHC is actually using.
Try this:
import Data.Bits
main = do
print $ length $ filter (\i -> i .&. (shift 1 (i `rem` 4)) /= 0) [0..123456789::Int]
Without the ::Int, the type defaults to ::Integer.
rem does the same as mod on positive values, and it is the same as % in C. mod on the other hand ist mathematically correct on negative values, but is slower.
int in C is 32bit
Int in Haskell is either 32 or 64bit wide, like long in C
Integer is an arbitrary-bit-integer, it has no min/max values, and its memory size depends on its value (similar to a string).

Runtime performance degradation for C FFI Callback when pthreads are enabled

I am curious about the behavior of GHC runtime with threaded option in case when C FFI calls back Haskell function. I wrote code to measure overhead of a basic function callback (below). While the function callback overhead has already been discussed before, I am curious about the sharp increase in total time I observed when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called Haskell function f 5M times using two scenarios (GHC 7.0.4, RHEL, 12-core box, runtime options below after the code):
Single thread in C create_threads function: call f 5M times - Total time 1.32s
5 threads in C create_threads function: each thread calls f 1M times - so, total is still 5M - Total time 7.79s
Code below - Haskell code below is for single-threaded C callback - comments explain how to update it for 5-thread testing:
t.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)
f :: CInt -> ()
f x = ()
-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
wrap :: (CInt -> ()) -> IO (FunPtr (CInt -> ()))
foreign import ccall safe "mt.h create_threads"
createThreads :: Ptr (FunPtr (CInt -> ())) -> Ptr CInt -> CInt -> IO()
main = do
-- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
let threads = [1..1]
l = 5000000
vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
lf <- mapM (\x -> wrap f ) threads -- wrap f into a funPtr and create a list
let vf = SV.fromList lf -- create vector of FunPtr to f
-- pass vector of function pointer to f, and vector of l to create_threads
-- create_threads will spawn threads (equal to length of threads list)
-- each pthread will call back f l times - then we can check the overhead
SV.unsafeWith vf $ \x ->
SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
SV.mapM_ freeHaskellFunPtr vf
mt.h:
#include <pthread.h>
#include <stdio.h>
typedef void(*FunctionPtr)(int);
/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
int threadId;
FunctionPtr fn;
int length;
} threadArgs;
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);
mt.c:
#include "mt.h"
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
FunctionPtr fn;
threadArgs args = *(threadArgs*) arg;
int id = args.threadId;
int length = args.length;
fn = args.fn;
int i;
for (i=0; i < length;){
fn(i++); //call haskell function
}
}
void create_threads(FunctionPtr* fp, int* length, int numThreads )
{
pthread_t pth[numThreads]; // this is our thread identifier
threadArgs args[numThreads];
int t;
for (t=0; t < numThreads;){
args[t].threadId = t;
args[t].fn = *(fp + t);
args[t].length = *(length + t);
pthread_create(&pth[t],NULL,threadFunc,&args[t]);
t++;
}
for (t=0; t < numThreads;t++){
pthread_join(pth[t],NULL);
}
printf("All threads terminated\n");
}
Compilation (GHC 7.0.4, gcc 4.4.3 in case it is used by ghc):
$ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2
Running with 1 thread in create_threads (the code above will do that) - I turned off parallel gc for testing:
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.05s elapsed)
GC time 0.28s ( 0.28s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.32s ( 1.34s elapsed)
%GC time 21.1% (21.2% elapsed)
Running with 5 threads (see first comment in main function of t.hs above on how to edit it for 5 threads):
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.42s ( 2.27s elapsed)
GC time 0.36s ( 0.37s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 7.79s ( 2.63s elapsed)
%GC time 4.7% (13.9% elapsed)
I will appreciate insight into why the performance degrades with multiple pthreads in create_threads. I first suspected parallel GC but I turned it off for testing above. The MUT time too goes up sharply for multiple pthreads, given the same runtime options. So, it is not just GC.
Also, are there any improvements in GHC 7.4.1 for this kind of scenario?
I don't plan to call back Haskell from FFI that often, but it helps to understand the above issue, when designing Haskell/C mult-threaded library interaction.
I believe the key question here is, how does the GHC runtime schedule C callbacks into Haskell? Although I don't know for certain, my suspicion is that all C callbacks are handled by the Haskell thread that originally made the foreign call, at least up to ghc-7.2.1 (which I'm using).
This would explain the large slowdown you (and I) see when moving from 1 thread to 5. If the five threads are all calling back into the same Haskell thread, there will be significant contention on that Haskell thread to complete all the callbacks.
In order to test this, I modified your code so that Haskell forks a new thread before calling create_threads, and create_threads only spawns one thread per call. If I'm correct, each OS thread will have a dedicated Haskell thread to perform work, so there should be much less contention. Although this still takes almost twice as long as the single-thread version, it's significantly faster than the original multi-threaded version, which lends some evidence to this theory. The difference is much less if I turn off thread migration with +RTS -qm.
As Daniel Fischer reports different results for ghc-7.2.2, I would expect that version changes how Haskell schedules callbacks. Maybe somebody on the ghc-users list can provide more information on this; I don't see anything likely in the release notes for 7.2.2 or 7.4.1.

Resources