I want to know how long it takes my program to read a 12.9MB .wav file into memory. The function that reads a file into memory looks as follows:
import qualified Data.ByteString as BS
getSamplesFromFileAsBS :: FilePath -> IO (BS.ByteString)
It takes the name of the file and returns the samples as a ByteString. It also performs some other validity checks on the data and ignores the header information. I read the ByteString of samples into memory using ByteString.hGet.
If I now benchmark this function with a 12.9MB file, using Criterion:
bencher :: FilePath -> IO ()
bencher fp = defaultMain [
bench "Reading all the samples from a file." $ nfIO (getSamplesFromFileAsBS fp)
]
I get the following result:
benchmarking Reading all the samples from a file.
time 3.617 ms (3.520 ms .. 3.730 ms)
0.989 R² (0.981 R² .. 0.994 R²)
mean 3.760 ms (3.662 ms .. 3.875 ms)
std dev 354.0 μs (259.9 μs .. 552.5 μs)
variance introduced by outliers: 62% (severely inflated)
It seems to load 12.9MB into memory in 3.617ms. This doesn't seem realistic since it indicates that my SSD can read 3+GB/s, which is not the case at all. What am I doing wrong?
I decided to try this another (more naive) way, by manually measuring the time difference:
runBenchmarks :: FilePath -> IO ()
runBenchmarks fp = do
start <- getCurrentTime
samplesBS <- getSamplesFromFileAsBS fp
end <- samplesBS `deepseq` getCurrentTime
print (diffUTCTime end start)
This gives me the following result: 0.023105s. This is realistic because it would mean my SSD can read at a speed of around 600MB/s. What is wrong with the Criterion result?
I looked at the visual results of my Criterion benchmark by writing the output to a html file. I could clearly see that the first run took around 0.020s, whereas the rest (after caching) takes around 0.003s.
So I'm getting these results because of caching.
Related
I was trying to bench parMap vs map with a very simple example:
import Control.Parallel.Strategies
import Criterion.Main
sq x = x^2
a = whnf sum $ map sq [1..1000000]
b = whnf sum $ parMap rseq sq [1..1000000]
main = defaultMain [
bench "1" a,
bench "2" b
]
My results seem to indicate zero speedup from parMap and I was wondering why this might be?
benchmarking 1
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
time 177.7 ms (165.5 ms .. 186.1 ms)
0.997 R² (0.992 R² .. 1.000 R²)
mean 185.1 ms (179.9 ms .. 194.1 ms)
std dev 8.265 ms (602.3 us .. 10.57 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 2
time 182.7 ms (165.4 ms .. 199.5 ms)
0.993 R² (0.976 R² .. 1.000 R²)
mean 189.4 ms (181.1 ms .. 195.3 ms)
std dev 8.242 ms (5.896 ms .. 10.16 ms)
variance introduced by outliers: 14% (moderately inflated)
The problem is that parMap sparks a parallel computation for each individual list element. It doesn't chunk the list at all as you seem to think from your comments—that would require the use of the parListChunk strategy.
So parMap has high overheads, so the fact that each spark simply squares one number means that its cost is overwhelmed by that overhead.
TL;DR: I'm working on a piece of code which generates a (long) array of numbers. I'm able to generate this array, convert it to a List and then calculate the maximum (using a strict left fold). BUT, I run into memory issues when I try to convert the list to a Sequence prior to calculating the maximum. This is quite counter-intuitive to me.
My question: Why is this happening and what is the correct approach for converting the data to a Sequence structure?
Background:
I'm working on a problem which I've chosen to tackle in using three steps (below).
*Note: I'm intentionally keeping the problem statement vague so this post doesn't serve as a hint.
Anyways, my proposed approach:
(i) First, generate a long list of integers, namely, the number of factors for each integer from 1 to 100 million (NOT the factors themselves, just the number of factors)
(ii) second, convert this list into a Sequence.
(iii) lastly, use an efficient sliding window maximum algorithm to calc my answer (this step requires dequeue operations, hence the need for a Sequence)
(Again, the specifics of the problem aren't that relevant as I'm just curious as to why I'm running into this particular issue in the first place.)
What I've done so far?
Step 1 was fairly straightforward - see output below (full code is included at the bottom). I just bruteforce a sieve using an Unboxed Array and the accumArray function, nothing fancy. Note: I've used this same algorithm to solve a number of other such problems so I'm reasonably confident that it's giving the right answer.
For the purposes of showing execution time / memory-usage stats, I've (admittedly arbitrarily) chosen to calculate the maximum element in the resulting array - the idea is simply to use a function which forces construction of all elements of the Array, thereby ensuring that we see meaningful stats for exec time / memory-usage.
The following main function...
main = print $ maximum' $ elems (sieve (10^8))
...results in the following (i.e., it says that the number below 100 million with the most divisors has a total of 768 divisors):
Linking maxdivSO ...
768
33.73s user 70.80s system 99% cpu 1:44.62 total
344,214,504,640 bytes allocated in the heap
58,471,497,320 bytes copied during GC
200,062,352 bytes maximum residency (298 sample(s))
3,413,824 bytes maximum slop
386 MB total memory in use (0 MB lost due to fragmentation)
%GC time 24.7% (30.5% elapsed)
The problem
It seems like we can accomplish the first step without breaking a sweat since I've allocated a total of 5gb to my VirtualBox and the above code uses <400mb (as reference, I've seen programs execute successfully and report using 3gb+ of total memory). In other words, it seems like we've accomplished Step 1 with plenty of headroom.
So I'm a bit surprised as to why the following version of the main function fails. We attempt to perform the same calculation of the maximum but after converting the list of integers to a Sequence. The following code...
main = print $ maximum' $ fromList $ elems (sieve (10^8))
...results in the following:
Linking maxdivSO ...
maxdivSO: out of memory (requested 2097152 bytes)
39.48s user 76.35s system 99% cpu 1:56.03 total
My question: Why does the algorithm (as currently written) run out of memory if we try to convert the list to a Sequence? And how might I go about successfully converting this list into a Sequence?"
(I'm not one to stubbornly stick to brute-force for these types of problems - but I have a strong suspicion that this particular issue is due to my not being able to reason well about evaluation.)
The code itself:
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.Word (Word32, Word16)
import Data.Foldable (Foldable, foldl')
import Data.Array.Unboxed (UArray, accumArray, elems)
import Data.Sequence (fromList)
main :: IO ()
main = print $ maximum' $ elems (sieve (10^8)) -- <-- works
--main = print $ maximum' $ fromList $ elems (sieve (10^8)) -- <-- doesn't work
maximum' :: (Foldable t, Ord a, Num a) => t a -> a
maximum' = foldl' (\x acc -> if x > acc then x else acc) 0
sieve :: Int -> UArray Word32 Word16
sieve u' = accumArray (+) 2 (1,u) ( (1,-1) : factors )
where
u = fromIntegral u'
cap = floor $ sqrt (fromIntegral u) :: Word32
factors = [ (i*d,j) | d <- [2..cap]
, i <- [2..(u `quot` d)]
, d <= i, let j = if i == d then 1 else 2
]
I think the reason for this is that to get the first element of of a sequence requires the full sequence to be constructed in memory (since the internal representation of the sequence is a tree). In the list case elems yields the elements lazily.
Rather than turning the full array into a sequence, why not make the sequence only as long as your sliding window?
I want to make rhythms with Haskell's printf. The following should produce a repeating rhythm in which one note is twice as long as the other two. (That rhythm is encoded by the list [1,1,2].)
import Control.Concurrent
import Text.Printf
import Control.Monad
main = mapM_ note (cycle [1,1,2])
beat = round (10^6 / 4) -- measured in miliseconds
note :: Int -> IO ()
note n = do
threadDelay $ beat * n
printf "\BEL\n"
When I run it the long note sounds roughly three times as long as the others, rather than twice. If I speed it up, by changing the number 4 to a 10, the rhythm is destroyed completely: the notes all have the same length.
Is there a refresh rate to change? Is threadDelay not the service to use if I want precise timing?
Is threadDelay not the service to use if I want precise timing?
No, not at all:
threadDelay :: Int -> IO () Source
Suspends the current thread for a given number of microseconds (GHC only).
There is no guarantee that the thread will be rescheduled promptly when the delay has expired, but the thread will never continue to run earlier than specified.
However, on my machine (Win 8.1 x64 i5-3570k#3.4GHz) the rhythm runs fine. That being said, \BEL isn't really a good way to create a beat:
the \BEL sound depends on the operating system (sound dreadful in Windows 8 if played at that frequency),
it isn't clear whether \BEL blocks.
If the latter happens you end up with roughly the same length, since every \BEL will block and the threadDelay is shorter than the actual \BEL sound.
The problem appears to have been print, not threading. Rohan Drape at Haskell-Cafe showed me how to use OSC instead of print. The timing of the following test, which uses OSC, is to my ears indistinguishable from perfect. I had it send instructions to a sine wave oscillator in Max/MSP.
import Control.Concurrent
import Control.Monad
import System.IO
import Sound.OSC
main = do
hSetBuffering stdout NoBuffering
mapM_ note (cycle [1,1,2])
withMax = withTransport (openUDP "127.0.0.1" 9000)
beat = 60000 -- 60 ms, measured in µs
note :: Int -> IO ()
note n = do
withMax (sendMessage (Message "sin0 frq 100" []))
-- set sine wave 0 to frequency 100
withMax (sendMessage (Message "sin0 amp 1" []))
-- set sine wave 0 to amplitude 1
threadDelay $ beat * n
withMax (sendMessage (Message "sin0 amp 0" []))
-- set sine wave 0 to amplitude 0
threadDelay $ beat * n
Thanks, everyone!
Most likely you will have to rely on OS support or GHC internals. For example, I have used GHC.Event for this purpose with the
registerTimeout :: TimerManager -> Int -> TimeoutCallback -> IO TimeoutKey
funtion. But then it will also be asynchronous with callbacks. Also this is GHC specific.
Other options are timer libraries on hackage, not sure though if they are all portable or if they can be used on Windows, but most use OS support as for precise timing you need hardware timers.
I am getting my feet wet writing concurrent programs in Haskell with GHC for multicore machines. As a first step I decided to write a program that reads and writes concurrently to an IOArray. I had the impression that reads and writes to IOArray involve no synchronization. I'm doing this to establish a baseline to compare with the performance of other data structures that do use appropriate synchronization mechanisms. I ran in to some surprising results, namely that in many cases, I am not getting any speed up at all. This makes me wonder if there is some low level synchronization happening in the ghc runtime, for example, synchronization and blocking on evaluation of thunks (i.e. "black holes"). Here are the details...
I write a couple variations on a single program. The main idea is that I wrote a DirectAddressTable data structure, which is simply a wrapper around an IOArray providing insert and lookup methods:
-- file DirectAddressTable.hs
module DirectAddressTable
( DAT
, newDAT
, lookupDAT
, insertDAT
, getAssocsDAT
)
where
import Data.Array.IO
import Data.Array.MArray
newtype DAT = DAT (IOArray Int Char)
-- create a fixed size array; missing keys have value '-'.
newDAT :: Int -> IO DAT
newDAT n = do a <- newArray (0, n - 1) '-'
return (DAT a)
-- lookup an item.
lookupDAT :: DAT -> Int -> IO (Maybe Char)
lookupDAT (DAT a) i = do c <- readArray a i
return (if c=='-' then Nothing else Just c)
-- insert an item
insertDAT :: DAT -> Int -> Char -> IO ()
insertDAT (DAT a) i v = writeArray a i v
-- get all associations (exclude missing items, i.e. those whose value is '-').
getAssocsDAT :: DAT -> IO [(Int,Char)]
getAssocsDAT (DAT a) =
do assocs <- getAssocs a
return [ (k,c) | (k,c) <- assocs, c /= '-' ]
I then have a main program that initializes a new table, forks some threads, with each thread writing and reading some fixed number of values to the just initialized table. The overall number of elements to write is fixed. The number of threads to use is a taken from a command line argument, and the elements to process are evenly divided among the threads.
-- file DirectTableTest.hs
import DirectAddressTable
import Control.Concurrent
import Control.Parallel
import System.Environment
main =
do args <- getArgs
let numThreads = read (args !! 0)
vs <- sequence (replicate numThreads newEmptyMVar)
a <- newDAT arraySize
sequence_ [ forkIO (doLotsOfStuff numThreads i a >>= putMVar v)
| (i,v) <- zip [1..] vs]
sequence_ [ takeMVar v >>= \a -> getAssocsDAT a >>= \xs -> print (last xs)
| v <- vs]
doLotsOfStuff :: Int -> Int -> DAT -> IO DAT
doLotsOfStuff numThreads i a =
do let p j c = (c `seq` insertDAT a j c) >>
lookupDAT a j >>= \v ->
v `pseq` return ()
sequence_ [ p j c | (j,c) <- bunchOfKeys i ]
return a
where bunchOfKeys i = take numElems $ zip cyclicIndices $ drop i cyclicChars
numElems = numberOfElems `div` numThreads
cyclicIndices = cycle [0..highestIndex]
cyclicChars = cycle chars
chars = ['a'..'z']
-- Parameters
arraySize :: Int
arraySize = 100
highestIndex = arraySize - 1
numberOfElems = 10 * 1000 * 1000
I compiled this using ghc 7.2.1 (similar results with 7.0.3) with "ghc --make -rtsopts -threaded -fforce-recomp -O2 DirectTableTest.hs".
Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds! Using one more core than worker threads is a little better, with "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 1 +RTS -N2" and "time ./DirectTableTest 2 +RTS -N3" both taking about 1.4 seconds.
Running with the "-N2 -s" option shows that productivity is 95.4% and GC is 4.3%. Looking at a run of the program with ThreadScope I don't see anything too alarming. Each HEC yields once per ms when a GC occurs. Running with 4 cores gives a time of about 1.2 seconds, which is at least a little better than 1 core. More cores doesn't improve over this.
I found that changing the array type used in the implementation of DirectAddressTable from IOArray to IOUArray fixes this problem. With this change, the running time of "time ./DirectTableTest 1 +RTS -N1" is about 1.4 seconds whereas the running "time ./DirectTableTest 2 +RTS -N2" is about 1.0 seconds. Increasing to 4 cores gives a run time of 0.55 seconds. Running with "-s" shows a GC time of %3.9 percent. Under ThreadScope I can see that both threads yield every 0.4 ms, more frequently than in the previous program.
Finally, I tried one more variation. Instead of having the threads work on the same shared array, I had each thread work on its own array. This scales nicely (as you would expect), more or less like the second program, with either IOArray or IOUArray implementing the DirectAddressTable data structure.
I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores. Does anyone know why this might be happening or what I can do to find out what is going on? I wonder if this problem could be due to multiple threads blocking while evaluating the same thunk and whether it is related to this: http://hackage.haskell.org/trac/ghc/ticket/3838 .
Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds!
I can not reproduce your results:
$ time ./so2 1 +RTS -N1
(99,'k')
real 0m0.950s
user 0m0.932s
sys 0m0.016s
tommd#Mavlo:Test$ time ./so2 2 +RTS -N2
(99,'s')
(99,'s')
real 0m0.589s
user 0m1.136s
sys 0m0.024s
And this seems to scale as expected as the number of light weight threads increases too:
ghc -O2 so2.hs -threaded -rtsopts
[1 of 2] Compiling DirectAddressTable2 ( DirectAddressTable2.hs, DirectAddressTable2.o )
[2 of 2] Compiling Main ( so2.hs, so2.o )
Linking so2 ...
tommd#Mavlo:Test$ time ./so2 4
(99,'n')
(99,'z')
(99,'y')
(99,'y')
real 0m1.538s
user 0m1.320s
sys 0m0.216s
tommd#Mavlo:Test$ time ./so2 4 +RTS -N2
(99,'z')
(99,'x')
(99,'y')
(99,'y')
real 0m0.600s
user 0m1.156s
sys 0m0.020s
Do you actually have 2 CPUs? If you run with more GHC threads (-Nx) than you have available CPUs then your results will be very poor. What I think I'm really asking is: are you sure no other CPU intensive processes are running on your system?
As for the IOUArray (by edit)
I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores
An unboxed array will be contiguous and thus benefit much more from caching. Boxed values living in arbitrary locations on the heap could cause a large increase in cache invalidations between the cores.
What I understand, Haskell have green threads. But how light weight are they. Is it possible to create 1 million threads?
Or How long would it take for 100 000 threads?
from here.
import Control.Concurrent
import Control.Monad
n = 100000
main = do
left <- newEmptyMVar
right <- foldM make left [0..n-1]
putMVar right 0 -- bang!
x <- takeMVar left -- wait for completion
print x
where
make l n = do
r <- newEmptyMVar
forkIO (thread n l r)
return r
thread :: Int -> MVar Int -> MVar Int -> IO ()
thread _ l r = do
v <- takeMVar r
putMVar l $! v+1
on my not quite 2.5gh laptop this takes less than a second.
set n to 1000000 and it becomes hard to write the rest of this post because the OS is paging like crazy. definitely using more than a gig of ram (didn't let it finish). If you have enough RAM it would definitely work in the appropriate 10x the time of the 100000 version.
Well according to here the default stack size is 1k, so I suppose in theory it would be possible to create 1,000,000 threads - the stack would take up around 1Gb of memory.
Using the benchmark here, http://www.reddit.com/r/programming/comments/a4n7s/stackless_python_outperforms_googles_go/c0ftumi
You can improve the performance on a per benchmark-basis by shrinking the thread stack size to one that fits the benchmark. E.g. 1M threads, with a 512 byte stack per thread, takes 2.7s
$ time ./A +RTS -s -k0.5k
For this synthetic test case, spawning hardware threads results in significant overheads. Working just with green threads looks like a preferred option. Note that spawning green threads in Haskell is indeed cheap. I've re-run the above program, with n = 1m on MacBook Pro, i7, 8GB of RAM, using:
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.6.3
Compiled with -threaded and -rtsopts:
$ time ./thr
1000000
real 0m5.974s
user 0m3.748s
sys 0m2.406s
Reducing the stack helps a bit:
$ time ./thr +RTS -k0.5k
1000000
real 0m4.804s
user 0m3.090s
sys 0m1.923s
Then, compiled without -threaded:
$ time ./thr
1000000
real 0m2.861s
user 0m2.283s
sys 0m0.572s
And finally, without -threaded and with reduced stack:
$ time ./thr +RTS -k0.5k
1000000
real 0m2.606s
user 0m2.198s
sys 0m0.404s