Worse forConcurrently performance than sequential - haskell

I've written a function that up-samples a file from 48kHz to 192kHz by means of a filter:
upsample :: Coefficients -> FilePath -> IO ()
It takes the filter coefficients, the path of the file (that has to be upsamples) and writes the result to a new file.
I have to up-sample many files so I've written a function to up-sample a full directory in parallel, using forConcurrently_ from Control.Concurrent.Async:
upsampleDirectory :: Directory -> FilePath -> IO ()
upsampleDirectory dir coefPath = do
files <- getAllFilesFromDirectory dir
coefs <- loadCoefficients coefPath
forConcurrently_ files $ upsample coefs
I'm compiling with the -threaded option and running using +RTS -N2. What I see is that up-sampling 2 files sequentially is faster than up-sampling both files in parallel.
Upsampling file1.wav takes 18.863s.
Upsampling file2.wav takes 18.707s.
Upsampling a directory with file1.wav and file2.wav takes 66.250s.
What am I doing wrong?
I've tried to keep this post concise, so ask me if you need more details on some of the functions.

Here are a couple of possibilities. First, make yourself 100% sure you're actually running your program with +RTS -N2 -RTS. I can't tell you how many times I've been benchmarking a parallel program and written:
stack exec myprogram +RTS -N2 -RTS
in place of:
stack exec myprogram -- +RTS -N2 -RTS
and gotten myself hopelessly confused. (The first version runs the stack executable on two processors but the target executable on one!) Maybe add a print $ getNumCapabilities at the beginning of your main program to be sure.
After confirming you're running on two processors, then the next most likely issue is that your implementation is not running in constant space and is blowing up the heap. Here's a simple test program I used to try to duplicate your problem. (Feel free to use my awesome upsampling filter yourself!)
module Main where
import Control.Concurrent.Async
import System.Environment
import qualified Data.ByteString as B
upsample :: FilePath -> IO ()
upsample fp = do c <- B.readFile fp
let c' = B.pack $ concatMap (replicate 4) $ B.unpack c
B.writeFile (fp ++ ".out") c'
upsampleFiles :: [FilePath] -> IO ()
upsampleFiles files = do
forConcurrently_ files $ upsample
main :: IO ()
main = upsampleFiles =<< getArgs -- sample all file on command line
When I ran this on a single 70meg test file, it ran in 14 secs. When I ran it on two copies in parallel, it ran for more than a minute before it started swapping like mad, and I had to kill it. After switching to:
import qualified Data.ByteString.Lazy as B
it ran in 3.7 secs on a single file, 7.8 secs on two copies on a single processor, and 4.0 secs on two copies on two processors with +RTS -N2.
Make sure you're compiling with optimzations on, profile your program, and make sure it's running in a constant (or at least reasonable) heap space. The above program runs in a constant 100k bytes of heap. A similar version that uses a strict ByteString for reading and a lazy ByteString for writing reads the whole file into memory, but the heap almost immediately grows to 70megs (the size of the file) within a fraction of a second and then stays constant while the file is processed.
No matter how complicated your filter is, if your program is growing gigabytes of heap, the implementation is broken, and you'll need to fix it before you worry about performance, parallel or otherwise.

Related

Haskell - Reading entire Lazy ByteString

Context: I have a function defined in a library called toXlsx :: ByteString -> Xlsx (that ByteString is from Data.ByteString.Lazy)
Now to do certain operations I've defined certain functions that operate on the same file, thus I would like to open, read and convert to Xlsx the file once and keep it in memory to operate with it.
Right now I'm reading the file as bs <- Data.ByteString.Lazy.readfile file and at the end doing Data.ByteString.Lazy.length bs 'seq' return value.
Is there any way to use this function and keep the file in memory as a whole to reuse it?
Note that the way a lazy bytestring works, the contents of the file won't be read until they are "used", but once they are read, they will remain in memory for any subsequent operations. The only way they will be removed from memory is if they are garbage collected because your program no longer has any way to access them.
For example, if you run the following program on a large file:
import qualified Data.ByteString.Lazy as BL
main = do
bigFile <- BL.readFile "ubuntu-14.04-desktop-amd64.iso"
print $ BL.length $ BL.filter (==0) bigFile -- takes a while
print $ BL.length $ BL.filter (==255) bigFile -- runs fast
the first computation will actually read the entire file into memory and it will be kept there for the second computation.
I guess this by itself isn't too convincing, since the operating system will also cache the file into memory, and it ends up being hard to tell the difference in timing between Haskell reading the file from the operating system cache for each computation and keeping it in memory across all computations. But, if you ran some heap profiling on this code, you'd discover that the first operation loads up the entire file into "pinned" bytestrings and that allocation stays constant through subsequent operations.
If your concern is that you want the complete file to be read at the start, even if the first operation doesn't need to read it all, so that there are no subsequent delays as additional parts of the file are read, then your seq-based solution is probably fine. Alternatively, you can read the entire file as a strict bytestring and then convert it using fromStrict -- this operation is instantaneous and doesn't copy any data. (In contrast to toStrict, which is expensive and does copy data.) So this will work:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BL
main = do
-- read strict
bigFile <- BS.readFile "whatever.mov"
-- do strict and lazy operations
print $ strictOp bigFile
print $ lazyOp (BL.fromStrict bigFile)

Read file as bytestring and write this bytestring to a file: issue on a network drive

Consider the following simple Haskell program, which reads a file as a bytestring and writes the file tmp.tmp from this bytestring:
module Main
where
import System.Environment
import qualified Data.ByteString.Lazy as B
main :: IO ()
main = do
[file] <- getArgs
bs <- B.readFile file
action <- B.writeFile "tmp.tmp" bs
putStrLn "done"
It is compiled to an executable named tmptmp.
I have two hard drives on my computer: the C drive and the U drive, and this one is a network drive, and this network drive is offline.
Now, let's try tmptmp.
When I run it from C, there's no problem; I run it two times below, the first time with a file on C and the second time with a file on U:
C:\HaskellProjects\imagelength> tmptmp LICENSE
done
C:\HaskellProjects\imagelength> tmptmp U:\Data\ztemp\test.xlsx
done
Now I run it from U, with a file on the C drive, no problem:
U:\Data\ztemp> tmptmp C:\HaskellProjects\imagelength\LICENSE
done
The problem occurs when I run it from U with a file on the U drive:
U:\Data\ztemp> tmptmp test.xlsx
tmptmp: tmp.tmp: openBinaryFile: resource busy (file is locked)
If in my program I use strict bytestrings instead of lazy bytestrings (by replacing Data.ByteString.Lazy with Data.ByteString), this problem does not occur anymore.
I'd like to understand that. Any explanation? (I would particularly like to know how to solve this issue but still using lazy bytestrings)
EDIT
To be perhaps more precise, the problem still occurs with this program:
import qualified Data.ByteString as SB
import qualified Data.ByteString.Lazy as LB
main :: IO ()
main = do
[file] <- getArgs
bs <- LB.readFile file
action <- SB.writeFile "tmp.tmp" (LB.toStrict bs)
putStrLn "done"
while the problem disappears with:
bs <- SB.readFile file
action <- LB.writeFile "tmp.tmp" (LB.fromStrict bs)
It looks like the point causing the problem is the laziness of readFile.
As per the most recent Data.ByteString.Lazy docs:
Using lazy I/O functions like readFile or hGetContents means that the order of operations such as closing the file handle is left at the discretion of the RTS.
The example given with the offline network drive presumably leads to the RTS continuing from readFile without closing the file. The docs, which have an almost identical example, say that
When writeFile is executed next, [tmp.tmp] is still open for reading and the RTS takes care to avoid simultaneously opening it for writing, instead returning the error.
As far as I am aware, there is no solution to this in Data.ByteString.Lazy — both your solution (using the strict read) and other packages are suggested on the docs. Sometimes reading and writing the same file can work, but you have no guarantee.

No heap profiling data for module Data.ByteString

I was trying to generate heap memory profile for following naive Haskell code that copies a file:
import System.Environment
import System.IO
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as LB
naiveCopy :: String -> String -> IO ()
naiveCopy from to = do
putStrLn $ "From: " ++ from
putStrLn $ "To: " ++ to
s <- B.readFile from
B.writeFile to s
main = do
args <- getArgs
mapM (\ x-> putStrLn x) args
naiveCopy (head args) ((head.tail) args)
Command that build the code with ghc 8.0.1:
ghc -o t -rtsopts -prof -fprof-auto t.hs
Command that collect the profiling data:
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp
where in/data is a quite big file (approx 500MB) which will take the program about 2 seconds to copy.
The problem is that I couldn't get heap profiling data if I use the strict Data.ByteString, there's only an small t.hp file without any sample data, it looks like this:
JOB "t in/data out/data +RTS -p -h"
DATE "Thu Aug 4 20:19 2016"
SAMPLE_UNIT "seconds"
VALUE_UNIT "bytes"
BEGIN_SAMPLE 0.000000
END_SAMPLE 0.000000
BEGIN_SAMPLE 0.943188
END_SAMPLE 0.943188
and corresponding profile chart like this:
However I could get heap profiling data if I switch to the lazy version Data.ByteString.Lazy, profile chart like this:
Update: Thanks #ryachza, I added a -i0 parameter to set sampling interval and tried again, this time I got sample data for strict ByteString and it looked reasonable (I was copying a 500M file and the memory allocation peak in following profiling chart is about 500M)
./t +RTS -p -h -RTS in/data out/data && hp2ps -e8in -c t.hp
It appears as though the runtime isn't "getting the chance to measure" the heap. If you add -s to your RTS options, it should print some time and allocation information. When I run this, I see the bytes allocated and total memory use is very high (size of the file), but the maximum residency (and the number of samples) is very low, and while the elapsed time is high the actual "work" time is practically 0.
Adding the RTS option -i0 allowed me to reproducibly visualize the bytestring allocation as PINNED (this is the classification because the byte arrays that bytestring uses internally are allocated in an area in which the GC can't move things). You could experiment with different -h options which associate allocations to different cost centers (for example, -hy should show ARR_WORDS) but it probably wouldn't have much value in this case as the bytestrings are really just "big chunks of raw memory".
The references I used to find the RTS options were (clearly I wasn't particular about the GHC version - I can't imagine these flags change frequently):
https://downloads.haskell.org/~ghc/7.0.1/docs/html/users_guide/runtime-control.html
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/profiling.html

Haskell http-conduit web-scraping daemon crashes with out of memory error

I've written a daemon in Haskell that scrapes information from a webpage every 5 minutes.
The daemon originally ran fine for about 50 minutes, but then it unexpectedly died with out of memory (requested 1048576 bytes). Every time I ran it it died after the same amount of time. Setting it to sleep only 30 seconds, it instead died after 8 minutes.
I realized the code to scrape the website was incredibly memory inefficient (going from about 30M while sleeping to 250M while parsing 9M of html), so I rewrote it so that now it only uses about 15M extra while parsing. Thinking the problem was fixed, I ran the daemon overnight and when I woke up it was actually using less memory than it was that night. I thought I was done, but roughly 20 hours after it had started, it had crashed with the same error.
I started looking into ghc profiling but I wasn't able to get that to work. Next I started messing with rts options, and I tried setting -H64m to set the default heap size to be larger than my program was using, and also using -Ksize to shrink the maximum size of the stack to see if that would make it crash sooner.
Despite every change I've made, the daemon still seems to crash after a constant number of iterations. Making the parsing more memory efficient made this value higher, but it still crashes. This doesn't make sense to me because none of these have runs have even come close to using all of my memory, much less swap space. The heap size is supposed to be unlimited by default, shrinking the stack size didn't make a difference, and all my ulimits are either unlimited or significantly higher than what the daemon is using.
In the original code I pinpointed the crash to somewhere in the html parsing, but I haven't done the same for the more memory efficient version because 20 hours takes so long to run. I don't know if this would even be useful to know because it doesn't seem like any specific part of the program is broken because it run successfully for dozens of iterations before crashing.
Out of ideas, I even looked through the ghc source code for this error, and it appears to be a failed call to mmap, which wasn't very helpful to me because I assume that isn't the root of the problem.
(Edit: code rewritten and moved to end of post)
I'm pretty new at Haskell, so I'm hoping this is some quirk of lazy evaluation or something else that has a quick fix. Otherwise, I'm fresh out of ideas.
I'm using GHC version 7.4.2 on FreeBsd 9.1
Edit:
Replacing the downloading with static html got rid of the problem, so I've narrowed it down to how I'm using http-conduit. I've edited the code above to include my networking code. The hackage docs mention to share a manager so I've done that. And it also says that for http you have to explicitly close connections, but I don't think I need to do that for httpLbs.
Here's my code.
import Control.Monad.IO.Class (liftIO)
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import Text.Regex.PCRE
import Network.HTTP.Conduit
main :: IO ()
main = do
manager <- newManager def
daemonLoop manager
daemonLoop :: Manager -> IO ()
daemonLoop manager = do
rows <- scrapeWebpage manager
putStrLn $ "number of rows parsed: " ++ (show $ length rows)
doSleep
daemonLoop manager
scrapeWebpage :: Manager -> IO [[BL.ByteString]]
scrapeWebpage manager = do
putStrLn "before makeRequest"
html <- makeRequest manager
-- Force evaluation of html.
putStrLn $ "html length: " ++ (show $ BL.length html)
putStrLn "after makeRequest"
-- Breaks ~10M html table into 2d list of bytestrings.
-- Max memory usage is about 45M, which is about 15M more than when sleeping.
return $ map tail $ html =~ pattern
where
pattern :: BL.ByteString
pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*"
makeRequest :: Manager -> IO BL.ByteString
makeRequest manager = runResourceT $ do
defReq <- parseUrl url
let request = urlEncodedBody params $ defReq
-- Don't throw errors for bad statuses.
{ checkStatus = \_ _ -> Nothing
-- 1 minute.
, responseTimeout = Just 60000000
}
response <- httpLbs request manager
return $ responseBody response
and it's output:
before makeRequest
html length: 1555212
after makeRequest
number of rows parsed: 3608
...
before makeRequest
html length: 1555212
after makeRequest
bannerstalkerd: out of memory (requested 2097152 bytes)
Getting rid of the regex computations fixed the problem, but it seems that the error happens after the networking and during the regex, presumably because of something I'm doing wrong with http-conduit. Any ideas?
Also, when I try to compile with profiling enabled I get this error:
Could not find module `Network.HTTP.Conduit'
Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?
Indeed, I have not installed profiling libraries for http-conduit and I don't know how.
So you've found yourself a leak. By tricking with compiler options and memory settings you can only postpone the moment your program crashes, but you cannot eliminate the source of the problem, so no matter what you set there, you will still run out of memory eventually.
I recommend you to carefully walk thru all the non-pure code and primarilly the part working with resources. Check whether all resources get released correctly. Check whether you have an accumulating state, like a growing ubounded channel. And, of course, as wisely suggested by n.m., profile it.
I have a scraper that parses pages without pausing and downloads files, and it does it all concurrently. I've never seen it using any more memory than ~60M. I've been compiling it with GHC 7.4.2, GHC 7.6.1 and GHC 7.6.2 and had no problems with neither.
It should be noted that the root of your problem may also be in the libraries you're using.
In my scraper I use http-conduit, http-conduit-browser, HandsomeSoup and HXT.
I ended up solving my own problem. It seems to be a GHC bug on FreeBSD. I submitted a bug report and switched to Linux, and now it's been running flawlessly for the last few days.

Basic I/O performance in Haskell

Another microbenchmark: Why is this "loop" (compiled with ghc -O2 -fllvm, 7.4.1, Linux 64bit 3.2 kernel, redirected to /dev/null)
mapM_ print [1..100000000]
about 5x slower than a simple for-cycle in plain C with write(2) non-buffered syscall? I am trying to gather Haskell gotchas.
Even this slow C solution is much faster than Haskell
int i;
char buf[16];
for (i=0; i<=100000000; i++) {
sprintf(buf, "%d\n", i);
write(1, buf, strlen(buf));
}
Okay, on my box the C code, compiled per gcc -O3 takes about 21.5 seconds to run, the original Haskell code about 56 seconds. So not a factor of 5, a bit above 2.5.
The first nontrivial difference is that
mapM_ print [1..100000000]
uses Integers, that's a bit slower because it involves a check upfront, and then works with boxed Ints, while the Show instance of Int does the conversion work on unboxed Int#s.
Adding a type signature, so that the Haskell code works on Ints,
mapM_ print [1 :: Int .. 100000000]
brings the time down to 47 seconds, a bit above twice the time the C code takes.
Now, another big difference is that show produces a linked list of Char and doesn't just fill a contiguous buffer of bytes. That is slower too.
Then that linked list of Chars is used to fill a byte buffer that then is written to the stdout handle.
So, the Haskell code does more, and more complicated things than the C code, thus it's not surprising that it takes longer.
Admittedly, it would be desirable to have an easy way to output such things more directly (and hence faster). However, the proper way to handle it is to use a more suitable algorithm (that applies to C too). A simple change to
putStr . unlines $ map show [0 :: Int .. 100000000]
almost halves the time taken, and if one wants it really fast, one uses the faster ByteString I/O and builds the output efficiently as exemplified in applicative's answer.
On my (rather slow and outdated) machine the results are:
$ time haskell-test > haskell-out.txt
real 1m57.497s
user 1m47.759s
sys 0m9.369s
$ time c-test > c-out.txt
real 7m28.792s
user 1m9.072s
sys 6m13.923s
$ diff haskell-out.txt c-out.txt
$
(I have fixed the list so that both C and Haskell start with 0).
Yes you read this right. Haskell is several times faster than C. Or rather, normally buffered Haskell is faster than C with write(2) non-buffered syscall.
(When measuring output to /dev/null instead of a real disk file, C is about 1.5 times faster, but who cares about /dev/null performance?)
Technical data: Intel E2140 CPU, 2 cores, 1.6 GHz, 1M cache, Gentoo Linux, gcc4.6.1, ghc7.6.1.
The standard Haskell way to hand giant bytestrings over to the operating system is to use a builder monoid.
import Data.ByteString.Lazy.Builder -- requires bytestring-0.10.x
import Data.ByteString.Lazy.Builder.ASCII -- omit for bytestring-0.10.2.x
import Data.Monoid
import System.IO
main = hPutBuilder stdout $ build [0..100000000::Int]
build = foldr add_line mempty
where add_line n b = intDec n <> charUtf8 '\n' <> b
which gives me:
$ time ./printbuilder >> /dev/null
real 0m7.032s
user 0m6.603s
sys 0m0.398s
in contrast to Haskell approach you used
$ time ./print >> /dev/null
real 1m0.143s
user 0m58.349s
sys 0m1.032s
That is, it's child's play to do nine times better than mapM_ print, contra Daniel Fischer's suprising defeatism. Everything you need to know is here: http://hackage.haskell.org/packages/archive/bytestring/0.10.2.0/doc/html/Data-ByteString-Builder.html I won't compare it with your C since my results were much slower than Daniel's and n.m. so I figure something was going wrong.
Edit: Made the imports consistent with all versions of bytestring-0.10.x It occurred to me the following might be clearer -- the Builder equivalent of unlines . map show:
main = hPutBuilder stdout $ unlines_ $ map intDec [0..100000000::Int]
where unlines_ = mconcat . map (<> charUtf8 '\n')

Resources