Lazy ByteString : memory exploding in certain cases

Lazy ByteString : memory exploding in certain cases - haskell

Below we have two seemingly functionally equivalent programs. For the first the memory remains constant, whereas for the second the memory explodes (using ghc 7.8.2 & bytestring-0.10.4.0 in Ubuntu 14.04 64-bit):
Non-exploding :
--NoExplode.hs
--ghc -O3 NoExplode.hs
module Main where
import Data.ByteString.Lazy as BL
import Data.ByteString.Lazy.Char8 as BLC
num = 1000000000
bytenull = BLC.pack ""
countDataPoint arg sum
| arg == bytenull = sum
| otherwise = countDataPoint (BL.tail arg) (sum+1)
test1 = BL.last $ BL.take num $ BLC.cycle $ BLC.pack "abc"
test2 = countDataPoint (BL.take num $ BLC.cycle $ BLC.pack "abc") 0
main = do
print test1
print test2
Exploding :
--Explode.hs
--ghc -O3 Explode.hs
module Main where
import Data.ByteString.Lazy as BL
import Data.ByteString.Lazy.Char8 as BLC
num = 1000000000
bytenull = BLC.pack ""
countDataPoint arg sum
| arg == bytenull = sum
| otherwise = countDataPoint (BL.tail arg) (sum+1)
longByteStr = BL.take num $ BLC.cycle $ BLC.pack "abc"
test1 = BL.last $ longByteStr
test2 = countDataPoint (BL.take num $ BLC.cycle $ BLC.pack "abc") 0
main = do
print test1
print test2
Additional details :
The difference is that inExplode.hs I have taken BL.take num $ BLC.cycle $ BLC.pack "abc" out of the definition of test1, and assigned it to its own value longByteStr.
Strangely if we comment out either print test1 or print test2 in Explode.hs (but obviously not both), then the program does not explode.
Is there a reason memory is exploding in Explode.hs and not in NoExplode.hs, and also why the exploding program (Explode.hs) requires both print test1 and print test2 in order to exlode?

Why ghc performs common expression elimination in one case, but not in the other? Who knows. Maybe common expressions where killed by inlining. Basically it depends on internal implementation.
Regarding -ddump-simp, see this SO question: Reading GHC Core
I reproduced it with ghc-7.8.2. It performs common expression elimination. You can check output of -ddump-simpl. So you actually creating one lazy bytestring.
In the first version you create two lazy bytestrings. print test1 forces the first one, but it is garbage collected on the fly because nobody else uses it. The same for print test2 -- it forces the second bytestring, and it is GC'ed on the fly.
In the second version you create one lazy bytestring. print test1 forces it, but it can't be GC'ed because it is needed for print test2. As a result, after the first print you have entire bytestring loaded into memory.
If you remove one print, the bytestring is GC'ed on the fly again. because it is not used anywhere else.
PS. "GC'ed on the fly" means: print takes the first chunk and outputs it to stdout. The chunk becomes available for GC. Then prints takes the second chunk, etc...

Related

String Formatting columns in Haskell without Text.Printf

I am new to Haskell. I am at the last part of a school project. I have to take tuples and print them to an outfile and separate them by a tab column. So (709,4226408), (12965,4226412) and (5,4226016) should have and output of
709 4226408
12965 4226412
5 4226016
What I have been trying to do is this:
genOutput :: (Int, Int) -> String
genOutput (a,b) = (show a) ++ "\t" ++ (show b)
And this gives outputs like:
"709\t4226408"
"12965\t4226412"
"5\t4226016"
There are 3 things wrong with this. 1) Quotes still appear in the output. 2) The \t tab does not actually become a tab space. .Whenever I try to make an actual tab for the "" it just comes out as a " " space. 3) They are not aligned into columns like the above example. I know Text.Printf exists but we are not allowed to import anything other than:
import System.IO
import Data.List
import System.Environment

that's the output you get from GHCi I guess? Try to use putStrLn instead:
Prelude> genOutput (1,42)
"1\t42"
Prelude> putStrLn $ genOutput (1,42)
1 42
Why is that?
If you tell GHCi to evaluate an expression it will do so and (more or less) output it using show - show is designed to work with read and will usually output a value as if you would input it directly into Haskell. For a String that will include escape sequences and the "s
Now using putStrLn it will take the string and print it to stdout as you would expect.
Using print
Another reason could be that you use print to output your value - print is show + putStrLn so it'll show the values first re-introducing the escapes (as GHCi would) - so if you use print change it to putStrLn if you are using Strings

How to make a Haskell program display a preliminary result in response to user input?

I am writing a program in Haskell which repeatedly takes its most recent result and uses this to compute the next result. I want to be able to see the newest result in response to user input, so I tried something like this:
main = mainhelper 0
mainhelper count = do
count <- return (count + 1)
line <- getLine
if null line
then do mainhelper count
else do
putStrLn $ show count
return ()
I was hoping that getLine would return an empty line if the user hasn't entered anything, but this doesn't happen, instead the program does nothing until it receives user input. Is there a way around this?

One simple solution is to fork a thread for the complicated computation and communicate with the main UI thread via MVar. For example:
import Control.Exception
import Control.Monad
import Control.Concurrent
thinkReallyHard x = do
threadDelay 1000000 -- as a proxy for something that's actually difficult
evaluate (x+1)
main = do
v <- newMVar 0
forkIO (forever (modifyMVar_ v thinkReallyHard))
forever (getLine >> readMVar v >>= print)
You may wonder about the role of evaluate in thinkReallyHard. The subtlety there is that MVars are lazy -- they can contain thunks just as easily as computed values. In particular, this means it's easy to accidentally push all the pure computation from the forked thread into the thread that's reading and using the contents of the MVar. The call to evaluate simply forces the forked thread to finish the pure computation before writing to the MVar.

It does return an empty line if you hit enter without entering text -- you just immediately prompt for more input, so it might look like nothing is happening. But if you run the program, hit enter three times, then enter something non-empty, you'll see that the final count reflects the multiple entries.
Here's a modified version of your code that does the same thing, but is slightly more canonical:
main = mainhelper 0
mainhelper count = do
let count' = count + 1
line <- getLine
if null line
then mainhelper count'
else print count'
Rather than count <- return (count + 1), you can write let count' = count + 1 -- this is a pure binding, not something that needs to invoke the IO monad (as you're doing with <- and return). But I used count' instead of count because otherwise that will create a recursive binding. The '-suffixing is a standard idiom for a "modified version" of an identifier.
Next I switched putStrLn . show to print, which is part of the Prelude and does exactly that.
I got rid of the return () because print (and putStrLn) already have the type IO (). This allows you to elide the do statements, as there's now a single IO expression in each branch of the if.
It's not really clear what you're trying to do here that's different from what you are doing -- the code does (in imperative terms) increment a counter every time the user presses enter, and displays the state of the counter every time the user enters some non-empty text.
Here's another version that prints the counter every time, but only increments it when prompted, which may or may not be helpful to you:
main = mainhelper 0
mainhelper count = do
print count
line <- getLine
mainhelper (if null line then count else succ count)
I'm using succ, the successor function, instead of the explicit + 1, which is just a style preference.

How to show progress in Shake?

I am trying to figure out how can i take the progress info from a Progress type (in Development.Shake.Progress) to output it before executing a command. The possible desired output would be:
[1/9] Compiling src/Window/Window.cpp
[2/9] Compiling src/Window/GlfwError.cpp
[3/9] Compiling src/Window/GlfwContext.cpp
[4/9] Compiling src/Util/MemTrack.cpp
...
For now i am simulating this using some IORef that keeps the total (initially set to the sum of the source files) and a count that i increase before executing each build command, but this seems like a hackish solution to me.
On top of that this solution seems to work correctly on clean builds, but misbehaves on partial builds as the sum that displayed is still the total of all the source files.
With access to a Progress data type i will be able to calculate this fraction correctly using its countSkipped, countBuild, and countTodo members (see Progress.hs:53), but i am still not sure how i can i achieve this.
Any help is appreciated.

Values of type Progress are currently only available as an argument to the function stored in shakeProgress. You can obtain the Progress whenever you want with:
{-# LANGUAGE RecordWildCards #-}
import Development.Shake
import Data.IORef
import Data.Monoid
import Control.Monad
main = do
ref <- newIORef $ return mempty
shakeArgs shakeOptions{shakeProgress = writeIORef ref} $ do
want ["test" ++ show i | i <- [1..5]]
"test*" %> \out -> do
Progress{..} <- liftIO $ join $ readIORef ref
putNormal $
"[" ++ show (countBuilt + countSkipped + 1) ++
"/" ++ show (countBuilt + countSkipped + countTodo) ++
"] " ++ out
writeFile' out ""
Here we create an IORef to squirrel away the argument passed to shakeProgress, then retrieve it later when running the rules. Running the above code I see:
[1/5] test5
[2/5] test4
[3/5] test3
[4/5] test2
[5/5] test1
Running at a higher level of parallelism gives less precise results - initially there are only 3 items in todo (Shake increments countTodo as it finds items todo, and spawns items as soon as it knows about any of them), and there are often two rules running at the same index (there is no information about how many are in progress). Given knowledge of your specific rules, you could refine the output, e.g. storing an IORef you increment to ensure the index was monotonic.
The reason this code is somewhat convoluted is that the Progress information was intended to be used for asynchronous progress messages, although your approach seems perfectly valid. It may be worth introducing a getProgress :: Action Progress function for synchronous progress messages.

How to get good performance when writing a list of integers from 1 to 10 million to a file?

question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.

Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.

Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.

If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.

Command line options picked up by criterion library

I have used the libraries criterion and cmdargs.
When I compile the program completely without cmdargs and run it e.g. ./prog --help then I get some unwanted response from criterion about the possible options and the number of runs etc..
When I compile and run it as below the command line options are first picked up by my code then then read by criterion. Criterion then subsequently reports and error telling me that the option --byte is unknown. I have not seen anything in the criterion documentation how this could be switched off or worked around. Is there a way to clear out the command line options ofter I have read them? Otherwise I would need to use e.g. CPUTime instead of criterion, that is OK to me since I do to really require the loads of extra functionality and data that criterion delivers.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveDataTypeable #-}
import System.Console.CmdArgs
data Strlen = Strlen {byte :: Int} deriving (Data, Typeable, Show)
strlen = cmdArgsMode $ Strlen {byte = def} &= summary "MessagePack benchmark v0.04"
main = do
n <- cmdArgsRun strlen
let datastring = take (byte n) $ randomRs ('a','z') (mkStdGen 3)
putStrLn "Starting..."
conn <- connect "192.168.35.62" 8081
defaultMain [bench "sendReceive" $ whnfIO (mywl conn datastring)]

Use System.Environment.withArgs. Parse the command line arguments first with cmdArgs, then pass what you haven't used to criterion:
main = do
(flags, remaining) <- parseArgsHowever
act according to flags
withArgs remaining $
defaultMain [ ... ]

Take a look at the criterion source. You should be able to write your own defaultMainWith function that handles args however you want, including ignoring them, or ignoring unknown args, or etc...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string