Throttling parallel computations - haskell

The Question
I have a finite list of values:
values :: [A]
... and an expensive, but pure, function on those values:
expensiveFunction :: A -> Maybe B
How do I run that function on each value in parallel and only return the first n results that complete with a Just and stop computation of the unfinished results?
takeJustsPar :: (NFData b) => Int -> (a -> Maybe b) -> [a] -> [b]
takeJustsPar maxJusts f as = ???
The Motivation
I know how I would do this using Control.Concurrent, but I wanted to experiment using Haskell's parallelism features. Also, the (scant) literature I could find seems to indicate that Haskell's parallelism features make it cheaper to spawn parallel computations and adapt the workload among the number of capabilities.

I attempted two solutions. The first uses the Par monad (i.e. Control.Monad.Par):
import Control.Monad.Par (Par, NFData)
import Control.Monad.Par.Combinator (parMap)
import Data.Maybe (catMaybes)
import Data.List.Split (chunksOf)
takeJustsPar :: (NFData b) => Int -> Int -> (a -> Maybe b) -> [a] -> Par [b]
takeJustsPar n chunkSize f as = go n (chunksOf chunkSize as) where
go _ [] = return []
go 0 _ = return []
go numNeeded (chunk:chunks) = do
evaluatedChunk <- parMap f chunk
let results = catMaybes evaluatedChunk
numFound = length results
numRemaining = numNeeded - numFound
fmap (results ++) $ go numRemaining chunks
The second attempt used Control.Parallel.Strategies:
import Control.Parallel.Strategies
import Data.List.Split (chunksOf)
chunkPar :: (NFData a) => Int -> Int -> [a] -> [a]
chunkPar innerSize outerSize as
= concat ((chunksOf innerSize as) `using` (parBuffer outerSize rdeepseq))
The latter one ended up being MUCH more composable, since I could just write:
take n $ catMaybes $ chunkPar 1000 10 $ map expensiveFunction xs
... rather than baking in the take and catMaybes behavior into the parallelism strategy.
The latter solution also gives nearly perfect utilization. On the embarrassingly parallel problem I tested it on, it gave 99% utilization for 8 cores. I didn't test the utilization of the Par monad because I was borrowing a colleague's computer and didn't want to waste their time when I was satisfied with the performance of Control.Parallel.Strategies.
So the answer is to use Control.Parallel.Strategies, which gives much more composable behavior and great multi-core utilization.

Related

Efficient bitstreams in Haskell

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)

Haskell: Replace mapM in a monad transformer stack to achieve lazy evaluation (no space leaks)

It has already been discussed that mapM is inherently not lazy, e.g. here and here. Now I'm struggling with a variation of this problem where the mapM in question is deep inside a monad transformer stack.
Here's a function taken from a concrete, working (but space-leaking) example using LevelDB that I put on gist.github.com:
-- read keys [1..n] from db at DirName and check that the values are correct
doRead :: FilePath -> Int -> IO ()
doRead dirName n = do
success <- runResourceT $ do
db <- open dirName defaultOptions{ cacheSize= 2048 }
let check' = check db def in -- is an Int -> ResourceT IO Bool
and <$> mapM check' [1..n] -- space leak !!!
putStrLn $ if success then "OK" else "Fail"
This function reads the values corresponding to keys [1..n] and checks that they are all correct. The troublesome line inside the ResourceT IO a monad is
and <$> mapM check' [1..n]
One solution would be to use streaming libraries such as pipes, conduit, etc. But these seem rather heavy and I'm not at all sure how to use them in this situation.
Another path I looked into is ListT as suggested here. But the type signatures of ListT.fromFoldable :: [Bool]->ListT Bool and ListT.fold :: (r -> a -> m r) -> r -> t m a -> mr (where m=IO and a,r=Bool) do not match the problem at hand.
What is a 'nice' way to get rid of the space leak?
Update: Note that this problem has nothing to do with monad transformer stacks! Here's a summary of the proposed solutions:
1) Using Streaming:
import Streaming
import qualified Streaming.Prelude as S
S.all_ id (S.mapM check' (S.each [1..n]))
2) Using Control.Monad.foldM:
foldM (\a i-> do {b<-check' i; return $! a && b}) True [1..n]
3) Using Control.Monad.Loops.allM
allM check' [1..n]
I know you mention you don't want to use streaming libraries, but your problem seems pretty easy to solve with streaming without changing the code too much.
import Streaming
import qualified Streaming.Prelude as S
We use each [1..n] instead of [1..n] to get a stream of elements:
each :: (Monad m, Foldable f) => f a -> Stream (Of a) m ()
Stream the elements of a pure, foldable container.
(We could also write something like S.take n $ S.enumFrom 1).
We use S.mapM check' instead of mapM check':
mapM :: Monad m => (a -> m b) -> Stream (Of a) m r -> Stream (Of b) m r
Replace each element of a stream with the result of a monadic action
And then we fold the stream of booleans with S.all_ id:
all_ :: Monad m => (a -> Bool) -> Stream (Of a) m r -> m Bool
Putting it all together:
S.all_ id (S.mapM check' (S.each [1..n]))
Not too different from the code you started with, and without the need for any new operator.
I think what you need is allM from the monad-loops package.
Then it would be just allM check' [1..n]
(Or if you don't want the import it's a pretty small function to copy.)

Haskell speed / memory usage

I'm trying to process some Point Cloud data with Haskell, and it seems to use a LOT of memory. The code I'm using is below, it basically parses the data into a format I can work with. The dataset has 440MB with 10M rows. When I run it with runhaskell, it uses up all the ram in a short time (~3-4gb) and then crashes. If I compile it with -O2 and run it, it goes to 100% cpu and takes a long time to finish (~3 minutes). I should mention that I'm using an i7 cpu with 4GB ram and an SSD, so there should be plenty of resources. How can I improve the performance of this?
{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (lines, readFile)
import Data.Text.Lazy (Text, splitOn, unpack, lines)
import Data.Text.Lazy.IO (readFile)
import Data.Maybe (fromJust)
import Text.Read (readMaybe)
filename :: FilePath
filename = "sample.txt"
readTextMaybe = readMaybe . unpack
data Classification = Classification
{ id :: Int, description :: Text
} deriving (Show)
data Point = Point
{ x :: Int, y :: Int, z :: Int, classification :: Classification
} deriving (Show)
type PointCloud = [Point]
maybeReadPoint :: Text -> Maybe Point
maybeReadPoint text = parse $ splitOn "," text
where toMaybePoint :: Maybe Int -> Maybe Int -> Maybe Int -> Maybe Int -> Text -> Maybe Point
toMaybePoint (Just x) (Just y) (Just z) (Just cid) cdesc = Just (Point x y z (Classification cid cdesc))
toMaybePoint _ _ _ _ _ = Nothing
parse :: [Text] -> Maybe Point
parse [x, y, z, cid, cdesc] = toMaybePoint (readTextMaybe x) (readTextMaybe y) (readTextMaybe z) (readTextMaybe cid) cdesc
parse _ = Nothing
readPointCloud :: Text -> PointCloud
readPointCloud = map (fromJust . maybeReadPoint) . lines
main = (readFile filename) >>= (putStrLn . show . sum . map x . readPointCloud)
The reason this uses all your memory when compiled without optimization is most likely because sum is defined using foldl. Without the strictness analysis that comes with optimization, that will blow up badly. You can try using this function instead:
sum' :: Num n => [n] -> n
sum' = foldl' (+) 0
The reason this is slow when compiled with optimization seems likely related to the way you parse the input. A cons will be allocated for each character when reading in the input, and again when breaking the input into lines, and probably yet again when splitting on commas. Using a proper parsing library (any of them) will almost certainly help; using one of the streaming ones like pipes or conduit may or may not be best (I'm not sure).
Another issue, not related to performance: fromJust is rather poor form in general, and is a really bad idea when dealing with user input. You should instead mapM over the list in the Maybe monad, which will produce a Maybe [Point] for you.

How can I unpack an arbitrary length list of IO Bool

I'm writing a program that should be able to simulate many instances of trying the martingale betting system with roulette. I would like main to take an argument giving the number of tests to perform, perform the test that many times, and then print the number of wins divided by the total number of tests. My problem is that instead of ending up with a list of Bool that I could filter over to count successes, I have a list of IO Bool and I don't understand how I can filter over that.
Here's the source code:
-- file: Martingale.hs
-- a program to simulate the martingale doubling system
import System.Random (randomR, newStdGen, StdGen)
import System.Environment (getArgs)
red = [1,3,5,7,9,12,14,16,18,19,21,23,25,27,30,32,34,36]
martingale :: IO StdGen -> IO Bool
martingale ioGen = do
gen <- ioGen
return $ martingale' 1 0 gen
martingale' :: Real a => a -> a -> StdGen -> Bool
martingale' bet acc gen
| acc >= 5 = True
| acc <= -100 = False
| otherwise = do
let (randNumber, newGen) = randomR (0,37) gen :: (Int, StdGen)
if randNumber `elem` red
then martingale' 1 (acc + bet) newGen
else martingale' (bet * 2) (acc - bet) newGen
main :: IO ()
main = do
args <- getArgs
let iters = read $ head args
gens = replicate iters newStdGen
results = map martingale gens
--results = map (<-) results
print "THIS IS A STUB"
Like I have in my comments, I basically want to map (<-) over my list of IO Bool, but as I understand it, (<-) isn't actually a function but a keyword. Any help would be greatly appreciated.
map martingale gens will give you something of type [IO Bool]. You can then use sequence to unpack it:
sequence :: Monad m => [m a] -> m [a]
A more natural alternative is to use mapM directly:
mapM :: Monad m => (a -> m b) -> [a] -> m [b]
i.e. you can write
results <- mapM martingale gens
Note - even after doing it this way, your code feels a bit unnatural. I can see some advantages to the structure, in particular because martingale' is a pure function. However having something of type IO StdGen -> IO Bool seems a bit odd.
I can see a couple of ways to improve it:
make martingale' return an IO type itself and push the newStdGen call all the way down into it
make gens use replicateM rather than replicate
You may want to head over to http://codereview.stackexchange.com for more comprehensive feedback.

Efficient hash map container in Haskell?

I want to count unique blocks stored in a file using Haskell.
The block is just consecutive bytes with a length of 512 and the target file has a size of at least 1GB.
This is my initial try.
import Control.Monad
import qualified Data.ByteString.Lazy as LB
import Data.Foldable
import Data.HashMap
import Data.Int
import qualified Data.List as DL
import System.Environment
type DummyDedupe = Map LB.ByteString Int64
toBlocks :: Int64 -> LB.ByteString -> [LB.ByteString]
toBlocks n bs | LB.null bs = []
| otherwise = let (block, rest) = LB.splitAt n bs
in block : toBlocks n rest
dedupeBlocks :: [LB.ByteString] -> DummyDedupe -> DummyDedupe
dedupeBlocks = flip $ DL.foldl' (\acc block -> insertWith (+) block 1 $! acc)
dedupeFile :: FilePath -> DummyDedupe -> IO DummyDedupe
dedupeFile fp dd = LB.readFile fp >>= return . (`dedupeBlocks` dd) . toBlocks 512
main :: IO ()
main = do
dd <- getArgs >>= (`dedupeFile` empty) . head
putStrLn . show . (*512) . size $ dd
putStrLn . show . (*512) . foldl' (+) 0 $ dd
It works, but I got frustrated with its execution time and memory usage. Especilly when I compared with those of C++ and even Python implementation listed below, it was 3~5x slower and consumed 2~3x more memory space.
import os
import os.path
import sys
def dedupeFile(dd, fp):
fd = os.open(fp, os.O_RDONLY)
for block in iter(lambda : os.read(fd, 512), ''):
dd.setdefault(block, 0)
dd[block] = dd[block] + 1
os.close(fd)
return dd
dd = {}
dedupeFile(dd, sys.argv[1])
print(len(dd) * 512)
print(sum(dd.values()) * 512)
I thought it was mainly due to the hashmap implementation, and tried other implementations such as hashmap, hashtables and unordered-containers.
But there wasn't any noticeable difference.
Please help me to improve this program.
I don't think you will be able to beat the performance of python dictionaries. They are actually implemented in c with years of optimizations put into it on the other hand hashmap is new and not that much optimized. So getting 3x performance in my opinion is good enough. You can optimize you haskell code at certain places but still it won't matter much. If you are still adamant about increasing performance I think you should use a highly optimized c library with ffi in your code.
Here are some of the similar discussions
haskell beginners
This may be completely irrelevant depending on your usage, but I am slightly worried about insertWith (+) block 1. If your counts reach high numbers, you will accumulate thunks in the cells of the hash map. It doesn't matter that you used ($!), that only forces the spine -- the values are likely still lazy.
Data.HashMap provides no strict version insertWith' like Data.Map does. But you can implement it:
insertWith' :: (Hashable k, Ord k) => (a -> a -> a) -> k -> a
-> HashMap k a -> HashMap k a
insertWith' f k v m = maybe id seq maybeval m'
where
(maybeval, m') = insertLookupWithKey (const f) k v m
Also, you may want to output (but not input) a list of strict ByteStrings from toBlocks, which will make hashing faster.
That's all I've got -- I'm no performance guru, though.

Resources