Library ghc RULES don't activate - haskell

fgl is a Haskell library for graph manipulation. This library comes with an implementation of its base classes - Data.Graph.Inductive.PatriciaTree - that is supposedly highly tuned for performance. Part of that performance tuning involves ghc RULES pragmas to replace certain generic functions with specialized versions that are much faster.
However, my evidence is that these RULES don't seem to work at all, and I don't understand why not. For people trying to replicate exactly what I see, I've put my test project up at https://github.com/fizbin/GraphOptiTest and am using ghc version 7.10.2.
Here's my test program:
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Exception
import Control.Monad
import Data.Graph.Inductive
import qualified Data.Graph.Inductive.PatriciaTree as Pt
import qualified MyPatriciaTree as MPt
makeGraph :: (DynGraph gr) => Int -> gr () Int
makeGraph n = mkGraph (map (,()) [1 .. n])
(concatMap (\x -> map (\y -> (x, y, x*y)) [x .. n]) [1 .. n])
main1 :: IO ()
main1 =
replicateM_ 200 $ let x = makeGraph 200 :: Pt.Gr () Int
in evaluate (length $ show x)
main2 :: IO ()
main2 =
replicateM_ 200 $ let x = makeGraph 200 :: MPt.Gr () Int
in evaluate (length $ show x)
main :: IO ()
main = main1 >> main2
Now, Data.Graph.Inductive.PatriciaTree has this definition for the class function mkGraph:
mkGraph vs es = insEdges es
. Gr
. IM.fromList
. map (second (\l -> (IM.empty,l,IM.empty)))
$ vs
Where insEdges is a function defined in the module Data.Graph.Inductive.Graph as:
insEdges :: (DynGraph gr) => [LEdge b] -> gr a b -> gr a b
insEdges es g = foldl' (flip insEdge) g es
And Data.Graph.Inductive.PatriciaTree has this to say about insEdge:
{-# RULES
"insEdge/Data.Graph.Inductive.PatriciaTree" insEdge = fastInsEdge
#-}
fastInsEdge :: LEdge b -> Gr a b -> Gr a b
fastInsEdge (v, w, l) (Gr g) = g2 `seq` Gr g2
where
g1 = IM.adjust addSucc' v g
g2 = IM.adjust addPred' w g1
addSucc' (ps, l', ss) = (ps, l', IM.insertWith addLists w [l] ss)
addPred' (ps, l', ss) = (IM.insertWith addLists v [l] ps, l', ss)
So, in theory, when I run main1 in my test program I should get that compiled down into something that eventually calls fastInsEdge.
To test this, I compare against a modified version of Data.Graph.Inductive.PatriciaTree that uses this as its definition of the mkGraph method: (this is the class MyPatriciaTree used above in main2)
mkGraph vs es = doInsEdges
. Gr
. IM.fromList
. map (second (\l -> (IM.empty,l,IM.empty)))
$ vs
where
doInsEdges g = foldl' (flip fastInsEdge) g es
When I run my test program (after cabal configure --enable-library-profiling --enable-executable-profiling and cabal build GraphOptiTest), though, the main2 method smokes the main1 method. It isn't even close - the profile shows 99.2% of the program's time is spent inside main1. (and changing the program to just run main2 shows that yes, main2 is really fast on its own)
Yes, I do have -O in the ghc-options section of my cabal file.
Trying ghc options like -ddump-rule-firings doesn't really help - all I can see is that these replacement rules aren't firing, but I have no idea why. I don't know how to get the compiler to tell me why it didn't activate the replacement rules.
Bringing up something discovered by messing around with fgl's source in response #dfeuer's answer below:
If I add a specialized version of insEdges to Data.Graph.Inductive.PatriciaTree as:
{-# RULES
"insEdges/Data.Graph.Inductive.PatriciaTree" insEdges = fastInsEdges
#-}
fastInsEdges :: [LEdge b] -> Gr a b -> Gr a b
fastInsEdges es g = foldl' (flip fastInsEdge) g es
Then both main1 and main2 are now fast. This replacement rule fires; why doesn't the other one? (And no, telling ghc to NOINLINE the function insEdge does no good)
EPILOGUE:
So there's now a bug filed with the fgl package for not tagging their functions that use insEdge and insNode appropriately so that the fast versions will be used. But in my code now I work around this and the workaround may be useful in more situations, so I thought I'd share it. At the top of my code now, I have:
import qualified Data.Graph.Inductive as G
import qualified Data.Graph.Inductive.PatriciaTree as Pt
-- Work around design and implementation performance issues
-- in the Data.Graph.Inductive package.
-- Specifically, the tuned versions of insNode, insEdge, gmap, nmap, and emap
-- for PatriciaTree graphs are exposed only through RULES pragmas, meaning
-- that you only get them when the compiler can specialize the function
-- to that specific instance of G.DynGraph. Therefore, I create my own
-- type class with the functions that have specialized versions and use that
-- type class here; the compiler then can do the specialized RULES
-- replacement on the Pt.Gr instance of my class.
class (G.DynGraph gr) => MyDynGraph gr where
mkGraph :: [G.LNode a] -> [G.LEdge b] -> gr a b
insNodes :: [G.LNode a] -> gr a b -> gr a b
insEdges :: [G.LEdge b] -> gr a b -> gr a b
insNode :: G.LNode a -> gr a b -> gr a b
insEdge :: G.LEdge b -> gr a b -> gr a b
gmap :: (G.Context a b -> G.Context c d) -> gr a b -> gr c d
nmap :: (a -> c) -> gr a b -> gr c b
emap :: (b -> c) -> gr a b -> gr a c
instance MyDynGraph Pt.Gr where
mkGraph nodes edges = insEdges edges $ G.mkGraph nodes []
insNodes vs g = foldl' (flip G.insNode) g vs
insEdges es g = foldl' (flip G.insEdge) g es
insNode = G.insNode
insEdge = G.insEdge
gmap = G.gmap
nmap = G.nmap
emap = G.emap
(Had I used the nemap function in my code I would have included that in the class too) Then, any code of mine which was formerly written in terms of (G.DynGraph gr) => ... is now written in terms of (MyDynGraph gr) => .... The compiler RULES activate for the Pt.Gr instance, and I then get the optimized version for each function.
Essentially, this trades away the ability of the compiler to inline any of these functions into the calling code and possibly do other optimizations for always getting the optimized versions. (and the cost of an extra pointer indirection at runtime, but that's trivial in comparison) Since profiling showed that those other optimizations never yielded anything significant anyway, this was a clear net win in my case.
Many people's code could use SPECIALIZE rules aggressively to get the optimized versions everywhere; however, sometimes that isn't possible and it wasn't in the real production code that caused my question without refactoring huge chunks of the application. I had a data structure with a member that has the type (forall gr. G.DynGraph gr => tokType -> gr () (MyEdge c)) - that now uses MyDynGraph for the class constraint, but completely unwinding it to not have forall gr. in the signature would have been a huge effort, and such a signature prevents specialization from working across that boundary.

I haven't done any experiments, but here's my guess. The insEdge function is not marked with a (phased) INLINE or NOINLINE, so the inliner is free to inline it whenever it's fully applied. In the definition of insEdges, we see
foldl' (flip insEdge) g es
Inlining foldl' gives
foldr f' id es g
where f' x k z = k $! flip insEdge z x
flip is now fully applied, so we can inline it:
foldr f' id es g
where f' x k z = k $! insEdge x z
Now insEdge is fully applied, so GHC may choose to inline it right then and there, before the rule ever has a chance to fire.
Try adding {-# NOINLINE [0] insEdge #-} right by the definition of insEdge and see what happens. If it works, submit a pull request to fgl.
P.S. In my opinion, this sort of thing should really be done by using class methods with defaults, rather than rewrite rules. Rules are always a bit fussy.
As the comments revealed, the big problem wasn't premature inlining, but rather a failure to specialize insEdge. In particular, Data.Graph.Inductive.Graph does not export an unfolding for insEdges, so it's impossible to specialize it, and the insEdge it calls, to the appropriate type. The ultimate fix was to mark insEdges INLINABLE, but I would still advise marking insEdge NOINLINE [0] out of an abundance of caution.

Related

Efficient bitstreams in Haskell

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)

Nested Parallelism in Repa

The following code produces the (dreaded) "nested parallelism" error with repa-3.4.0.1:
import Control.Monad.Identity (runIdentity, liftM)
import Data.Array.Repa as R
import Data.Array.Repa.Repr.Unboxed
import Data.Vector.Unboxed
import Test.Framework
import Test.Framework.Providers.QuickCheck2
import Test.QuickCheck hiding (generate,output)
main :: IO ()
main = defaultMainWithArgs [prop,prop] ["--maximum-generated-tests=1"]
prop = testProperty "Test" $ property prop_fmap
prop_fmap :: Arr Int -> Bool
prop_fmap x = fmapT id x == x
newtype Arr r = Arr (Array U DIM1 r) deriving (Eq, Show)
instance (Arbitrary r, Unbox r) => Arbitrary (Arr r) where
arbitrary = replM arbitrary
shrink = shrinkNothing
replM :: (Unbox r, Monad mon) => mon r -> mon (Arr r)
replM = let n = 6
in liftM (Arr . fromUnboxed (Z:.n)) . replicateM n
fmapT :: (Unbox a, Unbox b) => (a -> b) -> Arr a -> Arr b
fmapT f (Arr v) = Arr $ force' $ R.map f $ v
force' :: (Shape sh, Unbox r) => Array D sh r -> Array U sh r
force' = runIdentity . computeP
The exact error is:
Performing nested parallel computation sequentially. You've probably
called the 'compute' or 'copy' function while another instance was
already running. This can happen if the second version was suspended
due to lazy evaluation. Use 'deepSeqArray' to ensure that each array
is fully evaluated before you 'compute' the next one.
I'm compiling with ghc Main -threaded and running with Main +RTS -N2. I've tried using deepSeqArray in the definition of fmapT, but it doesn't help. Since the tests are independent (aside from sequencing the randomness), it's not clear how nested parallelism is even possible in this example.
Interestingly, if I change main to just quickCheck prop_fmap, it successfully completes 100 tests. So it seems there's something going on with test-framework (probably a more general problem with monad sequencing) that is different from QuickCheck.
Any ideas on why I'm getting this error and how I can avoid it while still doing a parallel computation?
Note: using the Identity monad: ref1, ref2
The problem is that test-framework is implicitly multi-threaded (see this for example.)
The solution that works for me is to run defaultMainWithArgs with the option --threads=1.

What are good Haskell conventions for managing deeply nested bracket patterns?

I am currently working with Haskell bindings to a HDF5 C library. Like many C libraries, this one uses many pointers in its functions calls.
The usual "best practice" Haskell functions for allocating and releasing C resources follow the bracket pattern, like alloca, withArray, etc. In using them, I often enter several nested brackets. For instance, here is a small excerpt for HDF5 bindings:
selectHyperslab rID dName = withDataset rID dName $ \dID -> do
v <- withDataspace 10 $ \dstDS -> do
srcDS <- c'H5Dget_space dID
dat <- alloca3 (0, 1, 10) $ \(start, stride, count) -> do
err <- c'H5Sselect_hyperslab srcDS c'H5S_SELECT_SET start stride count nullPtr
-- do some work ...
return value
alloca3 (a, b, c) action =
alloca $ \aP -> do
poke aP a
alloca $ \bP -> do
poke bP b
alloca $ \cP -> do
poke cP c
action (aP, bP, cP)
In the code above, the nested brackets are bracket functions I wrote withDataset, withDataspace, and alloca3, which I wrote to prevent the bracket nesting from going another 3 levels deep in the code. For C libraries with lots of resource acquisition calls and pointer arguments, coding with the standard bracket primitives can get unmanageable (which is why I wrote alloca3 to reduce the nesting.)
So generally, are there any best practices or coding techniques to help reduce the nesting of brackets when needing to allocate and deallocate many resources (such as with C calls)? The only alternative I have found is the ResourceT transformer, which from the tutorial looks like it is designed to make interleaving resource acquire/release possible, and not to simplify the bracket pattern.
Recently I was investigating this problem in Scala. The recurring pattern is (a -> IO r) -> IO r, where a given function is executed within some resource allocation context given a value of type a. And this is just ContT r IO a, which is readily available in Haskell. So we can write:
import Control.Monad
import Control.Monad.Cont
import Control.Monad.IO.Class
import Control.Exception (bracket)
import Foreign.Ptr (Ptr)
import Foreign.Storable (Storable)
import Foreign.Marshal.Alloc (alloca)
allocaC :: Storable a => ContT r IO (Ptr a)
allocaC = ContT alloca
bracketC :: IO a -> (a -> IO b) -> ContT r IO a
bracketC start end = ContT (bracket start end)
bracketC_ :: IO a -> IO b -> ContT r IO a
bracketC_ start end = ContT (bracket start (const end))
-- ...etc...
-- | Example:
main :: IO ()
main = flip runContT return $ do
bracketC_ (putStrLn "begin1") (putStrLn "end1")
bracketC_ (putStrLn "begin2") (putStrLn "end2")
liftIO $ putStrLn "..."
The standard monad/applicative functions allow you to simplify a lot of your code, for example:
allocAndPoke :: (Storable a) => a -> ContT r IO (Ptr a)
allocAndPoke x = allocaC >>= \ptr -> liftIO (poke ptr x) >> return ptr
-- With the monad alloca3 won't be probably needed, just as an example:
alloca3C (a, b, c) =
(,,) <$> allocAndPoke a <*> allocAndPoke b <*> allocAndPoke c
allocaManyC :: (Storable a) => [a] -> ContT r IO [Ptr a]
allocaManyC = mapM allocAndPoke

Join two consumers into a single consumer that returns multiple values?

I have been experimenting with the new pipes-http package and I had a thought. I have two parsers for a web page, one that returns line items and another a number from elsewhere in the page. When I grab the page, it'd be nice to string these parsers together and get their results at the same time from the same bytestring producer, rather than fetching the page twice or fetching all the html into memory and parsing it twice.
In other words, say you have two Consumers:
c1 :: Consumer a m r1
c2 :: Consumer a m r2
Is it possible to make a function like this:
combineConsumers :: Consumer a m r1 -> Consumer a m r2 -> Consumer a m (r1, r2)
combineConsumers = undefined
I have tried a few things, but I can't figure it out. I understand if it isn't possible, but it would be convenient.
Edit:
I'm sorry it turns out I was making an assumption about pipes-attoparsec, due to my experience with conduit-attoparsec that caused me to ask the wrong question. Pipes-attoparsec turns an attoparsec into a pipes Parser when I just assumed that it would return a pipes Consumer. That means that I can't actually turn two attoparsec parsers into consumers that take text and return a result, then use them with the plain old pipes ecosystem. I'm sorry but I just don't understand pipes-parse.
Even though it doesn't help me, Arthur's answer is pretty much what I envisioned when I asked the question, and I'll probably end up using his solution in the future. In the meantime I'm just going to use conduit.
It the results are "monoidal", you can use the tee function from the Pipes prelude, in combination with a WriterT.
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import Control.Monad
import Control.Monad.Writer
import Control.Monad.Writer.Class
import Pipes
import qualified Pipes.Prelude as P
import qualified Data.Text as T
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
counter :: Monoid w => T.Text
-> (T.Text -> w)
-> Consumer T.Text (WriterT w IO) ()
counter word inject = P.filter (==word) >-> P.mapM (tell . inject) >-> P.drain
main :: IO ()
main = do
result <-runWriterT $ runEffect $
hoist lift textSource >->
P.tee (counter "foo" inject1) >-> (counter "bar" inject2)
putStrLn . show $ result
where
inject1 _ = (,) (Sum 1) mempty
inject2 _ = (,) mempty (Sum 1)
Update: As mentioned in a comment, the real problem I see is that in pipes parsers aren't Consumers. And how can you run two parsers concurrently if they have different behaviours regarding leftovers? What happens if one of the parsers wants to "un-draw" some text and the other parser doesn't?
One possible solution is to run the parsers in a truly concurrent manner, in different threads. The primitives in the pipes-concurrency package let you "duplicate" a Producer by writing the same data to two different mailboxes. And then each parser can do whatever it wants with its own copy of the producer. Here's an example which also uses the pipes-parse, pipes-attoparsec and async packages:
{-# LANGUAGE OverloadedStrings #-}
import Data.Monoid
import qualified Data.Text as T
import Data.Attoparsec.Text hiding (takeWhile)
import Data.Attoparsec.Combinator
import Control.Applicative
import Control.Monad
import Control.Monad.State.Strict
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Attoparsec as P
import qualified Pipes.Concurrent as P
import qualified Control.Concurrent.Async as A
parseChars :: Char -> Parser [Char]
parseChars c = fmap mconcat $
many (notChar c) *> many1 (some (char c) <* many (notChar c))
textSource :: Producer T.Text IO ()
textSource = yield "foo" >> yield "bar" >> yield "foo" >> yield "nah"
parseConc :: Producer T.Text IO ()
-> Parser a
-> Parser b
-> IO (Either P.ParsingError a,Either P.ParsingError b)
parseConc producer parser1 parser2 = do
(outbox1,inbox1,seal1) <- P.spawn' P.Unbounded
(outbox2,inbox2,seal2) <- P.spawn' P.Unbounded
feeding <- A.async $ runEffect $ producer >-> P.tee (P.toOutput outbox1)
>-> P.toOutput outbox2
sealing <- A.async $ A.wait feeding >> P.atomically seal1 >> P.atomically seal2
r <- A.runConcurrently $
(,) <$> (A.Concurrently $ parseInbox parser1 inbox1)
<*> (A.Concurrently $ parseInbox parser2 inbox2)
A.wait sealing
return r
where
parseInbox parser inbox = evalStateT (P.parse parser) (P.fromInput inbox)
main :: IO ()
main = do
(Right a, Right b) <- parseConc textSource (parseChars 'o') (parseChars 'a')
putStrLn . show $ (a,b)
The result is:
("oooo","aa")
I'm not sure how much overhead this approach introduces.
I think something is wrong with the way you are going about this, for the reasons Davorak mentions in his remark. But if you really need such a function, you can define it.
import Pipes.Internal
import Pipes.Core
zipConsumers :: Monad m => Consumer a m r -> Consumer a m s -> Consumer a m (r,s)
zipConsumers p q = go (p,q) where
go (p,q) = case (p,q) of
(Pure r , Pure s) -> Pure (r,s)
(M mpr , ps) -> M (do pr <- mpr
return (go (pr, ps)))
(pr , M mps) -> M (do ps <- mps
return (go (pr, ps)))
(Request _ f, Request _ g) -> Request () (\a -> go (f a, g a))
(Request _ f, Pure s) -> Request () (\a -> do r <- f a
return (r, s))
(Pure r , Request _ g) -> Request () (\a -> do s <- g a
return (r,s))
(Respond x _, _ ) -> closed x
(_ , Respond y _) -> closed y
If you are 'zipping' consumers without using their return value, only their 'effects' you can just use tee consumer1 >-> consumer2
The idiomatic solution is to rewrite your Consumers as a Fold or FoldM from the foldl library and then combine them using Applicative style. You can then convert this combined fold to one that works on pipes.
Let's assume that you either have two Folds:
fold1 :: Fold a r1
fold2 :: Fold a r2
... or two FoldMs:
foldM1 :: Monad m => FoldM a m r1
foldM2 :: Monad m => FoldM a m r2
Then you combine these into a single Fold/FoldM using Applicative style:
import Control.Applicative
foldBoth :: Fold a (r1, r2)
foldBoth = (,) <$> fold1 <*> fold2
foldBothM :: Monad m => FoldM a m (r1, r2)
foldBothM = (,) <$> foldM1 <*> foldM2
-- or: foldBoth = liftA2 (,) fold1 fold2
-- foldMBoth = liftA2 (,) foldM1 foldM2
You can turn either fold into a Pipes.Prelude-style fold or a Parser. Here are the necessary conversion functions:
import Control.Foldl (purely, impurely)
import qualified Pipes.Prelude as Pipes
import qualified Pipes.Parse as Parse
purely Pipes.fold
:: Monad m => Fold a b -> Producer a m () -> m b
impurely Pipes.foldM
:: Monad m => FoldM m a b -> Producer a m () -> m b
purely Parse.foldAll
:: Monad m => Fold a b -> Parser a m r
impurely Parse.foldMAll
:: Monad m => FoldM a m b -> Parser a m r
The reason for the purely and impurely functions is so that foldl and pipes can interoperate without either one incurring a dependency on the other. Also, they allow libraries other than pipes (like conduit) to reuse foldl without a dependency, too (Hint hint, #MichaelSnoyman).
I apologize that this feature is not documented, mainly because it took me a while to figure out how to get pipes and foldl to interoperate in a dependency-free manner, and that was after I wrote the pipes tutorial. I will update the tutorial to point out this trick.
To learn how to use foldl, just read the documentation in the main module. It's a very small and easy-to-learn library.
For what it's worth, in the conduit world, the relevant function is zipSinks. There might be some way to adapt this function to work for pipes, but automatic termination may get in the way.
Consumer forms a Monad so
combineConsumers = liftM2 (,)
will type check. Unfortunately, the semantics might be unlike what you're expecting: the first consumer will run to completion and then the second.

Pure pseudo-random generators for Haskell with a nice API?

What are the recommended Haskell packages for pure pseudo-random generators (uniform Doubles)?
I'm interested in a convenient API in the first place, speed would be nice too.
Maybe mwc-random?
I like the mersenne-random-pure64 package. For example you can use it like this to generate an infinite lazy stream of random doubles from a seed value:
import Data.Word (Word64)
import Data.List (unfoldr)
import System.Random.Mersenne.Pure64
randomStream :: (PureMT -> (a, PureMT)) -> PureMT -> [a]
randomStream rndstep g = unfoldr (Just . rndstep) g
toStream :: Word64 -> [Double]
toStream seed = randomStream randomDouble $ pureMT seed
main = print . take 10 $ toStream 42
using System.Random (randoms)
You can get a similar output with the builtin randoms function, which is shorter and more general (Thanks to ehird for pointing that out):
import System.Random (randoms)
import System.Random.Mersenne.Pure64 (pureMT)
main = print . take 10 $ randomdoubles where
randomdoubles :: [Double]
randomdoubles = randoms $ pureMT 42
making it an instance of MonadRandom
After reading about MonadRandom i got curious how to get PureMT working as an instance of it. Out of the box it doesn't work, because PureMT doesn't instantiate RandomGen's split function. One way to make it work is to wrap PureMT in a newtype and write a custom split instance for the RandomGen typeclass, for which there exists a default MonadRandom instance.
import Control.Monad.Random
import System.Random.Mersenne.Pure64
getTenRandomDoubles :: Rand MyPureMT [Double]
getTenRandomDoubles = getRandoms >>= return . take 10
main = print $ evalRand getTenRandomDoubles g
where g = MyPureMT $ pureMT 42
newtype MyPureMT = MyPureMT { unMyPureMT :: PureMT }
myPureMT = MyPureMT . pureMT
instance RandomGen MyPureMT where
next = nextMyPureMT
split = splitMyPureMT
splitMyPureMT :: MyPureMT -> (MyPureMT, MyPureMT)
splitMyPureMT (MyPureMT g) = (myPureMT s, myPureMT s') where
(s',g'') = randomWord64 g'
(s ,g' ) = randomWord64 g
nextMyPureMT (MyPureMT g) = (s, MyPureMT g') where
(s, g') = randomInt g
The standard System.Random has a pure interface. I would recommend wrapping it in a State g (for whatever generator g you're using) to avoid threading the state; the state function makes turning functions like next into stateful actions easy:
next :: (RandomGen g) => g -> (Int, g)
state :: (s -> (a, s)) -> State s a
state next :: (RandomGen g) => State g Int
The MonadRandom package is based on the State g interface with pre-written wrappers for the generator functions; I think it's fairly popular.
Note that you can still run actions using this pure interface on the global RNG. MonadRandom has evalRandIO for the purpose.
I think you could write an (orphan) RandomGen instance to use mwc-random with these.
A particularly nice package with a pure interface that is also suitable for cryptographic applications and yet maintains a high performance is the cprng-aes package.
It provides two interfaces: A deterministic pure one using the type classes from System.Random as well as a strong IO interface using the type classes from the Crypto-API package.
As a side note: I would generally prefer the mersenne-random packages over mwc-random. They use the original Mersenne Twister algorithm and in my benchmarks outperformed mwc-random by a large factor.

Resources