Poor performance with transpose and cumulative sum in Repa - haskell

I have developed a cumulative sum function as defined below in the Haskell library Repa. However, I have run into an issue when combining this function with the transpose operation. All 3 of the following operations take well under a second:
cumsum $ cumsum $ cumsum x
transpose $ transpose $ transpose x
transpose $ cumsum x
However, if I write:
cumsum $ transpose x
performance degrades horrendously. While each individual operation in isolation takes well under a second on a 1920x1080 image, when combined they now take 30+ seconds...
Any ideas on what could be causing this? My gut tells me it has something to do with delayed arrays, not forcing at the right time, etc... But I do not have enough experience to track this down quite yet.
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
import Data.Array.Repa as Repa
{-# INLINE indexSlice #-}
indexSlice :: (Shape sh, Elt a) => Int -> Array (sh :. Int) a -> (sh :. Int) -> a
indexSlice from arr (z :. ix) = arr `unsafeIndex` (z :. (ix + from))
{-# INLINE sliceRange #-}
sliceRange :: (Slice sh, Shape sh, Elt a) => Int -> Int -> Array (sh :. Int) a -> Array (sh :. Int) a
sliceRange from to arr = fromFunction (z :. (to - from + 1)) $ indexSlice from arr
where (z :. _) = extent arr
{-# INLINE cumsum' #-}
cumsum' :: (Slice (SliceShape sh), Slice sh, Shape (FullShape sh), Shape (SliceShape sh), Elt a, Num a) =>
Array (FullShape sh :. Int) a -> t -> (sh :. Int) -> a
cumsum' arr f (sh :. outer) = Repa.sumAll $ sliceRange 0 outer $ Repa.slice arr (sh :. All)
{-# INLINE cumsum #-}
cumsum :: (FullShape sh ~ sh, Slice sh, Slice (SliceShape sh), Shape sh, Shape (SliceShape sh), Elt a, Num a) =>
Array (sh :. Int) a -> Array (sh :. Int) a
cumsum arr = Repa.force $ unsafeTraverse arr id $ cumsum' arr

From a library implementor's perspective, the way to debug this is to create a wrapper for the suspect operation, then look at the core code to see if fusion has worked.
-- Main.hs ---------------------------------------------------
import Solver
import Data.Array.Repa.IO.BMP
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP img
-- Solver.hs --------------------------------------------------
{-# LANGUAGE TypeOperators, FlexibleContexts, TypeFamilies #-}
module Solver (cumsumBMP) where
import Data.Array.Repa as Repa
import Data.Word
{- all your defs -}
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img = cumsum $ transpose img
I've put the "solver" code in a separate module, so we only have to wade through the core code for the definitions we care about.
Compile like:
touch Solver.hs ; ghc -O2 --make Main.hs \
-ddump-simpl -dsuppress-module-prefixes -dsuppress-coercions > dump
Go to the definition of cumsumBMP and search for the letrec keyword. Searching for letrec is a quick way to find the inner loops.
Not too far down I see this: (slightly reformatted)
case gen_a1tr
of _ {
GenManifest vec_a1tv ->
case sh2_a1tc `cast` ... of _ { :. sh3_a1iu sh4_a1iv ->
case ix'_a1t9 `cast` ... of _ { :. sh1'_a1iz sh2'_a1iA ->
case sh3_a1iu `cast` ... of _ { :. sh5_X1n0 sh6_X1n2 ->
case sh1'_a1iz `cast` ... of _ { :. sh1'1_X1n9 sh2'1_X1nb ->
case sh5_X1n0 of _ { :. sh7_X1n8 sh8_X1na ->
...
case sh2'1_X1nb of _ { I# y3_X1nO ->
case sh4_a1iv of _ { I# y4_X1nP ->
case sh2'_a1iA of _ { I# y5_X1nX ->
...
let { x3_a1x6 :: Int# [LclId]
x3_a1x6 =
+#
(*#
(+#
(*#
y1_a1iM
y2_X1nG)
y3_X1nO)
y4_X1nP)
y5_X1nX } in
case >=#
x3_a1x6
0
of ...
Disaster! The x3_a1x6 binding is clearly doing some useful work (multiplications, additions and suchlike) but it's wrapped in a long series of unboxing operations that are also executed for every loop iteration. What's worse is that it's unboxing the length and width (shape) of the array at every iteration, and this information will always be the same. GHC should really float these case expressions out of the loop, but it doesn't yet. This is an instance of Issue #4081 on the GHC trac, which hopefully will be fixed sometime soon.
The work around is to apply deepSeqArray to the incoming array. This places a demand on its value at the top level (outside the loop) which lets GHC know it's ok to move the case matches further up. For a function like cumsumBMP, we also expect the incoming array to already be manifest, so we can add an explicit case match for this:
{-# NOINLINE cumsumBMP #-}
cumsumBMP :: Array DIM3 Word8 -> Array DIM3 Word8
cumsumBMP img#(Array _ [Region RangeAll (GenManifest _)])
= img `deepSeqArray` cumsum $ transpose img
Compiling again, the inner loop now looks much better:
letrec {
$s$wfoldlM'_loop_s2mW [...]
:: Int# -> Word# -> Word# [...]
$s$wfoldlM'_loop_s2mW =
\ (sc_s2mA :: Int#) (sc1_s2mB :: Word#) ->
case <=# sc_s2mA a_s2ji of _ {
False -> sc1_s2mB;
True ->
$s$wfoldlM'_loop_s2mW
(+# sc_s2mA 1)
(narrow8Word#
(plusWord#
sc1_s2mB
(indexWord8Array#
rb3_a2gZ
(+#
rb1_a2gX
(+#
(*#
(+#
(*#
wild19_X1zO
ipv1_X1m5)
sc_s2mA)
ipv2_X1m0)
wild20_X1Ct)))))
}; } in
That's a tight, tail recursive loop that only uses primitive operations. Provided you compile with -fllvm -optlo-O3, there's no reason that won't run as fast as an equivalent C program.
There's a slight hiccup when running it though:
desire:tmp benl$ ./Main
Main: Solver.hs:(50,1)-(51,45): Non-exhaustive patterns in function cumsumBMP
This just reminds us that we need to force the array before calling cumsumBMP.
-- Main.hs ---------------------------------------------------
...
import Data.Array.Repa as Repa
main
= do Right img <- readImageFromBMP "whatever.bmp"
print $ cumsumBMP $ Repa.force img
In summary:
You need to add some deepSeqArray and pattern matching goop to your top level
functions to work around a current infelicity in GHC. This is demonstrated by
the final version of the cumsumBMP function above. If you want GHC HQ to fix
this soon then add yourself as a cc to Issue #4081 on the GHC trac. Repa programs will be much prettier when this is fixed.
You don't need to add the goop to every function. In this example I didn't need to touch indexSlice and friends. The general rule is to add the goop to functions that use force, fold or sumAll. These functions instantiate the actual loops that operate over the array data, that is, they convert a delayed array to a manifest value.
The performance of a piece of Repa code is determined as much by the context in which it's used as the actual code. If you pass your top level functions delayed arrays then they will run very slowly. There is more discussion of this in The Repa Tutorial.
BMP files read with the repa-io library aren't pre-forced, so you need to force them before use. This is probably the wrong default, so I'll change it in the next version.

Related

Efficient bitstreams in Haskell

In an ongoing endeavour to efficiently fiddle with bits (e.g. see this SO question) the newest challenge is the efficient streaming and consumption of bits.
As a first simple task I choose to find the longest sequence of identical bits in a bitstream generated by /dev/urandom. A typical incantation would be head -c 1000000 </dev/urandom | my-exe. The actual goal is to stream bits and decode an Elias gamma code, for example, i.e. codes that are not chunks of bytes or multiples thereof.
For such codes of variable length it is nice to have the take, takeWhile, group, etc. language for list manipulation. Since a BitStream.take would actually consume part of the bistream some monad would probably come into play.
The obvious starting point is the lazy bytestring from Data.ByteString.Lazy.
A. Counting bytes
This very simple Haskell program performs on par with a C program, as is to be expected.
import qualified Data.ByteString.Lazy as BSL
main :: IO ()
main = do
bs <- BSL.getContents
print $ BSL.length bs
B. Adding bytes
Once I start using unpack things should get worse.
main = do
bs <- BSL.getContents
print $ sum $ BSL.unpack bs
Suprisingly, Haskell and C show the almost same performance.
C. Longest sequence of identical bits
As a first nontrivial task the longest sequence of identical bits can be found like this:
module Main where
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.List (group)
import Data.Word8 (Word8)
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
bitStream :: BSL.ByteString -> [Bool]
bitStream bs = concat $ map splitByte (BSL.unpack bs)
main :: IO ()
main = do
bs <- BSL.getContents
print $ maximum $ length <$> (group $ bitStream bs)
The lazy bytestring is converted to a list [Word8] and then, using shifts, each Word is split into the bits, resulting in a list [Bool]. This list of lists is then flattened with concat. Having obtained a (lazy) list of Bool, use group to split the list into sequences of identical bits and then map length over it. Finally maximum gives the desired result. Quite simple, but not very fast:
# C
real 0m0.606s
# Haskell
real 0m6.062s
This naive implementation is exactly one order of magnitude slower.
Profiling shows that quite a lot of memory gets allocated (about 3GB for parsing 1MB of input). There is no massive space leak to be observed, though.
From here I start poking around:
There is a bitstream package that promises "Fast, packed, strict bit streams (i.e. list of Bools) with semi-automatic stream fusion.". Unfortunately it is not up-to-date with the current vector package, see here for details.
Next, I investigate streaming. I don't quite see why I should need 'effectful' streaming that brings some monad into play - at least until I start with the reverse of the posed task, i.e. encoding and writing bitstreams to file.
How about just fold-ing over the ByteString? I'd have to introduce state to keep track of consumed bits. That's not quite the nice take, takeWhile, group, etc. language that is desirable.
And now I'm not quite sure where to go.
Update:
I figured out how to do this with streaming and streaming-bytestring. I'm probably not doing this right because the result is catastrophically bad.
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Streaming as BSS
import Data.Word8 (Word8)
import qualified Streaming as S
import Streaming.Prelude (Of, Stream)
import qualified Streaming.Prelude as S
splitByte :: Word8 -> [Bool]
splitByte w = (\i-> (w `shiftR` i) .&. 1 == 1) <$> [0..7]
bitStream :: Monad m => Stream (Of Word8) m () -> Stream (Of Bool) m ()
bitStream s = S.concat $ S.map splitByte s
main :: IO ()
main = do
let bs = BSS.unpack BSS.getContents :: Stream (Of Word8) IO ()
gs = S.group $ bitStream bs :: Stream (Stream (Of Bool) IO) IO ()
maxLen <- S.maximum $ S.mapped S.length gs
print $ S.fst' maxLen
This will test your patience with anything beyond a few thousand bytes of input from stdin. The profiler says it spends an insane amount of time (quadratic in the input size) in Streaming.Internal.>>=.loop and Data.Functor.Of.fmap. I'm not quite sure what the first one is, but the fmap indicates (?) that the juggling of these Of a b isn't doing us any good and because we're in the IO monad it can't be optimised away.
I also have the streaming equivalent of the byte adder here: SumBytesStream.hs, which is slightly slower than the simple lazy ByteString implementation, but still decent. Since streaming-bytestring is proclaimed to be "bytestring io done right" I expected better. I'm probably not doing it right, then.
In any case, all these bit-computations shouldn't be happening in the IO monad. But BSS.getContents forces me into the IO monad because getContents :: MonadIO m => ByteString m () and there's no way out.
Update 2
Following the advice of #dfeuer I used the streaming package at master#HEAD. Here's the result.
longest-seq-c 0m0.747s (C)
longest-seq 0m8.190s (Haskell ByteString)
longest-seq-stream 0m13.946s (Haskell streaming-bytestring)
The O(n^2) problem of Streaming.concat is solved, but we're still not getting closer to the C benchmark.
Update 3
Cirdec's solution produces a performance on par with C. The construct that is used is called "Church encoded lists", see this SO answer or the Haskell Wiki on rank-N types.
Source files:
All the source files can be found on github. The Makefile has all the various targets to run the experiments and the profiling. The default make will just build everything (create a bin/ directory first!) and then make time will do the timing on the longest-seq executables. The C executables get a -c appended to distinguish them.
Intermediate allocations and their corresponding overhead can be removed when operations on streams fuse together. The GHC prelude provides foldr/build fusion for lazy streams in the form of rewrite rules. The general idea is if one function produces a result that looks like a foldr (it has the type (a -> b -> b) -> b -> b applied to (:) and []) and another function consumes a list that looks like a foldr, constructing the intermediate list can be removed.
For your problem I'm going to build something similar, but using strict left folds (foldl') instead of foldr. Instead of using rewrite rules that try to detect when something looks like a foldl, I'll use a data type that forces lists to look like left folds.
-- A list encoded as a strict left fold.
newtype ListS a = ListS {build :: forall b. (b -> a -> b) -> b -> b}
Since I've started by abandoning lists we'll be re-implementing part of the prelude for lists.
Strict left folds can be created from the foldl' functions of both lists and bytestrings.
{-# INLINE fromList #-}
fromList :: [a] -> ListS a
fromList l = ListS (\c z -> foldl' c z l)
{-# INLINE fromBS #-}
fromBS :: BSL.ByteString -> ListS Word8
fromBS l = ListS (\c z -> BSL.foldl' c z l)
The simplest example of using one is to find the length of a list.
{-# INLINE length' #-}
length' :: ListS a -> Int
length' l = build l (\z a -> z+1) 0
We can also map and concatenate left folds.
{-# INLINE map' #-}
-- fmap renamed so it can be inlined
map' f l = ListS (\c z -> build l (\z a -> c z (f a)) z)
{-# INLINE concat' #-}
concat' :: ListS (ListS a) -> ListS a
concat' ll = ListS (\c z -> build ll (\z l -> build l c z) z)
For your problem we need to be able to split a word into bits.
{-# INLINE splitByte #-}
splitByte :: Word8 -> [Bool]
splitByte w = Prelude.map (\i-> (w `shiftR` i) .&. 1 == 1) [0..7]
{-# INLINE splitByte' #-}
splitByte' :: Word8 -> ListS Bool
splitByte' = fromList . splitByte
And a ByteString into bits
{-# INLINE bitStream' #-}
bitStream' :: BSL.ByteString -> ListS Bool
bitStream' = concat' . map' splitByte' . fromBS
To find the longest run we'll keep track of the previous value, the length of the current run, and the length of the longest run. We make the fields strict so that the strictness of the fold prevents chains of thunks from being accumulated in memory. Making a strict data type for a state is an easy way to get control over both its memory representation and when its fields are evaluated.
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where
current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: ListS Bool -> Int
longestRun l = longest
where
(LongestRun _ _ longest) = build l extendRun (LongestRun False 0 0)
And we're done
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun $ bitStream' bs
This is much faster, but not quite the performance of c.
longest-seq-c 0m00.12s (C)
longest-seq 0m08.65s (Haskell ByteString)
longest-seq-fuse 0m00.81s (Haskell ByteString fused)
The program allocates about 1 Mb to read 1000000 bytes from input.
total alloc = 1,173,104 bytes (excludes profiling overheads)
Updated github code
I found another solution that is on par with C. The Data.Vector.Fusion.Stream.Monadic has a stream implementation based on this Coutts, Leshchinskiy, Stewart 2007 paper. The idea behind it is to use a destroy/unfoldr stream fusion.
Recall that list's unfoldr :: (b -> Maybe (a, b)) -> b -> [a] creates a list by repeatedly applying (unfolding) a step-forward function, starting with an initial value. A Stream is just an unfoldr function with starting state. (The Data.Vector.Fusion.Stream.Monadic library uses GADTs to create constructors for Step that can be pattern-matched conveniently. It could just as well be done without GADTs, I think.)
The central piece of the solution is the mkBitstream :: BSL.ByteString -> Stream Bool function that turns a BytesString into a stream of Bool. Basically, we keep track of the current ByteString, the current byte, and how much of the current byte is still unconsumed. Whenever a byte is used up another byte is chopped off ByteString. When Nothing is left, the stream is Done.
The longestRun function is taken straight from #Cirdec's solution.
Here's the etude:
{-# LANGUAGE CPP #-}
#define PHASE_FUSED [1]
#define PHASE_INNER [0]
#define INLINE_FUSED INLINE PHASE_FUSED
#define INLINE_INNER INLINE PHASE_INNER
module Main where
import Control.Monad.Identity (Identity)
import Data.Bits (shiftR, (.&.))
import qualified Data.ByteString.Lazy as BSL
import Data.Functor.Identity (runIdentity)
import qualified Data.Vector.Fusion.Stream.Monadic as S
import Data.Word8 (Word8)
type Stream a = S.Stream Identity a -- no need for any monad, really
data Step = Step BSL.ByteString !Word8 !Word8 -- could use tuples, but this is faster
mkBitstream :: BSL.ByteString -> Stream Bool
mkBitstream bs' = S.Stream step (Step bs' 0 0) where
{-# INLINE_INNER step #-}
step (Step bs w n) | n==0 = case (BSL.uncons bs) of
Nothing -> return S.Done
Just (w', bs') -> return $
S.Yield (w' .&. 1 == 1) (Step bs' (w' `shiftR` 1) 7)
| otherwise = return $
S.Yield (w .&. 1 == 1) (Step bs (w `shiftR` 1) (n-1))
data LongestRun = LongestRun !Bool !Int !Int
{-# INLINE extendRun #-}
extendRun :: LongestRun -> Bool -> LongestRun
extendRun (LongestRun previous run longest) x = LongestRun x current (max current longest)
where current = if x == previous then run + 1 else 1
{-# INLINE longestRun #-}
longestRun :: Stream Bool -> Int
longestRun s = runIdentity $ do
(LongestRun _ _ longest) <- S.foldl' extendRun (LongestRun False 0 0) s
return longest
main :: IO ()
main = do
bs <- BSL.getContents
print $ longestRun (mkBitstream bs)

Haskell module optimization

I have a problem with Haskell module optimization.
There is Main module.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.DeepSeq
import Formatting
import Formatting.Clock
import System.Clock
import Data.Array
size :: Int
size = 200 :: Int
stdMult :: (Ix a, Ix b, Ix c, Num d) =>
Array (a,b) d -> Array (b,c) d -> Array (a,c) d
stdMult x y = array resultBounds
[((i,j), sum [ x!(i,k) * y!(k,j) | k <- range (lj,uj)])
| i <- range (li,ui),
j <- range (lj',uj') ]
where ((li,lj),(ui,uj)) = bounds x
((li',lj'),(ui',uj')) = bounds y
resultBounds
| (lj,uj)==(li',ui') = ((li,lj'),(ui,uj'))
| otherwise = error "error"
main :: IO ()
main = do
let a = array ((1,1),(size, size)) [((i,j), 2*i-j) |
i <- range (1,size),
j <- range (1,size)]
let b = array ((1,1),(size, size)) [((i,j), 2*i+3*j) |
i <- range (1,size)`,
j <- range (1,size)]
start <- getTime ProcessCPUTime
let
c = stdMult a b
end <- c `deepseq` getTime ProcessCPUTime
fprint (timeSpecs % "\n") start end
return()
When stdMult in Main module, everything works ok. I replace stdMult to another module.
When I don't use ghc optimization, execution time is the same.
When I use ghc options -O3, when stdMult in Main module time execution decreases, but when stdMult in another module, execution time is almost unchanged!
For example, when stdMult in Main I have time 3 seconds, and when stdMult not in Main I have time 30 seconds, for matrix 500x500.
It is very strange!
(You need the clock and formatting packages from Hackage to compile the code.)
I can reproduce the 10x slowdown when stdMult is in a different module. Luckily a fix is easy: in the module where stdMult is defined, add an INLINABLE pragma:
{-# INLINABLE stdMult #-}
It adds the definition to the interface file (.hi) which allows inlining in the modules that uses it, which in turn allows it to be specialized to fast machine Int instead of slow abstract Ix and Num polymorphic code. (If it's in the same module GHC can inline and specialize at will, and things aren't INLINABLE by default because it can cause executable code bloat and slower compilation.)
Alternatively to INLINABLE, you can manually SPECIALIZE to the types you want optimized implementations for. This is a bit more verbose, but should be faster to compile in big projects (it will be specialized once per export, instead of once per import, at a rough guess).
{-# SPECIALIZE stdMult :: Array (Int, Int) Int -> Array (Int, Int) Int -> Array (Int, Int) Int #-}

Library ghc RULES don't activate

fgl is a Haskell library for graph manipulation. This library comes with an implementation of its base classes - Data.Graph.Inductive.PatriciaTree - that is supposedly highly tuned for performance. Part of that performance tuning involves ghc RULES pragmas to replace certain generic functions with specialized versions that are much faster.
However, my evidence is that these RULES don't seem to work at all, and I don't understand why not. For people trying to replicate exactly what I see, I've put my test project up at https://github.com/fizbin/GraphOptiTest and am using ghc version 7.10.2.
Here's my test program:
{-# LANGUAGE TupleSections #-}
module Main where
import Control.Exception
import Control.Monad
import Data.Graph.Inductive
import qualified Data.Graph.Inductive.PatriciaTree as Pt
import qualified MyPatriciaTree as MPt
makeGraph :: (DynGraph gr) => Int -> gr () Int
makeGraph n = mkGraph (map (,()) [1 .. n])
(concatMap (\x -> map (\y -> (x, y, x*y)) [x .. n]) [1 .. n])
main1 :: IO ()
main1 =
replicateM_ 200 $ let x = makeGraph 200 :: Pt.Gr () Int
in evaluate (length $ show x)
main2 :: IO ()
main2 =
replicateM_ 200 $ let x = makeGraph 200 :: MPt.Gr () Int
in evaluate (length $ show x)
main :: IO ()
main = main1 >> main2
Now, Data.Graph.Inductive.PatriciaTree has this definition for the class function mkGraph:
mkGraph vs es = insEdges es
. Gr
. IM.fromList
. map (second (\l -> (IM.empty,l,IM.empty)))
$ vs
Where insEdges is a function defined in the module Data.Graph.Inductive.Graph as:
insEdges :: (DynGraph gr) => [LEdge b] -> gr a b -> gr a b
insEdges es g = foldl' (flip insEdge) g es
And Data.Graph.Inductive.PatriciaTree has this to say about insEdge:
{-# RULES
"insEdge/Data.Graph.Inductive.PatriciaTree" insEdge = fastInsEdge
#-}
fastInsEdge :: LEdge b -> Gr a b -> Gr a b
fastInsEdge (v, w, l) (Gr g) = g2 `seq` Gr g2
where
g1 = IM.adjust addSucc' v g
g2 = IM.adjust addPred' w g1
addSucc' (ps, l', ss) = (ps, l', IM.insertWith addLists w [l] ss)
addPred' (ps, l', ss) = (IM.insertWith addLists v [l] ps, l', ss)
So, in theory, when I run main1 in my test program I should get that compiled down into something that eventually calls fastInsEdge.
To test this, I compare against a modified version of Data.Graph.Inductive.PatriciaTree that uses this as its definition of the mkGraph method: (this is the class MyPatriciaTree used above in main2)
mkGraph vs es = doInsEdges
. Gr
. IM.fromList
. map (second (\l -> (IM.empty,l,IM.empty)))
$ vs
where
doInsEdges g = foldl' (flip fastInsEdge) g es
When I run my test program (after cabal configure --enable-library-profiling --enable-executable-profiling and cabal build GraphOptiTest), though, the main2 method smokes the main1 method. It isn't even close - the profile shows 99.2% of the program's time is spent inside main1. (and changing the program to just run main2 shows that yes, main2 is really fast on its own)
Yes, I do have -O in the ghc-options section of my cabal file.
Trying ghc options like -ddump-rule-firings doesn't really help - all I can see is that these replacement rules aren't firing, but I have no idea why. I don't know how to get the compiler to tell me why it didn't activate the replacement rules.
Bringing up something discovered by messing around with fgl's source in response #dfeuer's answer below:
If I add a specialized version of insEdges to Data.Graph.Inductive.PatriciaTree as:
{-# RULES
"insEdges/Data.Graph.Inductive.PatriciaTree" insEdges = fastInsEdges
#-}
fastInsEdges :: [LEdge b] -> Gr a b -> Gr a b
fastInsEdges es g = foldl' (flip fastInsEdge) g es
Then both main1 and main2 are now fast. This replacement rule fires; why doesn't the other one? (And no, telling ghc to NOINLINE the function insEdge does no good)
EPILOGUE:
So there's now a bug filed with the fgl package for not tagging their functions that use insEdge and insNode appropriately so that the fast versions will be used. But in my code now I work around this and the workaround may be useful in more situations, so I thought I'd share it. At the top of my code now, I have:
import qualified Data.Graph.Inductive as G
import qualified Data.Graph.Inductive.PatriciaTree as Pt
-- Work around design and implementation performance issues
-- in the Data.Graph.Inductive package.
-- Specifically, the tuned versions of insNode, insEdge, gmap, nmap, and emap
-- for PatriciaTree graphs are exposed only through RULES pragmas, meaning
-- that you only get them when the compiler can specialize the function
-- to that specific instance of G.DynGraph. Therefore, I create my own
-- type class with the functions that have specialized versions and use that
-- type class here; the compiler then can do the specialized RULES
-- replacement on the Pt.Gr instance of my class.
class (G.DynGraph gr) => MyDynGraph gr where
mkGraph :: [G.LNode a] -> [G.LEdge b] -> gr a b
insNodes :: [G.LNode a] -> gr a b -> gr a b
insEdges :: [G.LEdge b] -> gr a b -> gr a b
insNode :: G.LNode a -> gr a b -> gr a b
insEdge :: G.LEdge b -> gr a b -> gr a b
gmap :: (G.Context a b -> G.Context c d) -> gr a b -> gr c d
nmap :: (a -> c) -> gr a b -> gr c b
emap :: (b -> c) -> gr a b -> gr a c
instance MyDynGraph Pt.Gr where
mkGraph nodes edges = insEdges edges $ G.mkGraph nodes []
insNodes vs g = foldl' (flip G.insNode) g vs
insEdges es g = foldl' (flip G.insEdge) g es
insNode = G.insNode
insEdge = G.insEdge
gmap = G.gmap
nmap = G.nmap
emap = G.emap
(Had I used the nemap function in my code I would have included that in the class too) Then, any code of mine which was formerly written in terms of (G.DynGraph gr) => ... is now written in terms of (MyDynGraph gr) => .... The compiler RULES activate for the Pt.Gr instance, and I then get the optimized version for each function.
Essentially, this trades away the ability of the compiler to inline any of these functions into the calling code and possibly do other optimizations for always getting the optimized versions. (and the cost of an extra pointer indirection at runtime, but that's trivial in comparison) Since profiling showed that those other optimizations never yielded anything significant anyway, this was a clear net win in my case.
Many people's code could use SPECIALIZE rules aggressively to get the optimized versions everywhere; however, sometimes that isn't possible and it wasn't in the real production code that caused my question without refactoring huge chunks of the application. I had a data structure with a member that has the type (forall gr. G.DynGraph gr => tokType -> gr () (MyEdge c)) - that now uses MyDynGraph for the class constraint, but completely unwinding it to not have forall gr. in the signature would have been a huge effort, and such a signature prevents specialization from working across that boundary.
I haven't done any experiments, but here's my guess. The insEdge function is not marked with a (phased) INLINE or NOINLINE, so the inliner is free to inline it whenever it's fully applied. In the definition of insEdges, we see
foldl' (flip insEdge) g es
Inlining foldl' gives
foldr f' id es g
where f' x k z = k $! flip insEdge z x
flip is now fully applied, so we can inline it:
foldr f' id es g
where f' x k z = k $! insEdge x z
Now insEdge is fully applied, so GHC may choose to inline it right then and there, before the rule ever has a chance to fire.
Try adding {-# NOINLINE [0] insEdge #-} right by the definition of insEdge and see what happens. If it works, submit a pull request to fgl.
P.S. In my opinion, this sort of thing should really be done by using class methods with defaults, rather than rewrite rules. Rules are always a bit fussy.
As the comments revealed, the big problem wasn't premature inlining, but rather a failure to specialize insEdge. In particular, Data.Graph.Inductive.Graph does not export an unfolding for insEdges, so it's impossible to specialize it, and the insEdge it calls, to the appropriate type. The ultimate fix was to mark insEdges INLINABLE, but I would still advise marking insEdge NOINLINE [0] out of an abundance of caution.

Haskell / GHC: {-# SPECIALIZE #-} Causes 'RULE left-hand side too complicated to desugar' Warning

I have a body of code that uses a monad to abstract whether the actual implementation runs inside ST or IO. Removing the extra layer of abstraction and just substituting concrete types gives a huge speedup (~4.5x) due to the inlining and missing typeclass function call overhead. I was thinking of getting some of that performance by using a specialize pragma, but I'm getting a rather meaningless warning from the compiler. I can't make a simple reproduction case as the simple example seems to work, and I don't know what's causing the difference in my actual program.
Basically, my program does this:
{-# LANGUAGE FlexibleInstances, RankNTypes #-}
module STImpl (runAbstractST, MonadAbstractIOST(..), ReaderST) where
import Control.Monad.Reader
import Control.Monad.ST
class Monad m => MonadAbstractIOST m where
addstuff :: Int -> m Int
type ReaderST s = ReaderT (Int) (ST s)
instance MonadAbstractIOST (ReaderST s) where
addstuff a = return . (a +) =<< ask
runAbstractST :: (forall s. ReaderST s a) -> a
runAbstractST f = runST $ runReaderT f 99
and
module Main (main) where
import STImpl
import Control.Monad
{-# SPECIALIZE INLINE useAbstractMonad :: ReaderST s Int #-}
useAbstractMonad :: MonadAbstractIOST m => m Int
useAbstractMonad = foldM (\a b -> a `seq` return . (a +) =<< (addstuff b)) 0 [1..50000000]
main :: IO ()
main = do
let st = runAbstractST useAbstractMonad
putStrLn . show $ st
Now, here everything seems to work fine. But in my program I get
RULE left-hand side too complicated to desugar
let {
$dFunctor :: Functor (RSTSim s)
[LclId]
$dFunctor =
Control.Monad.Trans.Reader.$fFunctorReaderT
# (MonadSim.SimState s)
# (GHC.ST.ST s)
(GHC.ST.$fFunctorST # s) } in
simulate
# (Control.Monad.Trans.Reader.ReaderT
(MonadSim.SimState s) (GHC.ST.ST s))
(MonadSim.$fMonadSimReaderT
# s
$dFunctor
(Control.Monad.Trans.Reader.$fMonadReaderT
# (MonadSim.SimState s)
# (GHC.ST.ST s)
(GHC.ST.$fMonadST # s))
(Control.Monad.Trans.Reader.$fApplicativeReaderT
# (MonadSim.SimState s)
# (GHC.ST.ST s)
$dFunctor
(Control.Applicative.$fApplicativeST0
# s (GHC.ST.$fFunctorST # s))))
I don't understand what 'left-hand side', 'too complicated' and 'desugar' mean ;-)
It seems I have the same problem as described here: http://marc.info/?l=haskell-cafe&m=133242702914511
How do I diagnose this? How do I figure out what's causing the optimization to be disabled in my program?
Thanks!
For what it's worth, on the 7.10 RC1 this error no longer occurs, so it looks like the fix to https://ghc.haskell.org/trac/ghc/ticket/8848 may have helped.

How can Haskell quasiquotation be used for replacing tokens on the Haskell level?

Quasiquotation as described in haskellwiki is shown mostly as useful tool for embedding other languages inside Haskell without messing around with string quotation.
Question is: For Haskell itself, how easy it would be to put existing Haskell code through a quasiquoter for the purpose of just replacing tokens and passing the result over to ghc? Perhaps Template Haskell is key here?
I have looked for code examples and didn't find any. Some EDSLs can benefit from this ability by reducing the size of their combinating operators (e.g. turn 'a .|. b .>>. c' to '[myedsl|a | b >> c]').
You can build quasi-quoters that manipulate Haskell code by, for example, using the haskell-src-meta package. It parses valid Haskell code into an AST, which you can then modify.
In this case, the easiest way to modify the AST is by using Data.Generics to apply a generic transformation to the whole AST that replaces operators with other operators.
We'll begin by building the transformation function for generic Haskell expressions. The data type that represents an expression is Exp in the template-haskell package.
For example, to convert the operator >> to .>>. we'd use a function like
import Language.Haskell.TH (Exp(..), mkName)
replaceOp :: Exp -> Exp
replaceOp (VarE n) | n == mkName ">>" = VarE (mkName ".>>.")
replaceOp e = e
This changes a variable expression (VarE), but cannot do anything to any other kind of expressions.
Now, to walk the whole AST and to replace all occurrences of >> we'll use the functions everywhere and mkT from Data.Generic.
import Data.Generics (everywhere, mkT)
replaceEveryOp :: Exp -> Exp
replaceEveryOp = everywhere (mkT replaceOp)
In order to make several replacements, we can alter the function so that it takes an association list of any operator to replace.
type Replacements = [(String, String)]
replaceOps :: Replacements -> Exp -> Exp
replaceOps reps = everywhere (mkT f) where
f e#(VarE n) = case rep of
Just n' -> VarE (mkName n')
_ -> e
where rep = lookup (show n) reps
f e = e
And by the way, this is a good example of a function that is much nicer to write by using the view patterns language extension.
{-# LANGUAGE ViewPatterns #-}
replaceOps :: Replacements -> Exp -> Exp
replaceOps reps = everywhere (mkT f) where
f (VarE (replace -> Just n')) = VarE (mkName n')
f e = e
replace n = lookup (show n) reps
Now all that's left for us to do is to build the "myedsl" quasi-quoter.
{-# LANGUAGE ViewPatterns #-}
import Data.Generics (everywhere, mkT)
import Language.Haskell.Meta.Parse (parseExp)
import Language.Haskell.TH (Exp(..), mkName, ExpQ)
import Language.Haskell.TH.Quote (QuasiQuoter(..))
type Replacements = [(String, String)]
replacements :: Replacements
replacements =
[ ("||", ".|.")
, (">>", ".>>.")
]
myedls = QuasiQuoter
{ quoteExp = replaceOpsQ
, quotePat = undefined
, quoteType = undefined
, quoteDec = undefined
}
replaceOpsQ :: String -> ExpQ
replaceOpsQ s = case parseExp s of
Right e -> return $ replaceOps replacements e
Left err -> fail err
replaceOps :: Replacements -> Exp -> Exp
replaceOps reps = everywhere (mkT f) where
f (VarE (replace -> Just n')) = VarE (mkName n')
f e = e
replace n = lookup (show n) reps
If you save the above to its own module (e.g. MyEDSL.hs), then you can import it and use the quasi-quoter.
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE QuasiQuotes #-}
import MyEDSL
foo = [myedsl| a || b >> c |]
Note that I've used || instead of | because the latter is not a valid operator in Haskell (since it's the syntactic element used for pattern guards).

Resources