Let's say I pass a small function f to map. Can Haskell inline f with map to produce a small imperative loop? If so, how does Haskell keep track of what function f really is? Can the same be done with Arrow combinators?
If f is passed in as an argument, then no, probably not. If f is the name of a top-level function or a local function, then probably yes.
foobar f = ... map f ...
-- Probably not inlined.
foobar = ... map (\ x -> ...) ...
-- Probably inlined.
That said, I gather that most of the performance difference between inline and out of line comes not from the actual inlining itself, but rather from any additional subsequent optimisations this might allow.
The only way to be "sure" about these things is to actually write the code, actually compile it, and have a look at the Core that gets generated. And the only way to know if it makes a difference (positive or negative) is to actually benchmark the thing.
The definition of the Haskell language does not mandate a Haskell implementation to inline code, or to perform any kind of optimization. Any implementation is free to apply any optimization it may deem appropriate.
That being said, Haskell is nowadays often run using GHC, which does optimize Haskell code. For inlining, GHC uses some heuristics to decide whether something should inlined or not. The general advice is to turn optimization on with -O2 and check the output of the compiler. You can see the produced Core with -ddump-simpl, or the assembly code with -ddump-asm. Some other flags can be useful as well.
If you then see that GHC is not inlining stuff you would like to, you can provide a hint to the compiler with {-# INLINE foo #-} and related pragmas.
Be wary of mindlessly applying optimizations, though. Often, programmers spend their time to optimize parts of the program which have a negligible impact to the overall performance. To avoid this, it is strongly recommended to profile your code first, so that you know where your program spends a lot of time.
Here is an example where GHC does inline a function passed as an argument :
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector as V
plus :: Int -> Int -> Int
plus = (+)
sumVect :: V.Vector Int -> Int
sumVect = V.foldl1 plus
plus is passed as the argument of foldl1, which results in summing a vector of integers. In the Core, plus is inlined and optimized to the unboxed GHC.Prim.+# :: Int# -> Int# -> Int# :
letrec {
$s$wfoldlM_loop_s759
:: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
$s$wfoldlM_loop_s759 =
\ (sc_s758 :: GHC.Prim.Int#) (sc1_s757 :: GHC.Prim.Int#) ->
case GHC.Prim.tagToEnum# # Bool (GHC.Prim.>=# sc_s758 ww1_s748)
of _ {
False ->
case GHC.Prim.indexArray#
# Int ww2_s749 (GHC.Prim.+# ww_s747 sc_s758)
of _ { (# ipv1_X72o #) ->
case ipv1_X72o of _ { GHC.Types.I# y_a5Kg ->
$s$wfoldlM_loop_s759
(GHC.Prim.+# sc_s758 1#) (GHC.Prim.+# sc1_s757 y_a5Kg)
}
};
True -> sc1_s757
}; }
That's the GHC.Prim.+# sc1_s757 y_a5Kg. You can add simple artihmetic inside function plus and see this Core expression expand.
Related
I was trying to figure out how mVars work, and I came across this bit of code:
-- |Create an 'MVar' which is initially empty.
newEmptyMVar :: IO (MVar a)
newEmptyMVar = IO $ \ s# ->
case newMVar# s# of
(# s2#, svar# #) -> (# s2#, MVar svar# #)
Besides being confusingly mutually recursive with newMVar, it's also littered with hashs (#).
Between the two, I can't figure out how it works. I know that this is basically just a pseudo-constructor for mVar, but the rest of the module (most of the library actually) contains them, and I can't find anything on them. Googling "Haskell hashs" didn't yield anything relevant.
They're (literally) magic hashes. They distinguish GHC's primitive's like addition, unboxed types, and unboxed tuples. You can enable writing them with
{-# LANGUAGE MagicHash #-}
Now you can import the stubs that let you use them with
import GHC.Exts
unboxed :: Int# -> Int# -> Int#
unboxed a# b# = a# +# b#
boxed :: Int -> Int -> Int
boxed (I# a#) (I# b#) = I# (unboxed a# b#)
This actually is kinda nifty when you think about it, by wrapping the magical and strict primitives like this, we can handle lazy Ints and Chars uniformly at the runtime system level.
Because primitives are not boxed, they're segregated at the kind level. This means that Int# doesn't have the kind * like normal types, which also means something like
kindClash :: Int# -> Int#
kindClash = id -- id expects boxed types
Won't compile.
To further elaborate on your code, newMVar includes a call to a compiler primitive in GHC to allocate a new mutable variable. It's not mutually recursive so much as a thin wrapper over a compiler call. There's also some darkness gathering at the corners of this function since we're treating IO as a perverse state monad, but let's not look to closely at that. I like my sanity too much.
I don't use primitives in everyday code, nor should you. They come up when implementing crazy optimized hotspots, or near primitive abstractions like what you're looking at.
I have a program that uses haskell-src-exts, and to improve performance I decided to make some record fields strict. This resulted in much worse performance.
Here's the complete module that I'm changing:
{-# LANGUAGE DeriveDataTypeable, BangPatterns #-}
module Cortex.Hackage.HaskellSrcExts.Language.Haskell.Exts.SrcSpan(
SrcSpan, srcSpan, srcSpanFilename, srcSpanStartLine,
srcSpanStartColumn, srcSpanEndLine, srcSpanEndColumn,
) where
import Control.DeepSeq
import Data.Data
data SrcSpan = SrcSpanX
{ srcSpanFilename :: String
, srcSpanStartLine :: Int
, srcSpanStartColumn :: Int
, srcSpanEndLine :: Int
, srcSpanEndColumn :: Int
}
deriving (Eq,Ord,Show,Typeable,Data)
srcSpan :: String -> Int -> Int -> Int -> Int -> SrcSpan
srcSpan fn !sl !sc !el !ec = SrcSpanX fn sl sc el ec
instance NFData SrcSpan where
rnf (SrcSpanX x1 x2 x3 x4 x5) = rnf x1
Note that the only way to construct a SrcSpan is by using the srcSpan function which is strict in all the Ints.
With this code my program (sorry, I can't share it) runs in 163s.
Now change a single line, e.g.,
, srcSpanStartLine :: !Int
I.e., the srcSpanStartLine field is now marked as strict. My program now takes 198s to run. So making that one field strict increases the running time by about 20%.
How is this possible? The code for the srcSpan function should be the same regardless since it is already strict. The code for the srcSpanStartLine selector should be a bit simpler since it no longer has to evaluate.
I've experimented with -funbox-strict-fields and -funbox-small-strict-field on and off. It doesn't make any noticeable difference. I'm using ghc 7.8.3.
Has anyone seen something similar? Any bright ideas what might cause it?
With some more investigation I can answer my own question. The short answer is uniplate.
Slightly longer answer. In one place I used uniplate to get the children of a Pat (haskell-src-exts type for patterns). The call looked like children p and the type of this instance of children was Pat SrcSpanInfo -> [Pat SrcSpanInfo]. So it's doing no recursion, just returning the immediate children of a node.
Uniplate uses two very different methods depending on if there are strict fields in the type your operating on. Without strict fields it reasonable fast, with strict fields it switches to using gfoldl and is incredibly slow. And even though my use of uniplate didn't directly involve a strict field, it slowed down.
Conclusion: Beware uniplate if you have a strict field anywhere in sight!
I would like to know what is the most standard way in Haskell.
The first one states clearly that we want two arguments (most of the time).
The second involves a function call (id) in the second clause, so it should be less efficient because in the first implementation we can simply return the second argument.
So i tend to think that the first is better and should be the one to pick : easier to read and to figure out what it does[1], and a function call save.
But i'm newbie to Haskell, maybe the compiler optimize this extra call.
xor :: Bool -> Bool -> Bool
xor True x = not x
xor False x = x
xor True = not
xor False = id
Also, i would like to know if i can replace both False with a wildcard there.
So, what is the good practice in Haskell. Maybe another implementation ?
[1] We omit there that it is a well known functionallity, let's imagine it is a non-trivial function.
Thanks
For readability, I would try to avoid pattern matching and define the function with a single equation that expresses something interesting about the function to be defined. That's not always possible, but for this example, there are many options:
xor = (/=)
xor a b = a /= b
xor a b = not (a == b)
xor a b = (a && not b) || (not a && b)
xor a b = (a || b) && not (a && b)
xor a b = odd (fromEnum a + fromEnum b)
Of course it depends on the compiler and the options passed to the compiler.
For this particular example, if you compile without optimisations, GHC produces the code as you have written it, so the second version contains a call to id resp. to not. That is slightly less efficient than the first version, which then only contains the call to not:
Xors.xor1 :: GHC.Types.Bool -> GHC.Types.Bool -> GHC.Types.Bool
[GblId, Arity=2]
Xors.xor1 =
\ (ds_dkm :: GHC.Types.Bool) (x_aeI :: GHC.Types.Bool) ->
case ds_dkm of _ {
GHC.Types.False -> x_aeI;
GHC.Types.True -> GHC.Classes.not x_aeI
}
Xors.xor2 :: GHC.Types.Bool -> GHC.Types.Bool -> GHC.Types.Bool
[GblId, Arity=1]
Xors.xor2 =
\ (ds_dki :: GHC.Types.Bool) ->
case ds_dki of _ {
GHC.Types.False -> GHC.Base.id # GHC.Types.Bool;
GHC.Types.True -> GHC.Classes.not
}
(the calls are still in the produced assembly, but core is more readable, so I post only that).
But with optimisations, both functions compile to the same core (and thence to the same machine code),
Xors.xor2 =
\ (ds_dkf :: GHC.Types.Bool) (eta_B1 :: GHC.Types.Bool) ->
case ds_dkf of _ {
GHC.Types.False -> eta_B1;
GHC.Types.True ->
case eta_B1 of _ {
GHC.Types.False -> GHC.Types.True;
GHC.Types.True -> GHC.Types.False
}
}
GHC eta-expanded the second version and inlined the calls to id and not, you get pure pattern-matching.
Whether the second equation uses False or a wildcard makes no difference in either version, with or without optimisations.
maybe the compiler optimize this extra call.
If you ask it to optimise, in simple cases like this, GHC will eliminate the extra call.
let's imagine it is a non-trivial function.
Here's a possible problem. If the code is non-trivial enough, the compiler may not be able to eliminate all calls introduced by defining the function with not all arguments supplied. GHC is rather good at doing that and inlining calls, though, so you need a fair amount of non-triviality to make GHC fail eliminating calls to simple functions it knows when compiling your code (it can of course never inline calls to functions it doesn't know the implementation of when compiling the module in question).
If it's critical code, always check what code the compiler produces, for GHC, the relevant flags are -ddump-simpl to get the core produced after optimisations, and -ddump-asm to get the produced assembly.
So i tend to think that the first is better and should be the one to pick : easier to read and to figure out what it does
I agree about readability. However, the second one is very much idiomatic Haskell and rather easier to read for experienced programmers: not performing that trivial eta reduction is quite suspicious and might actually distract from the intend. So for an optimised version, I'd rather write it out completely in explicit form:
True `xor` False = True
False `xor` True = True
_ `xor` _ = False
However, if such an alternative is considerably less readable than the most idiomatic one you should consider not replacing it, but adding hints so the compiler can still optimise it to the ideal version. As demonstrated by Daniel Fischer, GHC is quite clever by itself and will often get it right without help; when it doesn't it might help to add some INLINE and/or RULES pragmas. It's not easy to figure out how to do this to get optimal performance, but the same is true for writing fast Haskell98 code.
According to this article,
Enumerations don't count as single-constructor types as far as GHC is concerned, so they don't benefit from unpacking when used as strict constructor fields, or strict function arguments. This is a deficiency in GHC, but it can be worked around.
And instead the use of newtypes is recommended. However, I cannot verify this with the following code:
{-# LANGUAGE MagicHash,BangPatterns #-}
{-# OPTIONS_GHC -O2 -funbox-strict-fields -rtsopts -fllvm -optlc --x86-asm-syntax=intel #-}
module Main(main,f,g)
where
import GHC.Base
import Criterion.Main
data D = A | B | C
newtype E = E Int deriving(Eq)
f :: D -> Int#
f z | z `seq` False = 3422#
f z = case z of
A -> 1234#
B -> 5678#
C -> 9012#
g :: E -> Int#
g z | z `seq` False = 7432#
g z = case z of
(E 0) -> 2345#
(E 1) -> 6789#
(E 2) -> 3535#
f' x = I# (f x)
g' x = I# (g x)
main :: IO ()
main = defaultMain [ bench "f" (whnf f' A)
, bench "g" (whnf g' (E 0))
]
Looking at the assembly, the tags for each constructor of the enumeration D is actually unpacked and directly hard-coded in the instruction. Furthermore, the function f lacks error-handling code, and more than 10% faster than g. In a more realistic case I have also experienced a slowdown after converting a enumeration to a newtype. Can anyone give me some insight about this? Thanks.
It depends on the use case. For the functions you have, it's expected that the enumeration performs better. Basically, the three constructors of D become Ints resp. Int#s when the strictness analysis allows that, and GHC knows it's statically checked that the argument can only have one of the three values 0#, 1#, 2#, so it needs not insert error handling code for f. For E, the static guarantee of only one of three values being possible isn't given, so it needs to add error handling code for g, that slows things down significantly. If you change the definition of g so that the last case becomes
E _ -> 3535#
the difference vanishes completely or almost completely (I get a 1% - 2% better benchmark for f still, but I haven't done enough testing to be sure whether that's a real difference or an artifact of benchmarking).
But this is not the use case the wiki page is talking about. What it's talking about is unpacking the constructors into other constructors when the type is a component of other data, e.g.
data FooD = FD !D !D !D
data FooE = FE !E !E !E
Then, if compiled with -funbox-strict-fields, the three Int#s can be unpacked into the constructor of FooE, so you'd basically get the equivalent of
struct FooE {
long x, y, z;
};
while the fields of FooD have the multi-constructor type D and cannot be unpacked into the constructor FD(1), so that would basically give you
struct FooD {
long *px, *py, *pz;
}
That can obviously have significant impact.
I'm not sure about the case of single-constructor function arguments. That has obvious advantages for types with contained data, like tuples, but I don't see how that would apply to plain enumerations, where you just have a case and splitting off a worker and a wrapper makes no sense (to me).
Anyway, the worker/wrapper transformation isn't so much a single-constructor thing, constructor specialisation can give the same benefit to types with few constructors. (For how many constructors specialisations would be created depends on the value of -fspec-constr-count.)
(1) That might have changed, but I doubt it. I haven't checked it though, so it's possible the page is out of date.
I would guess that GHC has changed quite a bit since that page was last updated in 2008. Also, you're using the LLVM backend, so that's likely to have some effect on performance as well. GHC can (and will, since you've used -O2) strip any error handling code from f, because it knows statically that f is total. The same cannot be said for g. I would guess that it's the LLVM backend that then unpacks the constructor tags in f, because it can easily see that there is nothing else used by the branching condition. I'm not sure of that, though.
Is there a GHC-specific "unsafe" extension to ask whether two Haskell references point to the same location?
I'm aware this can break referential transparency if not used properly. But there should be little harm (unless I'm missing something), if it is used very careful, as a means for optimizations by short-cutting recursive (or expensive) data traversal, e.g. for implementing an optimized Eq instance, e.g.:
instance Eq ComplexTree where
a == b = (a `unsafeSameRef` b) || (a `deepCompare` b)
providing deepCompare is guaranteed to be true if unsafeSameRef decides true (but not necessarily the other way around).
EDIT/PS: Thanks to the answer pointing to System.Mem.StableName, I was able to also find the paper Stretching the storage manager: weak pointers and stable names in Haskell which happens to have addressed this very problem already over 10 years ago...
GHC's System.Mem.StableName solves exactly this problem.
There's a pitfall to be aware of:
Pointer equality can change strictness. I.e., you might get pointer equality saying True when in fact the real equality test would have looped because of, e.g., a circular structure. So pointer equality ruins the semantics (but you knew that).
I think StablePointers might be of help here
http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/Foreign-StablePtr.html
Perhaps this is the kind of solution you are looking for:
import Foreign.StablePtr (newStablePtr, freeStablePtr)
import System.IO.Unsafe (unsafePerformIO)
unsafeSameRef :: a -> a -> Bool
unsafeSameRef x y = unsafePerformIO $ do
a <- newStablePtr x
b <- newStablePtr y
let z = a == b
freeStablePtr a
freeStablePtr b
return z;
There's unpackClosure# in GHC.Prim, with the following type:
unpackClosure# :: a -> (# Addr#,Array# b,ByteArray# #)
Using that you could whip up something like:
{-# LANGUAGE MagicHash, UnboxedTuples #-}
import GHC.Prim
eq a b = case unpackClosure# a of
(# a1,a2,a3 #) -> case unpackClosure# b of
(# b1,b2,b3 #) -> eqAddr# a1 b1
And in the same package, there's the interestingly named reallyUnsafePtrEquality# of type
reallyUnsafePtrEquality# :: a -> a -> Int#
But I'm not sure what the return value of that one is - going by the name it will lead to much gnashing of teeth.