I'm observing the following execution times:
~9s when compiled with stack build
~5s when compiled with stack build --profile
I'd expect non-profiled execution time to be below 1s, which tells me that the profiled time above makes sense, it's the non-profiled time that is abnormally slow.
A few more details:
The program reads a relational algebra-like DSL, applies a series of rule-based transformations, and outputs a translation to SQL. Parsing is done with megaparsec. I/O is String based and is relatively small (~ 150 KBs). I would exclude I/O as a source of the problem. Transformations involve recursive rewriting rules over an ADT. In a few occasions, I use ugly-memo to speed up such recursive rewrites.
Using stack 2.9.1 with LTS 18.28, ghc 8.10.7
(EDIT: upgrading to LTS 20.11, ghc 9.2.5, does not help)
In the cabal file:
ghc-options: -O2 -Wall -fno-warn-unused-do-bind -fno-warn-missing-signatures -fno-warn-name-shadowing -fno-warn-orphans
ghc-prof-options: -O2 -fprof-auto "-with-rtsopts=-p -V0.0001 -hc -s"
Notice that none of the above is new, but I have never observed this behaviour before.
I've seen this related question, but compiling with -fno-state-hack does not help
Compiling with -O1 doesn't help (about the same as -O2), and -O0 is significantly slower, as expected.
The profiling information shows no culprit. The problem only shows up with non-profiled execution.
I realise I'm not giving many details. In particular, I'm not including any code snippet because it's a large code base and I have no idea of which part of it could trigger this behaviour. My point is indeed that I don't know how to narrow it down.
So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue, given that the obvious tool (profiling) turns out to be useless in this case?
Though I still don't know the exact reason for the difference between profile and non-profile execution, I now found the culprit in my code.
A function that recurse over the ADT was implemented as follows:
f :: (?env :: Env) => X -> Y
f = memo go
where
go :: (?env :: Env) => X -> Y
go (...) = ...
go x = f (children x) -- memoised recursion over x's children
Note that:
each recursion is memoised, as it the data structure may contain many duplicates
the function uses an Env parameter, which is used internally but never changes during the whole recursion. Note that it is declared as an implicit parameter.
Profiling info about this function showed many more function calls than expected. Three orders of magnitude more than expected! (Still pretty weird that, despite this, profile execution was reasonably fast).
Turns out, the implicit parameter was the issue. I don't know the exact reason, but I had read blogs about how evil implicit parameters can be and now I have first-hand proof of that.
Changing this function to using only explicit parameters fixed it completely:
f :: Env -> X -> Y
f env = memo (go env)
where
go :: Env -> X -> Y
go env (...) = ...
go env x = f env (children x) -- memoised recursion over x's children
EDIT: Thanks to #leftaroundabout for pointing out that my solution above isn't very correct - it under-uses memoisation (see discussion below). The correct version should be:
f :: Env -> X -> Y
f env = f_env
where
f_env = memo $ go env
go env (...)
go env x = f_env (children x)
What the implicit parameter seemed to cause is that uglymemo failed to recognise previously stored calls, so I got all the overhead that comes with it and no benefit.
My problem is kind of solved, but if someone can explain the underlying reasons for which implicit parameters can cause this behaviour, and why would profile execution be faster, I'd still be very interested to learn that.
This sounds like a contradiction: "So my question is not "where is the problem?", but rather: what could I do to get closer to the source of the issue"
Run it under a debugger, and just manually pause it during those 9s or 5s. Do this several times. The stack will show you exactly how it's wasting time, whether I/O or not. That's this technique.
What have you got to lose?
Related
I'm working on a project right now where I'm dealing with
the Prim typeclass and I need to ensure that a particular
function I've written is specialized. That is, I need to make sure that
when I call it, I get a specialized version of the function in which the
Prim dictionaries get inlined into the specialized
definition instead of being passed at runtime.
Fortunately, this is a pretty well-understood thing in GHC. You can just
write:
{-# SPECIALIZE foo :: ByteArray Int -> Int #-}
foo :: Prim a => ByteArray a -> Int
foo = ...
And in my code, this approach is working fine. But, since typeclasses are
open, there can be Prim instances that I don't know about yet when
the library is being written. This brings me to the problem at hand.
The GHC user guide's documentation of SPECIALIZE
provides two ways to use it. The first is putting SPECIALIZE at the
site of the definition, as I did in the example above. The second is
putting the SPECIALIZE pragma in another module where the function is imported.
For reference, the example the user manual provides is:
module Map( lookup, blah blah ) where
lookup :: Ord key => [(key,a)] -> key -> Maybe a
lookup = ...
{-# INLINABLE lookup #-}
module Client where
import Map( lookup )
data T = T1 | T2 deriving( Eq, Ord )
{-# SPECIALISE lookup :: [(T,a)] -> T -> Maybe a
The problem I'm having is that this is not working in my code. The project
is on github,
and the relevant lines are:
bench/Main.hs line 24
src/BTree/Compact.hs line 149
To run the benchmark, run these commands:
git submodule init && git submodule update
cabal new-build bench && ./dist-newstyle/build/btree-0.1.0.0/build/bench/bench
When I run the benchmark as is, there is a part of the output that reads:
Off-heap tree, Amount of time taken to build:
0.293197796
If I uncomment line 151 of BTree.Compact,
that part of the benchmark runs fifty times faster:
Off-heap tree, Amount of time taken to build:
5.626834e-2
It's worth pointing out that the function in question, modifyWithM, is enormous.
It's implementation is over 100 lines, but I do not think this should make a
difference. The docs claim:
... mark the definition of f as INLINABLE, so that GHC guarantees to expose an unfolding regardless of how big it is.
So, my understanding is that, if specializing at the definition site works, it
should always be possible to instead specialize at the call site. I would appreciate
any insights from people who understand this machinery better than I do, and I'm
happy to provide more information if something is unclear. Thanks.
EDIT: I've realized that in the git commit I linked to in this post, there is a problem with the benchmark code. It repeatedly inserts the same value. However, even after fixing this, the specialization problem is still happening.
I'm working on this tool where the user can define-and-include in [config files | content text-files | etc] their own "templates" (like mustache etc) and these can reference others so they can induce a loop. Just when I was about to create a "max-loops" setting I realized with runghc the program after a while just quits with farewell message of just <<loop>>. That's actually good enough for me but raises a few ponderations:
how does GHC or the runtime actually detect it's stuck in a loop, and how would it distinguish between a wanted long-running operation and an accidental infinite loop? The halting problem is still a thing last I checked..
any (time or iteration) limits that can be custom-set to the compiler or the runtime?
is this runghc-only or does it exist in all final compile outputs?
will any -o (optimization) flags set much later when building releases disable this apparent built-in loop detection?
All stuff I can figure out the hard way of course, but who knows maybe someone already looked into this in more detail.. (hard to google/ddg for "haskell" "<<loop>>" because they strip the angle brackets and then show results for "how to loop in Haskell" etc..)
This is a simple "improvement" of the STG runtime which was implemented in GHC. I'll share what I have understood, but GHC experts can likely provide more useful and accurate information.
GHC compiles to an intermediate language called Core, after having done several optimizations. You can see it using ghc -ddump-simpl ...
Very roughly, in Core, an unevaluated binding (like let x = 1+y+x in f x) creates a thunk. Some memory is allocated somewhere to represent the closure, and x is made to point at it.
When (and if) x is forced by f, then the thunk is evaluated. Here's the improvement: before the evaluation starts, the thunk of x is overwritten with a special value called BLACKHOLE. After x is evaluated (to WHNF) then the black hole is again overwritten with the actual value (so we don't recompute it if e.g. f x = x+x).
If the black hole is ever forced, <<loop>> is triggered. This is actually an IO exception (those can be raised in pure code, too, so this is fine).
Examples:
let x = 1+x in 2*x -- <<loop>>
let g x = g (x+1) in g 0 -- diverges
let h x = h (10-x) in h 0 -- diverges, even if h 0 -> h 10 -> h 0 -> ...
let g0 = g10 ; g10 = g0 in g0 -- <<loop>>
Note that each call of h 0 is considered a distinct thunk, hence no black hole is forced there.
The tricky part is that it's not completely trivial to understand which thunks are actually created in Core since GHC can perform several optimizations before emitting Core. Hence, we should regard <<loop>> as a bonus, not as a given / hard guarantee by GHC. New optimizations in the future might replace some <<loop>>s with actual non-termination.
If you want to google something, "GHC, blackhole, STG" should be good keywords.
After writing this article I decided to put my money where my mouth is and started to convert a previous project of mine to use recursion-schemes.
The data structure in question is a lazy kdtree. Please have a look at the implementations with explicit and implicit recursion.
This is mostly a straightforward conversion along the lines of:
data KDTree v a = Node a (Node v a) (Node v a) | Leaf v a
to
data KDTreeF v a f = NodeF a f f | Leaf v a
Now after benchmarking the whole shebang I find that the KDTreeF version is about two times slower than the normal version (find the whole run here).
Is it just the additional Fix wrapper that slows me down here? And is there anything I could do against this?
Caveats:
At the moment this is specialized to (V3 Double).
This is cata- after anamorphism application. Hylomorphism isn't suitable for kdtrees.
I use cata (fmap foo algebra) several times. Is this good practice?
I use Edwards recursion-schemes package.
Edit 1:
Is this related? https://ghc.haskell.org/trac/ghc/wiki/NewtypeWrappers
Is newtype Fix f = Fix (f (Fix f)) not "free"?
Edit 2:
Just did another bunch of benchmarks. This time I tested tree construction and deconstruction. Benchmark here: https://dl.dropboxusercontent.com/u/2359191/2014-05-15-kdtree-bench-03.html
While the Core output indicates that intermediate data structures are not removed completely and it is not surprising that the linear searches dominate now, the KDTreeFs now are slightly faster than the KDTrees. Doesn't matter much though.
I have just implemented the Thing + ThingF + Base instance variant of the tree. And guess what ... this one is amazingly fast.
I was under the impression that this one would be the slowest of all variants. I really should have read my own post ... the line where I write:
there is no trace of the TreeF structure to be found
Let the numbers speak for themselves, kdtreeu is the new variant. The results are not always as clear as for these cases, but in most cases they are at least as fast as the explicit recursion (kdtree in the benchmark).
I wasn't using recursion schemes, but rather my own "hand-rolled" cata, ana, Fix/unFix to do generation of (lists of) and evaluation of programs in a small language in the hope of finding one that matched a list of (input, output) pairs.
In my experience, cata optimized better than direct recursion and gave a speed boost. Also IME, ana prevented stack overflow errors that my naive generator was causing, but that make have centered around generation of the final list.
So, my answer would be that no, they aren't always slower, but I don't see any obvious problems; so they may simply be slower in your case. It's also possible that recursion-schemes itself is just not optimized for speed.
When running hlint over my program it reported an error for
\x -> [x]
and suggested the alternative form
(: [])
What is there erroneous according to hlint about the first form, and thus why should I use the (less readable) second option?
Edit
(added hlint explicitly to the question)
My question lies not so much with what the difference is (I do understand both of them) in lexical point of view. My problem is that I do not understand why hlint is marking it as an error. Is there for example a difference in laziness? Furthermore why is the previous thought of as erroneous by hlint while \x -> Just x raises only a warning.
A common question, to which I've just added an answer in the HLint manual. It says:
Every hint has a severity level:
Error - for example concat (map f x) suggests concatMap f x as an "error" severity hint. From a style point of view, you should always replace a combination of concat and map with concatMap. Note that both expressions are equivalent - HLint is reporting an error in style, not an actual error in the code.
Warning - for example x !! 0 suggests head x as a "warning" severity hint. Typically head is a simpler way of expressing the first element of a list, especially if you are treating the list inductively. However, in the expression f (x !! 4) (x !! 0) (x !! 7), replacing the middle argument with head makes it harder to follow the pattern, and is probably a bad idea. Warning hints are often worthwhile, but should not be applied blindly.
The difference between error and warning is one of personal taste, typically my personal taste. If you already have a well developed sense of Haskell style, you should ignore the difference. If you are a beginner Haskell programmer you may wish to focus on error hints before warning hints.
While the difference is personal taste, sometimes I change my mind. Looking at the two examples in this thread, (:[]) seems a relatively "complex" hint - you are breaking down the syntactic sugar of [x] to x:[], which in some ways is peeling through the abstraction of a list as a generic container, if you never pattern match on it. In contrast \x -> Just x to Just always seems like a good idea. Therefore, in HLint-1.8.43 (just released) I have made the first a warning, and the second an error.
There is no real difference. HLint concerns itself with style issues; ultimately, they are just hints on how to make your code look better.
In general, using a lambda with a constructor or function like that is redundant and makes the code harder to read. As an extreme example, take a constructor like Just as an example: compare Just to \ x -> Just x. These are equivalent but the second version certainly makes things more confusing! As a closer example, most people would choose (+ 1) over \ x -> x + 1.
In your particular case, it's a different story because lists have special syntax. So if you like the \ x -> [x] version better, just keep it. However, once you become used to operator sections, it's likely you'll find the (: []) version as easy--if not easier--to read, so consider using it even now.
I might consider using return or pure for this:
ghci> return 0 :: [Int]
[0]
ghci> import Control.Applicative
ghci> pure 0 :: [Int]
[0]
I needed to include the type annotation (:: [Int]) because I was working in GHCi. In the middle of a bunch of other code you probably wouldn't need it.
I'm trying to understand a performance anomaly being observed when running a program under runhaskell.
The program in question is:
isFactor n = (0 ==) . (mod n)
factors x = filter (isFactor x) [2..x]
main = putStrLn $ show $ sum $ factors 10000000
When I run this, it takes 1.18 seconds.
However, if I redefine isFactor as:
isFactor n f = (0 ==) (mod n f)
then the program takes 17.7 seconds.
That's a huge difference in performance and I would expect the programs to be equivalent. Does anybody know what I'm missing here?
Note: This doesn't happen when compiled under GHC.
Although the functions should be identical, there's a difference in how they're applied. With the first definition, isFactor is fully applied at the call site isFactor x. In the second definition, it isn't, because now isFactor explicitly takes two arguments.
Even minimal optimizations are enough for GHC to see through this and create identical code for both, however if you compile with -O0 -ddump-simpl you can determine that, with no optimizations, this makes a difference (at least with ghc-7.2.1, YMMV with other versions).
With the first isFactor, GHC creates a single function that is passed as the predicate to "GHC.List.Filter", with the calls to mod 10000000 and (==) inlined. For the second definition, what happens instead is that most of the calls within isFactor are let-bound references to class functions, and are not shared between multiple invocations of isFactor. So there's a lot of dictionary overhead that's completely unnecessary.
This is almost never an issue because even the default compiler settings will optimize it away, however runhaskell apparently doesn't even do that much. Even so, I have occasionally structured code as someFun x y = \z -> because I know that someFun will be partially applied and this was the only way to preserve sharing between calls (i.e. GHC's optimizer wasn't sufficiently clever).
As I understand it, runhaskell does little to no optimisation. It's designed to quickly load and run code. If it did more optimisation, it would take longer for your code to start running. Of course, in this case, the code runs faster with optimisations.
As I understand it, if a compiled version of the code exists, then runhaskell will use it. So if performance matters to you, just make sure you compile with optimisations turned on first. (I think you might even be able to pass switches to runhaskell to turn on optimisations - you'd have to check the documentation...)