Performance of pattern matching in GHC - haskell

I'm writing an "append" function for a data type I've created (which basically deals with "streams"). However, this data type has 12 different constructors, dealing with different types of "stream", for example, infinite, null, fixed length, variable length, already appended etc.
There logic between the input types and output types is a bit complex but not incredibly so.
I've considered two approaches:
Match against broad categories (perhaps by wrapping in a simpler proxy type) and then match inside those matches OR
Just pattern match against 144 cases (12*12). I could perhaps reduce this to 100 with wildcard matches for particular combinations but that's about it.
I know the second approach is more ugly and difficult to maintain, but disregarding that, will GHC find the second approach easier to optimise? If it can do the second approach with a simple jump table (or perhaps two jump tables) I suspect it will be faster. But if it's doing a linear check it will be far slower.
Does GHC optimise pattern matches (even very big ones) into constant time jump tables?

Yes, GHC optimizes such pattern matches. The first seven (I think) constructors get optimizes especially well, via pointer tagging. I believe the rest will be handled by a jump table. But 144 cases sounds hard to maintain, and you'll have to watch for code size. Do you really need all those cases?

It's not too hard to write a small Haskell script that writes a huge case-block and a small benchmark for it. For example:
module Main (main) where
mapping = zip ['!'..'z'] (reverse ['!'..'z'])
test_code =
[
"module Main where",
"",
"tester :: String -> String",
"tester cs = do",
" c <- cs",
" case transform c of",
" Just c' -> [c']",
" Nothing -> [c ]",
"",
"input = concat [ [' '..'z'] | x <- [1..10000] ]",
"",
"main = print $ length $ tester $ input",
""
]
code1 =
test_code ++
[
"transform :: Char -> Maybe Char",
"transform c = lookup c " ++ show mapping
]
code2 =
test_code ++
[
"transform :: Char -> Maybe Char",
"transform c =",
" case c of"
] ++
map (\(k, v) -> " " ++ show k ++ " -> Just " ++ show v) mapping ++
[
" _ -> Nothing"
]
main = do
writeFile "Test1.hs" (unlines code1)
writeFile "Test2.hs" (unlines code2)
If you run this code, it generates two small Haskell source files: Test1.hs and Test2.hs. The former uses Prelude.lookup to map characters to characters. The latter uses a giant case-block. Both files contain code to apply the mapping to a large list of data and print out the size of the result. (This way avoids I/O, which would otherwise be the dominating factor.) On my system, Test1 takes a few seconds to run, whereas Test2 is pretty much instant.
The over-interested reader may like to try extending this to use Data.Map.lookup and compare the speed.
This proves that pattern-matching is far faster than an O(n) traversal of a list of key/value mappings... which isn't what you asked. But feel free to brew up your own benchmarks. You could try auto-generating a nested-case verses a flat-case and timing the result. My guess is that you won't see much difference, but feel free to try it.

Related

Dealing with tabular data in Haskell

This is an excerpt of a file.csv file with some tabular data
John,23,Paris
Alban,28,London
Klaus,27,Berlin
Hans,29,Stockholm
Julian,25,Paris
Jonathan,26,Lyon
Albert,27,London
The column headers for this file would be
firstName, age, city
This file is loaded in ghci like this
π›Œ> :m + Data.List Data.Function Data.List.Split
π›Œ> contents <- readFile "file.csv"
π›Œ> let t = map (splitOn ",") $ lines contents
π›Œ> mapM print $ take 3 t
["John","23","Paris"]
["Alban","28","London"]
["Klaus","27","Berlin"]
[(),(),()]
Now, if I want to add a birthYear column to those 3 columns, I can do
π›Œ> let getYear str = show $ 2016 - read str
π›Œ> let withYear = map (\(x:xs) -> x : xs ++ [getYear (head xs)]) t
π›Œ> mapM print $ take 3 withYear
["John","23","Paris","France","1993"]
["Alban","28","London","UK","1988"]
["Klaus","27","Berlin","Germany","1989"]
[(),(),()]
This works well but what bothers me is that the getYear function has type String -> String and as such, type checking is pretty much useless here.
I could easily convert t into a list of tuples like ("John", 23, "Paris") but what about if I have not 3, but 300 features (which is not that uncommon in machine learning problems)?
What would be the best way to deal with different column types? Using tuples? Using maps?
In case of a big number of columns, is there a way to make Haskell infer the column's types? For instance, it would detect that column 2 in the above example is Int, and the others are strings?
Concerning column headers, would there be a way that one could simply access the columns by label instead of by index, so that getYear could be something like 2016 - column['age'] (Python example)?
I'm used to Python's Pandas DataFrames which perform all this stuff automagically, but Haskell looks like it could perform a ton of it natively. Not sure how to do this however as of now.

Why is building a Haskell String from Data.Text so slow

So I had a location class
data Location = Location {
title :: String
, description :: String
}
instance Show Location where
show l = title l ++ "\n"
++ replicate (length $ title l) '-' ++ "\n"
++ description l
Then I changed it to use Data.Text
data Location = Location {
title :: Text
, description :: Text
}
instance Show Location where
show l = T.unpack $
title l <> "\n"
<> T.replicate (T.length $ title l) "-" <> "\n"
<> description l
Using criterion, I benchmarked the time taken by show on both the String and Data.Text implementations:
benchmarks = [ bench "show" (whnf show l) ]
where l = Location {
title="My Title"
, description = "This is the description."
}
The String implementation took 34ns, the Data.Text implementation was almost six times slower, at 170ns
How do I get Data.Text working as fast as String?
Edit: Silly mistakes
I'm not sure how this happened, but I cannot replicate the original speed difference: now for String and Text I get 28ns and 24ns respectively
For the more aggressive bench "length.show" (whnf (length . show) l) benchmark, for String and Text, I get 467ns and 3954ns respectively.
If I use a very basic lazy builder, without the replicated dashes
import qualified Data.Text.Lazy.Builder as Bldr
instance Show Location where
show l = show $
Bldr.fromText (title l) <> Bldr.singleton '\n'
-- <> Bldr.fromText (T.replicate (T.length $ title l) "-") <> Bldr.singleton '\n'
<> Bldr.fromText (description l)
and try the original, ordinary show benchmark, I get 19ns. Now this is buggy, as using show to convert a builder to a String will escape newlines. If I replace it with LT.unpack $ Bldr.toLazyText, where LT is a qualified import of Data.Text.Lazy, then I get 192ns.
I'm testing this on a Mac laptop, and I suspect my timings are getting horribly corrupted by machine noise. Thanks for the guidance.
You can't make it as fast, but you can speed it up some.
Appending
Text is represented as an array. This makes <> rather slow, because a new array has to be allocated and each Text copied into it. You can fix this by converting each piece to a String first, and then concatenating them. I imagine Text probably also offers an efficient way to concatenate multiple texts at once (as a commenter mentions, you can use a lazy builder) but for this purpose that will be slower. Another good option might be the lazy version of Text, which probably supports efficient concatenation.
Sharing
In your String-based implementation, the description field doesn't have to be copied at all. It's just shared between the Location and the result of showing that Location. There's no way to accomplish this with the Text version.
In the String case you are not fully evaluating all of the string operations - (++) and replicate.
If you change your benchmark to:
benchmarks = [ bench "show" (whnf (length.show) l) ]
you'll see that the String case takes around 520 ns - approx 10 times longer.

Is there a (Template) Haskell library that would allow me to print/dump a few local bindings with their respective names?

For instance:
let x = 1 in putStrLn [dump|x, x+1|]
would print something like
x=1, (x+1)=2
And even if there isn't anything like this currently, would it be possible to write something similar?
TL;DR There is this package which contains a complete solution.
install it via cabal install dump
and/or
read the source code
Example usage:
{-# LANGUAGE QuasiQuotes #-}
import Debug.Dump
main = print [d|a, a+1, map (+a) [1..3]|]
where a = 2
which prints:
(a) = 2 (a+1) = 3 (map (+a) [1..3]) = [3,4,5]
by turnint this String
"a, a+1, map (+a) [1..3]"
into this expression
( "(a) = " ++ show (a) ++ "\t " ++
"(a+1) = " ++ show (a + 1) ++ "\t " ++
"(map (+a) [1..3]) = " ++ show (map (+ a) [1 .. 3])
)
Background
Basically, I found that there are two ways to solve this problem:
Exp -> String The bottleneck here is pretty-printing haskell source code from Exp and cumbersome syntax upon usage.
String -> Exp The bottleneck here is parsing haskell to Exp.
Exp -> String
I started out with what #kqr put together, and tried to write a parser to turn this
["GHC.Classes.not x_1627412787 = False","x_1627412787 = True","x_1627412787 GHC.Classes.== GHC.Types.True = True"]
into this
["not x = False","x = True","x == True = True"]
But after trying for a day, my parsec-debugging-skills have proven insufficient to date, so instead I went with a simple regular expression:
simplify :: String -> String
simplify s = subRegex (mkRegex "_[0-9]+|([a-zA-Z]+\\.)+") s ""
For most cases, the output is greatly improved.
However, I suspect this to likely mistakenly remove things it shouldn't.
For example:
$(dump [|(elem 'a' "a.b.c", True)|])
Would likely return:
["elem 'a' \"c\" = True","True = True"]
But this could be solved with proper parsing.
Here is the version that works with the regex-aided simplification: https://github.com/Wizek/kqr-stackoverflow/blob/master/Th.hs
Here is a list of downsides / unresolved issues I've found with the Exp -> String solution:
As far as I know, not using Quasi Quotation requires cumbersome syntax upon usage, like: $(d [|(a, b)|]) -- as opposed to the more succinct [d|a, b|]. If you know a way to simplify this, please do tell!
As far as I know, [||] needs to contain fully valid Haskell, which pretty much necessitates the use of a tuple inside further exacerbating the syntactic situation. There is some upside to this too, however: at least we don't need to scratch our had where to split the expressions since GHC does that for us.
For some reason, the tuple only seemed to accept Booleans. Weird, I suspect this should be possible to fix somehow.
Pretty pretty-printing Exp is not very straight-forward. A more complete solution does require a parser after all.
Printing an AST scrubs the original formatting for a more uniform looks. I hoped to preserve the expressions letter-by-letter in the output.
The deal-breaker was the syntactic over-head. I knew I could get to a simpler solution like [d|a, a+1|] because I have seen that API provided in other packages. I was trying to remember where I saw that syntax. What is the name...?
String -> Exp
Quasi Quotation is the name, I remember!
I remembered seeing packages with heredocs and interpolated strings, like:
string = [qq|The quick {"brown"} $f {"jumps " ++ o} the $num ...|]
where f = "fox"; o = "over"; num = 3
Which, as far as I knew, during compile-time, turns into
string = "The quick " ++ "brown" ++ " " ++ $f ++ "jumps " ++ o ++ " the" ++ show num ++ " ..."
where f = "fox"; o = "over"; num = 3
And I thought to myself: if they can do it, I should be able to do it too!
A bit of digging in their source code revealed the QuasiQuoter type.
data QuasiQuoter = QuasiQuoter {quoteExp :: String -> Q Exp}
Bingo, this is what I want! Give me the source code as string! Ideally, I wouldn't mind returning string either, but maybe this will work. At this point I still know quite little about Q Exp.
After all, in theory, I would just need to split the string on commas, map over it, duplicate the elements so that first part stays string and the second part becomes Haskell source code, which is passed to show.
Turning this:
[d|a+1|]
into this:
"a+1" ++ " = " ++ show (a+1)
Sounds easy, right?
Well, it turns out that even though GHC most obviously is capable to parse haskell source code, it doesn't expose that function. Or not in any way we know of.
I find it strange that we need a third-party package (which thankfully there is at least one called haskell-src-meta) to parse haskell source code for meta programming. Looks to me such an obvious duplication of logic, and potential source of mismatch -- resulting in bugs.
Reluctantly, I started looking into it. After all, if it is good enough for the interpolated-string folks (those packaged did rely on haskell-src-meta) then maybe it will work okay for me too for the time being.
And alas, it does contain the desired function:
Language.Haskell.Meta.Parse.parseExp :: String -> Either String Exp
Language.Haskell.Meta.Parse
From this point it was rather straightforward, except for splitting on commas.
Right now, I do a very simple split on all commas, but that doesn't account for this case:
[d|(1, 2), 3|]
Which fails unfortunatelly. To handle this, I begun writing a parsec parser (again) which turned out to be more difficult than anticipated (again). At this point, I am open to suggestions. Maybe you know of a simple parser that handles the different edge-cases? If so, tell me in a comment, please! I plan on resolving this issue with or without parsec.
But for the most use-cases: it works.
Update at 2015-06-20
Version 0.2.1 and later correctly parses expressions even if they contain commas inside them. Meaning [d|(1, 2), 3|] and similar expressions are now supported.
You can
install it via cabal install dump
and/or
read the source code
Conclusion
During the last week I've learnt quite a bit of Template Haskell and QuasiQuotation, cabal sandboxes, publishing a package to hackage, building haddock docs and publishing them, and some things about Haskell too.
It's been fun.
And perhaps most importantly, I now am able to use this tool for debugging and development, the absence of which has been bugging me for some time. Peace at last.
Thank you #kqr, your engagement with my original question and attempt at solving it gave me enough spark and motivation to continue writing up a full solution.
I've actually almost solved the problem now. Not exactly what you imagined, but fairly close. Maybe someone else can use this as a basis for a better version. Either way, with
{-# LANGUAGE TemplateHaskell, LambdaCase #-}
import Language.Haskell.TH
dump :: ExpQ -> ExpQ
dump tuple =
listE . map dumpExpr . getElems =<< tuple
where
getElems = \case { TupE xs -> xs; _ -> error "not a tuple in splice!" }
dumpExpr exp = [| $(litE (stringL (pprint exp))) ++ " = " ++ show $(return exp)|]
you get the ability to do something like
Ξ»> let x = True
Ξ»> print $(dump [|(not x, x, x == True)|])
["GHC.Classes.not x_1627412787 = False","x_1627412787 = True","x_1627412787 GHC.Classes.== GHC.Types.True = True"]
which is almost what you wanted. As you see, it's a problem that the pprint function includes module prefixes and such, which makes the result... less than ideally readable. I don't yet know of a fix for that, but other than that I think it is fairly usable.
It's a bit syntactically heavy, but that is because it's using the regular [| quote syntax in Haskell. If one wanted to write their own quasiquoter, as you suggest, I'm pretty sure one would also have to re-implement parsing Haskell, which would suck a bit.

Order of execution within monads

I was learning how to use the State monad and I noticed some odd behavior in terms of the order of execution. Removing the distracting bits that involve using the actual state, say I have the following code:
import Control.Monad
import Control.Monad.State
import Debug.Trace
mainAction :: State Int ()
mainAction = do
traceM "Starting the main action"
forM [0..2] (\i -> do
traceM $ "i is " ++ show i
forM [0..2] (\j -> do
traceM $ "j is " ++ show j
someSubaction i j
)
)
Running runState mainAction 1 in ghci produces the following output:
j is 2
j is 1
j is 0
i is 2
j is 2
j is 1
j is 0
i is 1
j is 2
j is 1
j is 0
i is 0
Outside for loop
which seems like the reverse order of execution of what might be expected. I thought that maybe this is a quirk of forM and tried it with sequence which specifically states that it runs its computation sequentially from left to right like so:
mainAction :: State Int ()
mainAction = do
traceM "Outside for loop"
sequence $ map handleI [0..2]
return ()
where
handleI i = do
traceM $ "i is " ++ show i
sequence $ map (handleJ i) [0..2]
handleJ i j = do
traceM $ "j is " ++ show j
someSubaction i j
However, the sequence version produces the same output. What is the actual logic in terms of the order of execution that is happening here?
Haskell is lazy, which means things are not executed immediately. Things are executed whenever their result is needed – but no sooner. Sometimes code isn't executed at all if its result isn't needed.
If you stick a bunch of trace calls in a pure function, you will see this laziness happening. The first thing that is needed will be executed first, so that's the trace call you see first.
When something says "the computation is run from left to right" what it means is that the result will be the same as if the computation was run from left to right. What actually happens under the hood might be very different.
This is in fact why it's a bad idea to do I/O inside pure functions. As you have discovered, you get "weird" results because the execution order can be pretty much anything that produces the correct result.
Why is this a good idea? When the language doesn't enforce a specific execution order (such as the traditional "top to bottom" order seen in imperative languages) the compiler is free to do a tonne of optimisations, such as for example not executing some code at all because its result isn't needed.
I would recommend you to not think too much about execution order in Haskell. There should be no reason to. Leave that up to the compiler. Think instead about which values you want. Does the function give the correct value? Then it works, regardless of which order it executes things in.
I thought that maybe this is a quirk of forM and tried it with sequence which specifically states that it runs its computation sequentially from left to right like so: [...]
You need to learn to make the following, tricky distinction:
The order of evaluation
The order of effects (a.k.a. "actions")
What forM, sequence and similar functions promise is that the effects will be ordered from left to right. So for example, the following is guaranteed to print characters in the same order that they occur in the string:
putStrLn :: String -> IO ()
putStrLn str = forM_ str putChar >> putChar '\n'
But that doesn't mean that expressions are evaluated in this left-to-right order. The program has to evaluate enough of the expressions to figure out what the next action is, but that often does not require evaluating everything in every expression involved in earlier actions.
Your example uses the State monad, which bottoms out to pure code, so that accentuates the order issues. The only thing that a traversal functions such as forM promises in this case is that gets inside the actions mapped to the list elements will see the effect of puts for elements to their left in the list.

Haskell: can I use laziness to "abort early" and gain performance?

I'm writing a Haskell program that reads a wordlist of the English language and a rectangular grid of letters such as:
I T O L
I H W S
N H I S
K T S I
and then finds a Hamiltonian path through the grid from the top-left corner that spells out a sequence of English words, such as:
--> $ runghc unpacking.hs < 4x4grid.txt
I THINK THIS IS SLOW
(If there are multiple solutions, it can just print any one it finds and stop looking.)
The naΓ―ve, strict approach is to generate a full path and then try to split it up into words. However, assuming that I'm doing this (and currently I am forcing myself to -- see below) I'm spending a lot of time finding paths like:
IINHHTOL...
IINHHTOW...
IINHHWOL...
These are obviously never going to turn out to be words, looking at the first few letters ("IINH" can't be split into words, and no English word contains "NHH".) So, say, in the above grid, I don't want to look at the many[1] paths that begin with IINHH.
Now, my functions look like this:
paths :: Coord -> Coord -> [[Coord]]
paths (w, h) (1, 1) = [[(1, 1), (1, 2), ... (x, y)], ...]
lexes :: Set String -> String -> [[String]]
lexes englishWordset "ITHINKTHISWILLWORK" = [["I", "THINK", "THIS", ...], ...]
paths just finds all the paths worth considering on a (w, h) grid. lexes finds all the ways to chop a phrase up, and is defined as:
lexes language [] = [[]]
lexes language phrase = let
splits = tail $ zip (inits phrase) (tails phrase)
in concat [map (w:) (lexes language p') | (w, p') <- splits,
w `S.member` language]
Given "SAMPLESTRING", it looks at "S", then "SA", then "SAM"... as soon as it finds a valid word, it recurses and tries to "lex" the rest of the string. (First it will recurse on "PLESTRING" and try to make phrases with "SAM", but find no way to chop "plestring" up into words, and fail; then it will find ["SAMPLE", "STRING"].)
Of course, for an invalid string above, any hope of being "lazy" is lost by following this approach: in the example from earlier we need to still search beyond a ridiculous phrase like "ITOLSHINHISIST", because maybe "ITOLSHINHISISTK" (one letter longer) might form a valid single word.
I feel like somehow I could use laziness here to improve performance throughout the entire program: if the first few characters of phrase aren't a prefix of any word, we can bail out entirely, stop evaluating the rest of phrase, and thus the rest of the path.[2] Does this make sense at all? Is there some tree-like data structure that will help me check not for set membership, but set "prefix-ness", thereby making checking validity lazier?
[1] Obviously, for a 4x4 grid there are very few of these, but this argument is about the general case: for bigger grids I could skip hundreds of thousands of paths the moment I see they start with "JX".
[2] phrase is just map (grid M.!) path for some Map Coord Char grid read from the input file.

Resources