Why is building a Haskell String from Data.Text so slow

Why is building a Haskell String from Data.Text so slow - haskell

So I had a location class
data Location = Location {
title :: String
, description :: String
}
instance Show Location where
show l = title l ++ "\n"
++ replicate (length $ title l) '-' ++ "\n"
++ description l
Then I changed it to use Data.Text
data Location = Location {
title :: Text
, description :: Text
}
instance Show Location where
show l = T.unpack $
title l <> "\n"
<> T.replicate (T.length $ title l) "-" <> "\n"
<> description l
Using criterion, I benchmarked the time taken by show on both the String and Data.Text implementations:
benchmarks = [ bench "show" (whnf show l) ]
where l = Location {
title="My Title"
, description = "This is the description."
}
The String implementation took 34ns, the Data.Text implementation was almost six times slower, at 170ns
How do I get Data.Text working as fast as String?
Edit: Silly mistakes
I'm not sure how this happened, but I cannot replicate the original speed difference: now for String and Text I get 28ns and 24ns respectively
For the more aggressive bench "length.show" (whnf (length . show) l) benchmark, for String and Text, I get 467ns and 3954ns respectively.
If I use a very basic lazy builder, without the replicated dashes
import qualified Data.Text.Lazy.Builder as Bldr
instance Show Location where
show l = show $
Bldr.fromText (title l) <> Bldr.singleton '\n'
-- <> Bldr.fromText (T.replicate (T.length $ title l) "-") <> Bldr.singleton '\n'
<> Bldr.fromText (description l)
and try the original, ordinary show benchmark, I get 19ns. Now this is buggy, as using show to convert a builder to a String will escape newlines. If I replace it with LT.unpack $ Bldr.toLazyText, where LT is a qualified import of Data.Text.Lazy, then I get 192ns.
I'm testing this on a Mac laptop, and I suspect my timings are getting horribly corrupted by machine noise. Thanks for the guidance.

You can't make it as fast, but you can speed it up some.
Appending
Text is represented as an array. This makes <> rather slow, because a new array has to be allocated and each Text copied into it. You can fix this by converting each piece to a String first, and then concatenating them. I imagine Text probably also offers an efficient way to concatenate multiple texts at once (as a commenter mentions, you can use a lazy builder) but for this purpose that will be slower. Another good option might be the lazy version of Text, which probably supports efficient concatenation.
Sharing
In your String-based implementation, the description field doesn't have to be copied at all. It's just shared between the Location and the result of showing that Location. There's no way to accomplish this with the Text version.

In the String case you are not fully evaluating all of the string operations - (++) and replicate.
If you change your benchmark to:
benchmarks = [ bench "show" (whnf (length.show) l) ]
you'll see that the String case takes around 520 ns - approx 10 times longer.

Related

String Formatting columns in Haskell without Text.Printf

I am new to Haskell. I am at the last part of a school project. I have to take tuples and print them to an outfile and separate them by a tab column. So (709,4226408), (12965,4226412) and (5,4226016) should have and output of
709 4226408
12965 4226412
5 4226016
What I have been trying to do is this:
genOutput :: (Int, Int) -> String
genOutput (a,b) = (show a) ++ "\t" ++ (show b)
And this gives outputs like:
"709\t4226408"
"12965\t4226412"
"5\t4226016"
There are 3 things wrong with this. 1) Quotes still appear in the output. 2) The \t tab does not actually become a tab space. .Whenever I try to make an actual tab for the "" it just comes out as a " " space. 3) They are not aligned into columns like the above example. I know Text.Printf exists but we are not allowed to import anything other than:
import System.IO
import Data.List
import System.Environment

that's the output you get from GHCi I guess? Try to use putStrLn instead:
Prelude> genOutput (1,42)
"1\t42"
Prelude> putStrLn $ genOutput (1,42)
1 42
Why is that?
If you tell GHCi to evaluate an expression it will do so and (more or less) output it using show - show is designed to work with read and will usually output a value as if you would input it directly into Haskell. For a String that will include escape sequences and the "s
Now using putStrLn it will take the string and print it to stdout as you would expect.
Using print
Another reason could be that you use print to output your value - print is show + putStrLn so it'll show the values first re-introducing the escapes (as GHCi would) - so if you use print change it to putStrLn if you are using Strings

Performance of pattern matching in GHC

I'm writing an "append" function for a data type I've created (which basically deals with "streams"). However, this data type has 12 different constructors, dealing with different types of "stream", for example, infinite, null, fixed length, variable length, already appended etc.
There logic between the input types and output types is a bit complex but not incredibly so.
I've considered two approaches:
Match against broad categories (perhaps by wrapping in a simpler proxy type) and then match inside those matches OR
Just pattern match against 144 cases (12*12). I could perhaps reduce this to 100 with wildcard matches for particular combinations but that's about it.
I know the second approach is more ugly and difficult to maintain, but disregarding that, will GHC find the second approach easier to optimise? If it can do the second approach with a simple jump table (or perhaps two jump tables) I suspect it will be faster. But if it's doing a linear check it will be far slower.
Does GHC optimise pattern matches (even very big ones) into constant time jump tables?

Yes, GHC optimizes such pattern matches. The first seven (I think) constructors get optimizes especially well, via pointer tagging. I believe the rest will be handled by a jump table. But 144 cases sounds hard to maintain, and you'll have to watch for code size. Do you really need all those cases?

It's not too hard to write a small Haskell script that writes a huge case-block and a small benchmark for it. For example:
module Main (main) where
mapping = zip ['!'..'z'] (reverse ['!'..'z'])
test_code =
[
"module Main where",
"",
"tester :: String -> String",
"tester cs = do",
" c <- cs",
" case transform c of",
" Just c' -> [c']",
" Nothing -> [c ]",
"",
"input = concat [ [' '..'z'] | x <- [1..10000] ]",
"",
"main = print $ length $ tester $ input",
""
]
code1 =
test_code ++
[
"transform :: Char -> Maybe Char",
"transform c = lookup c " ++ show mapping
]
code2 =
test_code ++
[
"transform :: Char -> Maybe Char",
"transform c =",
" case c of"
] ++
map (\(k, v) -> " " ++ show k ++ " -> Just " ++ show v) mapping ++
[
" _ -> Nothing"
]
main = do
writeFile "Test1.hs" (unlines code1)
writeFile "Test2.hs" (unlines code2)
If you run this code, it generates two small Haskell source files: Test1.hs and Test2.hs. The former uses Prelude.lookup to map characters to characters. The latter uses a giant case-block. Both files contain code to apply the mapping to a large list of data and print out the size of the result. (This way avoids I/O, which would otherwise be the dominating factor.) On my system, Test1 takes a few seconds to run, whereas Test2 is pretty much instant.
The over-interested reader may like to try extending this to use Data.Map.lookup and compare the speed.
This proves that pattern-matching is far faster than an O(n) traversal of a list of key/value mappings... which isn't what you asked. But feel free to brew up your own benchmarks. You could try auto-generating a nested-case verses a flat-case and timing the result. My guess is that you won't see much difference, but feel free to try it.

Is there a (Template) Haskell library that would allow me to print/dump a few local bindings with their respective names?

For instance:
let x = 1 in putStrLn [dump|x, x+1|]
would print something like
x=1, (x+1)=2
And even if there isn't anything like this currently, would it be possible to write something similar?

TL;DR There is this package which contains a complete solution.
install it via cabal install dump
and/or
read the source code
Example usage:
{-# LANGUAGE QuasiQuotes #-}
import Debug.Dump
main = print [d|a, a+1, map (+a) [1..3]|]
where a = 2
which prints:
(a) = 2 (a+1) = 3 (map (+a) [1..3]) = [3,4,5]
by turnint this String
"a, a+1, map (+a) [1..3]"
into this expression
( "(a) = " ++ show (a) ++ "\t " ++
"(a+1) = " ++ show (a + 1) ++ "\t " ++
"(map (+a) [1..3]) = " ++ show (map (+ a) [1 .. 3])
)
Background
Basically, I found that there are two ways to solve this problem:
Exp -> String The bottleneck here is pretty-printing haskell source code from Exp and cumbersome syntax upon usage.
String -> Exp The bottleneck here is parsing haskell to Exp.
Exp -> String
I started out with what #kqr put together, and tried to write a parser to turn this
["GHC.Classes.not x_1627412787 = False","x_1627412787 = True","x_1627412787 GHC.Classes.== GHC.Types.True = True"]
into this
["not x = False","x = True","x == True = True"]
But after trying for a day, my parsec-debugging-skills have proven insufficient to date, so instead I went with a simple regular expression:
simplify :: String -> String
simplify s = subRegex (mkRegex "_[0-9]+|([a-zA-Z]+\\.)+") s ""
For most cases, the output is greatly improved.
However, I suspect this to likely mistakenly remove things it shouldn't.
For example:
$(dump [|(elem 'a' "a.b.c", True)|])
Would likely return:
["elem 'a' \"c\" = True","True = True"]
But this could be solved with proper parsing.
Here is the version that works with the regex-aided simplification: https://github.com/Wizek/kqr-stackoverflow/blob/master/Th.hs
Here is a list of downsides / unresolved issues I've found with the Exp -> String solution:
As far as I know, not using Quasi Quotation requires cumbersome syntax upon usage, like: $(d [|(a, b)|]) -- as opposed to the more succinct [d|a, b|]. If you know a way to simplify this, please do tell!
As far as I know, [||] needs to contain fully valid Haskell, which pretty much necessitates the use of a tuple inside further exacerbating the syntactic situation. There is some upside to this too, however: at least we don't need to scratch our had where to split the expressions since GHC does that for us.
For some reason, the tuple only seemed to accept Booleans. Weird, I suspect this should be possible to fix somehow.
Pretty pretty-printing Exp is not very straight-forward. A more complete solution does require a parser after all.
Printing an AST scrubs the original formatting for a more uniform looks. I hoped to preserve the expressions letter-by-letter in the output.
The deal-breaker was the syntactic over-head. I knew I could get to a simpler solution like [d|a, a+1|] because I have seen that API provided in other packages. I was trying to remember where I saw that syntax. What is the name...?
String -> Exp
Quasi Quotation is the name, I remember!
I remembered seeing packages with heredocs and interpolated strings, like:
string = [qq|The quick {"brown"} $f {"jumps " ++ o} the $num ...|]
where f = "fox"; o = "over"; num = 3
Which, as far as I knew, during compile-time, turns into
string = "The quick " ++ "brown" ++ " " ++ $f ++ "jumps " ++ o ++ " the" ++ show num ++ " ..."
where f = "fox"; o = "over"; num = 3
And I thought to myself: if they can do it, I should be able to do it too!
A bit of digging in their source code revealed the QuasiQuoter type.
data QuasiQuoter = QuasiQuoter {quoteExp :: String -> Q Exp}
Bingo, this is what I want! Give me the source code as string! Ideally, I wouldn't mind returning string either, but maybe this will work. At this point I still know quite little about Q Exp.
After all, in theory, I would just need to split the string on commas, map over it, duplicate the elements so that first part stays string and the second part becomes Haskell source code, which is passed to show.
Turning this:
[d|a+1|]
into this:
"a+1" ++ " = " ++ show (a+1)
Sounds easy, right?
Well, it turns out that even though GHC most obviously is capable to parse haskell source code, it doesn't expose that function. Or not in any way we know of.
I find it strange that we need a third-party package (which thankfully there is at least one called haskell-src-meta) to parse haskell source code for meta programming. Looks to me such an obvious duplication of logic, and potential source of mismatch -- resulting in bugs.
Reluctantly, I started looking into it. After all, if it is good enough for the interpolated-string folks (those packaged did rely on haskell-src-meta) then maybe it will work okay for me too for the time being.
And alas, it does contain the desired function:
Language.Haskell.Meta.Parse.parseExp :: String -> Either String Exp
Language.Haskell.Meta.Parse
From this point it was rather straightforward, except for splitting on commas.
Right now, I do a very simple split on all commas, but that doesn't account for this case:
[d|(1, 2), 3|]
Which fails unfortunatelly. To handle this, I begun writing a parsec parser (again) which turned out to be more difficult than anticipated (again). At this point, I am open to suggestions. Maybe you know of a simple parser that handles the different edge-cases? If so, tell me in a comment, please! I plan on resolving this issue with or without parsec.
But for the most use-cases: it works.
Update at 2015-06-20
Version 0.2.1 and later correctly parses expressions even if they contain commas inside them. Meaning [d|(1, 2), 3|] and similar expressions are now supported.
You can
install it via cabal install dump
and/or
read the source code
Conclusion
During the last week I've learnt quite a bit of Template Haskell and QuasiQuotation, cabal sandboxes, publishing a package to hackage, building haddock docs and publishing them, and some things about Haskell too.
It's been fun.
And perhaps most importantly, I now am able to use this tool for debugging and development, the absence of which has been bugging me for some time. Peace at last.
Thank you #kqr, your engagement with my original question and attempt at solving it gave me enough spark and motivation to continue writing up a full solution.

I've actually almost solved the problem now. Not exactly what you imagined, but fairly close. Maybe someone else can use this as a basis for a better version. Either way, with
{-# LANGUAGE TemplateHaskell, LambdaCase #-}
import Language.Haskell.TH
dump :: ExpQ -> ExpQ
dump tuple =
listE . map dumpExpr . getElems =<< tuple
where
getElems = \case { TupE xs -> xs; _ -> error "not a tuple in splice!" }
dumpExpr exp = [| $(litE (stringL (pprint exp))) ++ " = " ++ show $(return exp)|]
you get the ability to do something like
λ> let x = True
λ> print $(dump [|(not x, x, x == True)|])
["GHC.Classes.not x_1627412787 = False","x_1627412787 = True","x_1627412787 GHC.Classes.== GHC.Types.True = True"]
which is almost what you wanted. As you see, it's a problem that the pprint function includes module prefixes and such, which makes the result... less than ideally readable. I don't yet know of a fix for that, but other than that I think it is fairly usable.
It's a bit syntactically heavy, but that is because it's using the regular [| quote syntax in Haskell. If one wanted to write their own quasiquoter, as you suggest, I'm pretty sure one would also have to re-implement parsing Haskell, which would suck a bit.

Haskell: How to organize a group of functions that all take the same arguments

I am writing a program with several functions that take the same arguments. Here is a somewhat contrived example for simplicity:
buildPhotoFileName time word stamp = show word ++ "-" ++ show time ++ show stamp
buildAudioFileName time word = show word ++ "-" ++ show time ++ ".mp3"
buildDirectoryName time word = show word ++ "_" ++ show time
Say I am looping over a resource from IO to get the time and word parameters at runtime. In this loop, I need to join the results of the above functions for further processing so I do this:
let photo = buildPhotoFileName time word stamp
audio = buildAudioFileName time word
dir = buildDirectoryName time word
in ....
This seems like a violation of "Don't Repeat Yourself" principle. If down the road I find I would like to change word to a function taking word, I might make a new binding at the beginning of let expression like so:
let wrd = processWord word
photo = buildPhotoFileName time wrd stamp
audio = buildAudioFileName time wrd
dir = buildDirectoryName time wrd
in ....
and would have to change each time I wrote word to wrd, leading to bugs if I remember to change some function calls, but not the others.
In OOP, I would solve this by putting the above functions in a class whose constructor would take time and word as arguments. The instantiated object would essentially be the three functions curried to time and word. If I wanted to then make sure that the functions receive processWord word instead of word as an "argument", I could call processWord in the constructor.
What is a better way to do this that would be more suited to Functional Programming and Haskell?

Since you say you're ready to create an OO-wrapper-class just for that, I assume you're open to changing your functions. Following is a function producting a tuple of all three results you wanted:
buildFileNames time word stamp =
( show word ++ "-" ++ show time ++ show stamp,
show word ++ "-" ++ show time ++ ".mp3",
show word ++ "_" ++ show time )
You'll be able to use it like so:
let wrd = processWord word
(photo, audio, dir) = buildFileNames time wrd stamp
in ....
And if you don't need any of the results, you can just skip them like so:
let wrd = processWord word
(_, audio, _) = buildFileNames time wrd stamp
in ....
It's worth noting that you don't have to worry about Haskell wasting resources on computing values you don't use, since it's lazy.

The solution you described from OOP land sounds like a good one in FP land to me. To wit:
data UID = UID
{ _time :: Integer
, _word :: String
}
Including or not including the "stamp" in this record is a design decision that we probably don't have enough information to answer here. One can put this data type in its own module, and define a "smart constructor" and "smart accessors":
uid = UID
time = _time
word = _word
Then hide the real constructor and accessors at the module boundary, e.g. export the UID type, uid smart constructor, and time and word smart accessors, but not the UID constructor or _time and _word accessors.
module UID (UID, uid, time, word) where
If we later discover that the smart constructor should do some processing, we can change the definition of uid:
uid t w = UID t (processWord w)

Building on top of Nikita Vokov’s answer, you can use record wild cards for some neat syntax with little repetition:
{-# LANGUAGE RecordWildCards #-}
data FileNames = FileNames { photo :: String, audio :: String, dir :: String }
buildFileNames :: Word -> Time -> Stamp -> FileNames
buildFileNames time word stamp = FileNames
(show word ++ "-" ++ show time ++ show stamp)
(show word ++ "-" ++ show time ++ ".mp3")
(show word ++ "_" ++ show time )
let FileNames {...} = buildFileNames time wrd stamp
in ... photo ... audio ... dir...

Just to give you another example, if you are passing around the same parameters to multiple functions, you can use the Reader monad instead:
import Control.Monad.Reader
runR = flip runReader
type Params = (String, String, String)
buildPhotoFileName :: Reader Params String
buildPhotoFileName = do
(time, word, stamp) <- ask
return $ show word ++ "-" ++ show time ++ show stamp
main = do
runR (time, word, stamp) $ do
photo <- buildPhotoFileName
audio <- buildAudioFileName
dir <- buildDirectoryName
processStuff photo audio dir

To build on David Wagner's solution, and your OO objectives you should move the buildxxx function or functions to a separate module (NameBuilders?) That would give you complete control.
Even with this approach you should also "wrap" the variables with functions inside the module as David suggested.
You would export the variables and the buildxxx constructor (returning a triplet) or constructors (three separate functions).
you could also simplify by
buildDirectoryName time word = show word ++ "_" ++ show time
buildPhotoFileName stamp = buildDirectoryName + show stamp
buildAudioFileName = buildDirectoryName ++ ".mp3"

Could I get help implementing a concept, "When a String changes, its type changes"

One day on #haskell, someone mentioned the concept of how a string's type should change when the string changes. This reminded me of some code I have in my project. It keeps bugging me, and I couldn't articulate why. The reason, I now surmise, is because I am not implementing this concept. Here's the code below, followed by some ideas of how I can begin to change it for the better. What I would like is some input to the effect of , "You're on the right track." or , "No, way off.", or "Here's this other thing you should be mindful of.".
> processHTML :: String -> [[String]]
> processHTML htmlFILE =
> let parsedHTML = parseTags htmlFILE
> allTagOpens = sections (~== TagOpen "a" [("href","")]) parsedHTML
> taggedTEXT = head $ map (filter isTagOpen) allTagOpens
> allHREFS = map (fromAttrib "href") taggedTEXT
> allPotentials = map (dropWhile (/= '?')) allHREFS
> removedNulls = filter (not . null) allPotentials
> removedQs = map (drop 1) removedNulls
> in map (splitOn "&") removedQs
The idea here is I'm taking raw HTML and filtering out everything I don't want until I get what I do want. Each let binding represents a stage in filtering. This could be the foundation of a data structure, like so:
> data Stage = Stage1 Foo
> | Stage2 Bar
> | Stage3 Baz
Where Foo Bar and Baz are the appropriate datatype; a String, or TagOpen for example, depending on what stage I am at in the filtering process. I could use this data type to get precise information when I add in the error handling code. Plus, it could help me keep track of what is happening when.
Feedback appreciated.

You're on the right track.
First of all, when you're building a long pipeline like this, you may prefer to compose functions directly:
> processHTML :: String -> [[String]]
> processHTML =
> parseTags
> >>> sections (~== TagOpen "a" [("href","")])
> >>> head $ map (filter isTagOpen)
> >>> map (fromAttrib "href")
> >>> map (dropWhile (/= '?'))
> >>> filter (not . null)
> >>> map (drop 1)
> >>> map (splitOn "&")
This uses Control.Category.(>>>), which is just (at least in this case) flipped function composition.
Now for your actual question, it looks like you're using the tagsoup package for parsing tags. This already does some type changing throughout the pipeline: parseTags generates a Tag, some functions operate on it, and then fromAttrib goes back to a String.
Depending on how much work you'll be doing, I might create a newtype:
newtype QueryElement = QE { unQE :: String } deriving (Eq, Show)
> processHTML :: String -> [[QueryElement]]
> processHTML =
> parseTags
> >>> sections (~== TagOpen "a" [("href","")])
> >>> head $ map (filter isTagOpen)
> >>> map (fromAttrib "href")
> >>> map (dropWhile (/= '?'))
> >>> filter (not . null)
> >>> map (drop 1)
> >>> map (splitOn "&" >>> map QE)
Only the last line has changed here, to add the QE newtype tags to each element.
Depending on your use case, you could take a difference approach. For example, you may want to add more information to the URI instead of just collecting the query variables. Or you might want to fold over the query items and produce a Map String String directly.
Finally, if you're trying to gain type safety, you usually wouldn't make a sum type such as your Stage. This is because each constructor creates a value of the same type, so the compiler can't do any extra checking. Instead you'd create a separate type for each stage:
data Stage1 = Stage1 Foo
data Stage2 = Stage2 Bar
data Stage3 = Stage3 Baz
doStage1 :: Stage1 -> Stage2
doStage2 :: Stage2 -> Stage3
It's easy to create very fine-grained classes and data structures, but at some point they get out of hand. For example, in your functions allPotentials, removedNulls, and removedQs, you may want to just work on Strings. There isn't a lot of semantic meaning that can be attached to the output of those stages, especially as they're partial steps within a slightly larger process.

This page talks about using types to enforce safety of operations, and causing common errors to show up at compile-time. I'm not sure, but I think this is along the lines of what you're trying to implement.
An example of the problem:
You're running a web application that needs to use a database. It generates an SQL query from the username and password (for example) and sends it off to the database server, gets a response, and presents it to the user. This works great for a while. But then a very rude user types in " OR 1 = 1; -- for the username. Can you imagine sending that string to the following query:
SELECT * FROM users WHERE password = "$" AND username = "$";
Disaster!
The basic solution:
1) create a type for strings that are safe to send to the database server (i.e. GoodSQLString)
2) make sure that all GoodSQLString's really are safe (perhaps the constructor passes the argument query string through an escaping function)
3) only allow GoodSQLString's to be sent to the database server from an application
That said, it's hard to say how that translates to your processHTML problem. Perhaps the type signature should be processHTML :: HTML -> [Tags] -- unless it's meaningful to pass in String's that are invalid HTML.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why is building a Haskell String from Data.Text so slow - haskell

In the String case you are not fully evaluating all of the string operations - (++) and replicate. If you change your benchmark to: benchmarks = [ bench "show" (whnf (length.show) l) ] you'll see that the String case takes around 520 ns - approx 10 times longer.

Related

String Formatting columns in Haskell without Text.Printf

Performance of pattern matching in GHC

Is there a (Template) Haskell library that would allow me to print/dump a few local bindings with their respective names?

Haskell: How to organize a group of functions that all take the same arguments

Could I get help implementing a concept, "When a String changes, its type changes"

Categories

Resources