Haskell: How to get "\\0" into "\0"? - string

Haskell has a number of string literals that use the \ escape sequence. Ones such as \n, \t, \NUL.
If I have the string literal:
let s = "Newline: \\n Tab: \\t"
how do I define the function escape :: String -> String that will convert the above string to:
"Newline: \n Tab: \t"
And the same with all other string literal escape sequences.
I'm okay with using Quasi Quoting and Template Haskell, but don't know how to use them to achieve the result. Any pointers?
Update: I just found the Text.ParserCombinators.ReadP module that's included in the Base library. It supports the readLitChar :: ReadS Char function in Data.Char that does what I want, but I don't know how to use the ReadP module. I tried the following and it works:
escape2 [] = []
escape2 xs = case readLitChar xs of
[] -> []
[(a, b)] -> a : escape2 b
But this may not be the right way to use the ReadP module. Can anyone provide some pointers?
Another update: Thanks everyone. My final function below. Not bad, I think.
import Text.ParserCombinators.ReadP
import Text.Read.Lex
escape xs
| [] <- r = []
| [(a,_)] <- r = a
where r = readP_to_S (manyTill lexChar eof) xs

You don't need to do anything. When you input the string literal
let s = "Newline: \\n Tab: \\t"
you can check that it is what you want:
Prelude> putStrLn s
Newline: \n Tab: \t
Prelude> length s
19
If you just ask ghci for the value of s you'll get something else,
Prelude> s
"Newline: \\n Tab: \\t"
apparently it's doing some escape formatting behind your back, and it also displays the quotes. If you call show or print you'll get yet other answers:
Prelude> show s
"\"Newline: \\\\n Tab: \\\\t\""
Prelude> print s
"Newline: \\n Tab: \\t"
This is because show is meant for serializing values, so when you show a string you don't get the original back, you instead get a serialized string which can be parsed into the original string. The result of show s is actually displayed by print s (print is defined as putStrLn . show). When you just show s in ghci you get an even stranger answer; here ghci is formatting the characters which are serialized by show.
tl;dr - always use putStrLn to see what the value of a string is in ghci.
Edit: I just realized that maybe you want to convert the literal value
Newline: \n Tab: \t
into the actual control sequences. The easiest way to do this is probably to stick it in quotes and use read:
Prelude> let s' = '"' : s ++ "\""
Prelude> read s' :: String
"Newline: \n Tab: \t"
Prelude> putStrLn (read s')
Newline:
Tab:
Edit 2: an example of using readLitChar, this is very close to Chris's answer except with readLitChar:
strParser :: ReadP String
strParser = do
str <- many (readS_to_P readLitChar)
eof
return str
Then you run it with readP_to_S, which gives you a list of matching parses (there shouldn't be more than one match, however there might not be any match so you should check for an empty list.)
> putStrLn . fst . head $ readP_to_S strParser s
Newline:
Tab:
>

Asking about QQ and TH means you wish to do this conversion at compile time. For simple String -> Something conversions you can use the OverloadedString literal facility in GHC.
EDIT 2 : Using the exposed character lexer in Text.Read.Lex
module UnEscape where
import Data.String(IsString(fromString))
import Text.ParserCombinators.ReadP as P
import Text.Read.Lex as L
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape lexed
where lexer = do s <- P.many L.lexChar
eof
return s
lexed = case P.readP_to_S lexer rawString of
((answer,""):_) -> answer
_ -> error ("UnEscape could not process "++show rawString)
EDIT 1 : I have now got a better UnEscape instance that uses GHC's read:
instance IsString UnEscape where
fromString rawString = UnEscape (read (quote rawString))
where quote s = '"' : s ++ ['"']
For example:
module UnEscape where
import Data.String(IsString(fromString))
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape (transform rawString)
where transform [] = []
transform ('\\':x:rest) = replace x : transform rest
transform (y:rest) = y : transform rest
-- also covers special case of backslash at end
replace x = case x of
'n' -> '\n'
't' -> '\t'
unrecognized -> unrecognized
The above has to be a separate module from the module that uses unEscape:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import UnEscape(UnEscape(unEscape))
main = do
let s = "Newline: \\n Tab: \\t"
t = unEscape "Newline: \\n Tab: \\t"
print s
putStrLn s
print t
putStrLn t
This produces
shell prompt$ ghci Main.hs
GHCi, version 7.0.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 2] Compiling UnEscape ( UnEscape.hs, interpreted )
[2 of 2] Compiling Main ( Main.hs, interpreted )
Ok, modules loaded: Main, UnEscape.
*Main> main
"Newline: \\n Tab: \\t"
Newline: \n Tab: \t
"Newline: \n Tab: \t"
Newline:
Tab:

Related

Parsec - improving error message for "between"

I'm learning Parsec. I've got this code:
import Text.Parsec.String (Parser)
import Control.Applicative hiding ((<|>))
import Text.ParserCombinators.Parsec hiding (many)
inBracketsP :: Parser [String]
inBracketsP = (many $ between (char '[') (char ']') (many $ char '.')) <* eof
main :: IO ()
main = putStr $ show $ parse inBracketsP "" "[...][..."
The result is
Left (line 1, column 10):
unexpected end of input
expecting "." or "]"
This message is not useful (adding . won't fix the problem). I'd expect something like ']' expected (only ] fixes the problem).
Is it possible to achieve that easily with Parsec? I've seen the SO question Parsec: error message at specific location, which is inspiring, but I'd prefer to stick to the between combinator, without manual lookahead or other overengineering (kind of), if possible.
You can hide a terminal from being displayed in the expected input list by attaching an empty label to it (parser <?> ""):
inBracketsP :: Parser [String]
inBracketsP = (many $ between (char '[') (char ']') (many $ (char '.' <?> ""))) <* eof
-- >>> main
-- Left (line 1, column 10):
-- unexpected end of input
-- expecting "]"
In megaparsec, there is also a hidden combinator that achieves the same effect.

Why does this parser always fail when the end-of-line sequence is CRLF?

This simple parser is expected to parse messages of the form
key: value\r\nkey: value\r\n\r\nkey: value\r\nkey: value\r\n\r\n
One EOL acts as a field separator, and double EOL acts as a message separator. It works perfectly fine when the EOL separator is \n but parseWith always returns fail when it is \r\n.
parsePair = do
key <- B8.takeTill (==':')
_ <- B8.char ':'
_ <- B8.char ' '
value <- B8.manyTill B8.anyChar endOfLine
return (key, value)
parseListPairs = sepBy parsePair endOfLine <* endOfLine
parseMsg = sepBy parseListPairs endOfLine <* endOfLine
I'm assuming you are using these imports:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Attoparsec.ByteString.Char8 as B8
import Data.Attoparsec.ByteString.Char8
The problem is that endOfLine consumes the end of line, so perhaps you really want something like:
parseListPairs = B8.many1 parsePair <* endOfInput
For instance, this works:
ghci> parseOnly parseListPairs "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]
Update:
For parsing multiple messages you can use:
parseListPairs = B8.manyTill parsePair endOfLine
parseMsgs = B8.manyTill parseListPairs endOfInput
ghci> test3 = parseOnly parseMsgs "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4\r\n\r\n"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]
Problems
Your code isn't self-contained and the actual problem is unclear. However, I suspect your woes are actually caused by how keys are parsed; in particular, something like \r\nk is a valid key, according to your parser:
λ> parseOnly parsePair "\r\nk: v\r\n"
Right ("\r\nk","v")
That needs to be fixed.
Moreover, since one EOL separates (rather than terminates) key-value pairs, an EOL shouldn't be consumed at the end of your parsePair parser.
Another tangential issue: because you use the many1 combinator instead ByteString-oriented parsers (such as takeTill), your values have type String instead of ByteString. That's probably not what you want, here, because it defeats the purpose of using ByteString in the first place.; see Performance considerations.
Solution
I suggest the following refactoring:
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString ( ByteString )
import Data.Attoparsec.ByteString.Char8 ( Parser
, count
, endOfLine
, parseOnly
, sepBy
, string
, takeTill
)
-- convenient type synonyms
type KVPair = (ByteString, ByteString)
type Msg = [KVPair]
pair :: Parser KVPair
pair = do
k <- key
_ <- string ": "
v <- value
return (k, v)
where
key = takeTill (\c -> c == ':' || isEOL c)
value = takeTill isEOL
isEOL c = c == '\n' || c == '\r'
-- one EOL separates key-value pairs
msg :: Parser Msg
msg = sepBy pair endOfLine
-- two EOLs separate messages
msgs :: Parser [Msg]
msgs = sepBy msg (count 2 endOfLine)
I have renamed your parsers, for consistency with attoparsec's, none of which have "parse" as a prefix:
parsePair --> pair
parseListPairs --> msg
parseMsg --> msgs
Tests in GHCi
λ> parseOnly keyValuePair "\r\nk: v"
Left "string"
Good; you do want a fail, in this case.
λ> parseOnly keyValuePair "k: v"
Right ("k","v")
λ> parseOnly msg "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]
λ> parseOnly msgs "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]
λ> parseOnly msgs "k: v"
Right [[("k","v")]]

Haskell attoparsec: "Failed reading: satisfyWith"

I want to parse text like "John","Kate","Ruddiger" into list of Strings.
I tried to start with parsing "John", to Name (alias for String) but it already fails with Fail "\"," [","] "Failed reading: satisfyWith".
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
Question B: How can I make the parser to not require a comma after the last name?
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many)
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char ','
return $ Name (BS.unpack name)
someEntry :: IO BS.ByteString
someEntry = do
return $ BS.pack "\"John\","
main :: IO()
main = do
someEntry >>= print . parse entryParser
I am using GHC 7.6.3 and attoparsec-0.11.3.4.
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
readName = P.takeWhile (/='"')
takeWhile consumes as long as the predicate is true. Therefor, after you read the name, " hasn't been consumed. This is easy to see if we remove P.char ',' from the entryParser:
entryParser = P.char '"' >> fmap (Name . BS.unpack) readName
$ runhaskell SO.hs
Done "\"," Name "John"
You need to consume the ":
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char '"' -- <<<<<<<<<<<<<<<<<<<<<<
P.char ','
return $ Name (BS.unpack name)
Question B: How can I make the parser to not require a comma after the last name?
Use sepBy.
Now your questions has been cleared up, lets make things a little bit easier. Don't consume the , at all in entryParser, instead, only take the name:
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
In case you don't know (*>) and (<*), they're both from Control.Applicative, and they basically mean "discard whatever is on the asterisks side".
Now, in order to parse all comma separated entries, we use sepBy entryParser (P.char ','). However, this will lead into attoparsec returning a Partial:
$ runhaskell SO.hs
Partial _
That's actually a feature of attoparsec you have to keep in mind:
Attoparsec supports incremental input, meaning that you can feed it a bytestring that represents only part of the expected total amount of data to parse. If your parser reaches the end of a fragment of input and could consume more input, it will suspend parsing and return a Partial continuation.
If you do want to use incremental input, use parse and feed. Otherwise use parseOnly. The complete code for your example would be something like
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many, (*>), (<*))
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
allEntriesParser = sepBy entryParser (P.char ',')
testString = "\"John\",\"Martha\",\"test\""
main = print . parseOnly allEntriesParser $ testString
$ runhaskell SO.hs
Right [Name "John",Name "Martha",Name "test"]

Parsec not parsing newline character

I have the following piece of code:
import Text.ParserCombinators.Parsec
import Control.Applicative hiding ((<|>))
import Control.Monad
data Test = Test Integer Integer deriving Show
integer :: Parser Integer
integer = rd <$> many1 digit
where rd = read :: String -> Integer
testParser :: Parser Test
testParser = do
a <- integer
char ','
b <- integer
eol
return $ Test a b
eol :: Parser Char
eol = char '\n'
main = forever $ do putStrLn "Enter the value you need to parse: "
input <- getLine
parseTest testParser input
But when I actually try to parse my value in ghci, it doesn't work.
ghci > main
Enter the value you need to parse:
34,343\n
parse error at (line 1, column 7):
unexpected "\\"
expecting digit or "\n"
Any ideas on what I'm missing here ?
The problem seems to be that you're expecting a newline, but your text doesn't contain one. Change eol to
eol :: Parser ()
eol = void (char '\n') <|> eof
and it'll work.
"\n" is an escape code used in Haskell (and C, etc.) string and character literals to represent ASCII 0x0A, the character that is used to indicate end-of-line on UNIX and UNIX-like platforms. You don't (normally) use the <\> or <n> keys on your keyboard to put this character in a file (e.g.) instead you use the <Enter> key.
On PC-DOS and DOS-like systems, ASCII 0x0D followed by ASCII 0x0A is used for end-of-line and "\r" is the escape code used for ASCII 0x0D.
getLine reads until it finds end-of-line and returns a string containing everything but the end-of-line character. So, in your example, your parser will fail to match. You might fix this by matching end-of-line optionally.

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I've written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?
As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.
So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).
Edit
Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.
I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).
I haven't tested it, but:
many anyChar can match an empty string
Therefore prose can match an empty string
Therefore codeOrProse can match an empty string
Therefore literateFile can loop forever, matching infinitely many empty strings
Changing prose to match many1 characters might fix this problem.
(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)
For reference, here's another version I came up with (slightly expanded to handle other cases):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content

Resources