Parsec not parsing newline character - haskell

I have the following piece of code:
import Text.ParserCombinators.Parsec
import Control.Applicative hiding ((<|>))
import Control.Monad
data Test = Test Integer Integer deriving Show
integer :: Parser Integer
integer = rd <$> many1 digit
where rd = read :: String -> Integer
testParser :: Parser Test
testParser = do
a <- integer
char ','
b <- integer
eol
return $ Test a b
eol :: Parser Char
eol = char '\n'
main = forever $ do putStrLn "Enter the value you need to parse: "
input <- getLine
parseTest testParser input
But when I actually try to parse my value in ghci, it doesn't work.
ghci > main
Enter the value you need to parse:
34,343\n
parse error at (line 1, column 7):
unexpected "\\"
expecting digit or "\n"
Any ideas on what I'm missing here ?

The problem seems to be that you're expecting a newline, but your text doesn't contain one. Change eol to
eol :: Parser ()
eol = void (char '\n') <|> eof
and it'll work.

"\n" is an escape code used in Haskell (and C, etc.) string and character literals to represent ASCII 0x0A, the character that is used to indicate end-of-line on UNIX and UNIX-like platforms. You don't (normally) use the <\> or <n> keys on your keyboard to put this character in a file (e.g.) instead you use the <Enter> key.
On PC-DOS and DOS-like systems, ASCII 0x0D followed by ASCII 0x0A is used for end-of-line and "\r" is the escape code used for ASCII 0x0D.
getLine reads until it finds end-of-line and returns a string containing everything but the end-of-line character. So, in your example, your parser will fail to match. You might fix this by matching end-of-line optionally.

Related

Parsec negative match

parseIdent :: Parser (String)
parseIdent = do
x <- lookAhead $ try $ many1 (choice [alphaNum])
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
I'm trying to parse an alphanumeric string that only succeeds if it does not match a predetermined value (macro in this case).
However the following is giving me an error of:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string.
Which does not make sense, how does many1 (choice [alphaNum]) accept an empty string?
This error goes away if i remove the lookAhead $ try. But it 'fails' with illegal:
...
*** Exception: (line 6, column 36):
unexpected " "
expecting letter or digit or new-line
illegal
Am I going about this correctly? Or is there another technique to implement a negative search?
You almost have it:
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
import Control.Monad
parseIdent :: Parser (String)
parseIdent = try $ do
x <- many1 alphaNum
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
So, why didn't your code work?
the try is in the wrong spot. The real backtracking piece here is backtracking after you've gotten back your alphanumeric word and checked it isn't "macro"
lookAhead has no business here. If you end up with the word you wanted, you do want the word to be consumed from the input. try already takes care of resetting your input stream to its previous state

Parsec start-of-row pattern?

I am trying to parse mediawiki text using Parsec. Some of the constructs in mediawiki markup can only occur at the start of rows (such as the header markup ==header level 2==). In regexp I would use an anchor (such as ^) to find the start of a line.
One attempt in GHCi is
Prelude Text.Parsec> parse (char '\n' *> string "==" *> many1 letter <* string "==") "" "\n==hej=="
Right "hej"
but this is not too good since it will fail on the first line of a file. I feel like this should be a solved problem...
What is the most idiomatic "Start of line" parsing in Parsec?
You can use getPosition and sourceColumn in order to find out the column number that the parser is currently looking at. The column number will be 1 if the current position is at the start of a line (such as at the start of input or after a \n or \r character).
There isn't a built-in combinator for this, but you can easily make it:
import Text.Parsec
import Control.Monad (guard)
startOfLine :: Monad m => ParsecT s u m ()
startOfLine = do
pos <- getPosition
guard (sourceColumn pos == 1)
Now you can write your header parser as:
header = startOfLine *> string "==" *> many1 letter <* string "=="
Probably you can use many (char '\n') instead of just char '\n'. In parser combinators there's no sense of start of the line because they always run at the start of input. The only thing you can do is to check manually which symbols your input can start from. Using many (char '\n') ensures that there only zero or more empty lines before header == my header ==.

Convert unescaped unicode to utf8 integer

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to ยต). By "string" I mean String or Text.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Edit
Following #Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.
Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

Parsec - error "combinator 'many' is applied to a parser that accepts an empty string"

I'm trying to write a parser using Parsec that will parse literate Haskell files, such as the following:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
I've written the following, sort-of-inspired by the examples in RWH:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
Which I hoped would result in something along the lines of:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(allowing for whitespace etc).
This compiles fine, but when run, I get the error:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
Can anyone shed any light on this, and possibly help with a solution please?
As sth pointed out many anyChar is the problem. But not just in prose but also in code. The problem with code is, that content <- many anyChar will consume everything: The newlines and the \end{code} tag.
So, you need to have some way to tell the prose and the code apart. An easy (but maybe too naive) way to do so, is to look for backslashes:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
Now, you don't completely have the desired result, because the Haskell part will also contain newlines, but you can filter these out quite easily (given a function filterNewlines you could say `content <- filterNewlines <$> (many $ noneOf "\\")).
Edit
Okay, I think I found a solution (requires the newest Parsec version, because of lookAhead):
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
untilP p parses a line, then checks if the beginning of the next line can be successfully parsed by p. If so, it returns the empty string, otherwise it goes on. The lookAhead is needed, because otherwise the begin\end-tags would be consumed and code couldn't recognize them.
I guess it could still be made more concise (i.e. not having to repeat string "\\end{code}\n" inside code).
I haven't tested it, but:
many anyChar can match an empty string
Therefore prose can match an empty string
Therefore codeOrProse can match an empty string
Therefore literateFile can loop forever, matching infinitely many empty strings
Changing prose to match many1 characters might fix this problem.
(I'm not very familiar with Parsec, but how will prose know how many characters it should match? It might consume the whole input, never giving the code parser a second chance to look for the start of a new code segment. Alternatively it might only match one character in each call, making the many/many1 in it useless.)
For reference, here's another version I came up with (slightly expanded to handle other cases):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content

Haskell: How to get "\\0" into "\0"?

Haskell has a number of string literals that use the \ escape sequence. Ones such as \n, \t, \NUL.
If I have the string literal:
let s = "Newline: \\n Tab: \\t"
how do I define the function escape :: String -> String that will convert the above string to:
"Newline: \n Tab: \t"
And the same with all other string literal escape sequences.
I'm okay with using Quasi Quoting and Template Haskell, but don't know how to use them to achieve the result. Any pointers?
Update: I just found the Text.ParserCombinators.ReadP module that's included in the Base library. It supports the readLitChar :: ReadS Char function in Data.Char that does what I want, but I don't know how to use the ReadP module. I tried the following and it works:
escape2 [] = []
escape2 xs = case readLitChar xs of
[] -> []
[(a, b)] -> a : escape2 b
But this may not be the right way to use the ReadP module. Can anyone provide some pointers?
Another update: Thanks everyone. My final function below. Not bad, I think.
import Text.ParserCombinators.ReadP
import Text.Read.Lex
escape xs
| [] <- r = []
| [(a,_)] <- r = a
where r = readP_to_S (manyTill lexChar eof) xs
You don't need to do anything. When you input the string literal
let s = "Newline: \\n Tab: \\t"
you can check that it is what you want:
Prelude> putStrLn s
Newline: \n Tab: \t
Prelude> length s
19
If you just ask ghci for the value of s you'll get something else,
Prelude> s
"Newline: \\n Tab: \\t"
apparently it's doing some escape formatting behind your back, and it also displays the quotes. If you call show or print you'll get yet other answers:
Prelude> show s
"\"Newline: \\\\n Tab: \\\\t\""
Prelude> print s
"Newline: \\n Tab: \\t"
This is because show is meant for serializing values, so when you show a string you don't get the original back, you instead get a serialized string which can be parsed into the original string. The result of show s is actually displayed by print s (print is defined as putStrLn . show). When you just show s in ghci you get an even stranger answer; here ghci is formatting the characters which are serialized by show.
tl;dr - always use putStrLn to see what the value of a string is in ghci.
Edit: I just realized that maybe you want to convert the literal value
Newline: \n Tab: \t
into the actual control sequences. The easiest way to do this is probably to stick it in quotes and use read:
Prelude> let s' = '"' : s ++ "\""
Prelude> read s' :: String
"Newline: \n Tab: \t"
Prelude> putStrLn (read s')
Newline:
Tab:
Edit 2: an example of using readLitChar, this is very close to Chris's answer except with readLitChar:
strParser :: ReadP String
strParser = do
str <- many (readS_to_P readLitChar)
eof
return str
Then you run it with readP_to_S, which gives you a list of matching parses (there shouldn't be more than one match, however there might not be any match so you should check for an empty list.)
> putStrLn . fst . head $ readP_to_S strParser s
Newline:
Tab:
>
Asking about QQ and TH means you wish to do this conversion at compile time. For simple String -> Something conversions you can use the OverloadedString literal facility in GHC.
EDIT 2 : Using the exposed character lexer in Text.Read.Lex
module UnEscape where
import Data.String(IsString(fromString))
import Text.ParserCombinators.ReadP as P
import Text.Read.Lex as L
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape lexed
where lexer = do s <- P.many L.lexChar
eof
return s
lexed = case P.readP_to_S lexer rawString of
((answer,""):_) -> answer
_ -> error ("UnEscape could not process "++show rawString)
EDIT 1 : I have now got a better UnEscape instance that uses GHC's read:
instance IsString UnEscape where
fromString rawString = UnEscape (read (quote rawString))
where quote s = '"' : s ++ ['"']
For example:
module UnEscape where
import Data.String(IsString(fromString))
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape (transform rawString)
where transform [] = []
transform ('\\':x:rest) = replace x : transform rest
transform (y:rest) = y : transform rest
-- also covers special case of backslash at end
replace x = case x of
'n' -> '\n'
't' -> '\t'
unrecognized -> unrecognized
The above has to be a separate module from the module that uses unEscape:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import UnEscape(UnEscape(unEscape))
main = do
let s = "Newline: \\n Tab: \\t"
t = unEscape "Newline: \\n Tab: \\t"
print s
putStrLn s
print t
putStrLn t
This produces
shell prompt$ ghci Main.hs
GHCi, version 7.0.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 2] Compiling UnEscape ( UnEscape.hs, interpreted )
[2 of 2] Compiling Main ( Main.hs, interpreted )
Ok, modules loaded: Main, UnEscape.
*Main> main
"Newline: \\n Tab: \\t"
Newline: \n Tab: \t
"Newline: \n Tab: \t"
Newline:
Tab:

Resources