Error parsing a char (――) in Haskell

Error parsing a char (――) in Haskell - haskell

I'm writing a parser to parse huge chunks of English text using attoparsec. Everything has been great so far, except for parsing this char "――". I know it is just 2 dashes together "--". The weird thing is, the parser catches it in this code:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure ()
but not in this case:
specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()
The reason I'm using the list specialChars is because I have a lot of characters to consider and I apply it multiple cases. And for the input consider: "I am ――Walt Whitman._" and the output is supposed to be {"I", "am", "Walt", "Whiteman."} I believe it's mostly because "――" is not a Char? How do I fix this?

A Char is one character, full stop. ―― is two characters, so it is two Chars. You can fit as many Chars as you want into a String, but you certainly cannot fit two Chars into one Char.
Since satisfy considers individual characters at a time, it probably isn’t what you want if you need to parse a sequence of two characters as a single unit. The inClass function just produces a predicate on characters (inClass partially applied to one argument produces a function of type Char -> Bool), so inClass "――" is the same as inClass ['―', '―'], which is just the same as inClass ['―'] since duplicates are irrelevant. That won’t help you much.
Consider using string instead of or in combination with inClass, since it is designed to handle sequences of characters. For example, something like this might better suit your needs:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()

Related

Parsing redundant parenthesis while restoring input on non-committed failure

I have written a parser in Haskell's parsec library for a lisp-like language, and I would like to improve its error messages, but I am stuck in the following situation:
p :: Parser Integer
p = do
try (string "number")
space
integer
parens :: Parser a -> Parser a
parens p = do
char '('
v <- p
char ')'
return v
finalParser p = p <|> (parens finalParser)
So the parser p behaves quite nicely error-wise, it does not commit and consume input until it sees the keyword "number". Once it successfully parsed the keyword, it will not restore its input on failure. If this parser is combined with other parsers using the <|> operator, and it parses the keyword, it will show the error message of p and not try the next parser.
Already parens p does not have this property anymore.
I would like to write a function parens as above that consumes only input, if the parser p consumes input.
Here are a few examples of the behavior, I am trying to achieve. finalParser should parse:
"(((number 5)))", "number 5", "(number 5)" all should parse to the integer 5, and consume everything,
"(5)" should consume nothing (not even the parenthesis) and fail
"(number abc)", "number cde" should fail and consume input.

I would like to write a function parens as above that consumes only input, if the parser p consumes input.
Rather than that, consider factoring your syntax so you don't even need to backtrack after consuming a parenthesis.
Instead of trying to make this work:
parens p1 <|> parens p2
why not write this instead:
parens (p1 <|> p2)
?

Parsec start-of-row pattern?

I am trying to parse mediawiki text using Parsec. Some of the constructs in mediawiki markup can only occur at the start of rows (such as the header markup ==header level 2==). In regexp I would use an anchor (such as ^) to find the start of a line.
One attempt in GHCi is
Prelude Text.Parsec> parse (char '\n' *> string "==" *> many1 letter <* string "==") "" "\n==hej=="
Right "hej"
but this is not too good since it will fail on the first line of a file. I feel like this should be a solved problem...
What is the most idiomatic "Start of line" parsing in Parsec?

You can use getPosition and sourceColumn in order to find out the column number that the parser is currently looking at. The column number will be 1 if the current position is at the start of a line (such as at the start of input or after a \n or \r character).
There isn't a built-in combinator for this, but you can easily make it:
import Text.Parsec
import Control.Monad (guard)
startOfLine :: Monad m => ParsecT s u m ()
startOfLine = do
pos <- getPosition
guard (sourceColumn pos == 1)
Now you can write your header parser as:
header = startOfLine *> string "==" *> many1 letter <* string "=="

Probably you can use many (char '\n') instead of just char '\n'. In parser combinators there's no sense of start of the line because they always run at the start of input. The only thing you can do is to check manually which symbols your input can start from. Using many (char '\n') ensures that there only zero or more empty lines before header == my header ==.

Haskell: Parsec: Pipeline of transformers of the whole file

I'm trying to use parsec to read a C/C++/java source file and do a series of transformations on the entire file. The first phase removes strings and the second phase removes comments. (That's because you might get a /* inside a string.)
So each phase transforms a string onto Either String Error, and I want to bind (in the sense of Either) them together to make a pipeline of transformations of the whole file. This seems like a fairly general requirement.
import Text.ParserCombinators.Parsec
commentless, stringless :: Parser String
stringless = fmap concat ( (many (noneOf "\"")) `sepBy` quotedString )
quotedString = (char '"') >> (many quotedChar) >> (char '"')
quotedChar = try (string "\\\"" >> return '"' ) <|> (noneOf "\"")
commentless = fmap concat $ notComment `sepBy` comment
notComment = manyTill anyChar (lookAhead (comment <|> eof))
comment = (string "//" >> manyTill anyChar newline >> spaces >> return ())
<|> (string "/*" >> manyTill anyChar (string "*/") >> spaces >> return ())
main =
do c <- getContents
case parse commentless "(stdin)" c of -- THIS WORKS
-- case parse stringless "(stdin)" c of -- THIS WORKS TOO
-- case parse (stringless `THISISWHATIWANT` commentless) "(stdin)" c of
Left e -> do putStrLn "Error parsing input:"
print e
Right r -> print r
So how can I do this? I tried parserBind but it didn't work.
(In case anybody cares why, I'm trying to do a kind of light parse where I just extract what I want but avoid parsing the entire grammar or even knowing whether it's C++ or Java. All I need to extract is the starting and ending line numbers of all classes and functions. So I envisage a bunch of preprocessing phases that just scrub out comments, #defines/ifdefs, template preambles and contents of parentheses (because of the semicolons in for clauses), then I'll parse for snippets preceding {s (or following }s because of typedefs) and stuff those snippets through yet another phase to get the type and name of whatever it is, then recurse to just the second level to get java member functions.)

You need to bind Either Error, not Parser. You need to move the bind outside the parse, and use multiple parses:
parse stringless "(stdin)" input >>= parse commentless "(stdin)"
There is probably a better approach than what you are using, but this will do what you want.

Convert unescaped unicode to utf8 integer

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to µ). By "string" I mean String or Text.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Edit
Following #Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.

Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

What's the cleanest way to do case-insensitive parsing with Text.Combinators.Parsec?

I'm writing my first program with Parsec. I want to parse MySQL schema dumps and would like to come up with a nice way to parse strings representing certain keywords in case-insensitive fashion. Here is some code showing the approach I'm using to parse "CREATE" or "create". Is there a better way to do this? An answer that doesn't resort to buildExpressionParser would be best. I'm taking baby steps here.
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (string "CREATE" <|> string "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later

You can build the case-insensitive parser out of character parsers.
-- Match the lowercase or uppercase form of 'c'
caseInsensitiveChar c = char (toLower c) <|> char (toUpper c)
-- Match the string 's', accepting either lowercase or uppercase form of each character
caseInsensitiveString s = try (mapM caseInsensitiveChar s) <?> "\"" ++ s ++ "\""

Repeating what I said in a comment, as it was apparently helpful:
The simple sledgehammer solution here is to simply map toLower over the entire input before running the parser, then do all your keyword matching in lowercase.
This presents obvious difficulties if you're parsing something that needs to be case-insensitive in some places and case-sensitive in others, or if you care about preserving case for cosmetic reasons. For example, although HTML tags are case-insensitive, converting an entire webpage to lowercase while parsing it would probably be undesirable. Even when compiling a case-insensitive programming language, converting identifiers could be annoying, as any resulting error messages would not match what the programmer wrote.

No, Parsec cannot do that in clean way. string is implemented on top of
primitive tokens combinator that is hard-coded to use equality test
(==). It's a bit simpler to parse case-insensitive character, but you
probably want more.
There is however a modern fork of Parsec, called
Megaparsec which has
built-in solutions for everything you may want:
λ> parseTest (char' 'a') "b"
parse error at line 1, column 1:
unexpected 'b'
expecting 'A' or 'a'
λ> parseTest (string' "foo") "Foo"
"Foo"
λ> parseTest (string' "foo") "FOO"
"FOO"
λ> parseTest (string' "foo") "fo!"
parse error at line 1, column 1:
unexpected "fo!"
expecting "foo"
Note the last error message, it's better than what you can get parsing
characters one by one (especially useful in your particular case). string'
is implemented just like Parsec's string but uses case-insensitive
comparison to compare characters. There are also oneOf' and noneOf' that
may be helpful in some cases.
Disclosure: I'm one of the authors of Megaparsec.

Instead of mapping the entire input with toLower, consider using caseString from Text.ParserCombinators.Parsec.Rfc2234 (from the hsemail package)
Text.ParsecCombinators.Parsec.Rfc2234
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (caseString "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later
So now x will be whatever case-variant is present in the input without changing your input.
ps: I know that this is an ancient question, I just thought that I would add this as this question came up while I was searching for a similar problem

There is a package name parsec-extra for this purpuse. You need install this package then use 'caseInsensitiveString' parser.
:m Text.Parsec
:m +Text.Parsec.Extra
*> parseTest (caseInsensitiveString "values") "vaLUES"
"values"
*> parseTest (caseInsensitiveString "values") "VAlues"
"values"
Link to package is here:
https://hackage.haskell.org/package/parsec-extra

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Error parsing a char (――) in Haskell - haskell

Related

Parsing redundant parenthesis while restoring input on non-committed failure

Parsec start-of-row pattern?

Haskell: Parsec: Pipeline of transformers of the whole file

Convert unescaped unicode to utf8 integer

What's the cleanest way to do case-insensitive parsing with Text.Combinators.Parsec?

Categories

Resources