Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to µ). By "string" I mean String or Text.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Edit
Following #Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.
Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!
Related
I'm writing a parser to parse huge chunks of English text using attoparsec. Everything has been great so far, except for parsing this char "――". I know it is just 2 dashes together "--". The weird thing is, the parser catches it in this code:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure ()
but not in this case:
specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()
The reason I'm using the list specialChars is because I have a lot of characters to consider and I apply it multiple cases. And for the input consider: "I am ――Walt Whitman._" and the output is supposed to be {"I", "am", "Walt", "Whiteman."} I believe it's mostly because "――" is not a Char? How do I fix this?
A Char is one character, full stop. ―― is two characters, so it is two Chars. You can fit as many Chars as you want into a String, but you certainly cannot fit two Chars into one Char.
Since satisfy considers individual characters at a time, it probably isn’t what you want if you need to parse a sequence of two characters as a single unit. The inClass function just produces a predicate on characters (inClass partially applied to one argument produces a function of type Char -> Bool), so inClass "――" is the same as inClass ['―', '―'], which is just the same as inClass ['―'] since duplicates are irrelevant. That won’t help you much.
Consider using string instead of or in combination with inClass, since it is designed to handle sequences of characters. For example, something like this might better suit your needs:
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()
I am trying to parse mediawiki text using Parsec. Some of the constructs in mediawiki markup can only occur at the start of rows (such as the header markup ==header level 2==). In regexp I would use an anchor (such as ^) to find the start of a line.
One attempt in GHCi is
Prelude Text.Parsec> parse (char '\n' *> string "==" *> many1 letter <* string "==") "" "\n==hej=="
Right "hej"
but this is not too good since it will fail on the first line of a file. I feel like this should be a solved problem...
What is the most idiomatic "Start of line" parsing in Parsec?
You can use getPosition and sourceColumn in order to find out the column number that the parser is currently looking at. The column number will be 1 if the current position is at the start of a line (such as at the start of input or after a \n or \r character).
There isn't a built-in combinator for this, but you can easily make it:
import Text.Parsec
import Control.Monad (guard)
startOfLine :: Monad m => ParsecT s u m ()
startOfLine = do
pos <- getPosition
guard (sourceColumn pos == 1)
Now you can write your header parser as:
header = startOfLine *> string "==" *> many1 letter <* string "=="
Probably you can use many (char '\n') instead of just char '\n'. In parser combinators there's no sense of start of the line because they always run at the start of input. The only thing you can do is to check manually which symbols your input can start from. Using many (char '\n') ensures that there only zero or more empty lines before header == my header ==.
I am stuck writing an attoparsec parser to parse what the Uniform Code for Units of Measure calls a <ATOM-SYMBOL>. It's defined to be the longest sequence of characters in a certain class (that class includes all the digits 0-9) which doesn't end with a digit.
So given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
I can't figure out how to build this out of takeWhile1 or takeTill or scan but I am probably missing something obvious.
Update:
My best attempt so far was that I managed to exclude sequences that are entirely digits
atomSymbol :: Parser Text
atomSymbol = do
r <- core
if (P.all (inClass "0-9") . T.unpack $ r)
then fail "Expected an atom symbol but all characters were digits."
else return r
where
core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\\^-z|~"
I tried changing that to test if the last character was a digit instead of if they all were, but it doesn't seem to backtrack one character at a time.
Update 2:
The whole file is at https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes branch from https://github.com/dmcclean/dimensional.
You should reformulate the problem and treat spans of digits (0-9) and spans of non-digit characters (!#-'*,:-<>-Z\\^-z|~) separately. The syntactic element of interest can then be described as
an optional digit span, followed by
a non-digit span, followed by
zero or more {digit span followed by a non-digit span}.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative ((<|>), many)
import Data.Char (isDigit)
import Data.Attoparsec.Combinator (option)
import Data.Attoparsec.Text (Parser)
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
atomSymbol :: Parser Text
atomSymbol = f <$> (option "" digitSpan)
<*> (nonDigitSpan <|> fail errorMsg)
<*> many (g <$> digitSpan <*> nonDigitSpan)
where
nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\\^-z|~"
digitSpan = A.takeWhile1 isDigit
f x y xss = T.concat $ x : y : concat xss
g x y = [x,y]
errorMsg = "Expected an atom symbol but all characters (if any) were digits."
Tests
[...] given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
λ> A.parseOnly atomSymbol "foo26"
Right "foo"
λ> A.parseOnly atomSymbol "237bar26"
Right "237bar"
λ> A.parseOnly atomSymbol "19"
Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."
I'm writing my first program with Parsec. I want to parse MySQL schema dumps and would like to come up with a nice way to parse strings representing certain keywords in case-insensitive fashion. Here is some code showing the approach I'm using to parse "CREATE" or "create". Is there a better way to do this? An answer that doesn't resort to buildExpressionParser would be best. I'm taking baby steps here.
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (string "CREATE" <|> string "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later
You can build the case-insensitive parser out of character parsers.
-- Match the lowercase or uppercase form of 'c'
caseInsensitiveChar c = char (toLower c) <|> char (toUpper c)
-- Match the string 's', accepting either lowercase or uppercase form of each character
caseInsensitiveString s = try (mapM caseInsensitiveChar s) <?> "\"" ++ s ++ "\""
Repeating what I said in a comment, as it was apparently helpful:
The simple sledgehammer solution here is to simply map toLower over the entire input before running the parser, then do all your keyword matching in lowercase.
This presents obvious difficulties if you're parsing something that needs to be case-insensitive in some places and case-sensitive in others, or if you care about preserving case for cosmetic reasons. For example, although HTML tags are case-insensitive, converting an entire webpage to lowercase while parsing it would probably be undesirable. Even when compiling a case-insensitive programming language, converting identifiers could be annoying, as any resulting error messages would not match what the programmer wrote.
No, Parsec cannot do that in clean way. string is implemented on top of
primitive tokens combinator that is hard-coded to use equality test
(==). It's a bit simpler to parse case-insensitive character, but you
probably want more.
There is however a modern fork of Parsec, called
Megaparsec which has
built-in solutions for everything you may want:
λ> parseTest (char' 'a') "b"
parse error at line 1, column 1:
unexpected 'b'
expecting 'A' or 'a'
λ> parseTest (string' "foo") "Foo"
"Foo"
λ> parseTest (string' "foo") "FOO"
"FOO"
λ> parseTest (string' "foo") "fo!"
parse error at line 1, column 1:
unexpected "fo!"
expecting "foo"
Note the last error message, it's better than what you can get parsing
characters one by one (especially useful in your particular case). string'
is implemented just like Parsec's string but uses case-insensitive
comparison to compare characters. There are also oneOf' and noneOf' that
may be helpful in some cases.
Disclosure: I'm one of the authors of Megaparsec.
Instead of mapping the entire input with toLower, consider using caseString from Text.ParserCombinators.Parsec.Rfc2234 (from the hsemail package)
Text.ParsecCombinators.Parsec.Rfc2234
p_create_t :: GenParser Char st Statement
p_create_t = do
x <- (caseString "create")
xs <- manyTill anyChar (char ';')
return $ CreateTable (x ++ xs) [] -- refine later
So now x will be whatever case-variant is present in the input without changing your input.
ps: I know that this is an ancient question, I just thought that I would add this as this question came up while I was searching for a similar problem
There is a package name parsec-extra for this purpuse. You need install this package then use 'caseInsensitiveString' parser.
:m Text.Parsec
:m +Text.Parsec.Extra
*> parseTest (caseInsensitiveString "values") "vaLUES"
"values"
*> parseTest (caseInsensitiveString "values") "VAlues"
"values"
Link to package is here:
https://hackage.haskell.org/package/parsec-extra
I always run into the following error when trying to read a ByteString:
Prelude.read: no parse
Here's a sample of code that will cause this error to occur upon rendering in a browser:
factSplice :: SnapletSplice App App
factSplice = do
mbstr <- getParam "input" -- returns user input as bytestring
let str = maybe (error "splice") show mbstr
let n = read str :: Int
return [X.TextNode $ T.pack $ show $ product [1..n]]
Or perhaps more simply:
simple bs = read (show bs) :: Int
For some reason, after show bs the resulting string includes quotes.
So in order to get around the error I have to remove the quotes then read it.
I use the following function copied from the internet to do so:
sq :: String -> String
sq s#[c] = s
sq ('"':s) | last s == '"' = init s
| otherwise = s
sq ('\'':s) | last s == '\'' = init s
| otherwise = s
sq s = s
Then simple bs = read (sq.show bs) :: Int works as expected.
Why is this the case?
What is the best way to convert a ByteString to an Int?
What the best way to convert a ByteString to an X is depends onX. If you have a good conversion from String, going via Data.BytString.Char8.unpack can be good, if it's an ASCII ByteString. For UTF-8 encoded ByteStrings, the utf8-string package contains the conversion function toString. For some specific types, like Int, as mentioned in the title, special faster conversions exist. For example Data.ByteString.Char8.readInt and readInteger.
Show is used to create a String representation of something, that is useful for debugging and plain-text serialization. The Show typeclass is not just a fancy way of converting anything into a String. That's why ByteString adds quotes to the string: because it's arguably easier to read it that way when debugging or deserializing a data stream.
You can use the Data.ByteString.Char8.unpack function to convert a ByteString to a String, but note that this unpacks the ByteString byte-per-byte, which messes up high-value Unicode characters or other characters that are stored as more than one byte; if you want to do something other than using read on the result, I'd recommend converting the ByteString to Text instead, which offers more flexibility in this situation. Assuming that your encoding is UTF8 in this case (As should be the default in Snap), you can use the Data.Text.Encoding.decodeUtf8 function for this. To then convert a Text value to a String with correct Unicode symbols, you use Data.Text.unpack.
Once you have a String, you are free to read it as much as you want; alternatively, you can choose to read a Text value directly using the functions in the Data.Text.Read module.