Parse identifiers that don't end with certain characters in attoparsec - haskell

I am stuck writing an attoparsec parser to parse what the Uniform Code for Units of Measure calls a <ATOM-SYMBOL>. It's defined to be the longest sequence of characters in a certain class (that class includes all the digits 0-9) which doesn't end with a digit.
So given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
I can't figure out how to build this out of takeWhile1 or takeTill or scan but I am probably missing something obvious.
Update:
My best attempt so far was that I managed to exclude sequences that are entirely digits
atomSymbol :: Parser Text
atomSymbol = do
r <- core
if (P.all (inClass "0-9") . T.unpack $ r)
then fail "Expected an atom symbol but all characters were digits."
else return r
where
core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\\^-z|~"
I tried changing that to test if the last character was a digit instead of if they all were, but it doesn't seem to backtrack one character at a time.
Update 2:
The whole file is at https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes branch from https://github.com/dmcclean/dimensional.

You should reformulate the problem and treat spans of digits (0-9) and spans of non-digit characters (!#-'*,:-<>-Z\\^-z|~) separately. The syntactic element of interest can then be described as
an optional digit span, followed by
a non-digit span, followed by
zero or more {digit span followed by a non-digit span}.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative ((<|>), many)
import Data.Char (isDigit)
import Data.Attoparsec.Combinator (option)
import Data.Attoparsec.Text (Parser)
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
atomSymbol :: Parser Text
atomSymbol = f <$> (option "" digitSpan)
<*> (nonDigitSpan <|> fail errorMsg)
<*> many (g <$> digitSpan <*> nonDigitSpan)
where
nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\\^-z|~"
digitSpan = A.takeWhile1 isDigit
f x y xss = T.concat $ x : y : concat xss
g x y = [x,y]
errorMsg = "Expected an atom symbol but all characters (if any) were digits."
Tests
[...] given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
λ> A.parseOnly atomSymbol "foo26"
Right "foo"
λ> A.parseOnly atomSymbol "237bar26"
Right "237bar"
λ> A.parseOnly atomSymbol "19"
Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."

Related

How to parse multiple lines with megaparsec? many . many runs into space leak

I'd like to parse some very simple text for example,
"abcxyzzzz\nhello\n" into ["abcxyzzz", "hello"] :: String.
Not looking for a simpler function to do this (like words) as I need to parse something more complex and I'm just laying the foundations here.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE FlexibleInstances #-}
module RgParse where
import Data.Text (Text)
import Text.Megaparsec
import Text.Megaparsec.Char
data SimpleData = SimpleData String deriving (Eq, Show, Ord)
data SimpleData' = SimpleData' [String] deriving (Eq, Show, Ord)
instance ShowErrorComponent SimpleData where
showErrorComponent = show
instance ShowErrorComponent String where
showErrorComponent = show
simple :: Parsec String Text SimpleData
simple = do
x <- many (noneOf (Just '\n'))
pure $ SimpleData x
simple' :: Parsec String Text SimpleData'
simple' = do
x <- many (many (noneOf (Just '\n')))
pure $ SimpleData' x
example2 :: Text
example2 = "abcxyzzzz\nhello\n"
main :: IO ()
main = do
print "Simple:"
case parse simple "<stdin>" example2 of
Left bundle -> putStr (errorBundlePretty bundle)
Right result -> print result
print "Simple':"
case parse simple' "<stdin>" example2 of
Left bundle -> putStr (errorBundlePretty bundle)
Right result -> print result
print "done.."
The above unfortunately runs into an infinite loop / space leak upon entering simple' as it outputs the following:
Hello, Haskell!
[]
"Simple:"
SimpleData "abcxyzzzz"
"Simple':"
Using megaparsec-7.0.5 (not the latest 9.x.x).
Is there possibly a simpler approach to getting multiple lines?
Apply many only to a parser that either consumes at least one token (here, one Char) or fails. That's because many works by running its argument until it fails. many x may consume zero tokens, so many (many x) breaks this requirement.
Note that a line should at least involve a terminating newline. That allows that requirement to be fulfilled.
oneline :: Parsec String Text String
oneline = many (noneOf (Just '\n')) <* single '\n'
manylines :: Parsec String Text [String]
manylines = many oneline
simple :: Parsec String Text SimpleData
simple = do
x <- oneline
pure $ SimpleData x
simple' :: Parsec String Text SimpleData'
simple' = do
x <- manylines
pure $ SimpleData' x
A looser requirement for many p is that any repetition of p must fail after a finite number of iterations (and here p = many x never fails), so p might consume nothing in some steps, but then it must be stateful so that after some repetitions it eventually consumes something or fails. But the above approximation is a pretty good rule of thumb in practice.

Parsec negative match

parseIdent :: Parser (String)
parseIdent = do
x <- lookAhead $ try $ many1 (choice [alphaNum])
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
I'm trying to parse an alphanumeric string that only succeeds if it does not match a predetermined value (macro in this case).
However the following is giving me an error of:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string.
Which does not make sense, how does many1 (choice [alphaNum]) accept an empty string?
This error goes away if i remove the lookAhead $ try. But it 'fails' with illegal:
...
*** Exception: (line 6, column 36):
unexpected " "
expecting letter or digit or new-line
illegal
Am I going about this correctly? Or is there another technique to implement a negative search?
You almost have it:
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
import Control.Monad
parseIdent :: Parser (String)
parseIdent = try $ do
x <- many1 alphaNum
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
So, why didn't your code work?
the try is in the wrong spot. The real backtracking piece here is backtracking after you've gotten back your alphanumeric word and checked it isn't "macro"
lookAhead has no business here. If you end up with the word you wanted, you do want the word to be consumed from the input. try already takes care of resetting your input stream to its previous state

How to ignore arbitrary tokens using parsec?

I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.
I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.
module Main where
import Text.Parsec
parser :: Parsec String () Int
parser = do
_ <- many anyToken
x <- read <$> many1 digit
_ <- many anyToken
return x
main :: IO ()
main = interact (show . parse parser "STDIN")
This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.
Here's a working version that's closest to your example:
{-# LANGUAGE ApplicativeDo #-}
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
parser :: RE Char Int
parser = do
_ <- few anySym
x <- decimal
_ <- many anySym
return x
main :: IO ()
main = interact (show . match parser)
Here's an even shorter version, using findFirstInfix:
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
main :: IO ()
main = interact (snd3 . findFirstInfix decimal)
where snd3 (_, r, _) = r
If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.
Replacing sed and awk with parsers is what the
replace-megaparsec
library is all about.
Extract numbers from unstructured strings with the
sepCap
parser combinator.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer
parseTest (sepCap (decimal :: Parsec Void String Int))
$ "unknown structure but containing the number 42 and maybe some other stuff"
[ Left "unknown structure but containing the number "
, Right 42
, Left " and maybe some other stuff"
]
This cannot work, since anyToken accepts and consumes - as its names says - any token, including digits. And you apply it many times. Therefore the attempt to read digits with the second parser must fail. There simply cannot be any tokens left.
Instead make your first parser accept any character, that is not a digit (using isDigit from module Data.Char):
parser :: Parsec String () Int
parser = do
_ <- many $ satisfy (not . isDigit)
x <- read <$> many1 digit
_ <- many anyToken
return x

Convert unescaped unicode to utf8 integer

Firstly, I apologize if the terms "unescaped unicode" and "utf8 integer" are not correct; I don't really know what I'm talking about when I'm talking about encoding.
As a concrete example, I would like to convert the string "\\u00b5ABC" to the string "\181ABC" (\u00b5 and \181 correspond to µ). By "string" I mean String or Text.
I know how to achieve this by using a tortuous (and perhaps laughable) way:
import Data.Aeson (decode)
import Data.ByteString.Lazy (packChars)
import Data.Text (Text)
decode (packChars "\"\\u00b5ABC\"") :: Maybe Text
I am ready to bet there exists a more direct way...
Edit
Following #Alec's comment, I provide more context. In the background, there is a Javascript program that receives a character string and replaces the characters in this string by their unicode representation \\uxxxx when this unicode representation is between \u007F and \uFFFF.
On the Haskell side, I receive this new string, and I want to replace the \\uxxxx with their corresponding utf8 integer representations.
Here's a nice simple parser written using regex-applicative. First some imports and other nonsense that isn't worth reading:
import Data.Char
import Data.Maybe
import Numeric
import Text.Regex.Applicative
-- no idea why this isn't in Control.Applicative
replicateA :: Applicative f => Int -> f a -> f [a]
replicateA n act = sequenceA (replicate n act)
Now, we want to parse an escaped character. We'll use a regex that matches characters and returns a character, so it's an RE Char Char. Ideally I'd write it this way:
escaped :: RE Char Char
escaped = do
string "\\u"
digits <- replicateM 4 (psym isHexDigit)
return . chr . fst . head . readHex $ digits
The head is safe because we've ensured that readHex will only be passed hex digits, and therefore will succeed. We can almost write it like that, except that RE Char is not a Monad. With newish GHC's you can probably turn on ApplicativeDo and be done with it, but it's not so bad to write in applicative style ourselves anyway and support all GHC's, so let's do that:
escaped :: RE Char Char
escaped
= chr . fst . head . readHex
<$> (string "\\u"
*> replicateA 4 (psym isHexDigit)
)
Anyway, once we have a regex for decoding a single escaped character, it's easy to produce a regex for decoding all the escaped characters and passing unescaped characters through unchanged: many (escaped <|> anySym). Since this regex will always succeed, we can ignore the Maybe-ness of (=~) hedging its bets about whether an expression will match, and write
decodeHex :: String -> String
decodeHex = fromJust . (=~ many (escaped <|> anySym))
Let's try it in ghci:
> decodeHex "\\u00b5ABC"
"\181ABC"
> decodeHex "\\u00bABC"
"\186BC"
> decodeHex "\\udefg"
"\\udefg"
The advantage of writing our own parser like this instead of relying on something like decode is that we gain control and confidence over exactly which transformations are being done; for example, since we know \u will always be followed by four hex digits, we can only transform it when that happens, in case the original, pre-Javascript text contained \\udefg and we want that to appear in the final output, rather than \3567g; and we don't have to worry that it is trying to de-escape other things that we don't want it to do; and we don't have to "extra-escape" our string before we hand it off, either, as you do with adding the extra quotes around it. And of course, the disadvantage is that we had to engineer it ourselves, and probably have less confidence in its correctness since it hasn't been battle-hardened by a thousand users!

Attoparsec: Skipping bracketed terms?

I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB.
In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz
Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.

Resources