Parsec - improving error message for "between"

Parsec - improving error message for "between" - haskell

I'm learning Parsec. I've got this code:
import Text.Parsec.String (Parser)
import Control.Applicative hiding ((<|>))
import Text.ParserCombinators.Parsec hiding (many)
inBracketsP :: Parser [String]
inBracketsP = (many $ between (char '[') (char ']') (many $ char '.')) <* eof
main :: IO ()
main = putStr $ show $ parse inBracketsP "" "[...][..."
The result is
Left (line 1, column 10):
unexpected end of input
expecting "." or "]"
This message is not useful (adding . won't fix the problem). I'd expect something like ']' expected (only ] fixes the problem).
Is it possible to achieve that easily with Parsec? I've seen the SO question Parsec: error message at specific location, which is inspiring, but I'd prefer to stick to the between combinator, without manual lookahead or other overengineering (kind of), if possible.

You can hide a terminal from being displayed in the expected input list by attaching an empty label to it (parser <?> ""):
inBracketsP :: Parser [String]
inBracketsP = (many $ between (char '[') (char ']') (many $ (char '.' <?> ""))) <* eof
-- >>> main
-- Left (line 1, column 10):
-- unexpected end of input
-- expecting "]"
In megaparsec, there is also a hidden combinator that achieves the same effect.

Related

Parsec negative match

parseIdent :: Parser (String)
parseIdent = do
x <- lookAhead $ try $ many1 (choice [alphaNum])
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
I'm trying to parse an alphanumeric string that only succeeds if it does not match a predetermined value (macro in this case).
However the following is giving me an error of:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string.
Which does not make sense, how does many1 (choice [alphaNum]) accept an empty string?
This error goes away if i remove the lookAhead $ try. But it 'fails' with illegal:
...
*** Exception: (line 6, column 36):
unexpected " "
expecting letter or digit or new-line
illegal
Am I going about this correctly? Or is there another technique to implement a negative search?

You almost have it:
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
import Control.Monad
parseIdent :: Parser (String)
parseIdent = try $ do
x <- many1 alphaNum
void $ optional endOfLine <|> eof
case x of
"macro" -> fail "illegal"
_ -> pure x
So, why didn't your code work?
the try is in the wrong spot. The real backtracking piece here is backtracking after you've gotten back your alphanumeric word and checked it isn't "macro"
lookAhead has no business here. If you end up with the word you wanted, you do want the word to be consumed from the input. try already takes care of resetting your input stream to its previous state

How to ignore arbitrary tokens using parsec?

I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.
I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.
module Main where
import Text.Parsec
parser :: Parsec String () Int
parser = do
_ <- many anyToken
x <- read <$> many1 digit
_ <- many anyToken
return x
main :: IO ()
main = interact (show . parse parser "STDIN")

This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.
Here's a working version that's closest to your example:
{-# LANGUAGE ApplicativeDo #-}
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
parser :: RE Char Int
parser = do
_ <- few anySym
x <- decimal
_ <- many anySym
return x
main :: IO ()
main = interact (show . match parser)
Here's an even shorter version, using findFirstInfix:
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
main :: IO ()
main = interact (snd3 . findFirstInfix decimal)
where snd3 (_, r, _) = r
If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.

Replacing sed and awk with parsers is what the
replace-megaparsec
library is all about.
Extract numbers from unstructured strings with the
sepCap
parser combinator.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer
parseTest (sepCap (decimal :: Parsec Void String Int))
$ "unknown structure but containing the number 42 and maybe some other stuff"
[ Left "unknown structure but containing the number "
, Right 42
, Left " and maybe some other stuff"
]

This cannot work, since anyToken accepts and consumes - as its names says - any token, including digits. And you apply it many times. Therefore the attempt to read digits with the second parser must fail. There simply cannot be any tokens left.
Instead make your first parser accept any character, that is not a digit (using isDigit from module Data.Char):
parser :: Parsec String () Int
parser = do
_ <- many $ satisfy (not . isDigit)
x <- read <$> many1 digit
_ <- many anyToken
return x

Basic Attoparsec Parsing returns only "Right []"

I'm very new to Haskell and I'm trying to parse a map file, just for practice. My code will compile, but it gives me the wrong result. All I get is "Right []" - which I don't understand.
My code is very similar to the tutorial here, but I rewrote it to serve my needs.
My file looks like this (I removed most of the lines to save space here):
#test map 2
0,0:1;
1,0:1;
2,0:1;
3,0:1;
My code:
import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- Types --
data Tile = Tile Int Int Int deriving Show
data MapLine =
MapLine { tile :: Tile } deriving Show
-- Parsing --
parseTile :: Parser Tile
parseTile = do
x <- decimal
char ','
y <- decimal
char ':'
t <- decimal
char ';'
return $ Tile x y t
mapLineParser :: Parser MapLine
mapLineParser = do
t <- parseTile
return $ MapLine t
fileParser :: Parser [MapLine]
fileParser = many $ mapLineParser <* endOfLine
-- Main --
main :: IO()
--main = B.readFile "map.hexmap" >>= print . parseOnly fileParser
main = do
print "Parsing map..."
let x = B.readFile "map.hexmap"
x >>= print . parseOnly fileParser
print "Done."
Thanks for the help.

Your parser "successfully parses" a list of MapLines of length zero before failing at the first line. Remove that line (and make sure your file doesn't include any non-parsable bytes at the start like a BOM) and it should work. Or write a parser for lines starting with a # that ignores the result, then combine.

Haskell attoparsec: "Failed reading: satisfyWith"

I want to parse text like "John","Kate","Ruddiger" into list of Strings.
I tried to start with parsing "John", to Name (alias for String) but it already fails with Fail "\"," [","] "Failed reading: satisfyWith".
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
Question B: How can I make the parser to not require a comma after the last name?
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many)
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char ','
return $ Name (BS.unpack name)
someEntry :: IO BS.ByteString
someEntry = do
return $ BS.pack "\"John\","
main :: IO()
main = do
someEntry >>= print . parse entryParser
I am using GHC 7.6.3 and attoparsec-0.11.3.4.

Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
readName = P.takeWhile (/='"')
takeWhile consumes as long as the predicate is true. Therefor, after you read the name, " hasn't been consumed. This is easy to see if we remove P.char ',' from the entryParser:
entryParser = P.char '"' >> fmap (Name . BS.unpack) readName
$ runhaskell SO.hs
Done "\"," Name "John"
You need to consume the ":
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char '"' -- <<<<<<<<<<<<<<<<<<<<<<
P.char ','
return $ Name (BS.unpack name)
Question B: How can I make the parser to not require a comma after the last name?
Use sepBy.
Now your questions has been cleared up, lets make things a little bit easier. Don't consume the , at all in entryParser, instead, only take the name:
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
In case you don't know (*>) and (<*), they're both from Control.Applicative, and they basically mean "discard whatever is on the asterisks side".
Now, in order to parse all comma separated entries, we use sepBy entryParser (P.char ','). However, this will lead into attoparsec returning a Partial:
$ runhaskell SO.hs
Partial _
That's actually a feature of attoparsec you have to keep in mind:
Attoparsec supports incremental input, meaning that you can feed it a bytestring that represents only part of the expected total amount of data to parse. If your parser reaches the end of a fragment of input and could consume more input, it will suspend parsing and return a Partial continuation.
If you do want to use incremental input, use parse and feed. Otherwise use parseOnly. The complete code for your example would be something like
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many, (*>), (<*))
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
allEntriesParser = sepBy entryParser (P.char ',')
testString = "\"John\",\"Martha\",\"test\""
main = print . parseOnly allEntriesParser $ testString
$ runhaskell SO.hs
Right [Name "John",Name "Martha",Name "test"]

Attoparsec: Skipping bracketed terms?

I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB.
In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz

Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parsec - improving error message for "between" - haskell

Related

Parsec negative match

How to ignore arbitrary tokens using parsec?

Basic Attoparsec Parsing returns only "Right []"

Haskell attoparsec: "Failed reading: satisfyWith"

Attoparsec: Skipping bracketed terms?

Categories

Resources