I want to parse text like "John","Kate","Ruddiger" into list of Strings.
I tried to start with parsing "John", to Name (alias for String) but it already fails with Fail "\"," [","] "Failed reading: satisfyWith".
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
Question B: How can I make the parser to not require a comma after the last name?
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many)
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char ','
return $ Name (BS.unpack name)
someEntry :: IO BS.ByteString
someEntry = do
return $ BS.pack "\"John\","
main :: IO()
main = do
someEntry >>= print . parse entryParser
I am using GHC 7.6.3 and attoparsec-0.11.3.4.
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
readName = P.takeWhile (/='"')
takeWhile consumes as long as the predicate is true. Therefor, after you read the name, " hasn't been consumed. This is easy to see if we remove P.char ',' from the entryParser:
entryParser = P.char '"' >> fmap (Name . BS.unpack) readName
$ runhaskell SO.hs
Done "\"," Name "John"
You need to consume the ":
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char '"' -- <<<<<<<<<<<<<<<<<<<<<<
P.char ','
return $ Name (BS.unpack name)
Question B: How can I make the parser to not require a comma after the last name?
Use sepBy.
Now your questions has been cleared up, lets make things a little bit easier. Don't consume the , at all in entryParser, instead, only take the name:
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
In case you don't know (*>) and (<*), they're both from Control.Applicative, and they basically mean "discard whatever is on the asterisks side".
Now, in order to parse all comma separated entries, we use sepBy entryParser (P.char ','). However, this will lead into attoparsec returning a Partial:
$ runhaskell SO.hs
Partial _
That's actually a feature of attoparsec you have to keep in mind:
Attoparsec supports incremental input, meaning that you can feed it a bytestring that represents only part of the expected total amount of data to parse. If your parser reaches the end of a fragment of input and could consume more input, it will suspend parsing and return a Partial continuation.
If you do want to use incremental input, use parse and feed. Otherwise use parseOnly. The complete code for your example would be something like
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many, (*>), (<*))
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
allEntriesParser = sepBy entryParser (P.char ',')
testString = "\"John\",\"Martha\",\"test\""
main = print . parseOnly allEntriesParser $ testString
$ runhaskell SO.hs
Right [Name "John",Name "Martha",Name "test"]
Related
I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.
I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.
module Main where
import Text.Parsec
parser :: Parsec String () Int
parser = do
_ <- many anyToken
x <- read <$> many1 digit
_ <- many anyToken
return x
main :: IO ()
main = interact (show . parse parser "STDIN")
This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.
Here's a working version that's closest to your example:
{-# LANGUAGE ApplicativeDo #-}
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
parser :: RE Char Int
parser = do
_ <- few anySym
x <- decimal
_ <- many anySym
return x
main :: IO ()
main = interact (show . match parser)
Here's an even shorter version, using findFirstInfix:
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
main :: IO ()
main = interact (snd3 . findFirstInfix decimal)
where snd3 (_, r, _) = r
If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.
Replacing sed and awk with parsers is what the
replace-megaparsec
library is all about.
Extract numbers from unstructured strings with the
sepCap
parser combinator.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer
parseTest (sepCap (decimal :: Parsec Void String Int))
$ "unknown structure but containing the number 42 and maybe some other stuff"
[ Left "unknown structure but containing the number "
, Right 42
, Left " and maybe some other stuff"
]
This cannot work, since anyToken accepts and consumes - as its names says - any token, including digits. And you apply it many times. Therefore the attempt to read digits with the second parser must fail. There simply cannot be any tokens left.
Instead make your first parser accept any character, that is not a digit (using isDigit from module Data.Char):
parser :: Parsec String () Int
parser = do
_ <- many $ satisfy (not . isDigit)
x <- read <$> many1 digit
_ <- many anyToken
return x
This simple parser is expected to parse messages of the form
key: value\r\nkey: value\r\n\r\nkey: value\r\nkey: value\r\n\r\n
One EOL acts as a field separator, and double EOL acts as a message separator. It works perfectly fine when the EOL separator is \n but parseWith always returns fail when it is \r\n.
parsePair = do
key <- B8.takeTill (==':')
_ <- B8.char ':'
_ <- B8.char ' '
value <- B8.manyTill B8.anyChar endOfLine
return (key, value)
parseListPairs = sepBy parsePair endOfLine <* endOfLine
parseMsg = sepBy parseListPairs endOfLine <* endOfLine
I'm assuming you are using these imports:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Attoparsec.ByteString.Char8 as B8
import Data.Attoparsec.ByteString.Char8
The problem is that endOfLine consumes the end of line, so perhaps you really want something like:
parseListPairs = B8.many1 parsePair <* endOfInput
For instance, this works:
ghci> parseOnly parseListPairs "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]
Update:
For parsing multiple messages you can use:
parseListPairs = B8.manyTill parsePair endOfLine
parseMsgs = B8.manyTill parseListPairs endOfInput
ghci> test3 = parseOnly parseMsgs "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4\r\n\r\n"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]
Problems
Your code isn't self-contained and the actual problem is unclear. However, I suspect your woes are actually caused by how keys are parsed; in particular, something like \r\nk is a valid key, according to your parser:
λ> parseOnly parsePair "\r\nk: v\r\n"
Right ("\r\nk","v")
That needs to be fixed.
Moreover, since one EOL separates (rather than terminates) key-value pairs, an EOL shouldn't be consumed at the end of your parsePair parser.
Another tangential issue: because you use the many1 combinator instead ByteString-oriented parsers (such as takeTill), your values have type String instead of ByteString. That's probably not what you want, here, because it defeats the purpose of using ByteString in the first place.; see Performance considerations.
Solution
I suggest the following refactoring:
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString ( ByteString )
import Data.Attoparsec.ByteString.Char8 ( Parser
, count
, endOfLine
, parseOnly
, sepBy
, string
, takeTill
)
-- convenient type synonyms
type KVPair = (ByteString, ByteString)
type Msg = [KVPair]
pair :: Parser KVPair
pair = do
k <- key
_ <- string ": "
v <- value
return (k, v)
where
key = takeTill (\c -> c == ':' || isEOL c)
value = takeTill isEOL
isEOL c = c == '\n' || c == '\r'
-- one EOL separates key-value pairs
msg :: Parser Msg
msg = sepBy pair endOfLine
-- two EOLs separate messages
msgs :: Parser [Msg]
msgs = sepBy msg (count 2 endOfLine)
I have renamed your parsers, for consistency with attoparsec's, none of which have "parse" as a prefix:
parsePair --> pair
parseListPairs --> msg
parseMsg --> msgs
Tests in GHCi
λ> parseOnly keyValuePair "\r\nk: v"
Left "string"
Good; you do want a fail, in this case.
λ> parseOnly keyValuePair "k: v"
Right ("k","v")
λ> parseOnly msg "k: v\r\nk2: v2\r\n"
Right [("k","v"),("k2","v2")]
λ> parseOnly msgs "k1: v1\r\nk2: v2\r\n\r\nk3: v3\r\nk4: v4"
Right [[("k1","v1"),("k2","v2")],[("k3","v3"),("k4","v4")]]
λ> parseOnly msgs "k: v"
Right [[("k","v")]]
I'm very new to Haskell and I'm trying to parse a map file, just for practice. My code will compile, but it gives me the wrong result. All I get is "Right []" - which I don't understand.
My code is very similar to the tutorial here, but I rewrote it to serve my needs.
My file looks like this (I removed most of the lines to save space here):
#test map 2
0,0:1;
1,0:1;
2,0:1;
3,0:1;
My code:
import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- Types --
data Tile = Tile Int Int Int deriving Show
data MapLine =
MapLine { tile :: Tile } deriving Show
-- Parsing --
parseTile :: Parser Tile
parseTile = do
x <- decimal
char ','
y <- decimal
char ':'
t <- decimal
char ';'
return $ Tile x y t
mapLineParser :: Parser MapLine
mapLineParser = do
t <- parseTile
return $ MapLine t
fileParser :: Parser [MapLine]
fileParser = many $ mapLineParser <* endOfLine
-- Main --
main :: IO()
--main = B.readFile "map.hexmap" >>= print . parseOnly fileParser
main = do
print "Parsing map..."
let x = B.readFile "map.hexmap"
x >>= print . parseOnly fileParser
print "Done."
Thanks for the help.
Your parser "successfully parses" a list of MapLines of length zero before failing at the first line. Remove that line (and make sure your file doesn't include any non-parsable bytes at the start like a BOM) and it should work. Or write a parser for lines starting with a # that ignores the result, then combine.
I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB.
In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz
Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.
Haskell has a number of string literals that use the \ escape sequence. Ones such as \n, \t, \NUL.
If I have the string literal:
let s = "Newline: \\n Tab: \\t"
how do I define the function escape :: String -> String that will convert the above string to:
"Newline: \n Tab: \t"
And the same with all other string literal escape sequences.
I'm okay with using Quasi Quoting and Template Haskell, but don't know how to use them to achieve the result. Any pointers?
Update: I just found the Text.ParserCombinators.ReadP module that's included in the Base library. It supports the readLitChar :: ReadS Char function in Data.Char that does what I want, but I don't know how to use the ReadP module. I tried the following and it works:
escape2 [] = []
escape2 xs = case readLitChar xs of
[] -> []
[(a, b)] -> a : escape2 b
But this may not be the right way to use the ReadP module. Can anyone provide some pointers?
Another update: Thanks everyone. My final function below. Not bad, I think.
import Text.ParserCombinators.ReadP
import Text.Read.Lex
escape xs
| [] <- r = []
| [(a,_)] <- r = a
where r = readP_to_S (manyTill lexChar eof) xs
You don't need to do anything. When you input the string literal
let s = "Newline: \\n Tab: \\t"
you can check that it is what you want:
Prelude> putStrLn s
Newline: \n Tab: \t
Prelude> length s
19
If you just ask ghci for the value of s you'll get something else,
Prelude> s
"Newline: \\n Tab: \\t"
apparently it's doing some escape formatting behind your back, and it also displays the quotes. If you call show or print you'll get yet other answers:
Prelude> show s
"\"Newline: \\\\n Tab: \\\\t\""
Prelude> print s
"Newline: \\n Tab: \\t"
This is because show is meant for serializing values, so when you show a string you don't get the original back, you instead get a serialized string which can be parsed into the original string. The result of show s is actually displayed by print s (print is defined as putStrLn . show). When you just show s in ghci you get an even stranger answer; here ghci is formatting the characters which are serialized by show.
tl;dr - always use putStrLn to see what the value of a string is in ghci.
Edit: I just realized that maybe you want to convert the literal value
Newline: \n Tab: \t
into the actual control sequences. The easiest way to do this is probably to stick it in quotes and use read:
Prelude> let s' = '"' : s ++ "\""
Prelude> read s' :: String
"Newline: \n Tab: \t"
Prelude> putStrLn (read s')
Newline:
Tab:
Edit 2: an example of using readLitChar, this is very close to Chris's answer except with readLitChar:
strParser :: ReadP String
strParser = do
str <- many (readS_to_P readLitChar)
eof
return str
Then you run it with readP_to_S, which gives you a list of matching parses (there shouldn't be more than one match, however there might not be any match so you should check for an empty list.)
> putStrLn . fst . head $ readP_to_S strParser s
Newline:
Tab:
>
Asking about QQ and TH means you wish to do this conversion at compile time. For simple String -> Something conversions you can use the OverloadedString literal facility in GHC.
EDIT 2 : Using the exposed character lexer in Text.Read.Lex
module UnEscape where
import Data.String(IsString(fromString))
import Text.ParserCombinators.ReadP as P
import Text.Read.Lex as L
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape lexed
where lexer = do s <- P.many L.lexChar
eof
return s
lexed = case P.readP_to_S lexer rawString of
((answer,""):_) -> answer
_ -> error ("UnEscape could not process "++show rawString)
EDIT 1 : I have now got a better UnEscape instance that uses GHC's read:
instance IsString UnEscape where
fromString rawString = UnEscape (read (quote rawString))
where quote s = '"' : s ++ ['"']
For example:
module UnEscape where
import Data.String(IsString(fromString))
newtype UnEscape = UnEscape { unEscape :: String }
instance IsString UnEscape where
fromString rawString = UnEscape (transform rawString)
where transform [] = []
transform ('\\':x:rest) = replace x : transform rest
transform (y:rest) = y : transform rest
-- also covers special case of backslash at end
replace x = case x of
'n' -> '\n'
't' -> '\t'
unrecognized -> unrecognized
The above has to be a separate module from the module that uses unEscape:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import UnEscape(UnEscape(unEscape))
main = do
let s = "Newline: \\n Tab: \\t"
t = unEscape "Newline: \\n Tab: \\t"
print s
putStrLn s
print t
putStrLn t
This produces
shell prompt$ ghci Main.hs
GHCi, version 7.0.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 2] Compiling UnEscape ( UnEscape.hs, interpreted )
[2 of 2] Compiling Main ( Main.hs, interpreted )
Ok, modules loaded: Main, UnEscape.
*Main> main
"Newline: \\n Tab: \\t"
Newline: \n Tab: \t
"Newline: \n Tab: \t"
Newline:
Tab: