How to ignore arbitrary tokens using parsec?

How to ignore arbitrary tokens using parsec? - haskell

I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.
I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.
module Main where
import Text.Parsec
parser :: Parsec String () Int
parser = do
_ <- many anyToken
x <- read <$> many1 digit
_ <- many anyToken
return x
main :: IO ()
main = interact (show . parse parser "STDIN")

This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.
Here's a working version that's closest to your example:
{-# LANGUAGE ApplicativeDo #-}
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
parser :: RE Char Int
parser = do
_ <- few anySym
x <- decimal
_ <- many anySym
return x
main :: IO ()
main = interact (show . match parser)
Here's an even shorter version, using findFirstInfix:
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
main :: IO ()
main = interact (snd3 . findFirstInfix decimal)
where snd3 (_, r, _) = r
If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.

Replacing sed and awk with parsers is what the
replace-megaparsec
library is all about.
Extract numbers from unstructured strings with the
sepCap
parser combinator.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer
parseTest (sepCap (decimal :: Parsec Void String Int))
$ "unknown structure but containing the number 42 and maybe some other stuff"
[ Left "unknown structure but containing the number "
, Right 42
, Left " and maybe some other stuff"
]

This cannot work, since anyToken accepts and consumes - as its names says - any token, including digits. And you apply it many times. Therefore the attempt to read digits with the second parser must fail. There simply cannot be any tokens left.
Instead make your first parser accept any character, that is not a digit (using isDigit from module Data.Char):
parser :: Parsec String () Int
parser = do
_ <- many $ satisfy (not . isDigit)
x <- read <$> many1 digit
_ <- many anyToken
return x

Related

How to parse multiple lines with megaparsec? many . many runs into space leak

I'd like to parse some very simple text for example,
"abcxyzzzz\nhello\n" into ["abcxyzzz", "hello"] :: String.
Not looking for a simpler function to do this (like words) as I need to parse something more complex and I'm just laying the foundations here.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeSynonymInstances #-}
{-# LANGUAGE FlexibleInstances #-}
module RgParse where
import Data.Text (Text)
import Text.Megaparsec
import Text.Megaparsec.Char
data SimpleData = SimpleData String deriving (Eq, Show, Ord)
data SimpleData' = SimpleData' [String] deriving (Eq, Show, Ord)
instance ShowErrorComponent SimpleData where
showErrorComponent = show
instance ShowErrorComponent String where
showErrorComponent = show
simple :: Parsec String Text SimpleData
simple = do
x <- many (noneOf (Just '\n'))
pure $ SimpleData x
simple' :: Parsec String Text SimpleData'
simple' = do
x <- many (many (noneOf (Just '\n')))
pure $ SimpleData' x
example2 :: Text
example2 = "abcxyzzzz\nhello\n"
main :: IO ()
main = do
print "Simple:"
case parse simple "<stdin>" example2 of
Left bundle -> putStr (errorBundlePretty bundle)
Right result -> print result
print "Simple':"
case parse simple' "<stdin>" example2 of
Left bundle -> putStr (errorBundlePretty bundle)
Right result -> print result
print "done.."
The above unfortunately runs into an infinite loop / space leak upon entering simple' as it outputs the following:
Hello, Haskell!
[]
"Simple:"
SimpleData "abcxyzzzz"
"Simple':"
Using megaparsec-7.0.5 (not the latest 9.x.x).
Is there possibly a simpler approach to getting multiple lines?

Apply many only to a parser that either consumes at least one token (here, one Char) or fails. That's because many works by running its argument until it fails. many x may consume zero tokens, so many (many x) breaks this requirement.
Note that a line should at least involve a terminating newline. That allows that requirement to be fulfilled.
oneline :: Parsec String Text String
oneline = many (noneOf (Just '\n')) <* single '\n'
manylines :: Parsec String Text [String]
manylines = many oneline
simple :: Parsec String Text SimpleData
simple = do
x <- oneline
pure $ SimpleData x
simple' :: Parsec String Text SimpleData'
simple' = do
x <- manylines
pure $ SimpleData' x
A looser requirement for many p is that any repetition of p must fail after a finite number of iterations (and here p = many x never fails), so p might consume nothing in some steps, but then it must be stateful so that after some repetitions it eventually consumes something or fails. But the above approximation is a pretty good rule of thumb in practice.

Parse identifiers that don't end with certain characters in attoparsec

I am stuck writing an attoparsec parser to parse what the Uniform Code for Units of Measure calls a <ATOM-SYMBOL>. It's defined to be the longest sequence of characters in a certain class (that class includes all the digits 0-9) which doesn't end with a digit.
So given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
I can't figure out how to build this out of takeWhile1 or takeTill or scan but I am probably missing something obvious.
Update:
My best attempt so far was that I managed to exclude sequences that are entirely digits
atomSymbol :: Parser Text
atomSymbol = do
r <- core
if (P.all (inClass "0-9") . T.unpack $ r)
then fail "Expected an atom symbol but all characters were digits."
else return r
where
core = A.takeWhile1 $ inClass "!#-'*,0-<>-Z\\^-z|~"
I tried changing that to test if the last character was a digit instead of if they all were, but it doesn't seem to backtrack one character at a time.
Update 2:
The whole file is at https://github.com/dmcclean/dimensional-attoparsec/blob/master/src/Numeric/Units/Dimensional/Parsing/Attoparsec.hs. This only builds against the prefixes branch from https://github.com/dmcclean/dimensional.

You should reformulate the problem and treat spans of digits (0-9) and spans of non-digit characters (!#-'*,:-<>-Z\\^-z|~) separately. The syntactic element of interest can then be described as
an optional digit span, followed by
a non-digit span, followed by
zero or more {digit span followed by a non-digit span}.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative ((<|>), many)
import Data.Char (isDigit)
import Data.Attoparsec.Combinator (option)
import Data.Attoparsec.Text (Parser)
import qualified Data.Attoparsec.Text as A
import Data.Text (Text)
import qualified Data.Text as T
atomSymbol :: Parser Text
atomSymbol = f <$> (option "" digitSpan)
<*> (nonDigitSpan <|> fail errorMsg)
<*> many (g <$> digitSpan <*> nonDigitSpan)
where
nonDigitSpan = A.takeWhile1 $ A.inClass "!#-'*,:-<>-Z\\^-z|~"
digitSpan = A.takeWhile1 isDigit
f x y xss = T.concat $ x : y : concat xss
g x y = [x,y]
errorMsg = "Expected an atom symbol but all characters (if any) were digits."
Tests
[...] given the input foo27 I want to consume and return foo, for 237bar26 I want to consume and return 237bar, for 19 I want to fail without consuming anything.
λ> A.parseOnly atomSymbol "foo26"
Right "foo"
λ> A.parseOnly atomSymbol "237bar26"
Right "237bar"
λ> A.parseOnly atomSymbol "19"
Left "Failed reading: Expected an atom symbol but all characters (if any) were digits."

Basic Attoparsec Parsing returns only "Right []"

I'm very new to Haskell and I'm trying to parse a map file, just for practice. My code will compile, but it gives me the wrong result. All I get is "Right []" - which I don't understand.
My code is very similar to the tutorial here, but I rewrote it to serve my needs.
My file looks like this (I removed most of the lines to save space here):
#test map 2
0,0:1;
1,0:1;
2,0:1;
3,0:1;
My code:
import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- Types --
data Tile = Tile Int Int Int deriving Show
data MapLine =
MapLine { tile :: Tile } deriving Show
-- Parsing --
parseTile :: Parser Tile
parseTile = do
x <- decimal
char ','
y <- decimal
char ':'
t <- decimal
char ';'
return $ Tile x y t
mapLineParser :: Parser MapLine
mapLineParser = do
t <- parseTile
return $ MapLine t
fileParser :: Parser [MapLine]
fileParser = many $ mapLineParser <* endOfLine
-- Main --
main :: IO()
--main = B.readFile "map.hexmap" >>= print . parseOnly fileParser
main = do
print "Parsing map..."
let x = B.readFile "map.hexmap"
x >>= print . parseOnly fileParser
print "Done."
Thanks for the help.

Your parser "successfully parses" a list of MapLines of length zero before failing at the first line. Remove that line (and make sure your file doesn't include any non-parsable bytes at the start like a BOM) and it should work. Or write a parser for lines starting with a # that ignores the result, then combine.

Haskell: using a Vector of Vectors instead of a nested list when parsing a CSV file

I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.
The question is related to another question.
I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.
So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.
Test file "myfile.tx":
11,22,33,44,55
66,77,88,99,10
Here is my code using Attoparsec, adapted from attoparsec-csv:
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Text.ParseCSV
(
parseCSV
) where
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)
import qualified Data.ByteString.Char8 as BSCH (readInteger)
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Integer
parserInt = (signed decimal)
record :: Parser [Integer]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Integer]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: T.Text -> Either String [[Integer]]
parseCSV =
parseOnly file
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
main = do
list <- getLines "myfile.txt"
putStr $ show $ map parseAndFilter list
But instead of using a list [Integer] I would like to use Data.Vector.
I found a relevant part in the Data.Vector tutorial:
--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
However, this is regular, not a nested list of integers.
I tried to adapt my code but it did not work.
How can I change my code to
use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).

There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.
Here is an example:
import Text.Regex
process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.
main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)
The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.
Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).
Because of its lazy nature, Haskell is a good language to stream data through.

I found a post that there is no easy way to parse with attoparsec to a vector.
See this forum post and thread.
But the good new is that the overhead of Data.Vector.fromList isn't so bad.
Attoparsec seems to be quite fast for parsing.
I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.
Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.
Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):
{-
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
-}
getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}
filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x
parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)
-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"
main = do
list <- getLines "mydata.txt"
--putStr $ show $ parseCSV $ mystr
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list
--show $ parseOnly parserInt $ T.pack "123"
Thanks to jamshidh and all the comments that pointed me to the right direction.
Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Main
(
parseCSV, main
) where
import Data.Vector as V (Vector, fromList, map, head, filter)
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Int
parserInt = skipSpace *> signed decimal
record :: Parser [Int]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Int]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV =
parseOnly file
getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"
filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x
parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)
-- MAIN
main = do
fContent <- B.readFile "myfile.txt"
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent

Attoparsec: Skipping bracketed terms?

I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB.
In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz

Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string