Attoparsec - ensure entire contents consumed with sepBy1 - haskell

I would like the below code to return [LoadInt 1,LoadDub 2.5,LoadInt 3], but it fails after parsing [LoadInt 1,LoadDub 2] and facing .5,3. How do I make it so it must parse all the way to the comma for a parse to succeed, and an int parse on 2.5 is a fail?
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 (Parser)
import Data.ByteString.Char8 (pack)
import Data.Attoparsec.Combinator
import Control.Applicative ((*>),(<$>),(<|>))
data LoadNum = LoadInt Int | LoadDub Double deriving (Show)
someFunc :: IO ()
someFunc = putStrLn . show $ A.parseOnly (lnParser <* A.endOfInput) (pack testString)
testString :: String
testString = "1,2.5,3"
lnParser :: Parser [LoadNum]
lnParser = (sepBy1' (ld <* A.atEnd) (A.char ','))
double :: Parser Double
double = A.double
int :: Parser Int
int = A.signed A.decimal
ld :: Parser LoadNum
ld = ((LoadInt <$> int ) <|> (LoadDub <$> double))

You could use a tiny bit of lookahead to decide whether you reached the end of a list element. So:
int :: Parser Int
int = do
i <- A.signed A.decimal
next <- A.peekChar
case next of
Nothing -> pure i
Just ',' -> pure i
_ -> fail "nah"

Related

How can I write a pattern quasi quoter in Haskell?

I use quasi quoters to create my smart-constructed data types at compile time. This looks something like:
import qualified Data.Text as T
import Language.Haskell.TH.Quote (QuasiQuoter(..))
import Language.Haskell.TH (Q, Exp, Pat(..), Lit(..))
import Language.Haskell.TH.Syntax (Lift(..))
import qualified Language.Haskell.TH.Syntax as TH
import Instances.TH.Lift () -- th-lift-instances package
newtype NonEmptyText = NonEmptyText Text
textIsWhitespace :: Text -> Bool
textIsWhitespace = T.all (== ' ')
mkNonEmptyText :: Text -> Maybe NonEmptyText
mkNonEmptyText t = if textIsWhitespace t then Nothing else (Just (NonEmptyText t))
compileNonEmptyText :: QuasiQuoter
compileNonEmptyText = QuasiQuoter
{ quoteExp = compileNonEmptyText'
, quotePat = error "NonEmptyText is not supported as a pattern"
, quoteDec = error "NonEmptyText is not supported at top-level"
, quoteType = error "NonEmptyText is not supported as a type"
}
where
compileNonEmptyText' :: String -> Q Exp
compileNonEmptyText' s = case mkNonEmptyText (pack s) of
Nothing -> fail $ "Invalid NonEmptyText: " ++ s
Just txt -> [| txt |]
(I can provide a standalone working example if necessary—I just pulled this example out of a larger codebase)
Essentially, by just deriving Lift for my newtypes, I can place the data type in an expression quasi quoter [| txt |] to implement quoteExp.
But I'm having trouble with quotePat. If I do e.g.:
Just txt -> [p| txt |]
Then I get a warning that the first txt is unused, and the second shadows the first. I'm pretty sure that that pattern is just creating a new name txt rather than splicing in the in-scope txt like the expression quasi quoter did, since when I do:
f :: NonEmptyText -> Bool
f [compileNonEmptyText|test|] = True
f _ = False
everything matches the first statement.
Alright I think I've got it. Starting from the base string s, I can wrap that in StringL and LitP to get a literal string, which because of Text's IsString instance will become a Text. From there I need to apply the NonEmptyText constructor using ConP:
compileNonEmptyTextPattern' :: String -> Q TH.Pat
compileNonEmptyTextPattern' s = case mkNonEmptyText (pack s) of
Nothing -> fail $ "Invalid NonEmptyText: " ++ s
Just (NonEmptyText txt) -> pure $ ConP 'NonEmptyText [(LitP (StringL (T.unpack txt)))]
It's unfortunate that this is so much more verbose than the expression version, though! I wonder if there could be a typeclass for Q Pat like Lift is for Q Exp?

Basic Attoparsec Parsing returns only "Right []"

I'm very new to Haskell and I'm trying to parse a map file, just for practice. My code will compile, but it gives me the wrong result. All I get is "Right []" - which I don't understand.
My code is very similar to the tutorial here, but I rewrote it to serve my needs.
My file looks like this (I removed most of the lines to save space here):
#test map 2
0,0:1;
1,0:1;
2,0:1;
3,0:1;
My code:
import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- Types --
data Tile = Tile Int Int Int deriving Show
data MapLine =
MapLine { tile :: Tile } deriving Show
-- Parsing --
parseTile :: Parser Tile
parseTile = do
x <- decimal
char ','
y <- decimal
char ':'
t <- decimal
char ';'
return $ Tile x y t
mapLineParser :: Parser MapLine
mapLineParser = do
t <- parseTile
return $ MapLine t
fileParser :: Parser [MapLine]
fileParser = many $ mapLineParser <* endOfLine
-- Main --
main :: IO()
--main = B.readFile "map.hexmap" >>= print . parseOnly fileParser
main = do
print "Parsing map..."
let x = B.readFile "map.hexmap"
x >>= print . parseOnly fileParser
print "Done."
Thanks for the help.
Your parser "successfully parses" a list of MapLines of length zero before failing at the first line. Remove that line (and make sure your file doesn't include any non-parsable bytes at the start like a BOM) and it should work. Or write a parser for lines starting with a # that ignores the result, then combine.

how to parse yahoo csv with parsec

how to parse into array such as open[i],high[i],low[i],close[i]
testhaskell.hs:22:5:
Couldn't match type `[]' with `IO'
Expected type: IO a0
Actual type: [a0]
In the return type of a call of `map'
In a stmt of a 'do' block: map (\ line -> sentence line) allLines
In the expression:
do { handle <- openFile
"C:\\Users\\ivan\\Downloads\\0388.HK.csv" ReadMode;
contents <- hGetContents handle;
let allLines = lines contents;
map (\ line -> sentence line) allLines;
.... }
testhaskell.hs:22:19:
Couldn't match expected type `String -> a0'
with actual type `Text.Parsec.Prim.ParsecT
String () Data.Functor.Identity.Identity [String]'
import System.IO
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString as Str
import Text.ParserCombinators.Parsec
word :: Parser String
word = many1 letter
sentence :: Parser [String]
sentence = do{ words <- sepBy1 word separator
; oneOf ".?!" <?> "end of sentence"
; return words
}
separator :: Parser ()
separator = skipMany1 (space <|> char ',' <?> "")
main = do
handle <- openFile "C:\\Users\\ivan\\Downloads\\0005.HK.csv" ReadMode
contents <- hGetContents handle
let allLines = lines contents
map (\line -> sentence line) allLines
--putStr contents
hClose handle
update:
module Main where
import qualified Data.ByteString.Char8 as B
import Data.Map ((!))
import Data.Text
import qualified Data.Vector as V
import System.Directory
import Test.Framework (Test, defaultMain, testGroup)
import Test.Framework.Providers.API
import Test.HUnit ((#=?))
import Data.CSV.Conduit
main :: IO ()
main = defaultMain tests
tests :: [Test]
tests = [testGroup "Basic Ops" baseTests]
baseTests :: [Test]
baseTests =
[
testCase "simple parsing works" test_simpleParse
]
test_simpleParse :: IO ()
test_simpleParse = do
(d :: V.Vector (MapRow B.ByteString)) <- readCSVFile csvSettings testFile1
V.mapM_ assertRow d
where
assertRow r = v3 #=? (v1 + v2)
where v1 = readBS $ r ! "Open"
v2 = readBS $ r ! "High"
v3 = readBS $ r ! "Low"
v4 = readBS $ r ! "Close"
csvSettings :: CSVSettings
csvSettings = defCSVSettings { csvQuoteChar = Just '`'}
testFile1 :: FilePath
testFile1 = "C:\\Users\\ivan\\Downloads\\0005.HK.csv"
readBS :: B.ByteString -> Int
readBS = read . B.unpack
testhaskell.hs:52:5: Not in scope: `testCase'
testhaskell.hs:58:9:
Illegal type signature: `V.Vector (MapRow B.ByteString)'
Perhaps you intended to use -XScopedTypeVariables
In a pattern type-signature
I'd strongly recommend you not do this. There are a number of high-quality CSV libraries on Hackage, and rolling your own is a recipe of problems. At FP Complete, we use csv-conduit, though cassava is also a great library. I'd recommend you try out one of them.

Haskell: using a Vector of Vectors instead of a nested list when parsing a CSV file

I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.
The question is related to another question.
I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.
So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.
Test file "myfile.tx":
11,22,33,44,55
66,77,88,99,10
Here is my code using Attoparsec, adapted from attoparsec-csv:
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Text.ParseCSV
(
parseCSV
) where
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)
import qualified Data.ByteString.Char8 as BSCH (readInteger)
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Integer
parserInt = (signed decimal)
record :: Parser [Integer]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Integer]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: T.Text -> Either String [[Integer]]
parseCSV =
parseOnly file
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
main = do
list <- getLines "myfile.txt"
putStr $ show $ map parseAndFilter list
But instead of using a list [Integer] I would like to use Data.Vector.
I found a relevant part in the Data.Vector tutorial:
--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
However, this is regular, not a nested list of integers.
I tried to adapt my code but it did not work.
How can I change my code to
use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).
There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.
Here is an example:
import Text.Regex
process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.
main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)
The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.
Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).
Because of its lazy nature, Haskell is a good language to stream data through.
I found a post that there is no easy way to parse with attoparsec to a vector.
See this forum post and thread.
But the good new is that the overhead of Data.Vector.fromList isn't so bad.
Attoparsec seems to be quite fast for parsing.
I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.
Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.
Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):
{-
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
-}
getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}
filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x
parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)
-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"
main = do
list <- getLines "mydata.txt"
--putStr $ show $ parseCSV $ mystr
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list
--show $ parseOnly parserInt $ T.pack "123"
Thanks to jamshidh and all the comments that pointed me to the right direction.
Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Main
(
parseCSV, main
) where
import Data.Vector as V (Vector, fromList, map, head, filter)
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Int
parserInt = skipSpace *> signed decimal
record :: Parser [Int]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Int]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV =
parseOnly file
getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"
filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x
parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)
-- MAIN
main = do
fContent <- B.readFile "myfile.txt"
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent

Data.Text.Text and quasiquoting

I have a parser which parses to an ast which contains Text values. I
am trying to use this parser with quasiquoting, but the implementation
of Data for Text is incomplete. I've attached a smallish test case, when I try to compile Text.hs I get:
Text.hs:17:9:
Exception when trying to run compile-time code:
Data.Text.Text.toConstr
Code: Language.Haskell.TH.Quote.quoteExp expr " test "
Is there a way to get this working?
I read through the discussion here: http://www.haskell.org/pipermail/haskell-cafe/2010-January/072379.html
It seems that no-one has found a proper solution to this issue? Also, I tried the Data instance given there and it didn't work, I have no idea how to fix it (or how to use it since the text package already has a Data instance for Text). I don't really understand a lot of the generics stuff and how it works.
The only solution I have so far is to give up using Text in the ast and go back to using String.
{-# LANGUAGE DeriveDataTypeable #-}
module Syntax where
import Data.Data
import Data.Text
data Expr = Iden Text
| Num Integer
| AntiIden Text
deriving (Eq,Show,Data,Typeable)
---------------------
module Parser where
import Control.Applicative
import Control.Monad.Identity
import qualified Data.Text as T
import Text.Parsec hiding (many, optional, (<|>), string, label)
import Text.Parsec.Language
import qualified Text.Parsec.Token as P
import Text.Parsec.Text ()
import Syntax
parseExpr :: T.Text -> Either ParseError Expr
parseExpr s =
runParser expr () "" s
expr :: ParsecT T.Text () Identity Expr
expr =
whiteSpace >> choice
[do
_ <- char '$'
AntiIden <$> identifier
,Num <$> natural
,Iden <$> identifier
]
identifier :: ParsecT T.Text () Identity T.Text
identifier = T.pack <$> P.identifier lexer
natural :: ParsecT T.Text () Identity Integer
natural = P.natural lexer
lexer :: P.GenTokenParser T.Text () Identity
lexer = P.makeTokenParser langDef
whiteSpace :: ParsecT T.Text () Identity ()
whiteSpace = P.whiteSpace lexer
langDef :: GenLanguageDef T.Text st Identity
langDef = P.LanguageDef
{ P.commentStart = "{-"
, P.commentEnd = "-}"
, P.commentLine = "--"
, P.nestedComments = True
, P.identStart = letter <|> char '_'
, P.identLetter = alphaNum <|> oneOf "_"
, P.opStart = P.opLetter langDef
, P.opLetter = oneOf "+-*/<>="
, P.reservedOpNames= []
, P.reservedNames = []
, P.caseSensitive = False
}
-------------------
module Quasi where
import Language.Haskell.TH.Quote
import Language.Haskell.TH
import Data.Generics
import qualified Data.Text as T
import Syntax
import Parser (parseExpr)
expr :: QuasiQuoter
expr = QuasiQuoter {quoteExp = prs
,quotePat = undefined
,quoteType = undefined
,quoteDec = undefined}
where
prs :: String -> Q Exp
prs s = p s
>>= dataToExpQ (const Nothing
`extQ` antiExpE
)
p s = either (fail . show) return (parseExpr $ T.pack s)
antiExpE :: Expr -> Maybe ExpQ
antiExpE v = fmap varE (antiExp v)
antiExp :: Expr -> Maybe Name
antiExp (AntiIden v) = Just $ mkName $ T.unpack v
antiExp _ = Nothing
----------------------------
-- test.hs:
{-# LANGUAGE QuasiQuotes #-}
import Syntax
import Quasi
test,test1,test2 :: Expr
-- works
test = [expr| 1234 |]
-- works
test1 = let stuff = Num 42
in [expr| $stuff |]
-- doesn't work
test2 = [expr| test |]
main :: IO ()
main = putStrLn $ show test2
Solution: add this function using extQ to the dataToExpQ call:
handleText :: T.Text -> Maybe ExpQ
handleText x =
-- convert the text to a string literal
-- and wrap it with T.pack
Just $ appE (varE 'T.pack) $ litE $ StringL $ T.unpack x
Add an extQ for handleText where handleText explicitly takes Text to an ExpQ, rather than going through generic machinery.
Here's one for Strings, for example, that renders them more efficiently than as explicit cons cells:
handleStr :: String -> Maybe (TH.ExpQ)
handleStr x = Just $ TH.litE $ TH.StringL x

Resources