Using Parsec with Data.Text

Using Parsec with Data.Text - haskell

Using Parsec 3.1, it is possible to parse several types of inputs:
[Char] with Text.Parsec.String
Data.ByteString with Text.Parsec.ByteString
Data.ByteString.Lazy with Text.Parsec.ByteString.Lazy
I don't see anything for the Data.Text module. I want to parse Unicode content without suffering from the String inefficiencies. So I've created the following module based on the Text.Parsec.ByteString module:
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}
module Text.Parsec.Text
( Parser, GenParser
) where
import Text.Parsec.Prim
import qualified Data.Text as T
instance (Monad m) => Stream T.Text m Char where
uncons = return . T.uncons
type Parser = Parsec T.Text ()
type GenParser t st = Parsec T.Text st
Does it make sense to do so?
It this compatible with the rest of the Parsec API?
Additional comments:
I had to add {-# LANGUAGE NoMonomorphismRestriction #-} pragma in my parse modules to make it work.
Parsing Text is one thing, building an AST with Text is another thing. I will also need to pack my String before return:
module TestText where
import Data.Text as T
import Text.Parsec
import Text.Parsec.Prim
import Text.Parsec.Text
input = T.pack "xxxxxxxxxxxxxxyyyyxxxxxxxxxp"
parser = do
x1 <- many1 (char 'x')
y <- many1 (char 'y')
x2 <- many1 (char 'x')
return (T.pack x1, T.pack y, T.pack x2)
test = runParser parser () "test" input

Since Parsec 3.1.2 support of Data.Text is built-in!
See http://hackage.haskell.org/package/parsec-3.1.2
If you are stuck with older version, the code snippets in other answers are helpful, too.

That looks like exactly what you need to do.
It should be compatible with the rest of Parsec, include the Parsec.Char parsers.
If you're using Cabal to build your program, please put an upper bound of parsec-3.1 in your package description, in case the maintainer decides to include that instance in a future version of Parsec.

I added a function parseFromUtf8File to help reading UTF-8 encoded files in an efficient fashion. Works flawlessly with umlaut characters. Function type matches parseFromFile from Text.Parsec.ByteString. This version uses strict ByteStrings.
-- A derivate work from
-- http://stackoverflow.com/questions/4064532/using-parsec-with-data-text
{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}
module Text.Parsec.Text
( Parser, GenParser, parseFromUtf8File
) where
import Text.Parsec.Prim
import qualified Data.Text as T
import qualified Data.ByteString as B
import Data.Text.Encoding
import Text.Parsec.Error
instance (Monad m) => Stream T.Text m Char where
uncons = return . T.uncons
type Parser = Parsec T.Text ()
type GenParser t st = Parsec T.Text st
-- | #parseFromUtf8File p filePath# runs a strict bytestring parser
-- #p# on the input read from #filePath# using
-- 'ByteString.readFile'. Returns either a 'ParseError' ('Left') or a
-- value of type #a# ('Right').
--
-- > main = do{ result <- parseFromFile numbers "digits.txt"
-- > ; case result of
-- > Left err -> print err
-- > Right xs -> print (sum xs)
-- > }
parseFromUtf8File :: Parser a -> String -> IO (Either ParseError a)
parseFromUtf8File p fname = do
raw <- B.readFile fname
let input = decodeUtf8 raw
return (runP p () fname input)

Related

How can I decode JSON using a custom `parseJSON` - a function rather than the function related to the instance for `fromJSON`?

This function:
eitherDecode :: FromJSON a => ByteString -> Either String a
Has a small limitation that I can't have an additional implementation of a decode that is NOT the one from FromJSON a.
In other words I'm looking for some way to pass my own Bytestring -> Either String a parsing function.
Okay... So I'll have to define my own function for this it seems.
It's defined as:
-- | Like 'decode' but returns an error message when decoding fails.
eitherDecode :: (FromJSON a) => L.ByteString -> Either String a
eitherDecode = eitherFormatError . eitherDecodeWith jsonEOF ifromJSON
Looks like ifrom is what I need to modify which is defined as:
-- | Convert a value from JSON, failing if the types do not match.
ifromJSON :: (FromJSON a) => Value -> IResult a
ifromJSON = iparse parseJSON
Well eitherFormatError is not exported from Aeson so this basically seems like I might be going down the wrong approach.

After a bit of type juggling...
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE ScopedTypeVariables #-}
module AesonExtra where
import Data.Aeson
import Data.Aeson.Types
import qualified Data.ByteString.Lazy as L
import qualified Data.HashMap.Strict as HashMap
import Data.Foldable (toList)
import Data.String.Conversions
import Data.Text (Text)
eitherDecodeCustom :: (Value -> Parser a) -> L.ByteString -> Either String a
eitherDecodeCustom f x = do
xx <- eitherDecode x :: Either String Value
parseEither f xx

Testing laziness of IsString s => s -> Bool function

Is there any way I can test that a function p :: IsString s => s -> Bool evaluates its input lazily? That is, it only consumes a part of its input when determining its result. And is it possible in such a way that it's compatible with both String and Data.Text.Lazy?
I've looked at the Q&A Unit-testing the undefined evaluated in lazy expression in Haskell, which doesn't cover IsString specifically, and I've found the StrictCheck package on Hackage that I'm not really sure how works. Does it apply here?
Problem
I've got a predicate,
p :: IsString s => s -> Bool
and an Hspec test,
{-# LANGUAGE OverloadedStrings #-}
...
import Data.String (fromString)
spec_p :: Spec
spec_p =
describe "p" $
it "is lazy" $ p (fromString x) `shouldBe` y
where
x = "foo" ++ [undefined]
y = True
that fails if p ("foo" ++ [undefined]) tries to consume any more than "foo".
This works fine for my String implementation,
import qualified Data.List as L
p :: String -> Bool
p = ("foo" `L.isPrefixOf`)
But it does not work so fine on my Data.Text.Lazy implementation,
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Text.Lazy as T
import Data.Text.Lazy (Text)
p :: Text -> Bool
p = ("foo" `T.isPrefixOf`)
because fromString does not convert the lazy String into a lazy Text in a way that preserves the undefined unevaluated. I can test that the lazy version does work by writing a specialized test,
pTest :: Bool
pTest = p (T.fromChunks [ "foo", undefined ]) -- True
but I can't control how fromString chunks.
Attempted solution:
I tried to write my own wrapper to control the chunking of fromString,
import qualified Data.Text as T
import qualified Data.Text.Lazy as LT
newtype LazyChunkyText = LazyChunkyText LT.Text deriving (Show)
instance IsString LazyChunkyText where
fromString = LazyChunkyText . LT.fromChunks . map (T.pack . return)
But because fromChunks takes a [T.Text], I need to T.pack.
Meaning my [undefined] gets evaluated.

LiftIO, do block, and syntax

I'm getting to grips with writing an API in Haskell using Scotty. My files are provided below. My questions are:
In the routes definition, I'm extracting from liftIO whatsTheTime in a do block. This works, but it seems verbose. Is there a nicer syntax?
In the whatsTheTime definition, I'm needing to do fromString. I'd have thought OverloadedString would take care of that, but that's not the case. I'd really appreciate it if somebody pointed out why it doesn't work without fromString.
In a stack project, if I need a directive like OverloadedStrings, do I need to include it every file that needs it, or just at the top of the main entrypoint?
Api.hs:
{-# LANGUAGE OverloadedStrings #-}
module Api
( whatsTheTime
) where
import Data.Time (getCurrentTime)
import Web.Scotty
import Data.String
whatsTheTime :: IO (ActionM ())
whatsTheTime = do
time <- getCurrentTime
return $ text $ fromString ("The time is now " ++ show time)
Main.hs:
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
module Main where
import Api
import Web.Scotty
import Control.Monad.IO.Class
routes = do
get "/" $ do
res <- liftIO whatsTheTime
res
main :: IO ()
main = do
putStrLn "Starting server..."
scotty 3000 routes

(1) This:
do
res <- liftIO whatsTheTime
res
Desugars to this:
liftIO whatsTheTime >>= \ res -> res
If you look at the type of \ m -> m >>= id:
(Monad m) => m (m a) -> m a
That’s exactly the type of join (Hoogle), so you can use:
get "/" $ join $ liftIO whatsTheTime
join is a common idiom for “execute this action which returns an action, and also execute the returned action”.
(2) OverloadedStrings is for overloading of string literals. You have an overloaded literal "The time is now ", but you constrain it to be of type String by using it as an operand of (++) with a String (the result of show time). You can pack the result of show time as a Text instead using fromString or Data.Text.pack:
import Data.Monoid ((<>))
import qualified Data.Text as Text
-- ...
return $ text $ "The time is now " <> Text.pack (show time)
(3) LANGUAGE pragmas operate per file; as #mgsloan notes, you can add OverloadedStrings to the default-extensions: field of your library or executable in your .cabal file.

Haskell IO-Streams and Groundhog db usage

How to compile the following program? Somehow I cannot escape the error "No instance for (PersistBackend IO).
My aim is to see, how to efficiently fill a db-table using io-streams. The type of makeOutputStream is (Maybe a -> IO ()) -> IO (OutputStream a) while insertWords returns m () and it does not accept IO () as return type.
(Late addition: a work around found, but it is not an answer to the question. See below.)
The error msg is:
Words_read2.hs:30:36:
No instance for (PersistBackend IO)
arising from a use of `insertWord'
Possible fix: add an instance declaration for (PersistBackend IO)
In the first argument of `Streams.makeOutputStream', namely
`insertWord'
In a stmt of a 'do' block:
os <- Streams.makeOutputStream insertWord
In the expression:
do { is <- Streams.handleToInputStream h >>= Streams.words;
os <- Streams.makeOutputStream insertWord;
Streams.connect is os }
And the code producing this error is:
{-# LANGUAGE GADTs, TypeFamilies, TemplateHaskell, QuasiQuotes, FlexibleInstances, FlexibleContexts, StandaloneDeriving #-}
import qualified Data.ByteString as B
import Data.Maybe
import Control.Monad.IO.Class (MonadIO, liftIO)
import Database.Groundhog.Core
import Database.Groundhog.TH
import Database.Groundhog.Sqlite
import System.IO
import System.IO.Streams.File
import qualified System.IO.Streams as Streams
data Words = Words {word :: String} deriving (Eq, Show)
mkPersist defaultCodegenConfig [groundhog|
definitions:
- entity: Words
|]
insertWord :: (MonadIO m, PersistBackend m) => Maybe B.ByteString -> m ()
insertWord wo = case wo of
Just ww -> insert_ $ Words ((show . B.unpack) ww)
Nothing -> return ()
main = do
withSqliteConn "words2.sqlite" $ runDbConn $ do
runMigration defaultMigrationLogger $ migrate (undefined :: Words)
liftIO $ withFile "web2" ReadMode $ \h -> do -- a link to /usr/share/dict/web2 - a list of words one per line
is <- Streams.handleToInputStream h >>= Streams.words
os <- Streams.makeOutputStream insertWord
Streams.connect is os
As a work around, we can do things other way: we do not try to work inside runDbConn but rather return a handle to a (pool of) connection and pass it around. The idea come from SO answer to question:
Making Custom Instances of PersistBackend.
{-# LANGUAGE GADTs, TypeFamilies, TemplateHaskell, QuasiQuotes, FlexibleInstances, FlexibleContexts, StandaloneDeriving #-}
import qualified Data.ByteString as B
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import Control.Monad.IO.Class -- (MonadIO, liftIO)
import Control.Monad.Trans.Control
import Database.Groundhog.Core
import Database.Groundhog.TH
import Database.Groundhog.Sqlite
import System.IO
import System.IO.Streams.File
import qualified System.IO.Streams as Streams
data Words = Words {word :: T.Text} deriving (Eq, Show)
mkPersist defaultCodegenConfig [groundhog|
definitions:
- entity: Words
|]
main = do
gh <- do withSqlitePool "words5.sqlite" 5 $ \pconn -> return pconn
runDbConn (runMigration defaultMigrationLogger $ migrate (undefined :: Words)) gh
withFile "web3" ReadMode $ \h -> do -- 500 words from /usr/share/dict/web2 - a list of words one per line
is <- Streams.handleToInputStream h >>= Streams.words
os <- Streams.makeOutputStream (iw2db gh)
Streams.connect is os
iw2db :: (MonadIO m, MonadBaseControl IO m, ConnectionManager cm Sqlite) => cm -> Maybe B.ByteString -> m()
iw2db gh (Just x) = runDbConn (insert_ $ Words (T.decodeUtf8 x)) gh
iw2db gh Nothing = return ()

Groundhog actions can run only in monad which is an instance of PersistBackend. IO cannot be made its instance because unlike DbPersist it does not carry connection information.
I like the code in the workaround, but can be made much faster. Now each action is run within its own transaction opened by runDbConn. To avoid this we can open a connection from pool and begin a single transaction. And then each action reuses this connection avoiding transaction overhead. Also createSqlitePool is nicer than withSqlitePool in this case.
{-# LANGUAGE GADTs, TypeFamilies, TemplateHaskell, QuasiQuotes, FlexibleInstances, FlexibleContexts, StandaloneDeriving #-}
import qualified Data.ByteString as B
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.Encoding as T
import Control.Monad.IO.Class -- (MonadIO, liftIO)
import Control.Monad.Trans.Control
import Database.Groundhog.Core
import Database.Groundhog.TH
import Database.Groundhog.Sqlite
import System.IO
import System.IO.Streams.File
import qualified System.IO.Streams as Streams
import Control.Monad.Logger (MonadLogger, NoLoggingT(..))
data Words = Words {word :: T.Text} deriving (Eq, Show)
mkPersist defaultCodegenConfig [groundhog|
definitions:
- entity: Words
|]
main = do
gh <- createSqlitePool "words5.sqlite" 5
runDbConn (runMigration defaultMigrationLogger $ migrate (undefined :: Words)) gh
withFile "/usr/share/dict/words" ReadMode $ \h -> do -- 500 words from /usr/share/dict/web2 - a list of words one per line
is <- Streams.handleToInputStream h >>= Streams.words
withConn (\conn -> liftIO $ do -- (conn :: Sqlite) with opened transaction
os <- Streams.makeOutputStream (iw2db conn)
-- It is important to put Streams.connect inside withConn so that it uses the same transaction
-- If we put it outside, the transaction will be already closed and Sqlite will automatically do a new transaction for each insert
Streams.connect is os) gh
iw2db :: (MonadIO m, MonadBaseControl IO m, ConnectionManager cm Sqlite)
=> cm -> Maybe B.ByteString -> m ()
iw2db gh (Just x) = runDbConnNoTransaction (insert_ $ Words (T.decodeUtf8 x)) gh
iw2db gh Nothing = return ()
-- Probably this function should go to the Generic module
runDbConnNoTransaction :: (MonadBaseControl IO m, MonadIO m, ConnectionManager cm conn) => DbPersist conn (NoLoggingT m) a -> cm -> m a
runDbConnNoTransaction f cm = runNoLoggingT (withConnNoTransaction (runDbPersist f) cm)

Reading YAML in Haskell

I'm looking to have my Haskell program read settings from an external file, to avoid recompiling for minor changes. Being familiar with YAML, I thought it would be a good choice. Now I have to put the two pieces together. Google hasn't been very helpful so far.
A little example code dealing with reading and deconstructing YAML from a file would be very much appreciated.

If I'm interested in what packages are available, I go to hackage, look at the complete package list, and then just search-in-page for the keyword. Doing that brings up these choices (along with a few other less compelling ones):
yaml: http://hackage.haskell.org/package/yaml
HsSyck: http://hackage.haskell.org/package/HsSyck
and a wrapper around HsSyck called yaml-light: http://hackage.haskell.org/package/yaml-light
Both yaml and HsSyck look updated relatively recently, and appear to be used by other packages in widespread use. You can see this by checking the reverse deps:
http://packdeps.haskellers.com/reverse/HsSyck
http://packdeps.haskellers.com/reverse/yaml
Of the two, yaml has more deps, but that is because it is part of the yesod ecosystem. One library that depends on HsSyck is yst, which I happen to know is actively maintained, so that indicates to me that HsSyck is fine too.
The next step in making my choice would be to browse through the documentation of both libraries and see which had the more appealing api for my purposes.
Of the two, it looks like HsSyck exposes more structure but not much else, while yaml goes via the json encodings provided by aeson. This indicates to me that the former is probably more powerful while the latter is more convenient.

A simple example:
First you need a test.yml file:
db: /db.sql
limit: 100
Reading YAML in Haskell
{-# LANGUAGE DeriveGeneric #-}
import GHC.Generics
import Data.Yaml
data Config = Config { db :: String
, limit :: Int
} deriving (Show, Generic)
instance FromJSON Config
main :: IO ()
main = do
file <- decodeFile "test.yml" :: IO (Maybe Config)
putStrLn (maybe "Error" show file)

With yamlparse-applicative you can describe your YAML parser in a static analysis-friendly way, so you can get a description of the YAML format from a parser for free. I'm going to use Matthias Braun's example format for this one:
{-# LANGUAGE ApplicativeDo, RecordWildCards, OverloadedStrings #-}
import Data.Yaml
import Data.Aeson.Types (parse)
import YamlParse.Applicative
import Data.Map (Map)
import qualified Data.Text.IO as T
data MyType = MyType
{ stringsToStrings :: Map String String
, mapOfLists :: Map String [String]
} deriving Show
parseMyType :: YamlParser MyType
parseMyType = unnamedObjectParser $ do
stringsToStrings <- requiredField' "strings_to_strings"
mapOfLists <- requiredField' "map_of_lists"
pure MyType{..}
main :: IO ()
main = do
T.putStrLn $ prettyParserDoc parseMyType
yaml <- decodeFileThrow "config/example.yaml"
print $ parse (implementParser parseMyType) yaml
Note that main is able to print out a schema before even seeing an instance:
strings_to_strings: # required
<key>: <string>
map_of_lists: # required
<key>: - <string>
Success
(MyType
{ stringsToStrings = fromList
[ ("key_one","val_one")
, ("key_two","val_two")
]
, mapOfLists = fromList
[ ("key_one",["val_one","val_two","val_three"])
, ("key_two",["val_four","val_five"])
]
})

Here's how to parse specific objects from your YAML file using the yaml library.
Let's parse parts of this file, config/example.yaml:
# A map from strings to strings
strings_to_strings:
key_one: val_one
key_two: val_two
# A map from strings to list of strings
map_of_lists:
key_one:
- val_one
- val_two
- val_three
key_two:
- val_four
- val_five
# We won't parse this
not_for: us
This module parses strings_to_strings and map_of_lists individually and puts them into a custom record, MyType:
{-# Language OverloadedStrings, LambdaCase #-}
module YamlTests where
import Data.Yaml ( decodeFileEither
, (.:)
, parseEither
, prettyPrintParseException
)
import Data.Map ( Map )
import Control.Applicative ( (<$>) )
import System.FilePath ( FilePath
, (</>)
)
data MyType = MyType {stringsToStrings :: Map String String,
mapOfLists :: Map String [String]} deriving Show
type ErrorMsg = String
readMyType :: FilePath -> IO (Either ErrorMsg MyType)
readMyType file =
(\case
(Right yamlObj) -> do
stringsToStrings <- parseEither (.: "strings_to_strings") yamlObj
mapOfLists <- parseEither (.: "map_of_lists") yamlObj
return $ MyType stringsToStrings mapOfLists
(Left exception) -> Left $ prettyPrintParseException exception
)
<$> decodeFileEither file
yamlTest = do
parsedValue <- readMyType $ "config" </> "example.yaml"
print parsedValue
Run yamlTest to see the parsing result.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using Parsec with Data.Text - haskell

Since Parsec 3.1.2 support of Data.Text is built-in! See http://hackage.haskell.org/package/parsec-3.1.2 If you are stuck with older version, the code snippets in other answers are helpful, too.

Related

How can I decode JSON using a custom `parseJSON` - a function rather than the function related to the instance for `fromJSON`?

Testing laziness of IsString s => s -> Bool function

LiftIO, do block, and syntax

Haskell IO-Streams and Groundhog db usage

Reading YAML in Haskell

Categories

Resources