How to use ByteStrings with QuickTest in DocTest? - haskell

How do I define the Arbitrary instance (as stated here) when using doctest and quickcheck?
Doctest and Cabal are set up as described here with a separate directory for tests.
The doctest line looks like this:
-- prop> (\s -> (decode . encode $ s == s)) :: ByteString -> Bool
decode :: ByteString -> ByteString
encode :: ByteString -> ByteString
Where and how do I define the Arbitrary instance, so that doctest can find it?
Note that I would want to define it in the test project.

Try
-- $setup
-- >>> import Control.Applicative
-- >>> import qualified Data.ByteString as ByteString
-- >>> import Test.QuickCheck
-- >>> instance Arbitrary ByteString where arbitrary = ByteString.pack <$> arbitrary
-- >>> instance CoArbitrary ByteString where coarbitrary = coarbitrary . ByteString.unpack
-- |
-- prop> \ s -> (decode . encode) s == s
decode:: ByteString -> ByteString
encode :: ByteString -> ByteString
Named chunks can be used for such definitions. However, each complete definition must be on one line, and doctest will report each use of >>> as a success or failure - so in this case, 6 attempts will be reported, even though only 1 of them is actually a test.

Related

How can you print the type of a haskell expression in main? [duplicate]

I'm looking for a function that does what the GHCi :type command does.
Ideally, it would have a signature something like
getStaticType :: a -> String
a = getStaticType (1+2)
-- a = "(Num t) => t"
b = getStaticType zipWith
-- b = "(a -> b -> c) -> [a] -> [b] -> [c]"
(Note: this has nothing to do with Data.Dynamic. I just want the static type inferred from the compiler. In fact the function wouldn't need a runtime implementation at all, as all calls to it could be inlined as constants at compile time. I'm assuming it exists somewhere, since GHCi can do it)
You can do it like this:
import Data.Typeable
getStaticType :: Typeable a => a -> String
getStaticType = show . typeOf
Note that the type must be an instance of Typeable. You can derive Typeable automatically using the DeriveDataTypeable Haskell language extension and ... deriving (Typeable, ...).
Also note that polymorphic types cannot be identified in this way; you must always call a function with a specific type, so you can never get that polymorphic type information that you get in GHCi with compiled Haskell code.
The way GHCi does it is that it uses the GHC API to analyse an intermediary Haskell abstract syntax tree (AST) that contains type information. GHCi does not have the same restricted environment that your typical compiled Haskell program does; it can do lots of stuff to find out more information about its environment.
With TemplateHaskell, you can do it like this; first, create this module:
module TypeOf where
import Control.Monad
import Language.Haskell.TH
import Language.Haskell.TH.Syntax
getStaticType :: Name -> Q Exp
getStaticType = lift <=< fmap pprint . reify
Then, in a different module (very important), you can do the following:
{-# LANGUAGE TemplateHaskell #-}
import TypeOf
main = putStrLn $(getStaticType 'zipWith)
This program outputs:
GHC.List.zipWith :: forall a_0 b_1 c_2 . (a_0 -> b_1 -> c_2) ->
[a_0] -> [b_1] -> [c_2]
You can use a better pretty-printer than the pprint function; take a look at the Language.Haskell.TH.Ppr module.
try http://www.haskell.org/haskellwiki/GHC/As_a_library
typed targetFile targetModule = do
defaultErrorHandler defaultFatalMessager defaultFlushOut $ do
runGhc (Just libdir) $ do
dflags <- getSessionDynFlags
let dflags' = xopt_set dflags Opt_ImplicitPrelude
setSessionDynFlags dflags'
target <- guessTarget targetFile Nothing
setTargets [target]
load LoadAllTargets
m <- getModSummary $ mkModuleName targetModule
p <- parseModule m
t <- typecheckModule p
return $ typecheckedSource d

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

Finding type signature of a function in Haskell

I have created two functions that basically parse and input and I need to find their type signatures so that ghc -Wall in the terminal won't give me a warning. This is the code:
import Text.Parsec.Prim
import Text.Parsec.Char
import Text.Parsec.Error
import Text.Parsec.String
import Text.Parsec.Combinator
cToken c = try (many space >> char c >> many space)
sToken s = try (many space >> string s >> many space)
If i write in the terminal:
:t cToken
:t sToken
It gives back:
Prelude CurvySyntax> :t sToken
sToken
:: Text.Parsec.Prim.Stream s m Char =>
String -> Text.Parsec.Prim.ParsecT s u m [Char]
Prelude CurvySyntax> :t cToken
cToken
:: Text.Parsec.Prim.Stream s m Char =>
Char -> Text.Parsec.Prim.ParsecT s u m [Char]
If I put these types in my code then it can't compile.
What are their types?
Thanks.
GHCi cheats a little bit with imports: it allows you to refer to public modules by their full name at any point in time. For instance
$ ghci
$ [ ... ]
Prelude> :t Data.List.sortBy (Data.Ord.comparing snd)
Data.List.sortBy (Data.Ord.comparing snd)
:: Ord a => [(a1, a)] -> [(a1, a)]
As you can see here, I was able to refer to sortBy and comparing by their fully qualified module+symbol names. If you try to do the same in a concrete Haskell source file it will fail unless I also import those modules (qualified). So, GHCi takes some liberties.
Likewise, when you ask for the type of some function it may need to refer to certain types or typeclasses which have not yet been imported. GHCi takes some liberties and just displays those types/classes using fully qualified module+symbol names. In your example these include
Text.Parsec.Prim.Stream
Text.Parsec.Prim.ParsecT
If you were to just copy these to your source file it will complain because you've not yet imported the module Text.Parsec.Prim.
So what's the resolution? Just import it!
import Text.Parsec.Prim
If you add that to your source file and :reload GHCi then check the types once more the new results will reflect the fact that you have access to these types
Prelude CurvySyntax> :t sToken
sToken :: Stream s m Char => String -> ParsecT s u m [Char]
and this new type can be directly pasted into your source file.

Memory exploding upon writing a lazy bytestring to file in ghci

The following program does not explode when the executable (compiled via ghc -O0 Explode.hs) is run, but does explode when run in ghci (via either ghci Explode.hs or ghci -fobject-code Explode.hs) :
--Explode.hs
--Does not explode with : ghc -O0 Explode.hs
--Explodes with : ghci Explode.hs
--Explodes with : ghci -fobject-code Explode.hs
module Main (main) where
import Data.Int
import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString.Lazy.Char8 as BLC
createStr :: Int64 -> String -> BL.ByteString
createStr num str = BL.take num $ BL.cycle $ BLC.pack str
main = do
BLC.writeFile "results.txt" $ createStr 100000000 "abc\n"
Why does it explode in ghci and not with ghc -O0 Explode.hs, and how can I stop it from exploding in ghci? The methods I adopted in Memory blowing up for strict sum/strict foldl in ghci dont seem to work here. Thanks.
After inspecting the code of writeFile, it seems that it depends on the hPut function of Data.ByteString.Lazy:
-- | Outputs a 'ByteString' to the specified 'Handle'.
--
hPut :: Handle -> ByteString -> IO ()
hPut h cs = foldrChunks (\c rest -> S.hPut h c >> rest) (return ()) cs
hPut constructs the IO action that will print the lazy bytestring by applying a right fold of sorts over the chunks. The source for the foldrChunks function is:
-- | Consume the chunks of a lazy ByteString with a natural right fold.
foldrChunks :: (S.ByteString -> a -> a) -> a -> ByteString -> a
foldrChunks f z = go
where go Empty = z
go (Chunk c cs) = f c (go cs)
Looking at the code, it seems as if the "spine" of the lazy bytestring (but not the actual data in each chunk) will be forced before writing the first byte, because of how (>>) behaves for the IO monad.
In your example, the strict chunks composing your lazy bytestring are very small. This means a whole lot of them will be generated when foldrChunks "forces the spine" of the 100000000 character long lazy bytestring.
If this analysis is correct, then reducing the number of strict chunks by making them bigger would reduce memory usage. This variant of createStr that creates bigger chunks doesn't blow up for me in ghci:
createStr :: Int64 -> String -> BL.ByteString
createStr num str = BL.take num $ BL.cycle $ BLC.pack $ concat $ replicate 1000 $ str
(I'm not sure why the compiled example doesn't blow up.)

Conduit - Combining multiple Sources/Producers into one

I'm reading from a file using sourceFile, but I also need to introduce randomness into the processing operation. The best approach I believe is to have a producer that is of the type
Producer m (StdGen, ByteString)
where StdGen is used to generate the random number.
I'm intending for the producer to perform the task of sourceFile, as well as producing a new seed to yield everytime it sends data downstream.
My problem is, there doesn't seem to be a source-combiner like zipSink for sinks. Reading through Conduit Overview, it seems to be suggesting that you can embed a Source inside a Conduit, but I'm failing to see how it is done in the example.
Can anyone provide an example of which you fuse two or more IO sources into one single Producer/Source?
EDIT :
An example:
{-# LANGUAGE NoImplicitPrelude #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE OverloadedStrings #-}
import System.Random (StdGen(..), split, newStdGen, randomR)
import ClassyPrelude.Conduit as Prelude
import Control.Monad.Trans.Resource (runResourceT, ResourceT(..))
import qualified Data.ByteString as BS
-- generate a infinite source of random number seeds
sourceStdGen :: MonadIO m => Source m StdGen
sourceStdGen = do
g <- liftIO newStdGen
loop g
where loop gin = do
let g' = fst (split gin)
yield gin
loop g'
-- combine the sources into one
sourceInput :: (MonadResource m, MonadIO m) => FilePath -> Source m (StdGen, ByteString)
sourceInput fp = getZipSource $ (,)
<$> ZipSource sourceStdGen
<*> ZipSource (sourceFile fp)
-- a simple conduit, which generates a random number from provide StdGen
-- and append the byte value to the provided ByteString
simpleConduit :: Conduit (StdGen, ByteString) (ResourceT IO) ByteString
simpleConduit = mapC process
process :: (StdGen, ByteString) -> ByteString
process (g, bs) =
let rnd = fst $ randomR (40,50) g
in bs ++ pack [rnd]
main :: IO ()
main = do
runResourceT $ sourceInput "test.txt" $$ simpleConduit =$ sinkFile "output.txt"
So this example takes what's in the input file and write it to the output file, as well as appending a random ASCII value between 40 and 50 to the end of the file. (Don't ask me why)
You can use ZipSource for this. In your case, it might look something like:
sourceStdGens :: Source m StdGen
sourceBytes :: Source m ByteString
sourceBoth :: Source m (StdGen, ByteString)
sourceBoth = getZipSource $ (,)
<$> ZipSource sourceStdGens
<*> ZipSource sourceBytes
You can do it in the IO monad then lift the result to a Producer.
do (i, newSeed) <- next currentSeed
b <- generateByteStringFromRandomNumber i
return (b, newSeed)
That IO action can be lifted into the appropriate conduit with a simple lift:
-- assuming the above action is named x and takes the current seed as an argument
-- the corresponding producer/source is:
lift $ x currentSeed

Resources