How to convert a ByteString value to a JSVal - haskell

In the module GHCJS.DOM.JSFFI.Generated.CanvasRenderingContext2D there is the function putImageData with the following type:
putImageData ::
Control.Monad.IO.Class.MonadIO m =>
CanvasRenderingContext2D
-> Maybe GHCJS.DOM.Types.ImageData -> Float -> Float -> m ()
The second parameter has the type Maybe GHCJS.DOM.Types.ImageData.
This type is defined in the module GHCJS.DOM.Types as a newtype wrapper around a JSVal value:
newtype ImageData = ImageData {unImageData :: GHCJS.Prim.JSVal}
I have a value of type ByteString that has always 4 bytes with the RGBA values of each pixel. How to I convert my ByteString value to a GHCJS.Prim.JSVal?

Edit: Looks like my original answer was too GHC centric. Added an untested fix that might work for GHCJS.
Edit #2: Added my stack.yaml file for the example.
You can use GHCJS.DOM.ImageData.newImageData to construct the ImageData object. It requires the data to be a GHCJS.DOM.Types.Uint8ClampedArray (which is a byte array in RGBA format).
There are conversion functions in GHCJS.Buffer from ByteStrings to Buffers (via fromByteString) and from there to typed arrays (e.g., getUint8Array). They do the conversion directly under GHCJS, and even under plain GHC they use a base64 conversion as an intermediary which should be pretty fast. Unfortunately, the conversion function getUint8ClampedArray isn't included (and for plain GHC, it looks like fromByteString might be broken anyway -- in jsaddle 0.8.3.0, it's calling the wrong JavaScript helper function).
For plain GHC, the following seems to work (the first line is copied from fromByteString with the helper renamed from the apparently incorrect h$newByteArrayBase64String):
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
buffer <- SomeBuffer <$> jsg1 "h$newByteArrayFromBase64String"
(decodeUtf8 $ B64.encode bs)
arrbuff <- ghcjsPure (getArrayBuffer (buffer :: MutableBuffer))
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
Here is an untested GHCJS version that may work. If they fix the above-mentioned jsaddle bug, it should work under plain GHC, too:
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
(buffer,_,_) <- ghcjsPure (fromByteString bs)
buffer' <- thaw buffer
arrbuff <- ghcjsPure (getArrayBuffer buffer')
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
I don't have a running GHCJS installation, but here's a complete working example I tested using JSaddle+Warp under plain GHC which seems to work okay (i.e., if you point a browser at localhost:6868, it displays a 3x4 image on the canvas element):
module Main where
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Text.Encoding (decodeUtf8)
import qualified Data.ByteString.Base64 as B64 (encode)
import Language.Javascript.JSaddle (js, js1, jss, jsg, jsg1,
new, pToJSVal, GHCJSPure(..), ghcjsPure, JSM,
fromJSVal, toJSVal, Object)
import Language.Javascript.JSaddle.Warp (run)
import JSDOM.Types (liftDOM, Uint8ClampedArray(..), RenderingContext(..))
import JSDOM.ImageData
import JSDOM.HTMLCanvasElement
import JSDOM.CanvasRenderingContext2D
import GHCJS.Buffer (getArrayBuffer, MutableBuffer)
import GHCJS.Buffer.Types (SomeBuffer(..))
import Control.Lens ((^.))
main :: IO ()
main = run 6868 $ do
let smallImage = BS.pack [0xff,0x00,0x00,0xff, 0xff,0x00,0x00,0xff, 0xff,0x00,0x00,0xff,
0x00,0x00,0x00,0xff, 0x00,0xff,0x00,0xff, 0x00,0x00,0x00,0xff,
0x00,0x00,0xff,0xff, 0x00,0x00,0xff,0xff, 0x00,0x00,0xff,0xff,
0x00,0x00,0xff,0xff, 0x00,0x00,0x00,0xff, 0x00,0x00,0xff,0xff]
img <- makeImageData 3 4 smallImage
doc <- jsg "document"
doc ^. js "body" ^. jss "innerHTML" "<canvas id=c width=10 height=10></canvas>"
Just canvas <- doc ^. js1 "getElementById" "c" >>= fromJSVal
Just ctx <- getContext canvas "2d" ([] :: [Object])
let ctx' = CanvasRenderingContext2D (unRenderingContext ctx)
putImageData ctx' img 3 4
return ()
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
buffer <- SomeBuffer <$> jsg1 "h$newByteArrayFromBase64String"
(decodeUtf8 $ B64.encode bs)
arrbuff <- ghcjsPure (getArrayBuffer (buffer :: MutableBuffer))
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
makeImageData :: Int -> Int -> ByteString -> JSM ImageData
makeImageData width height dat
= do dat' <- ghcjsPure (uint8ClampedArrayFromByteString dat)
newImageData dat' (fromIntegral width) (Just (fromIntegral height))
To build this, I used the following stack.yaml:
resolver: lts-8.12
extra-deps:
- ghcjs-dom-0.8.0.0
- ghcjs-dom-jsaddle-0.8.0.0
- jsaddle-0.8.3.0
- jsaddle-warp-0.8.3.0
- jsaddle-dom-0.8.0.0
- ref-tf-0.4.0.1

As K.A. Buhr pointed out, after converting the ByteString to a Uint8ClampedArray, you can pass the clamped array to newImageData to get the desired ImageData object.
You can use an inline Javascript function to generate the Uint8ClampedArray. To pass a ByteString through the Javascript FFI, use Data.ByteString.useAsCStringLen .
The code below shows how to do this.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE JavaScriptFFI #-}
{-# LANGUAGE CPP #-}
import Reflex.Dom
import Data.Monoid ((<>))
import Control.Monad.IO.Class (liftIO)
import GHCJS.DOM.ImageData (newImageData)
import GHCJS.DOM.HTMLCanvasElement (getContext)
import GHCJS.DOM.JSFFI.Generated.CanvasRenderingContext2D (putImageData)
import GHCJS.DOM.Types (CanvasRenderingContext2D(..), castToHTMLCanvasElement, Uint8ClampedArray(..))
import Foreign.Ptr (Ptr)
import GHCJS.Types (JSVal)
import GHCJS.Marshal.Pure (pFromJSVal, pToJSVal)
import Data.Map (Map)
import Data.Text as T (Text, pack)
import Data.ByteString as BS (ByteString, pack, useAsCStringLen)
-- Some code and techniques taken from these sites:
-- http://lpaste.net/154691
-- https://www.snip2code.com/Snippet/1032978/Simple-Canvas-Example/
-- import inline Javascript code as Haskell function : jsUint8ClampedArray
foreign import javascript unsafe
-- Arguments
-- pixels : Ptr a -- Pointer to a ByteString
-- len : JSVal -- Number of pixels
"(function(){ return new Uint8ClampedArray($1.u8.slice(0, $2)); })()"
jsUint8ClampedArray :: Ptr a -> JSVal -> IO JSVal
-- takes pointer and length arguments as passed by useAsCStringLen
newUint8ClampedArray :: (Ptr a, Int) -> IO Uint8ClampedArray
newUint8ClampedArray (pixels, len) =
pFromJSVal <$> jsUint8ClampedArray pixels (pToJSVal len)
canvasAttrs :: Int -> Int -> Map T.Text T.Text
canvasAttrs w h = ("width" =: T.pack (show w))
<> ("height" =: T.pack (show h))
main = mainWidget $ do
-- first, generate some test pixels
let boxWidth = 120
boxHeight = 30
boxDataLen = boxWidth*boxHeight*4 -- 4 bytes per pixel
reds = take boxDataLen $ concat $ repeat [0xff,0x00,0x00,0xff]
greens = take boxDataLen $ concat $ repeat [0x00,0xff,0x00,0xff]
blues = take boxDataLen $ concat $ repeat [0x00,0x00,0xff,0xff]
pixels = reds ++ greens ++ blues
image = BS.pack pixels -- create a ByteString with the pixel data.
-- create Uint8ClampedArray representation of pixels
imageArray <- liftIO $ BS.useAsCStringLen image newUint8ClampedArray
let imageWidth = boxWidth
imageHeight = (length pixels `div` 4) `div` imageWidth
-- use Uint8ClampedArray representation of pixels to create ImageData
imageData <- newImageData (Just imageArray) (fromIntegral imageWidth) (fromIntegral imageHeight)
-- demonstrate the imageData is what we expect by displaying it.
(element, _) <- elAttr' "canvas" (canvasAttrs 300 200) $ return ()
let canvasElement = castToHTMLCanvasElement(_element_raw element)
elementContext <- getContext canvasElement ("2d" :: String)
let renderingContext = CanvasRenderingContext2D elementContext
putImageData renderingContext (Just imageData) 80 20
Here's a link to a repository with the example code: https://github.com/dc25/stackOverflow__how-to-convert-a-bytestring-value-to-a-jsval
Here's a link to a live demo : https://dc25.github.io/stackOverflow__how-to-convert-a-bytestring-value-to-a-jsval/

You can use hoogle to find a function by it's type signature ByteString -> GHCJS.Prim.JSVal. https://www.stackage.org/lts-8.11/hoogle?q=ByteString+-%3E+GHCJS.Prim.JSVal
Which has this in the results:
https://www.stackage.org/haddock/lts-8.11/ghcjs-base-stub-0.1.0.2/GHCJS-Prim.html#v:toJSString
toJSString :: String -> JSVal
So now you just need a function to do ByteString -> String.

Related

In GHC- Why is the lazy version of this small program so much faster than the loop based variant?

These two programs do the same thing, but one runs 10x faster.
This takes approx. 10 seconds on my machine:
import Control.Monad
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as BL
theValueOne=B.singleton 1
main = replicateM_ 100000000 $ B.putStr theValueOne
The second version uses output-lazy IO. It is done in about 1 second (as fast as c):
import qualified Data.ByteString.Lazy as BL
main = BL.putStr $ BL.pack $ replicate 100000000 1
Question: Why is the non-lazy version so slow? More importantly, how can I make it fast? (I've tried recursion, forM, modifying the output buffer using hSetBuffering... Nothing has made a difference)
Note- This is more than just an academic question. The non-lazy version is an extremely simplified version of an executable my company uses in production, which is also slow in the same way. It would be nearly impossible to re-architect the larger program around the analogous lazy solution.
Updated: Added possible source of problem and a solution.
I don't think it has anything to do with lazy I/O. If you rewrite the strict I/O version to write two bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 50000000 $ B.putStr (theValueOne <> theValueOne)
that halves the time. Write ten bytes at once:
theValueOne = B.singleton 1
main = replicateM_ 10000000 $ B.putStr (foldMap id (replicate 10 theValueOne))
and it's already faster than the lazy I/O version.
The issue is that there's a bit of overhead in a B.hPutStr call, much more than the overhead of a C fwrite call, and it's just not a particularly efficient way to write a single byte.
A good chunk of the overhead comes from the fact that Haskell I/O buffers have immutable metadata. Even though the buffer content itself is mutable, the pointers to valid data within the buffer are immutable, and so writing a single byte requires a heap allocation of a new GHC.IO.Buffer.Buffer structure which GHC can't optimize away
One solution is to use a hand-crafted buffering structure with a mutable pointer. The following works, and it's about twice as fast as the lazy I/O version in the original question.
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.IORef
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !(IORef Int)
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap <$> newIORef 0 <*> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO ()
flushBuffer WriteBuffer{..} = do
n <- readIORef used
withForeignPtr content $ \p -> hPutBuf handle p n
writeIORef used 0
writeByte :: Word8 -> WriteBuffer -> IO ()
writeByte w buf#(WriteBuffer{..}) = do
n <- readIORef used
withForeignPtr content $ \p -> poke (plusPtr p n) w
let n' = n + 1
writeIORef used n'
when (n' == capacity) $
flushBuffer buf
main :: IO ()
main = do
b <- newBuffer stdout
replicateM_ 100000000 (writeByte 1 b)
flushBuffer b
Someone ironically, converting this to a version using an immutable counter and passing the WriteBuffer as state through foldM doubles the speed again, so it's about 4 times as fast as the lazy I/O version in the original question:
{-# LANGUAGE RecordWildCards #-}
{-# OPTIONS_GHC -Wall #-}
import Control.Monad
import Data.Word
import Foreign.ForeignPtr
import Foreign.Ptr
import Foreign.Storable
import System.IO
data WriteBuffer = WriteBuffer
{ handle :: !Handle
, capacity :: !Int
, used :: !Int
, content :: !(ForeignPtr Word8)
}
newBuffer :: Handle -> IO WriteBuffer
newBuffer h = do
hSetBinaryMode h True
hSetBuffering h NoBuffering
WriteBuffer h cap 0 <$> mallocForeignPtrBytes cap
where cap = 4096
flushBuffer :: WriteBuffer -> IO WriteBuffer
flushBuffer buf#WriteBuffer{..} = do
withForeignPtr content $ \p -> hPutBuf handle p used
return $ buf { used = 0 }
writeByte :: Word8 -> WriteBuffer -> IO WriteBuffer
writeByte w buf#(WriteBuffer{..}) = do
withForeignPtr content $ \p -> poke (plusPtr p used) w
let used' = used + 1
buf' = buf { used = used' }
if (used' == capacity)
then flushBuffer buf'
else return buf'
main :: IO ()
main = do
b <- newBuffer stdout
b' <- foldM (\s _ -> writeByte 1 s) b [(1::Int)..100000000]
void (flushBuffer b')
The reason this one is so fast seems to be that GHC is able to optimize away the WriteBuffer constructor entirely from the fold and just pass around unboxed pointers and integers in the loop. My guess is that if I modified the mutable version above to avoid boxing and unboxing the integer in the used IORef, it would be similarly fast.

Binary file writer adds extra byte

I'm building a Conduit that writes a binary file consisting of a header followed by a Double matrix as a row-ordered list. Here's the code:
import Conduit ((.|), ConduitM, mapC, sinkFileBS, yield)
import Control.Monad.Trans.Except (ExceptT)
import Control.Monad.Trans.Resource (ResourceT)
import Data.ByteString (ByteString)
import Data.ByteString.Conversion (toByteString')
import Data.Serialize.IEEE754 (putFloat64be)
import Data.Serialize.Put (putListOf, runPut)
import Data.Void (Void)
import Numeric.LinearAlgebra.Data ((><), Matrix, toLists)
import System.FilePath (FilePath)
type FileWriter = ResourceT (ExceptT String IO)
matrixSink :: FilePath -> ConduitM (Matrix Double) Void FileWriter ()
matrixSink path = byteBuilder .| sinkFileBS path where
byteBuilder = do
yield $ toByteString' "header"
mapC fromDoubleMatrix
fromDoubleMatrix :: Matrix Double -> ByteString
fromDoubleMatrix matrix = runPut $
putListOf putFloat64be (concat toLists matrix)
This almost works. If I test it using
runExceptT . runConduitRes $ yield matrix .| matrixSink "test.dat"
where matrix = (2 >< 2) [1, 2, 3, 4]
I get the expected file but with an extra byte between the header and the list of doubles. When displayed using show the extra byte looks like this:
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\t"
Any idea how not to print this byte? Or if it's a canonical separator or something (so that I can ignore it in the reader)?
EDIT: The problem seems to occur in the putListOf construction in fromDoubleMatrix.
putListOf :: Putter a -> Putter [a]
putListOf pa = \l -> do
putWord64be (fromIntegral (length l))
mapM_ pa l
putListOf encodes the length of the list before encoding the individual list elements. I think maybe you are dealing with fixed 2x2 matrices so you don't need that length, and you just want:
fromDoubleMatrix :: Matrix Double -> ByteString
fromDoubleMatrix matrix = runPut $
mapM_ putFloat64be (concat toLists matrix)

I/O Monad and ByteString to Char conversion?

I'm testing some HTTP requests in haskell and have the below methods:
import qualified Data.ByteString.Lazy as LAZ
import Language.Haskell.TH.Ppr
import System.IO
import Data.Word (Word8)
request :: IO LAZ.ByteString
request = do
response <- simpleHttp "https://www.url.com"
return (response)
exampleFunctionOne:: IO LAZ.ByteString -> IO LAZ.ByteString
exampleFunctionOne bytes = do
html <- bytes
let bytesToChars = bytesToString $ LAZ.unpack html
let x = exampleFunctionTwo bytesToChars
bytes
exampleFunctionTwo :: [Char] -> [Char]
exampleFunctionTwo chars = --Do stuff...
main = do
exampleFunctionOe $ request
My questions are:
Is there a more straight forward way to convert the ByteString to [Char]? Currently I've having to convert to perform (ByteString -> Word8) and then (Word8 -> Char)
Am I correct in saying the 'return ()' statement in my request function is simply re-applying the monad context (in this case IO) to the value I've extracted (response <- simpleHttp)? Or does it have an additional purpose?
To answer your first question, note that there's a different "unpack" in Data.ByteString.Lazy.Char8 with the signature you want:
unpack :: ByteString -> String
It's not unusual for people to import both modules:
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Lazy.Char8 as C
and mix and match functions from each.
To answer your second question, yes that's more or less it. For example:
redund = do x <- getLine
y <- return x
z <- return y
u <- return z
return u
is all equivalent to redund = getLine with a bunch of re-wrapping and extracting of pure values into an out of an IO monad.

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.
Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.
With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.
Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

How to parse a large XML file in Haskell with limited amount of resources?

I want to extract information from a large XML file (around 20G) in Haskell. Since it is a large file, I used SAX parsing functions from Hexpath.
Here is a simple code I tested:
import qualified Data.ByteString.Lazy as L
import Text.XML.Expat.SAX as Sax
parse :: FilePath -> IO ()
parse path = do
inputText <- L.readFile path
let saxEvents = Sax.parse defaultParseOptions inputText :: [SAXEvent Text Text]
let txt = foldl' processEvent "" saxEvents
putStrLn txt
After activating profiling in Cabal, it says that parse.saxEvents took 85% of allocated memory. I also used foldr and the result is the same.
If processEvent becomes complex enough, the program crashes with a stack space overflow error.
What am I doing wrong?
You don't say what processEvent is like. In principle, it ought to be unproblematic to use lazy ByteString for a strict left fold over lazily generated input, so I'm not sure what is going wrong in your case. But one ought to use streaming-appropriate types when dealing with gigantic files!
In fact, hexpat does have 'streaming' interface (just like xml-conduit). It uses the not-too-well known List library and the rather ugly List class it defines. In principle the ListT type from the List package should work well. I gave up quickly because of a lack of combinators, and wrote an appropriate instance of the ugly List class for a wrapped version of Pipes.ListT which I then used to export ordinary Pipes.Producer functions like parseProduce. The trivial manipulations needed for this are appended below as PipesSax.hs
Once we have parseProducer we can convert a ByteString or Text Producer into a Producer of SaxEvents with Text or ByteString components. Here are some simple operations. I was using a 238M "input.xml"; the programs never need more than 6 mb of memory, to judge from looking at top.
-- Sax.hs Most of the IO actions use a registerIds pipe defined at the bottom which is tailored to a giant bit of xml of which this is a valid 1000 fragment http://sprunge.us/WaQK
{-#LANGUAGE OverloadedStrings #-}
import PipesSax ( parseProducer )
import Data.ByteString ( ByteString )
import Text.XML.Expat.SAX
import Pipes -- cabal install pipes pipes-bytestring
import Pipes.ByteString (toHandle, fromHandle, stdin, stdout )
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString.Char8 as Char8
sax :: MonadIO m => Producer ByteString m ()
-> Producer (SAXEvent ByteString ByteString) m ()
sax = parseProducer defaultParseOptions
-- stream xml from stdin, yielding hexpat tagstream to stdout;
main0 :: IO ()
main0 = runEffect $ sax stdin >-> P.print
-- stream the extracted 'IDs' from stdin to stdout
main1 :: IO ()
main1 = runEffect $ sax stdin >-> registryIds >-> stdout
-- write all IDs to a file
main2 =
IO.withFile "input.xml" IO.ReadMode $ \inp ->
IO.withFile "output.txt" IO.WriteMode $ \out ->
runEffect $ sax (fromHandle inp) >-> registryIds >-> toHandle out
-- folds:
-- print number of IDs
main3 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do n <- P.length $ sax (fromHandle inp) >-> registryIds
print n
-- sum the meaningful part of the IDs - a dumb fold for illustration
main4 = IO.withFile "input.xml" IO.ReadMode $ \inp ->
do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId
n <- P.fold (+) 0 id pipeline
print n
where
readIntId :: ByteString -> Integer
readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2
-- my xml has tags with attributes that appear via hexpat thus:
-- StartElement "FacilitySite" [("registryId","110007915364")]
-- and the like. This is just an arbitrary demo stream manipulation.
registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m ()
registryIds = do
e <- await -- we look for a 'SAXEvent'
case e of -- if it matches, we yield, else we go to the next event
StartElement "FacilitySite" [("registryId",a)] -> do yield a
yield "\n"
registryIds
_ -> registryIds
-- 'library': PipesSax.hs
This just newtypes Pipes.ListT to get the appropriate instances. We don't export anything to do with List or ListT but just use the standard Pipes.Producer concept.
{-#LANGUAGE TypeFamilies, GeneralizedNewtypeDeriving #-}
module PipesSax (parseProducerLocations, parseProducer) where
import Data.ByteString (ByteString)
import Text.XML.Expat.SAX
import Data.List.Class
import Control.Monad
import Control.Applicative
import Pipes
import qualified Pipes.Internal as I
parseProducer
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text) m ()
parseProducer opt = enumerate . enumerate_
. parseG opt
. Select_ . Select
parseProducerLocations
:: (Monad m, GenericXMLString tag, GenericXMLString text)
=> ParseOptions tag text
-> Producer ByteString m ()
-> Producer (SAXEvent tag text, XMLParseLocation) m ()
parseProducerLocations opt =
enumerate . enumerate_ . parseLocationsG opt . Select_ . Select
newtype ListT_ m a = Select_ { enumerate_ :: ListT m a }
deriving (Functor, Monad, MonadPlus, MonadIO
, Applicative, Alternative, Monoid, MonadTrans)
instance Monad m => List (ListT_ m) where
type ItemM (ListT_ m) = m
joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_)
runList = liftM emend . next . enumerate . enumerate_
where
emend (Right (a,q)) = Cons a (Select_ (Select q))
emend _ = Nil

Resources