Binary file writer adds extra byte

Binary file writer adds extra byte - haskell

I'm building a Conduit that writes a binary file consisting of a header followed by a Double matrix as a row-ordered list. Here's the code:
import Conduit ((.|), ConduitM, mapC, sinkFileBS, yield)
import Control.Monad.Trans.Except (ExceptT)
import Control.Monad.Trans.Resource (ResourceT)
import Data.ByteString (ByteString)
import Data.ByteString.Conversion (toByteString')
import Data.Serialize.IEEE754 (putFloat64be)
import Data.Serialize.Put (putListOf, runPut)
import Data.Void (Void)
import Numeric.LinearAlgebra.Data ((><), Matrix, toLists)
import System.FilePath (FilePath)
type FileWriter = ResourceT (ExceptT String IO)
matrixSink :: FilePath -> ConduitM (Matrix Double) Void FileWriter ()
matrixSink path = byteBuilder .| sinkFileBS path where
byteBuilder = do
yield $ toByteString' "header"
mapC fromDoubleMatrix
fromDoubleMatrix :: Matrix Double -> ByteString
fromDoubleMatrix matrix = runPut $
putListOf putFloat64be (concat toLists matrix)
This almost works. If I test it using
runExceptT . runConduitRes $ yield matrix .| matrixSink "test.dat"
where matrix = (2 >< 2) [1, 2, 3, 4]
I get the expected file but with an extra byte between the header and the list of doubles. When displayed using show the extra byte looks like this:
"\NUL\NUL\NUL\NUL\NUL\NUL\NUL\t"
Any idea how not to print this byte? Or if it's a canonical separator or something (so that I can ignore it in the reader)?
EDIT: The problem seems to occur in the putListOf construction in fromDoubleMatrix.

putListOf :: Putter a -> Putter [a]
putListOf pa = \l -> do
putWord64be (fromIntegral (length l))
mapM_ pa l
putListOf encodes the length of the list before encoding the individual list elements. I think maybe you are dealing with fixed 2x2 matrices so you don't need that length, and you just want:
fromDoubleMatrix :: Matrix Double -> ByteString
fromDoubleMatrix matrix = runPut $
mapM_ putFloat64be (concat toLists matrix)

Related

How to convert a ByteString value to a JSVal

In the module GHCJS.DOM.JSFFI.Generated.CanvasRenderingContext2D there is the function putImageData with the following type:
putImageData ::
Control.Monad.IO.Class.MonadIO m =>
CanvasRenderingContext2D
-> Maybe GHCJS.DOM.Types.ImageData -> Float -> Float -> m ()
The second parameter has the type Maybe GHCJS.DOM.Types.ImageData.
This type is defined in the module GHCJS.DOM.Types as a newtype wrapper around a JSVal value:
newtype ImageData = ImageData {unImageData :: GHCJS.Prim.JSVal}
I have a value of type ByteString that has always 4 bytes with the RGBA values of each pixel. How to I convert my ByteString value to a GHCJS.Prim.JSVal?

Edit: Looks like my original answer was too GHC centric. Added an untested fix that might work for GHCJS.
Edit #2: Added my stack.yaml file for the example.
You can use GHCJS.DOM.ImageData.newImageData to construct the ImageData object. It requires the data to be a GHCJS.DOM.Types.Uint8ClampedArray (which is a byte array in RGBA format).
There are conversion functions in GHCJS.Buffer from ByteStrings to Buffers (via fromByteString) and from there to typed arrays (e.g., getUint8Array). They do the conversion directly under GHCJS, and even under plain GHC they use a base64 conversion as an intermediary which should be pretty fast. Unfortunately, the conversion function getUint8ClampedArray isn't included (and for plain GHC, it looks like fromByteString might be broken anyway -- in jsaddle 0.8.3.0, it's calling the wrong JavaScript helper function).
For plain GHC, the following seems to work (the first line is copied from fromByteString with the helper renamed from the apparently incorrect h$newByteArrayBase64String):
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
buffer <- SomeBuffer <$> jsg1 "h$newByteArrayFromBase64String"
(decodeUtf8 $ B64.encode bs)
arrbuff <- ghcjsPure (getArrayBuffer (buffer :: MutableBuffer))
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
Here is an untested GHCJS version that may work. If they fix the above-mentioned jsaddle bug, it should work under plain GHC, too:
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
(buffer,_,_) <- ghcjsPure (fromByteString bs)
buffer' <- thaw buffer
arrbuff <- ghcjsPure (getArrayBuffer buffer')
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
I don't have a running GHCJS installation, but here's a complete working example I tested using JSaddle+Warp under plain GHC which seems to work okay (i.e., if you point a browser at localhost:6868, it displays a 3x4 image on the canvas element):
module Main where
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Text.Encoding (decodeUtf8)
import qualified Data.ByteString.Base64 as B64 (encode)
import Language.Javascript.JSaddle (js, js1, jss, jsg, jsg1,
new, pToJSVal, GHCJSPure(..), ghcjsPure, JSM,
fromJSVal, toJSVal, Object)
import Language.Javascript.JSaddle.Warp (run)
import JSDOM.Types (liftDOM, Uint8ClampedArray(..), RenderingContext(..))
import JSDOM.ImageData
import JSDOM.HTMLCanvasElement
import JSDOM.CanvasRenderingContext2D
import GHCJS.Buffer (getArrayBuffer, MutableBuffer)
import GHCJS.Buffer.Types (SomeBuffer(..))
import Control.Lens ((^.))
main :: IO ()
main = run 6868 $ do
let smallImage = BS.pack [0xff,0x00,0x00,0xff, 0xff,0x00,0x00,0xff, 0xff,0x00,0x00,0xff,
0x00,0x00,0x00,0xff, 0x00,0xff,0x00,0xff, 0x00,0x00,0x00,0xff,
0x00,0x00,0xff,0xff, 0x00,0x00,0xff,0xff, 0x00,0x00,0xff,0xff,
0x00,0x00,0xff,0xff, 0x00,0x00,0x00,0xff, 0x00,0x00,0xff,0xff]
img <- makeImageData 3 4 smallImage
doc <- jsg "document"
doc ^. js "body" ^. jss "innerHTML" "<canvas id=c width=10 height=10></canvas>"
Just canvas <- doc ^. js1 "getElementById" "c" >>= fromJSVal
Just ctx <- getContext canvas "2d" ([] :: [Object])
let ctx' = CanvasRenderingContext2D (unRenderingContext ctx)
putImageData ctx' img 3 4
return ()
uint8ClampedArrayFromByteString :: ByteString -> GHCJSPure (Uint8ClampedArray)
uint8ClampedArrayFromByteString bs = GHCJSPure $ do
buffer <- SomeBuffer <$> jsg1 "h$newByteArrayFromBase64String"
(decodeUtf8 $ B64.encode bs)
arrbuff <- ghcjsPure (getArrayBuffer (buffer :: MutableBuffer))
liftDOM (Uint8ClampedArray <$> new (jsg "Uint8ClampedArray") [pToJSVal arrbuff])
makeImageData :: Int -> Int -> ByteString -> JSM ImageData
makeImageData width height dat
= do dat' <- ghcjsPure (uint8ClampedArrayFromByteString dat)
newImageData dat' (fromIntegral width) (Just (fromIntegral height))
To build this, I used the following stack.yaml:
resolver: lts-8.12
extra-deps:
- ghcjs-dom-0.8.0.0
- ghcjs-dom-jsaddle-0.8.0.0
- jsaddle-0.8.3.0
- jsaddle-warp-0.8.3.0
- jsaddle-dom-0.8.0.0
- ref-tf-0.4.0.1

As K.A. Buhr pointed out, after converting the ByteString to a Uint8ClampedArray, you can pass the clamped array to newImageData to get the desired ImageData object.
You can use an inline Javascript function to generate the Uint8ClampedArray. To pass a ByteString through the Javascript FFI, use Data.ByteString.useAsCStringLen .
The code below shows how to do this.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE JavaScriptFFI #-}
{-# LANGUAGE CPP #-}
import Reflex.Dom
import Data.Monoid ((<>))
import Control.Monad.IO.Class (liftIO)
import GHCJS.DOM.ImageData (newImageData)
import GHCJS.DOM.HTMLCanvasElement (getContext)
import GHCJS.DOM.JSFFI.Generated.CanvasRenderingContext2D (putImageData)
import GHCJS.DOM.Types (CanvasRenderingContext2D(..), castToHTMLCanvasElement, Uint8ClampedArray(..))
import Foreign.Ptr (Ptr)
import GHCJS.Types (JSVal)
import GHCJS.Marshal.Pure (pFromJSVal, pToJSVal)
import Data.Map (Map)
import Data.Text as T (Text, pack)
import Data.ByteString as BS (ByteString, pack, useAsCStringLen)
-- Some code and techniques taken from these sites:
-- http://lpaste.net/154691
-- https://www.snip2code.com/Snippet/1032978/Simple-Canvas-Example/
-- import inline Javascript code as Haskell function : jsUint8ClampedArray
foreign import javascript unsafe
-- Arguments
-- pixels : Ptr a -- Pointer to a ByteString
-- len : JSVal -- Number of pixels
"(function(){ return new Uint8ClampedArray($1.u8.slice(0, $2)); })()"
jsUint8ClampedArray :: Ptr a -> JSVal -> IO JSVal
-- takes pointer and length arguments as passed by useAsCStringLen
newUint8ClampedArray :: (Ptr a, Int) -> IO Uint8ClampedArray
newUint8ClampedArray (pixels, len) =
pFromJSVal <$> jsUint8ClampedArray pixels (pToJSVal len)
canvasAttrs :: Int -> Int -> Map T.Text T.Text
canvasAttrs w h = ("width" =: T.pack (show w))
<> ("height" =: T.pack (show h))
main = mainWidget $ do
-- first, generate some test pixels
let boxWidth = 120
boxHeight = 30
boxDataLen = boxWidth*boxHeight*4 -- 4 bytes per pixel
reds = take boxDataLen $ concat $ repeat [0xff,0x00,0x00,0xff]
greens = take boxDataLen $ concat $ repeat [0x00,0xff,0x00,0xff]
blues = take boxDataLen $ concat $ repeat [0x00,0x00,0xff,0xff]
pixels = reds ++ greens ++ blues
image = BS.pack pixels -- create a ByteString with the pixel data.
-- create Uint8ClampedArray representation of pixels
imageArray <- liftIO $ BS.useAsCStringLen image newUint8ClampedArray
let imageWidth = boxWidth
imageHeight = (length pixels `div` 4) `div` imageWidth
-- use Uint8ClampedArray representation of pixels to create ImageData
imageData <- newImageData (Just imageArray) (fromIntegral imageWidth) (fromIntegral imageHeight)
-- demonstrate the imageData is what we expect by displaying it.
(element, _) <- elAttr' "canvas" (canvasAttrs 300 200) $ return ()
let canvasElement = castToHTMLCanvasElement(_element_raw element)
elementContext <- getContext canvasElement ("2d" :: String)
let renderingContext = CanvasRenderingContext2D elementContext
putImageData renderingContext (Just imageData) 80 20
Here's a link to a repository with the example code: https://github.com/dc25/stackOverflow__how-to-convert-a-bytestring-value-to-a-jsval
Here's a link to a live demo : https://dc25.github.io/stackOverflow__how-to-convert-a-bytestring-value-to-a-jsval/

You can use hoogle to find a function by it's type signature ByteString -> GHCJS.Prim.JSVal. https://www.stackage.org/lts-8.11/hoogle?q=ByteString+-%3E+GHCJS.Prim.JSVal
Which has this in the results:
https://www.stackage.org/haddock/lts-8.11/ghcjs-base-stub-0.1.0.2/GHCJS-Prim.html#v:toJSString
toJSString :: String -> JSVal
So now you just need a function to do ByteString -> String.

Efficient streaming and manipulation of a byte stream in Haskell

While writing a deserialiser for a large (<bloblength><blob>)* encoded binary file I got stuck with the various Haskell produce-transform-consume libraries. So far I'm aware of four streaming libraries:
Data.Conduit: Widely used, has very careful resource management
Pipes: Similar to conduit (Haskell Cast #6 nicely reveals the differences between conduit and pipes)
Data.Binary.Get: Offers useful functions such as getWord32be, but the streaming example is awkward
System.IO.Streams: Seems to be the easiest one to use
Here's a stripped down example of where things go wrong when I try to do Word32 streaming with conduit. A slightly more realistic example would first read a Word32 that determines the blob length and then yield a lazy ByteString of that length (which is then deserialised further).
But here I just try to extract Word32's in streaming fashion from a binary file:
module Main where
-- build-depends: bytestring, conduit, conduit-extra, resourcet, binary
import Control.Monad.Trans.Resource (MonadResource, runResourceT)
import qualified Data.Binary.Get as G
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as BL
import Data.Conduit
import qualified Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import Data.Word (Word32)
import System.Environment (getArgs)
-- gets a Word32 from a ByteString.
getWord32 :: C.ByteString -> Word32
getWord32 bs = do
G.runGet G.getWord32be $ BL.fromStrict bs
-- should read BytesString and return Word32
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = do
mbs <- await
case mbs of
Just bs -> do
case C.null bs of
False -> do
yield $ getWord32 bs
leftover $ BS.drop 4 bs
transform
True -> return ()
Nothing -> return ()
main :: IO ()
main = do
filename <- fmap (!!0) getArgs -- should check length getArgs
result <- runResourceT $ (CB.sourceFile filename) $$ transform =$ CL.consume
print $ length result -- is always 8188 for files larger than 32752 bytes
The output of the program is just the number of Word32's that were read. It turns out the stream terminates after reading the first chunk (about 32KiB). For some reason mbs is never Nothing, so I must check null bs which stops the stream when the chunk is consumed. Clearly, my conduit transform is faulty. I see two routes to a solution:
The await doesn't want to go to the second chunk of the ByteStream, so is there another function that pulls the next chunk? In examples I've seen (e.g. Conduit 101) this is not how it's done
This is just the wrong way to set up transform.
How is this done properly? Is this the right way to go? (Performance does matter.)
Update: Here's a BAD way to do it using Systems.IO.Streams:
module Main where
import Data.Word (Word32)
import System.Environment (getArgs)
import System.IO (IOMode (ReadMode), openFile)
import qualified System.IO.Streams as S
import System.IO.Streams.Binary (binaryInputStream)
import System.IO.Streams.List (outputToList)
main :: IO ()
main = do
filename : _ <- getArgs
h <- openFile filename ReadMode
s <- S.handleToInputStream h
i <- binaryInputStream s :: IO (S.InputStream Word32)
r <- outputToList $ S.connect i
print $ last r
'Bad' means: Very demanding in time and space, does not handle Decode exception.

Your immediate problem is caused by how you are using leftover. That function is used to "Provide a single piece of leftover input to be consumed by the next component in the current monadic binding", and so when you give it bs before looping with transform you are effectively throwing away the rest of the bytestring (i.e. what is after bs).
A correct solution based on your code would use the incremental input interface of Data.Binary.Get to replace your yield/leftover combination with something that consumes each chunk fully. A more pragmatic approach, though, is using the binary-conduit package, which provides that in the shape of conduitGet (its source gives a good idea of what a "manual" implementation would look like):
import Data.Conduit.Serialization.Binary
-- etc.
transform :: (Monad m, MonadResource m) => Conduit BS.ByteString m Word32
transform = conduitGet G.getWord32be
One caveat is that this will throw a parse error if the total number of bytes is not a multiple of 4 (i.e. the last Word32 is incomplete). In the unlikely case of that not being what you want, a lazy way out would be simply using \bs -> C.take (4 * truncate (C.length bs / 4)) bs on the input bytestring.

With pipes (and pipes-group and pipes-bytestring) the demo problem reduces to combinators. First we resolve the incoming undifferentiated byte stream into little 4 byte chunks:
chunksOfStrict :: (Monad m) => Int -> Producer ByteString m r -> Producer ByteString m r
chunksOfStrict n = folds mappend mempty id . view (Bytes.chunksOf n)
then we map these to Word32s and (here) count them.
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
n <- P.length $ chunksOfStrict 4 (Bytes.fromHandle h) >-> P.map getWord32
print n
This will fail if we have less than 4 bytes or otherwise fail to parse but we can as well map with
getMaybeWord32 :: ByteString -> Maybe Word32
getMaybeWord32 bs = case G.runGetOrFail G.getWord32be $ BL.fromStrict bs of
Left r -> Nothing
Right (_, off, w32) -> Just w32
The following program will then print the parses for the valid 4 byte sequences
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ chunksOfStrict 4 (Bytes.fromHandle h)
>-> P.map getMaybeWord32
>-> P.concat -- here `concat` eliminates maybes
>-> P.print
There are other ways of dealing with failed parses, of course.
Here, though, is something closer to the program you asked for. It takes a four byte segment from a byte stream (Producer ByteString m r) and reads it as a Word32 if it is long enough; it then takes that many of the incoming bytes and accumulates them into a lazy bytestring, yielding it. It just repeats this until it runs out of bytes. In main below, I print each yielded lazy bytestring that is produced:
module Main (main) where
import Pipes
import qualified Pipes.Prelude as P
import Pipes.Group (folds)
import qualified Pipes.ByteString as Bytes ( splitAt, fromHandle, chunksOf )
import Control.Lens ( view ) -- or Lens.Simple (view) -- or Lens.Micro ((.^))
import qualified System.IO as IO ( IOMode(ReadMode), withFile )
import qualified Data.Binary.Get as G ( runGet, getWord32be )
import Data.ByteString ( ByteString )
import qualified Data.ByteString.Lazy.Char8 as BL
import System.Environment ( getArgs )
splitLazy :: (Monad m, Integral n) =>
n -> Producer ByteString m r -> m (BL.ByteString, Producer ByteString m r)
splitLazy n bs = do
(bss, rest) <- P.toListM' $ view (Bytes.splitAt n) bs
return (BL.fromChunks bss, rest)
measureChunks :: Monad m => Producer ByteString m r -> Producer BL.ByteString m r
measureChunks bs = do
(lbs, rest) <- lift $ splitLazy 4 bs
if BL.length lbs /= 4
then rest >-> P.drain -- in fact it will be empty
else do
let w32 = G.runGet G.getWord32be lbs
(lbs', rest') <- lift $ splitLazy w32 bs
yield lbs
measureChunks rest
main :: IO ()
main = do
filename:_ <- getArgs
IO.withFile filename IO.ReadMode $ \h -> do
runEffect $ measureChunks (Bytes.fromHandle h) >-> P.print
This is again crude in that it uses runGet not runGetOrFail, but this is easily repaired. The pipes standard procedure would be to stop the stream transformation on a failed parse and return the unparsed bytestream.
If you were anticipating that the Word32s were for large numbers, so that you did not want to accumulate the corresponding stream of bytes as a lazy bytestring, but say write them to different files without accumulating, we could change the program pretty easily to do that. This would require a sophisticated use of conduit but is the preferred approach with pipes and streaming.

Here's a relatively straightforward solution that I want to throw into the ring. It's a repeated use of splitAt wrapped into a State monad that gives an interface identical to (a subset of) Data.Binary.Get. The resulting [ByteString] is obtained in main with a whileJust over getBlob.
module Main (main) where
import Control.Monad.Loops
import Control.Monad.State
import qualified Data.Binary.Get as G (getWord32be, runGet)
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int64)
import Data.Word (Word32)
import System.Environment (getArgs)
-- this is going to mimic the Data.Binary.Get.Get Monad
type Get = State BL.ByteString
getWord32be :: Get (Maybe Word32)
getWord32be = state $ \bs -> do
let (w, rest) = BL.splitAt 4 bs
case BL.length w of
4 -> (Just w', rest) where
w' = G.runGet G.getWord32be w
_ -> (Nothing, BL.empty)
getLazyByteString :: Int64 -> Get BL.ByteString
getLazyByteString n = state $ \bs -> BL.splitAt n bs
getBlob :: Get (Maybe BL.ByteString)
getBlob = do
ml <- getWord32be
case ml of
Nothing -> return Nothing
Just l -> do
blob <- getLazyByteString (fromIntegral l :: Int64)
return $ Just blob
runGet :: Get a -> BL.ByteString -> a
runGet g bs = fst $ runState g bs
main :: IO ()
main = do
fname <- head <$> getArgs
bs <- BL.readFile fname
let ls = runGet loop bs where
loop = whileJust getBlob return
print $ length ls
There's no error handling in getBlob, but it's easy to extend. Time and space complexity is quite good, as long as the resulting list is used carefully. (The python script that creates some random data for consumption by the above is here).

Creating a random permutation of 1..N with Data.Vector.Unboxed.Mutable

I want to create a list containing a random permutation of the numbers 1 through N. As I understand it, it is possible to use VUM.swap in the runST, but since I need random numbers as well I figured I might do both in the IO monad.
The code below yields:
Expected type: IO (VU.Vector Int), Actual type: IO (VU.Vector
(VU.Vector a0))
for the return statement.
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Unboxed.Mutable as VUM
import System.Random
randVector :: Int -> IO (VU.Vector Int)
randVector n = do
vector <- VU.unsafeThaw $ VU.enumFromN 1 n
VU.forM_ (VU.fromList [2..VUM.length vector]) $ \i -> do
j <- randomRIO(0, i) :: IO Int
VUM.swap vector i j
return $ VU.unsafeFreeze vector
I'm not quite sure why the return vector is nested. Do I have to use VU.fold1M_ instead?

unsafeFreeze vector already returns IO (VU.Vector Int). Just change the last line to VU.unsafeFreeze vector.
On another note, you should iterate until VUM.length vector - 1, since both [x .. y] and randomRIO use inclusive ranges. Also, you can use plain forM_ here for iteration, since you only care about side effects.
import Control.Monad
import qualified Data.Vector.Unboxed as VU
import qualified Data.Vector.Unboxed.Mutable as VUM
import System.Random
randVector :: Int -> IO (VU.Vector Int)
randVector n = do
vector <- VU.unsafeThaw $ VU.enumFromN 1 n
forM_ [2..VUM.length vector - 1] $ \i -> do
j <- randomRIO(0, i) :: IO Int
VUM.swap vector i j
VU.unsafeFreeze vector
I looked at the generated code, and it seems that with GHC 7.10.3 forM_ compiles to an efficient loop while VU.forM_ retains the intermediate list and is surely significantly slower (which was my expected outcome for forM_, but I was unsure about VU.forM_).

I would try (note update at end):
import Control.Monad
randVector :: Int -> IO (VU.Vector Int)
randVector n = do
vector <- VU.unsafeThaw $ VU.enumFromN 1 n
forM_ [2..VUM.length vector] $ \i -> do
j <- randomRIO(0, i) :: IO Int
VUM.swap vector i j
return $ VU.unsafeFreeze vector
Edit: as #András Kovács pointed out, you don't want the return at the end so the last line should be:
VU.unsafeFreeze vector

Haskell: using a Vector of Vectors instead of a nested list when parsing a CSV file

I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.
The question is related to another question.
I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.
So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.
Test file "myfile.tx":
11,22,33,44,55
66,77,88,99,10
Here is my code using Attoparsec, adapted from attoparsec-csv:
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Text.ParseCSV
(
parseCSV
) where
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)
import qualified Data.ByteString.Char8 as BSCH (readInteger)
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Integer
parserInt = (signed decimal)
record :: Parser [Integer]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Integer]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: T.Text -> Either String [[Integer]]
parseCSV =
parseOnly file
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
main = do
list <- getLines "myfile.txt"
putStr $ show $ map parseAndFilter list
But instead of using a list [Integer] I would like to use Data.Vector.
I found a relevant part in the Data.Vector tutorial:
--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
However, this is regular, not a nested list of integers.
I tried to adapt my code but it did not work.
How can I change my code to
use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).

There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.
Here is an example:
import Text.Regex
process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.
main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)
The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.
Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).
Because of its lazy nature, Haskell is a good language to stream data through.

I found a post that there is no easy way to parse with attoparsec to a vector.
See this forum post and thread.
But the good new is that the overhead of Data.Vector.fromList isn't so bad.
Attoparsec seems to be quite fast for parsing.
I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.
Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.
Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):
{-
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
-}
getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}
filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x
parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)
-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"
main = do
list <- getLines "mydata.txt"
--putStr $ show $ parseCSV $ mystr
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list
--show $ parseOnly parserInt $ T.pack "123"
Thanks to jamshidh and all the comments that pointed me to the right direction.
Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Main
(
parseCSV, main
) where
import Data.Vector as V (Vector, fromList, map, head, filter)
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Int
parserInt = skipSpace *> signed decimal
record :: Parser [Int]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Int]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV =
parseOnly file
getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"
filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x
parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)
-- MAIN
main = do
fContent <- B.readFile "myfile.txt"
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent

Why Haskell alloc huge amounts of memory when working with strings?

I've written a program in Haskell which had to load and parse big text file in UTF8. The file represents a dictionary with key:value pairs on each line. In my program I want to have a Data.Map container for fast dictionary search. My file is about 40MB, but after loading it to my program 1.5 GB of RAM is used, and never freed. What did I do wrong? Is the memory usage expected?
Here is a code sample from my program:
module Main where
import Engine
import Codec.Archive.Zip
import Data.IORef
import System.IO
import System.Directory
import qualified System.IO.UTF8 as UTF8
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.UTF8 as BsUtf
import qualified Data.Map as Map
import Graphics.UI.Gtk
import Graphics.UI.Gtk.Glade
maybeRead :: Read a => BsUtf.ByteString -> Maybe a
maybeRead s = case reads $ BsUtf.toString s of
[(x, "")] -> Just x
_ -> Nothing
parseToEntries :: [BsUtf.ByteString] -> [(BsUtf.ByteString, Int)]
parseToEntries [] = []
parseToEntries (x:xs) = let (key, svalue) = BsUtf.break (==':') x
value = maybeRead svalue
in case value of
Just x -> [(key, x)] ++ parseToEntries xs
Nothing -> parseToEntries xs
createDict :: BsUtf.ByteString -> IO (Map.Map BsUtf.ByteString Int)
createDict str = do
let entries = parseToEntries $ BsUtf.lines str
dict = Map.fromList entries
return (dict)
main :: IO ()
main = do
currFileName <- newIORef ""
dictZipFile <- B.readFile "data.db"
extractFilesFromArchive [] $ toArchive dictZipFile
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
...
searchAccent :: Map.Map BsUtf.ByteString Int -> String -> Int
searchAccent dict word = let sword = BsUtf.fromString $ map toLower word
entry = Map.lookup sword dict
in case entry of
Nothing -> -1
Just match -> 0

Quick answer.
Main problem is that System.IO.UTF8.readFile reads file into String.
Supposed bottleneck is here:
dictFile <- UTF8.readFile "dict.txt"
dict <- createDict $ BsUtf.fromString dictFile
When dealing with UTF-8 text it is better to use Data.Text instead of ByteString.
Try something like this:
import qualified Data.Text.Lazy as LT
import qualified Data.Text.Lazy.Encoding as LT
...
dictFile <- B.readFile "dict.txt"
dict <- createDict $ LT.decodeUtf8 dictFile
Another bottleneck is parsing numbers: you are converting ByteString to String and then read it.
It's better to use Data.Text.Lazy.Read:
import qualified Data.Text.Lazy.Read as LT
maybeRead :: LT.Text -> Maybe Int
maybeRead s = case LT.decimal s of
Left _ -> Nothing
Right i -> Just i

The Haskell String type is an indirect (because of laziness) linked list of characters; it is extremely wasteful space-wise. You may wish to try Data.Text (from http://hackage.haskell.org/package/text) instead, for large amounts of text.
(edit now that source is up I see the strings are lazy ByteString instead of String, so this is not relevant. Profiling is the next step.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string