Having my cereal and parsing it too - haskell

I'm using Data.Serialize.Get and am trying to define the following combinator:
getConsumed :: Get a -> Get (ByteString, a)
which should act like the passed-in Get action, but also return the ByteString that the Get consumed. The use case is that I have a binary structure that I need to both parse and hash, and I don't know the length before parsing it.
This combinator, despite its simple semantics, is proving surprisingly tricky to implement.
Without delving into the internals of Get, my instinct was to use this monstrosity:
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, r) <- lookAhead $ do
before <- remaining
res <- g
after <- remaining
return (before - after, res)
bs <- getBytes len
return (bs, r)
Which will use lookahead, peek at the remaining bytes before and after running the action, return the result of the action, and then consume the length. This shouldn't duplicate any work, but it occasionally fails with:
*** Exception: GetException "Failed reading: getBytes: negative length requested\nEmpty call stack\n"
so I must be misunderstanding something about cereal somewhere.
Does anyone see what's wrong with my definition of getconsumed or have a better idea for how to implement it?
Edit: Dan Doel points out that remaining can just return the remaining length of a given chunk, which isn't very useful if you cross a chunk boundary. I'm not sure what the point of the action is, in that case, but that explains why my code wasn't working! Now I just need to find a viable alternative.
Edit 2: after thinking about it some more, it seems like the fact that remaining gives me the length of the current chunk can be to my advantage if I feed the Get manually with individual chunks (remaining >>= getBytes) in a loop and keep track of what it's eating as I do it. I haven't managed to get this approach working either yet, but it seems more promising than the original one.
Edit 3: if anyone's curious, here's code from edit 2 above:
getChunk :: Get B.ByteString
getChunk = remaining >>= getBytes
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, res) <- lookAhead $ measure g
bs <- getBytes len
return (bs, res)
where
measure :: Get a -> Get (Int ,a)
measure g = do
chunk <- getChunk
measure' (B.length chunk) (runGetPartial g chunk)
measure' :: Int -> Result a -> Get (Int, a)
measure' !n (Fail e) = fail e
measure' !n (Done r bs) = return (n - B.length bs, r)
measure' !n (Partial f) = do
chunk <- getChunk
measure' (n + B.length chunk) (f chunk)
Unfortunately, it still seems to fail after a while on my sample input with:
*** Exception: GetException "Failed reading: too few bytes\nFrom:\tdemandInput\n\n\nEmpty call stack\n"

EDIT: Another solution, which does no extra computation!
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, r) <- lookAhead $ do
(res,after) <- lookAhead $ liftM2 (,) g remaining
total <- remaining
return (total-after, res)
bs <- getBytes len
return (bs, r)
One solution is to call lookAhead twice. The first time makes sure that all necessary chunks are loaded, and the second performs the actual length computation (along with returning the deserialized data).
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
_ <- lookAhead g -- Make sure all necessary chunks are preloaded
(len, r) <- lookAhead $ do
before <- remaining
res <- g
after <- remaining
return (before - after, res)
bs <- getBytes len
return (bs, r)

The Cereal package does not store enough information to simply implement what you want. I expect that your idea of using chunks might work, or perhaps a special runGet. Forking Cereal and using the internals is probably your easiest path.
Writing your own can work, this is what I did when making the protocol-buffers library. My custom Text.ProtocolBuffers.Get library does implement enough machinery to do what you want:
import Text.ProtocolBuffers.Get
import Control.Applicative
import qualified Data.ByteString as B
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed thing = do
start <- bytesRead
(a,stop) <- lookAhead ((,) <$> thing <*> bytesRead)
bs <- getByteString (fromIntegral (stop-start))
return (bs,a)
This is clear because my library tracks the number of byteRead. Otherwise the API is quite similar to Cereal.

Related

Haskell streaming - how to merge original stream with result stream

Using Haskell-streaming, I can easily group a stream and take sum on each group.
>>> S.print $ mapped S.toList $ S.groupBy (\ x y -> x*y>0) $ each [-1,-2,3,4,5,-6]
[-1,-2]
[3,4,5]
[-6]
>>> S.print $S.map sum $ mapped S.toList $ S.groupBy (\ x y -> x*y>0) $ each [-1,-2,3,4,5,-6]
-3
12
-6
How to have a function myfn that generates a stream that is a merge of the two above in an order sensitive way? I.e. I wish to have a result stream of
>>> myfn $ each [-1,-2,3,4,5,-6]
-1:> -2:> -3:> 3:> 4:> 5:> 12:> -6:> -6:> ()
The solution involves making the function argument of mapped both accumulate the list and calculate the sum, in one pass.
That can be done with store I think, but I find the streaming sinks from foldl easier to use. Their Applicative instance lets us build composite Folds from simpler ones:
foo :: Monad m
=> (Int -> Int -> Bool)
-> Stream (Of Int) m ()
-> Stream (Of Int) m ()
foo p =
flip S.for (\(xs,total) -> S.each xs *> S.yield total)
. mapped (L.purely S.fold $ (,) <$> L.list <*> L.sum)
. S.groupBy p
Where L.purely, L.list and L.sum are from "foldl".
The finishing touch is taking each pair ([Int],Int) coming out of mapped and replacing it with a substream using for.
Putting it to work:
*Main> S.print $ foo (\x y -> x*y>0) $ S.each [-1,-2,3,4,5,-6]
Edit: Come to think of it, the previous solution is flawed. We are only interested in a streamed result, yet we accumulate each individual group in memory using S.toList or L.list before sending it downstream. But what if one group happens to be bigger than the available memory in the machine?
Here's a solution that streams perfectly and is indifferent to the size of each group:
foo :: Monad m
=> (Int -> Int -> Bool)
-> Stream (Of Int) m ()
-> Stream (Of Int) m ()
foo p =
concats
. S.maps (S.store (\s -> do (total :> r) <- L.purely S.fold L.sum s
S.yield total
return r))
. S.groupBy p
What has changed? First, we use maps instead of mapped, because now we want to transform the subgroup streams, instead of returning a result in the base monad.
For each subgroup stream, we use store to perform a summing fold without destroying the stream. Then we take the result of the fold and append it back to the stream, while also taking care of preserving the original return value as required by maps.
The only step left is to rejoin the subgroups using concats.

Haskell IO program that calculates mean and highest value

I have doubts to do this exercise. I have the solution (which is apparently wrong) but I could not understand:
Write a program that reads integers from the default input device, one per line, to a negative or zero universe, and prints an average and the largest of the read values.
My code:
a6 ::Int -> Float ->Int->Int-> IO()
a6 cBigger average2 sum2 cCount = do
c <- getLine
let digit = read c :: Int
let sum = sum2 + digit
let average = fromIntegral sum2/ fromIntegral cCount
if (digit <=0)
then putStrLn("Bigger :" ++show(cBigger)++ "average "++show(cAverage))
else
if digit > cBigger
then a6 digit average sum (cCount+1)
else a6 cBigger average sum (cCount+1)
As I do not understand much Haskell I have doubts of how to do.
Although a little off topic, I thought I would comment on separation of concerns and modularity.
Usually, we try to keep the pure parts of the program separate from the impure (IO) parts.
We can read a list of Ints with impure code, and then process it with a pure function to find the max, sum, and length in order to compute the average.
Below, readInts reads Ints from stdin until it reads a non-positive value, returning the positive Ints in a list (in IO). maxSumLength takes the current maximum, sum, and length of the elements processed so far
as a tuple, and the next element to process, and returns a new tuple with the next element folded in. Finally, main reads the list of Ints, and applies a strict left fold (foldl') using maxSumLength and an initial state of (0, 0, 0) to compute the final maximum, sum, and length. It then prints the maximum and the average from the sum and length.
module Main where
import Data.List ( foldl' )
readInts :: IO [Int]
readInts = do
i <- read <$> getLine
if i <= 0
then return []
else (i:) <$> readInts
maxSumLength :: (Int, Int, Int) -> Int -> (Int, Int, Int)
maxSumLength (m, s, l) x = (max m x, s+x, l+1)
main :: IO ()
main = do
(m, s, l) <- foldl' maxSumLength (0, 0, 0) <$> readInts
putStrLn $ "max=" ++ show m ++ ", avg=" ++ show (fromIntegral s / fromIntegral l)
This code is more modular than before. We could reuse readInts in other programs that need a list of Ints. Also, the pure part of the algorithm no longer cares where the list of Ints comes from. However, there is a problem with this code. When written this way, the entire list has to be buffered in memory before the pure code can start processing it, even though the processing code can consume the input as it arrives.
This is where the conduit package can help. The conduit package allows a stream to be produced by an impure Source and connected to a pure Consumer, and allows the pure code to be interleaved with the impure code. The conduit-combinators package provides combinators that allow streams to be treated much like lists (in particular, foldlC allows us to perform a strict left fold over a conduit stream instead of a list).
In the code below, the readInts function is now a Source of Ints that runs in the IO monad. It uses the repeatWhileMC combinator to perform the looping and the termination test. The pure maxSumLength is unchanged; however, in main, rather than use foldl', we use foldlC to fold over the conduit stream.
module Main where
import Conduit
readInts :: Source IO Int
readInts = repeatWhileMC (read <$> getLine) (> 0)
maxSumLength :: (Int, Int, Int) -> Int -> (Int, Int, Int)
maxSumLength (m, s, l) x = (max m x, s+x, l+1)
main :: IO ()
main = do
(m, s, n) <- runConduit (readInts =$= foldlC maxSumLength (0, 0, 0))
putStrLn $ "max=" ++ show m ++ ", avg=" ++ show (fromIntegral s / fromIntegral n)
This code will interleave pure maxSumLength with the impure readInts so that the Ints are consumed as they are created, but without sacrificing modularity. The readInts stream can be used in other programs that need a stream of Ints, and the pure code still no longer cares where the Ints are coming from.
While not optimal, your program is almost working. Here's some minor fixes and cleanups. I tried to keep your original code when possible, even if there might be better solutions.
First, you are using cAverage which is not defined. This error can be easily fixed.
The average2 parameter is pointless, since it is unused -- let's remove it.
Some lets can be moved to the branches where we actually use those variables.
We can also perform a minor refactoring and compute the new bigger value using a conditional, instead of using a conditional to perform two different recursive calls. (It would be even better to use the max function, though.)
Consider to rename "bigger" to "biggest", or "greatest", or "maximum". It sounds better to me.
a6 :: Int -> Int -> Int -> IO()
a6 bigger oldSum count = do
c <- getLine
let digit = read c :: Int
if digit <= 0
then let average = fromIntegral oldSum / fromIntegral count :: Double
in putStrLn ("Bigger: " ++ show bigger ++ ", average: " ++ show average)
else let newBigger = if digit > bigger then digit else bigger
newSum = oldSum + digit
in a6 newBigger newSum (count+1)

How to generate a list of repeated applications of a function to the previous result of it in IO context

As a part of a solution for the problem I'm trying to solve I need to generate a list of repeated application of a function to it's previous result. Sounds very much like iterate function, with the exception, that iterate has signature of
iterate :: (a -> a) -> a -> [a]
and my function lives inside of IO (I need to generate random numbers), so I'd need something more of a:
iterate'::(a -> IO a) -> a -> [a]
I have looked at the hoogle, but without much success.
You can actually get a lazy iterate that works on infinite lists if you use the pipes library. The definition is really simple:
import Pipes
iterate' :: (a -> IO a) -> a -> Producer a IO r
iterate' f a = do
yield a
a2 <- lift (f a)
iterate' f a2
For example, let's say that our step function is:
step :: Int -> IO Int
step n = do
m <- readLn
return (n + m)
Then applying iterate to step generates a Producer that lazily prompts the user for input and generates the tally of values read so far:
iterate' step 0 :: Producer Int IO ()
The simplest way to read out the value is to loop over the Producer using for:
main = runEffect $
for (iterate' step 0) $ \n -> do
lift (print n)
The program then endlessly loops, requesting user input and displaying the current tally:
>>> main
0
10<Enter>
10
14<Enter>
24
5<Enter>
29
...
Notice how this gets two things correct which the other solutions do not:
It works on infinite lists (you don't need a termination condition)
It produces results immediately. It doesn't wait until you run the action on the entire list to start producing usable values.
However, we can easily filter results just like the other two solutions. For example, let's say I want to stop when the tally is greater than 100. I can just write:
import qualified Pipes.Prelude as P
main = runEffect $
for (iterate' step 0 >-> P.takeWhile (< 100)) $ \n -> do
lift (print n)
You can read that as saying: "Loop over the iterated values while they are less than 100. Print the output". Let's try it:
>>> main
0
10<Enter>
10
20<Enter>
30
75<Enter>
>>> -- Done!
In fact, pipes has another helper function for printing out values, so you can simplify the above to a pipeline:
main = runEffect $ iterate' step 0 >-> P.takeWhile (< 100) >-> P.print
This gives a clear view of the flow of information. iterate' produces a never-ending stream of Ints, P.takeWhile filters that stream, and P.print prints all values that reach the end.
If you want to learn more about the pipes library, I encourage you to read the pipes tutorial.
Your functions lives in IO, so the signature is rather:
iterate'::(a -> IO a) -> a -> IO [a]
The problem is that the original iterate function returns an infinite list, so if you try to do the same in IO you will get an action that never ends. Maybe you should add a condition to end the iteration.
iterate' action value = do
result <- action value
if condition result
then return []
else
rest <- iterate' action result
return $ result : rest
Firstly, your resulting list must be in the IO monad, so iterate' must have produce an IO [a], rather than '[a]'
Iterate can be defined as:
iterate (a -> a) -> a -> [a]
iterate f x = x : iterate f (f x)
so we could make an iterateM quite easily
iterateM :: (a -> m a) -> m a -> [m a]
iterateM f x = x : iterateM f (x >>= f)
This still needs your seed value to be in the monad to start though, and also gives you a list of monadic things, rather than a monad of listy things.
So, lets change it a bit.
iterateM :: (a -> m a) -> a -> m [a]
iterateM f x = sequence $ go f (return x)
where
go f x = x : go f (x >>= f)
However, this doesn't work. This is because sequence first runs every action, and then returns. (You can see this if you write some safeDivide :: Double -> Double -> Maybe Double, and then try something like fmap (take 10) $ iterateM (flip safeDivide 2) 1000. You'll find it doesn't terminate. I'm not sure how to fix that though.

How does this cyclic recursion provide the desired result?

Consider the following abbreviated code from this excellent blog post:
import System.Random (Random, randomRIO)
newtype Stream m a = Stream { runStream :: m (Maybe (NonEmptyStream m a)) }
type NonEmptyStream m a = (a, Stream m a)
empty :: (Monad m) => Stream m a
empty = Stream $ return Nothing
cons :: (Monad m) => a -> Stream m a -> Stream m a
cons a s = Stream $ return (Just (a, s))
fromList :: (Monad m) => [a] -> NonEmptyStream m a
fromList (x:xs) = (x, foldr cons empty xs)
Not too bad thus far - a monadic, recursive data structure and a way to build one from a list.
Now consider this function that chooses a (uniformly) random element from a stream, using constant memory:
select :: NonEmptyStream IO a -> IO a
select (a, s) = select' (return a) 1 s where
select' :: IO a -> Int -> Stream IO a -> IO a
select' a n s = do
next <- runStream s
case next of
Nothing -> a
Just (a', s') -> select' someA (n + 1) s' where
someA = do i <- randomRIO (0, n)
case i of 0 -> return a'
_ -> a
I'm not grasping the mysterious cyclic well of infinity that's going on in the last four lines; the result a' depends on a recursion on someA, which itself could depend on a', but not necessarily.
I get the vibe that the recursive worker is somehow 'accumulating' potential values in the IO a accumulator, but I obviously can't reason about it well enough.
Could anyone provide an explanation as to how this function produces the behaviour that it does?
That code doesn't actually run in constant space, as it composes a bigger and bigger IO a action which delays all the random choices until it's reached the end of the stream. Only when we reach the Nothing -> a case does the action in a actually get run.
For example, try running it on an infinite, constant space stream made by this function:
repeat' :: a -> NonEmptyStream IO a
repeat' x = let xs = (x, Stream $ return (Just xs)) in xs
Obviously, running select on this stream won't terminate, but you should see the memory usage going up as it allocates a lot of thunks for the delayed actions.
Here's a slightly re-written version of the code which does the choices as it goes along, so it runs in constant space and should hopefully be more clear as well. Note that I've replaced the IO a argument with a plain a which makes it clear that there are no delayed actions being built up here.
select :: NonEmptyStream IO a -> IO a
select (x, xs) = select' x 1 xs where
select' :: a -> Int -> Stream IO a -> IO a
select' current n xs = do
next <- runStream xs
case next of
Nothing -> return current
Just (x, xs') -> do
i <- randomRIO (0, n) -- (1)
case i of
0 -> select' x (n+1) xs' -- (2)
_ -> select' current (n+1) xs' -- (3)
As the name implies, current stores the currently selected value at each step. Once we've extracted the next item from the stream, we (1) pick a random number and use this to decide whether to (2) replace our selection with the new item or (3) keep our current selection before recursing on the rest of the stream.
There doesn't seem anything "cyclic" going on here. In particular, a' does not depend on someA. The a' is bound by pattern machting on the result of next. It is being used by someA which is in turn used on the right hand side, but this does not constitute a cycle.
What select' does is to traverse the stream. It maintains two accumulating arguments. The first is a random element from the stream (it's not yet selected and still random, hence IO a). The second is the position in the stream (Int).
The invariant being maintained is that the first accumulator selects an element uniformly from the stream we have seen so far, and that the integer represents the number of elements encountered so far.
Now, if we reach the end of the stream (Nothing), we can return the current random element, and it will be ok.
If we see another element (the Just case), then we recurse by calling select' again. Updating the number of elements to n + 1 is trivial. But how do we update the random element someA? Well, the old random element a chooses between the first n positions of the stream with equal probability. If we choose the new element a' with probability 1 / (n + 1) and use the old one in all other cases, then we have a uniform distribution over the whole stream up to this point again.

Using lookup with an IO list?

I am getting the contents of a file and transforming it into a list of form:
[("abc", 123), ("def", 456)]
with readFile, lines, and words.
Right now, I can manage to transform the resulting list into type IO [(String, Int)].
My problem is, when I try to make a function like this:
check x = lookup x theMap
I get this error, which I'm not too sure how to resolve:
Couldn't match expected type `[(a0, b0)]'
with actual type `IO [(String, Int)]'
In the second argument of `lookup', namely `theMap'
theMap is essentially this:
getLines :: String -> IO [String]
getLines = liftM lines . readFile
tuplify [x,y] = (x, read y :: Int)
theMap = do
list <- getLines "./test.txt"
let l = map tuplify (map words list)
return l
And the file contents are:
abc 123
def 456
Can anyone explain what I'm doing wrong and or show me a better solution? I just started toying around with monads a few hours ago and am running into a few bumps along the way.
Thanks
You will have to "unwrap" theMap from IO. Notice how you're already doing this to getLines by:
do
list <- getlines
[...]
return (some computation on list)
So you could have:
check x = do
m <- theMap
return . lookup x $ m
This is, in fact, an antipattern (albeit an illustrative one,) and you would be better off using the functor instance, ie. check x = fmap (lookup x) theMap

Resources