Haskell streaming - how to merge original stream with result stream - haskell

Using Haskell-streaming, I can easily group a stream and take sum on each group.
>>> S.print $ mapped S.toList $ S.groupBy (\ x y -> x*y>0) $ each [-1,-2,3,4,5,-6]
[-1,-2]
[3,4,5]
[-6]
>>> S.print $S.map sum $ mapped S.toList $ S.groupBy (\ x y -> x*y>0) $ each [-1,-2,3,4,5,-6]
-3
12
-6
How to have a function myfn that generates a stream that is a merge of the two above in an order sensitive way? I.e. I wish to have a result stream of
>>> myfn $ each [-1,-2,3,4,5,-6]
-1:> -2:> -3:> 3:> 4:> 5:> 12:> -6:> -6:> ()

The solution involves making the function argument of mapped both accumulate the list and calculate the sum, in one pass.
That can be done with store I think, but I find the streaming sinks from foldl easier to use. Their Applicative instance lets us build composite Folds from simpler ones:
foo :: Monad m
=> (Int -> Int -> Bool)
-> Stream (Of Int) m ()
-> Stream (Of Int) m ()
foo p =
flip S.for (\(xs,total) -> S.each xs *> S.yield total)
. mapped (L.purely S.fold $ (,) <$> L.list <*> L.sum)
. S.groupBy p
Where L.purely, L.list and L.sum are from "foldl".
The finishing touch is taking each pair ([Int],Int) coming out of mapped and replacing it with a substream using for.
Putting it to work:
*Main> S.print $ foo (\x y -> x*y>0) $ S.each [-1,-2,3,4,5,-6]
Edit: Come to think of it, the previous solution is flawed. We are only interested in a streamed result, yet we accumulate each individual group in memory using S.toList or L.list before sending it downstream. But what if one group happens to be bigger than the available memory in the machine?
Here's a solution that streams perfectly and is indifferent to the size of each group:
foo :: Monad m
=> (Int -> Int -> Bool)
-> Stream (Of Int) m ()
-> Stream (Of Int) m ()
foo p =
concats
. S.maps (S.store (\s -> do (total :> r) <- L.purely S.fold L.sum s
S.yield total
return r))
. S.groupBy p
What has changed? First, we use maps instead of mapped, because now we want to transform the subgroup streams, instead of returning a result in the base monad.
For each subgroup stream, we use store to perform a summing fold without destroying the stream. Then we take the result of the fold and append it back to the stream, while also taking care of preserving the original return value as required by maps.
The only step left is to rejoin the subgroups using concats.

Related

Append a delay to each chunk with the streaming library?

newbie to Streaming and Haskell here.
I've been playing around with the streaming library and I'm particularly interested in understanding the chunks part. Eg:
S.print $ S.delay 1.0 $ concats $ chunksOf 2 $ S.each [1..10]
Or:
S.print $ concats $ S.maps (S.delay 1.0) $ chunksOf 2 $ S.each [1..10]
Here I can introduce a delay after each element but what I want is to have a delay after each chunk, in this case a delay every second element. I tried this but doesn't compile:
S.print $ concats $ S.delay 1.0 $ chunksOf 2 $ S.each [1..10]
How can I achieve this?
What we need is a function that inserts a single delay at the end of a chunk stream, and pass that function to maps.
delay doesn't work here because it put delays between each yielded value. But we can do it easily using functions from Applicative:
S.print
$ concats
$ S.maps (\s -> s <* liftIO (threadDelay 1000000))
$ chunksOf 2
$ S.each [1..10]
What is happening here? maps applies a transformation to the "base functor" of the Stream. In a "chunked stream" obtained with chunksOf, that base functor is itself a Stream. Also, the transformation must preserve the return value of the Stream.
Streams can be sequenced with functions like (>>=) :: Stream f m a -> (a -> Stream f m b) -> Stream f m b if the next stream depends on the final result of the previous one, or with functions like (<*) :: Stream f m a -> Stream f m b -> Stream f m a if it doesn't. (<*) preserves the return value of the first Stream, which is what we want in this case.
We do not want to yield any more elements, but only to introduce a delay effect, so we simply liftIO the effect into the Stream monad.
Another way to insert delays after each yielded value of a Stream is to zip it with an infinite list of delays:
delay' :: MonadIO m => Int -> Stream (Of a) m r -> Stream (Of a) m r
delay' micros s = S.zipWith const s (S.repeatM (liftIO (threadDelay micros)))

How to generate a list of repeated applications of a function to the previous result of it in IO context

As a part of a solution for the problem I'm trying to solve I need to generate a list of repeated application of a function to it's previous result. Sounds very much like iterate function, with the exception, that iterate has signature of
iterate :: (a -> a) -> a -> [a]
and my function lives inside of IO (I need to generate random numbers), so I'd need something more of a:
iterate'::(a -> IO a) -> a -> [a]
I have looked at the hoogle, but without much success.
You can actually get a lazy iterate that works on infinite lists if you use the pipes library. The definition is really simple:
import Pipes
iterate' :: (a -> IO a) -> a -> Producer a IO r
iterate' f a = do
yield a
a2 <- lift (f a)
iterate' f a2
For example, let's say that our step function is:
step :: Int -> IO Int
step n = do
m <- readLn
return (n + m)
Then applying iterate to step generates a Producer that lazily prompts the user for input and generates the tally of values read so far:
iterate' step 0 :: Producer Int IO ()
The simplest way to read out the value is to loop over the Producer using for:
main = runEffect $
for (iterate' step 0) $ \n -> do
lift (print n)
The program then endlessly loops, requesting user input and displaying the current tally:
>>> main
0
10<Enter>
10
14<Enter>
24
5<Enter>
29
...
Notice how this gets two things correct which the other solutions do not:
It works on infinite lists (you don't need a termination condition)
It produces results immediately. It doesn't wait until you run the action on the entire list to start producing usable values.
However, we can easily filter results just like the other two solutions. For example, let's say I want to stop when the tally is greater than 100. I can just write:
import qualified Pipes.Prelude as P
main = runEffect $
for (iterate' step 0 >-> P.takeWhile (< 100)) $ \n -> do
lift (print n)
You can read that as saying: "Loop over the iterated values while they are less than 100. Print the output". Let's try it:
>>> main
0
10<Enter>
10
20<Enter>
30
75<Enter>
>>> -- Done!
In fact, pipes has another helper function for printing out values, so you can simplify the above to a pipeline:
main = runEffect $ iterate' step 0 >-> P.takeWhile (< 100) >-> P.print
This gives a clear view of the flow of information. iterate' produces a never-ending stream of Ints, P.takeWhile filters that stream, and P.print prints all values that reach the end.
If you want to learn more about the pipes library, I encourage you to read the pipes tutorial.
Your functions lives in IO, so the signature is rather:
iterate'::(a -> IO a) -> a -> IO [a]
The problem is that the original iterate function returns an infinite list, so if you try to do the same in IO you will get an action that never ends. Maybe you should add a condition to end the iteration.
iterate' action value = do
result <- action value
if condition result
then return []
else
rest <- iterate' action result
return $ result : rest
Firstly, your resulting list must be in the IO monad, so iterate' must have produce an IO [a], rather than '[a]'
Iterate can be defined as:
iterate (a -> a) -> a -> [a]
iterate f x = x : iterate f (f x)
so we could make an iterateM quite easily
iterateM :: (a -> m a) -> m a -> [m a]
iterateM f x = x : iterateM f (x >>= f)
This still needs your seed value to be in the monad to start though, and also gives you a list of monadic things, rather than a monad of listy things.
So, lets change it a bit.
iterateM :: (a -> m a) -> a -> m [a]
iterateM f x = sequence $ go f (return x)
where
go f x = x : go f (x >>= f)
However, this doesn't work. This is because sequence first runs every action, and then returns. (You can see this if you write some safeDivide :: Double -> Double -> Maybe Double, and then try something like fmap (take 10) $ iterateM (flip safeDivide 2) 1000. You'll find it doesn't terminate. I'm not sure how to fix that though.

Interleaving list functions

Lets say I'm given two functions:
f :: [a] -> b
g :: [a] -> c
I want to write a function that is the equivalent of this:
h x = (f x, g x)
But when I do that, for large lists inevitably I run out of memory.
A simple example is the following:
x = [1..100000000::Int]
main = print $ (sum x, product x)
I understand this is the case because the list x is being stored in memory without being garbage collected. It would be better instead of f and g worked on x in, well, "parallel".
Assuming I can't change f and g, nor want to make a separate copy of x (assume x is expensive to produce) how can I write h without running into out of memory issues?
A short answer is you can't. Since you have no control over f and g, you have no guarantee that the functions process their input sequentially. Such a function can as well keep the whole list stored in memory before producing the final result.
However, if your functions are expressed as folds, the situation is different. This means that we know how to incrementally apply each step, so we can parallelize those steps in one run.
The are many resources about this area. For example:
Haskell: Can I perform several folds over the same lazy list without keeping list in memory?
Classic Beautiful folding
More beautiful fold zipping
The pattern of consuming a sequence of values with properly defined space bounds is solved more generally with pipe-like libraries such conduit, iteratees or pipes. For example, in conduit, you could express the combination of computing sums and products as
import Control.Monad.Identity
import Data.Conduit
import Data.Conduit.List (fold, sourceList)
import Data.Conduit.Internal (zipSinks)
product', sum' :: (Monad m, Num a) => Sink a m a
sum' = fold (+) 0
product' = fold (*) 1
main = print . runIdentity $ sourceList (replicate (10^6) 1) $$
zipSinks sum' product'
If you can turn your functions into folds, you can then just use them with a scan:
x = [1..100000000::Int]
main = mapM_ print . tail . scanl foo (a0,b0) . takeWhile (not.null)
. unfoldr (Just . splitAt 1000) -- adjust the chunk length as needed
$ x
foo (a,b) x = let a2 = f' a $ f x ; b2 = g' b $ g x
in a2 `seq` b2 `seq` (a2, b2)
f :: [t] -> a -- e.g. sum
g :: [t] -> b -- (`rem` 10007) . product
f' :: a -> a -> a -- e.g. (+)
g' :: b -> b -> b -- ((`rem` 10007) .) . (*)
we consume the input in chunks for better performance. Compiled with -O2, this should run in a constant space. The interim results are printed as indication of progress.
If you can't turn your function into a fold, this means it has to consume the whole list to produce any output and this trick doesn't apply.
You can use multiple threads to evaluate f x and g x in parallel.
E.g.
x :: [Int]
x = [1..10^8]
main = print $ let a = sum x
b = product x
in a `par` b `pseq` (a,b)
Its a nice way to exploit GHC's parallel runtime to prevent a space leak by doing two things at once.
Alternatively, you need to fuse f and g into a single pass.

How does this cyclic recursion provide the desired result?

Consider the following abbreviated code from this excellent blog post:
import System.Random (Random, randomRIO)
newtype Stream m a = Stream { runStream :: m (Maybe (NonEmptyStream m a)) }
type NonEmptyStream m a = (a, Stream m a)
empty :: (Monad m) => Stream m a
empty = Stream $ return Nothing
cons :: (Monad m) => a -> Stream m a -> Stream m a
cons a s = Stream $ return (Just (a, s))
fromList :: (Monad m) => [a] -> NonEmptyStream m a
fromList (x:xs) = (x, foldr cons empty xs)
Not too bad thus far - a monadic, recursive data structure and a way to build one from a list.
Now consider this function that chooses a (uniformly) random element from a stream, using constant memory:
select :: NonEmptyStream IO a -> IO a
select (a, s) = select' (return a) 1 s where
select' :: IO a -> Int -> Stream IO a -> IO a
select' a n s = do
next <- runStream s
case next of
Nothing -> a
Just (a', s') -> select' someA (n + 1) s' where
someA = do i <- randomRIO (0, n)
case i of 0 -> return a'
_ -> a
I'm not grasping the mysterious cyclic well of infinity that's going on in the last four lines; the result a' depends on a recursion on someA, which itself could depend on a', but not necessarily.
I get the vibe that the recursive worker is somehow 'accumulating' potential values in the IO a accumulator, but I obviously can't reason about it well enough.
Could anyone provide an explanation as to how this function produces the behaviour that it does?
That code doesn't actually run in constant space, as it composes a bigger and bigger IO a action which delays all the random choices until it's reached the end of the stream. Only when we reach the Nothing -> a case does the action in a actually get run.
For example, try running it on an infinite, constant space stream made by this function:
repeat' :: a -> NonEmptyStream IO a
repeat' x = let xs = (x, Stream $ return (Just xs)) in xs
Obviously, running select on this stream won't terminate, but you should see the memory usage going up as it allocates a lot of thunks for the delayed actions.
Here's a slightly re-written version of the code which does the choices as it goes along, so it runs in constant space and should hopefully be more clear as well. Note that I've replaced the IO a argument with a plain a which makes it clear that there are no delayed actions being built up here.
select :: NonEmptyStream IO a -> IO a
select (x, xs) = select' x 1 xs where
select' :: a -> Int -> Stream IO a -> IO a
select' current n xs = do
next <- runStream xs
case next of
Nothing -> return current
Just (x, xs') -> do
i <- randomRIO (0, n) -- (1)
case i of
0 -> select' x (n+1) xs' -- (2)
_ -> select' current (n+1) xs' -- (3)
As the name implies, current stores the currently selected value at each step. Once we've extracted the next item from the stream, we (1) pick a random number and use this to decide whether to (2) replace our selection with the new item or (3) keep our current selection before recursing on the rest of the stream.
There doesn't seem anything "cyclic" going on here. In particular, a' does not depend on someA. The a' is bound by pattern machting on the result of next. It is being used by someA which is in turn used on the right hand side, but this does not constitute a cycle.
What select' does is to traverse the stream. It maintains two accumulating arguments. The first is a random element from the stream (it's not yet selected and still random, hence IO a). The second is the position in the stream (Int).
The invariant being maintained is that the first accumulator selects an element uniformly from the stream we have seen so far, and that the integer represents the number of elements encountered so far.
Now, if we reach the end of the stream (Nothing), we can return the current random element, and it will be ok.
If we see another element (the Just case), then we recurse by calling select' again. Updating the number of elements to n + 1 is trivial. But how do we update the random element someA? Well, the old random element a chooses between the first n positions of the stream with equal probability. If we choose the new element a' with probability 1 / (n + 1) and use the old one in all other cases, then we have a uniform distribution over the whole stream up to this point again.

Having my cereal and parsing it too

I'm using Data.Serialize.Get and am trying to define the following combinator:
getConsumed :: Get a -> Get (ByteString, a)
which should act like the passed-in Get action, but also return the ByteString that the Get consumed. The use case is that I have a binary structure that I need to both parse and hash, and I don't know the length before parsing it.
This combinator, despite its simple semantics, is proving surprisingly tricky to implement.
Without delving into the internals of Get, my instinct was to use this monstrosity:
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, r) <- lookAhead $ do
before <- remaining
res <- g
after <- remaining
return (before - after, res)
bs <- getBytes len
return (bs, r)
Which will use lookahead, peek at the remaining bytes before and after running the action, return the result of the action, and then consume the length. This shouldn't duplicate any work, but it occasionally fails with:
*** Exception: GetException "Failed reading: getBytes: negative length requested\nEmpty call stack\n"
so I must be misunderstanding something about cereal somewhere.
Does anyone see what's wrong with my definition of getconsumed or have a better idea for how to implement it?
Edit: Dan Doel points out that remaining can just return the remaining length of a given chunk, which isn't very useful if you cross a chunk boundary. I'm not sure what the point of the action is, in that case, but that explains why my code wasn't working! Now I just need to find a viable alternative.
Edit 2: after thinking about it some more, it seems like the fact that remaining gives me the length of the current chunk can be to my advantage if I feed the Get manually with individual chunks (remaining >>= getBytes) in a loop and keep track of what it's eating as I do it. I haven't managed to get this approach working either yet, but it seems more promising than the original one.
Edit 3: if anyone's curious, here's code from edit 2 above:
getChunk :: Get B.ByteString
getChunk = remaining >>= getBytes
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, res) <- lookAhead $ measure g
bs <- getBytes len
return (bs, res)
where
measure :: Get a -> Get (Int ,a)
measure g = do
chunk <- getChunk
measure' (B.length chunk) (runGetPartial g chunk)
measure' :: Int -> Result a -> Get (Int, a)
measure' !n (Fail e) = fail e
measure' !n (Done r bs) = return (n - B.length bs, r)
measure' !n (Partial f) = do
chunk <- getChunk
measure' (n + B.length chunk) (f chunk)
Unfortunately, it still seems to fail after a while on my sample input with:
*** Exception: GetException "Failed reading: too few bytes\nFrom:\tdemandInput\n\n\nEmpty call stack\n"
EDIT: Another solution, which does no extra computation!
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
(len, r) <- lookAhead $ do
(res,after) <- lookAhead $ liftM2 (,) g remaining
total <- remaining
return (total-after, res)
bs <- getBytes len
return (bs, r)
One solution is to call lookAhead twice. The first time makes sure that all necessary chunks are loaded, and the second performs the actual length computation (along with returning the deserialized data).
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed g = do
_ <- lookAhead g -- Make sure all necessary chunks are preloaded
(len, r) <- lookAhead $ do
before <- remaining
res <- g
after <- remaining
return (before - after, res)
bs <- getBytes len
return (bs, r)
The Cereal package does not store enough information to simply implement what you want. I expect that your idea of using chunks might work, or perhaps a special runGet. Forking Cereal and using the internals is probably your easiest path.
Writing your own can work, this is what I did when making the protocol-buffers library. My custom Text.ProtocolBuffers.Get library does implement enough machinery to do what you want:
import Text.ProtocolBuffers.Get
import Control.Applicative
import qualified Data.ByteString as B
getConsumed :: Get a -> Get (B.ByteString, a)
getConsumed thing = do
start <- bytesRead
(a,stop) <- lookAhead ((,) <$> thing <*> bytesRead)
bs <- getByteString (fromIntegral (stop-start))
return (bs,a)
This is clear because my library tracks the number of byteRead. Otherwise the API is quite similar to Cereal.

Resources