Is it safe to reuse a conduit? - haskell

Is it safe to perform multiple actions using the same conduit value? Something like
do
let sink = sinkSocket sock
something $$ sink
somethingElse $$ sink
I recall that in the early versions of conduit there were some dirty hacks that made this unsafe. What's the current status?
(Note that sinkSocket doesn't close the socket.)

That usage is completely safe. The issue in older versions had to do with blurring the line between resumable and non-resumable components. With modern versions (I think since 0.4), the line is very clear between the two.

It might be safe to reuse sinks in the sense that the semantics for the "used" sink doesn't change. But you should be aware of another threat: space leaks.
The situation is analogous to lazy lists: you can consume a huge list lazily in a constant space, but if you process the list twice it will be kept in memory. The same thing might happen with a recursive monadic expression: if you use it once it's constant size, but if you reuse it the structure of the computation is kept in memory, resulting in space leak.
Here's an example:
import Data.Conduit
import Data.Conduit.List
import Control.Monad.Trans.Class (lift)
consumeN 0 _ = return ()
consumeN n m = do
await >>= (lift . m)
consumeN (n-1) m
main = do
let sink = consumeN 1000000 (\i -> putStrLn ("Got one: " ++ show i))
sourceList [1..9000000::Int] $$ sink
sourceList [1..22000000::Int] $$ sink
This program uses about 150M of ram on my machine, but if you remove the last line or repeat the definition of sink in both places, you get a nice constant space usage.
I agree that this is a contrived example (this was the first that came to my mind), and this is not very likely to happen with most Sinks. For example this will not happen with your sinkSocket. (Why is this contrived: because the control structure of the sink doesn't depend on the values it gets. And that is also why it can leak.) But, for example, for sources this would be much more common. (Many of the common Sources exhibit this behavior. The sourceList would be an obvious example, because it would actually keep the source list in memory. But, enumFromTo is no different, although there is no data to keep in memory, just the structure of the monadic computation.)
So, all in all, I think it's important to be aware of this.

Related

STM-friendly list as a change log

I need an advice on the data structure to use as an atomic change log.
I'm trying to implement the following algorithm. There is a flow of incoming
changes updating an in-memory map. In Haskell-like pseudocode it is
update :: DataSet -> SomeListOf Change -> Change -> STM (DataSet, SomeListOf Change)
update dataSet existingChanges newChange = do
...
return (dataSet, existingChanges ++ [newChange])
where DataSet is a map (currently it is the Map from the stm-containers package, https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Map.html). The whole "update" is called from arbitrary number of threads. Some of the Change's can be rejected due to domain semantics, I use throwSTM for that to throw away the effect of the transaction. In case of successful commit the "newChange" is added to the list.
There exists separate thread which calls the following function:
flush :: STM (DataSet, SomeListOf Change) -> IO ()
this function is supposed to take the current snapshot of DataSet together with the list of changes (it has to a consistent pair) and flush it to the filesystem, i.e.
flush data = do
(dataSet, changes) <- atomically $ readTVar data_
-- write them both to FS
-- ...
atomically $ writeTVar data_ (dataSet, [])
I need an advice about the data structure to use for "SomeListOf Change". I don't want to use [Change] because it is "too ordered" and I'm afraid there will be too many conflicts, which will force the whole transaction to retry. Please correct me, if I'm wrong here.
I cannot use the Set (https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Set.html) because I still need to preserve some order, e.g. the order of transaction commits. I could use TChan for it and it looks like a good match (exactly the order of transaction commits), but I don't know how to implement the "flush" function so that it would give the consistent view of the whole change log together with the DataSet.
The current implementation of that is here https://github.com/lolepezy/rpki-pub-server/blob/add-storage/src/RRDP/Repo.hs, in the functions applyActionsToState and rrdpSyncThread, respectively. It uses TChan and seems to do it in a wrong way.
Thank you in advance.
Update: A reasonable answer seems to be like that
type SomeListOf c = TChan [c]
update :: DataSet -> TChan [Change] -> Change -> STM DataSet
update dataSet existingChanges newChange = do
...
writeTChan changeChan $ reverse (newChange : existingChanges)
return dataSet
flush data_ = do
(dataSet, changes) <- atomically $ (,) <$> readTVar data_ <*> readTChan changeChan
-- write them both to FS
-- ...
But I'm still not sure whether it's a neat solution to pass the whole list as an element of the channel.
I'd probably just go with the list and see how far it takes performance-wise. Given that, you should consider that both, appending to the end of a list and reversing it are O(n) operations, so you should try to avoid this. Maybe you can just prepend the incoming changes like this:
update dataSet existingChanges newChange = do
-- ...
return (dataSet, newChange : existingChanges)
Also, your example for flush has the problem that reading and updating the state is not atomic at all. You must accomplish this using a single atomically call like so:
flush data = do
(dataSet, changes) <- atomically $ do
result <- readTVar data_
writeTVar data_ (dataSet, [])
return result
-- write them both to FS
-- ...
You could then just write them out in reverse order (because now changes contains the elements from newest to oldest) or reverse here once if it's important to write them out oldest to newest. If that's important I'd probably go with some data structure which allows O(1) element access like a good old vector.
When using a fixed-size vector you would obviously have to deal with the problem that it can become "full" which would mean your writers would have to wait for flush to do it's job before adding fresh changes. That's why I'd personally go for the simple list first and see if it's sufficient or where it needs to be improved.
PS: A dequeue might be a good fit for your problem as well, but going fixed size forces you to deal with the problem that your writers can potentially produce more changes than your reader can flush out. The dequeue can grow infinitely, but you your RAM probably isn't. And the vector has pretty low overhead.
I made some (very simplistic) investigation
https://github.com/lolepezy/rpki-pub-server/tree/add-storage/test/changeLog
imitating exactly the type of load I supposedly going to have. I used the same STMContainers.Map for the data set and usual list for the change log. To track the number of transaction retries, I used Debug.Trace.trace, meaning, the number of lines printed by trace. And the number of unique lines printed by trace gives me the number of committed transactions.
The result is here (https://github.com/lolepezy/rpki-pub-server/blob/add-storage/test/changeLog/numbers.txt). The first column is the number of threads, the second is the number of change sets generated in total. The third column is the number of trace calls for the case without change log and the last one is the number of trace calls with the change log.
Apparently most of the time change log adds some extra retries, but it's pretty much insignificant. So, I guess, it's fair to say that any data structure would be good enough, because most of the work is related to updating the map and most of the retries are happening because of it.

Parallel processing in conduit flow

I really like the concept of conduit/pipes for applying operations to a streaming IO source. I am interested in building tools that work on very large log files. One of the attractions of moving to Haskell from Python/Ruby is the easier way of writing parallel code, but I can't find any documentation of this. How could I set up a conduit-flow which reads lines from a file and works on them in parallel (ie. with 8 cores, it should read eight lines, and hand them off to eight different threads to be processed, and then collected again etc), ideally with as little "ceremony" as possible...
Optionally it could be noted whether the lines need to be rejoined in order or not, if that could influence the speed of the process?
I am sure it would be possible to cobble together something myself using ideas from the Parallel Haskell book, but it seems to me that running a pure function in parallel (parmap etc) in the middle of a Conduit workflow should be very easy?
As an example of the "internal parallelism" mentioned by Petr Pudlák in his comment, consider this function (I'm using pipes, but could be implemented with conduit just as easily):
import Control.Monad
import Control.Lens (view)
import Control.Concurrent.Async (mapConcurrently)
import Pipes
import qualified Pipes.Group as G
import qualified Control.Foldl as L
concProd :: Int -> (a -> IO b) -> Producer a IO r -> Producer b IO r
concProd groupsize action producer =
L.purely G.folds L.list (view (G.chunksOf groupsize) producer)
>->
forever (await >>= liftIO . mapConcurrently action >>= mapM G.yield)
This function takes as parameters a group size, an action we want to run for each value of type a, and a Producer of a values.
It returns a new Producer. Internally, the producer reads a values in batches of groupsize, processes them concurrently, and yields the results one by one.
The code uses Pipes.Group to "partition" the original producer into sub-producers of size groupsize, and then Control.Foldl to "fold" each sub-producer into a list.
For more sophisticated tasks, you could turn to the asynchronous channels provided by pipes-concurrency or stm-conduit. But these yank you out somewhat of the "single pipeline" worldview of vanilla pipes/conduits.

Concurrency considerations between pipes and non-pipes code

I'm in the process of wrapping a C library for some encoding in a pipes interface, but I've hit upon some design decisions that need to be made.
After the C library is set up, we hold on to an encoder context. With this, we can either encode, or change some parameters (let's call the Haskell interface to this last function tune :: Context -> Int -> IO ()). There are two parts to my question:
The encoding part is easily wrapped up in a Pipe Foo Bar IO (), but I would also like to expose tune. Since simultaneous use of the encoding context must be lock protected, I would need to take a lock at every iteration in the pipe, and protect tune with taking the same lock. But now I feel I'm forcing hidden locks on the user. Am I barking up the wrong tree here? How is this kind of situation normally resolved in the pipes ecosystem? In my case I expect the pipe that my specific code is part of to always run in its own thread, with tuning happening concurrently, but I don't want to force this point of view upon any users. Other packages in the pipes ecosystem do not seem to force their users like either.
An encoding context that is no longer used needs to be properly de-initialized. How does one, in the pipes ecosystem, ensure that such things (in this case performing som IO actions) are taken care of when the pipe is destroyed?
A concrete example would be wrapping a compression library, in which case the above can be:
The compression strength is tunable. We set up the pipe and it runs along merrily. How should one best go about allowing the compression strength setting to be changed while the pipe keeps running, assuming that concurrent access to the compression codec context must be serialized?
The compression library allocated a bunch of memory off the Haskell heap when set up, and we'll need to call some library function to clean this up when the pipe is torn down.
Thanks… this might all be obvious, but I'm quite new to the pipes ecosystem.
Edit: Reading this after posting, I'm quite sure it's the vaguest question I've ever asked here. Ugh! Sorry ;-)
Regarding (1), the general solution is to change your Pipe's type to:
Pipe (Either (Context, Int) Foo) Bar IO ()
In other words, it accepts both Foo inputs and tune requests, which it processes internally.
So let's then assume that you have two concurrent Producers corresponding to inputs and tune requests:
producer1 :: Producer Foo IO ()
producer2 :: Producer (Context, Int) IO ()
You can use pipes-concurrency to create a buffer that they both feed into, like this:
example = do
(output, input) <- spawn Unbounded
-- input :: Input (Either (Context, Int) Foo)
-- output :: Output (Either (Context, Int) Foo)
let io1 = runEffect $ producer1 >-> Pipes.Prelude.map Right >-> toOutput output
io2 = runEffect $ producer2 >-> Pipes.Prelude.map Left >-> toOutput output
as <- mapM async [io1, io2]
runEffect (fromInput >-> yourPipe >-> someConsumer)
mapM_ wait as
You can learn more about the pipes-concurrency library by reading this tutorial.
By forcing all tune requests to go through the same single-threaded Pipe you can ensure that you don't accidentally have two concurrent invocations of the tune function.
Regarding (2) there are two ways you can acquire a resource using pipes. The more sophisticated approach is to use the pipes-safe library, which provides a bracket function that you can use within a Pipe, but that is probably overkill for your purpose and only exists for acquiring and releasing multiple resources over the lifetime of a pipe. A simpler solution is just to use the following with idiom to acquire the pipe:
withEncoder :: (Pipe Foo Bar IO () -> IO r) -> IO r
withEncoder k = bracket acquire release $ \resource -> do
k (createPipeFromResource resource)
Then a user would just write:
withEncoder $ \yourPipe -> do
runEffect (someProducer >-> yourPipe >-> someConsumer)
You can optionally use the managed package, which simplifies the types a bit and makes it easier to acquire multiple resources. You can learn more about it from reading this blog post of mine.

Can ghci reoder IO actions within unsafePerformIO IO blocks

Can IO actions in IO blocks call within unsafePerformIO be reordered?
I have effectively the IO function.
assembleInsts :: ... -> IO S.ByteString
assembleInsts ... = do
tmpInputFile <- generateUniqueTmpFile
writeFile tmpInputFile str
(ec,out,err) <- readProcessWithExitCode asm_exe [tmpInputFile] ""
-- asm generates binary output in tmpOutputFile
removeFile tmpInputFile
let tmpOutputFile = replaceExtension tmpIsaFile "bits" -- assembler creates this
bs <- S.readFile tmpOutputFile -- fails due to tmpOutputFile not existing
removeFile tmpOutputFile
return bs
where S.ByteString is a strict byte string.
Sadly, I need to call this in a tree of pure code far from the IO monad,
but since I the assembler behaves as a referentially transparent
(given unique files) tool, I figured for the time being I could make
an unsafe interface for the time being.
{-# NOINLINE assembleInstsUnsafe #-}
assembleInstsUnsafe :: ... -> S.ByteString
assembleInstsUnsafe args = unsafePerformIO (assembleInsts args)
In addition I added to the top of the module the following annotation
as per the documentation's (System.IO.Unsafe's) instructions.
{-# OPTIONS -fno-cse #-}
module Gen.IsaAsm where
(I tried to also add -fnofull-laziness as well, as per a reference that
I consulted, but this was rejected by the compiler. I don't think that
case applies here though.)
Running in ghci it reports the following error.
*** Exception: C:\Users\trbauer\AppData\Local\Temp\tempfile_13516_0.dat: openBinaryFile: does not exist (No such file or directory)
But if I remove removeFile tmpOutputFile, then it magically works.
Hence, it seems like the removeFile is executing ahead of the process termination.
Is this possible? The bytestring is strict, and I even tried to force the output at one point with a:
S.length bs `seq` return ()
before the removeFile.
Is there a way to dump intermediate code to find out what's going on?
(Maybe I can trace this with Process Monitor or something to find out.)
Unfortunately, I'd like to clean up within this operation (remove the file).
I think the exe version might work, but under ghci it fails (interpreted).
I am using GHC 7.6.3 from the last Haskell Platform.
I know unsafePerformIO is a really big hammer and has other risks associated with it, but it would really limit the complexity of my software change.
This may not be applicable, since it is based on assumptions unspecified in your question. In particular, this answer is based on the following two assumptions. S, which is unspecified, is Data.ByteString.Lazy and tmpDatFile, which is undefined, is tmpOutputFile.
import qualified Data.ByteString.Lazy as S
...
let tmpDatFile = tmpOutputFile
Possible Cause
If these assumptions are true, removeFile will run too early, even without the use of unsafePerformIO. The following code
import System.Directory
import qualified Data.ByteString.Lazy as S
assembleInsts = do
-- prepare a file, like asm might have generated
let tmpOutputFile = "dataFile.txt"
writeFile tmpOutputFile "a bit of text"
-- read the prepared file
let tmpDatFile = tmpOutputFile
bs <- S.readFile tmpOutputFile
removeFile tmpDatFile
return bs
main = do
bs <- assembleInsts
print bs
Results in the error
lazyIOfail.hs: DeleteFile "dataFile.txt": permission denied (The process cannot access the file because it is being used by another process.)
Removing the line removeFile tmpDatFile will make this code execute correctly, just like you describe, but leaving behind the temporary file isn't what is desired.
Possible Solution
Changing the import S to
import qualified Data.ByteString as S
instead results in the correct output,
"a bit of text".
Explanation
The documentation for Data.ByteSting.Lazy's readFile states that it will
Read an entire file lazily into a ByteString. The Handle will be held open until EOF is encountered.
Internally, readfile accomplishes this by calling unsafeInterleaveIO. unsafeInterleaveIO defers execution of the IO code until the term it returns is evaluated.
hGetContentsN :: Int -> Handle -> IO ByteString
hGetContentsN k h = lazyRead -- TODO close on exceptions
where
lazyRead = unsafeInterleaveIO loop
loop = do
c <- S.hGetSome h k -- only blocks if there is no data available
if S.null c
then do hClose h >> return Empty
else do cs <- lazyRead
return (Chunk c cs)
Because nothing tries to look at the constructor of the bs defined in the example above until it is printed, which doesn't happen until after removeFile has been executed, no chunks are read from the file (and the file is not closed) before removeFile is executed. Therefore, when removeFile is executed, the Handle opened by readFile is still open, and the file can't be removed.
Even if you are using unsafePerformIO, IO actions should not be reordered. If you want to be sure of that, you can use the -ddump-simpl flag to see the intermediate Core language which GHC produces, or even one of the other -dump-* flags showing all the compilation intermediate steps up to assembly.
I am aware that this answers what you asked, and not what you actually need, but you can rule out GHC bugs at least. It seems unlikely there's a bug affecting this in GHC, though.
Totally my fault.... sorry everyone. GHC does not reorder IO actions in an IO block under the above stated conditions as mentioned by those above. The assembler was just failing to assemble the output and create the assumed file. I simply forgot to check the exit code or the output stream of the assembler. I assumed the input to be syntactically correct since it is generated, the assembler rejected it and simply failed to create the file. It gave a valid error code and error diagnostic too, so that was really bad on my part. I may have been using readProcess the first time around, which raises an exception on a non-zero exit, but must have eventually changed this. I think the assembler had a bug where it didn't correctly indicate a failing exit code for some cases, and I had to change from readProcessWithExitCode.
I am still not sure why the error went away when I elided the removeFile.
I thought about deleting the question, but I a hoping the suggestions above help others debug similar (more valid) problems as well. I've been burned by the lazy IO thing Cirdec mentioned, and the -ddump-simpl flag mentioned by chi is good to know as well.

A way to form a 'select' on MVars without polling

I have two MVars (well an MVar and a Chan). I need to pull things out of the Chan and process them until the other MVar is not empty any more. My ideal solution would be something like the UNIX select function where I pass in a list of (presumably empty) MVars and the thread blocks until one of them is full, then it returns the full MVar. Try as I might I can think of no way of doing this beyond repeatedly polling each MVar with isEmptyMVar until I get false. This seems inefficient.
A different thought was to use throwTo, but it interrupts what ever is happening in the thread and I need to complete processing a job out the the Chan in an atomic fashion.
A final thought as I'm typing is to create a new forkIO for each MVar which tries to read its MVar then fill a newly created MVar with its own instance. The original thread can then block on that MVar. Are Haskell threads cheap enough to go running that many?
Haskell threads are very cheap, so you could solve it that way, but it sounds like STM would be a better fit for your problem. With STM you can do
do var <- atomically (takeTMVar a `orElse` takeTMVar b)
... do stuff with var
Because of the behavior of retry and orElse, this code tries to get a, then if that fails, get b. If both fail, it blocks until either of them is updated and tries again.
You could even use this to make your own rudimentary version of select:
select :: [TMVar a] -> STM a
select = foldr1 orElse . map takeTMVar
How about using STM versions, TChan and TVar, with the retry and orElse behavior?
Implementing select is one of STM's nice capabilities. From "Composable Memory Transactions":
Beyond this, we also provide orElse,
which allows them to be composed as alternatives, so that
the second is run if the first retries (Section 3.4). This ability allows threads to wait for many things at once, like the
Unix select system call – except that orElse composes well,
whereas select does not.
orElse in RWH.
The STM package
Papers on Haskell's STM

Resources