Parallel processing in conduit flow - haskell

I really like the concept of conduit/pipes for applying operations to a streaming IO source. I am interested in building tools that work on very large log files. One of the attractions of moving to Haskell from Python/Ruby is the easier way of writing parallel code, but I can't find any documentation of this. How could I set up a conduit-flow which reads lines from a file and works on them in parallel (ie. with 8 cores, it should read eight lines, and hand them off to eight different threads to be processed, and then collected again etc), ideally with as little "ceremony" as possible...
Optionally it could be noted whether the lines need to be rejoined in order or not, if that could influence the speed of the process?
I am sure it would be possible to cobble together something myself using ideas from the Parallel Haskell book, but it seems to me that running a pure function in parallel (parmap etc) in the middle of a Conduit workflow should be very easy?

As an example of the "internal parallelism" mentioned by Petr Pudlák in his comment, consider this function (I'm using pipes, but could be implemented with conduit just as easily):
import Control.Monad
import Control.Lens (view)
import Control.Concurrent.Async (mapConcurrently)
import Pipes
import qualified Pipes.Group as G
import qualified Control.Foldl as L
concProd :: Int -> (a -> IO b) -> Producer a IO r -> Producer b IO r
concProd groupsize action producer =
L.purely G.folds L.list (view (G.chunksOf groupsize) producer)
>->
forever (await >>= liftIO . mapConcurrently action >>= mapM G.yield)
This function takes as parameters a group size, an action we want to run for each value of type a, and a Producer of a values.
It returns a new Producer. Internally, the producer reads a values in batches of groupsize, processes them concurrently, and yields the results one by one.
The code uses Pipes.Group to "partition" the original producer into sub-producers of size groupsize, and then Control.Foldl to "fold" each sub-producer into a list.
For more sophisticated tasks, you could turn to the asynchronous channels provided by pipes-concurrency or stm-conduit. But these yank you out somewhat of the "single pipeline" worldview of vanilla pipes/conduits.

Related

Using a single Haskell pipe to split HTTP content to two consumers

I can't really figure out whether some of these other questions are similar enough to mine but I couldn't extract a solution out of them so I'm posting. Feel free to indicate to me otherwise.
I have a flow where I need to download a large CSV file, and 1) save it to disk, and 2) process it. I'd like to use Haskell pipes, with the pipes-http and pipes-csv packages to do this.
The obvious way is to have two separate pipes: 1) web -> disk, and then 2) disk -> process. Is it possible to do another topology where the output from the web splits into two consumers, one that saves and the other that processes? I feel that this could be more elegant and possibly more efficient.
If so, how is the splitting done? Splitting of pipes is not mentioned anywhere in the documentation.
The expression "splitting the content between consumers" might be a little misleading; you want to send all bytes to each of two consumers. But Pipes.Prelude.tee turns any consumer into a Pipe, thus
producer >-> tee consumer1 >-> consumer2
feeds the producer to both of the consumers. But the particular case of writing to a file might be simplest with Pipes.Prelude.chain, rather than a consumer. tee and chain allow you to do something with each incoming value, before forwarding it along the pipeline. In this case I just write each successive chunk to a handle, before passing it along:
import Pipes
import Pipes.HTTP
import qualified Pipes.ByteString as PB
import qualified Pipes.Prelude as P
import qualified System.IO as IO
import qualified Data.ByteString as B
main = do
req <- parseUrl "https://www.example.com"
m <- newManager tlsManagerSettings
withHTTP req m $ \resp ->
IO.withFile "file.txt" IO.WriteMode $ \h ->
runEffect $ responseBody resp >-> P.chain (B.hPut h) >-> PB.stdout
I ended the pipeline with PB.stdout where you would use pipes-csv materials. Using tee, I could as well have written
runEffect $ responseBody resp >-> P.tee (PB.toHandle h) >-> PB.stdout
for the last line. Where the 'consumers' can be viewed as folds, there is the apparatus of Control.Foldl for combining many folds together - and any number of other devices.

Distributing Haskell on a cluster

I have a piece of code that process files,
processFiles :: [FilePath] -> (FilePath -> IO ()) -> IO ()
This function spawns an async process that execute an IO action. This IO action must be submitted to a cluster through a job scheduling system (e.g Slurm).
Because I must use the job scheduling system, it's not possible to use cloudHaskell to distribute the closure. Instead the program writes a new Main.hs containing the desired computations, that is copy to the cluster node together with all the modules that main depends on and then it is executed remotely with "runhaskell Main.hs [opts]". Then the async process should ask periodically to the job scheduling system (using threadDelay) if the job is done.
Is there a way to avoid creating a new Main? Can I serialize the IO action and execute it somehow in the node?
Yep. There is a magical library called packman. It allows you to turn any haskell thing into data (as long as it does not have IORefs or related things in them.) Here the things you would need:
trySerialize :: a -> IO (Serialized a)
deserialize :: Serialized a -> IO a
instance Typeable a => Binary (Serialized a)
Yep, those are the exact types. You can package up your IO actions using trySerialize, use Binary to transfer it to wherever, and then deserialize to get the IO action out, ready for use.
Caveats for packman is that:
It stores things as thunks. This is probably what you want, so that the node can do the evaluating.
That said, if your thunk is huge, the Binary will probably be huge. Evaluating the thunk can fix this.
Like I said, mutable references are a no-no. One thing to watch out is them being inside thunks without you knowing it.
Other than that, this seems like what you want!

Concurrency considerations between pipes and non-pipes code

I'm in the process of wrapping a C library for some encoding in a pipes interface, but I've hit upon some design decisions that need to be made.
After the C library is set up, we hold on to an encoder context. With this, we can either encode, or change some parameters (let's call the Haskell interface to this last function tune :: Context -> Int -> IO ()). There are two parts to my question:
The encoding part is easily wrapped up in a Pipe Foo Bar IO (), but I would also like to expose tune. Since simultaneous use of the encoding context must be lock protected, I would need to take a lock at every iteration in the pipe, and protect tune with taking the same lock. But now I feel I'm forcing hidden locks on the user. Am I barking up the wrong tree here? How is this kind of situation normally resolved in the pipes ecosystem? In my case I expect the pipe that my specific code is part of to always run in its own thread, with tuning happening concurrently, but I don't want to force this point of view upon any users. Other packages in the pipes ecosystem do not seem to force their users like either.
An encoding context that is no longer used needs to be properly de-initialized. How does one, in the pipes ecosystem, ensure that such things (in this case performing som IO actions) are taken care of when the pipe is destroyed?
A concrete example would be wrapping a compression library, in which case the above can be:
The compression strength is tunable. We set up the pipe and it runs along merrily. How should one best go about allowing the compression strength setting to be changed while the pipe keeps running, assuming that concurrent access to the compression codec context must be serialized?
The compression library allocated a bunch of memory off the Haskell heap when set up, and we'll need to call some library function to clean this up when the pipe is torn down.
Thanks… this might all be obvious, but I'm quite new to the pipes ecosystem.
Edit: Reading this after posting, I'm quite sure it's the vaguest question I've ever asked here. Ugh! Sorry ;-)
Regarding (1), the general solution is to change your Pipe's type to:
Pipe (Either (Context, Int) Foo) Bar IO ()
In other words, it accepts both Foo inputs and tune requests, which it processes internally.
So let's then assume that you have two concurrent Producers corresponding to inputs and tune requests:
producer1 :: Producer Foo IO ()
producer2 :: Producer (Context, Int) IO ()
You can use pipes-concurrency to create a buffer that they both feed into, like this:
example = do
(output, input) <- spawn Unbounded
-- input :: Input (Either (Context, Int) Foo)
-- output :: Output (Either (Context, Int) Foo)
let io1 = runEffect $ producer1 >-> Pipes.Prelude.map Right >-> toOutput output
io2 = runEffect $ producer2 >-> Pipes.Prelude.map Left >-> toOutput output
as <- mapM async [io1, io2]
runEffect (fromInput >-> yourPipe >-> someConsumer)
mapM_ wait as
You can learn more about the pipes-concurrency library by reading this tutorial.
By forcing all tune requests to go through the same single-threaded Pipe you can ensure that you don't accidentally have two concurrent invocations of the tune function.
Regarding (2) there are two ways you can acquire a resource using pipes. The more sophisticated approach is to use the pipes-safe library, which provides a bracket function that you can use within a Pipe, but that is probably overkill for your purpose and only exists for acquiring and releasing multiple resources over the lifetime of a pipe. A simpler solution is just to use the following with idiom to acquire the pipe:
withEncoder :: (Pipe Foo Bar IO () -> IO r) -> IO r
withEncoder k = bracket acquire release $ \resource -> do
k (createPipeFromResource resource)
Then a user would just write:
withEncoder $ \yourPipe -> do
runEffect (someProducer >-> yourPipe >-> someConsumer)
You can optionally use the managed package, which simplifies the types a bit and makes it easier to acquire multiple resources. You can learn more about it from reading this blog post of mine.

Is it safe to reuse a conduit?

Is it safe to perform multiple actions using the same conduit value? Something like
do
let sink = sinkSocket sock
something $$ sink
somethingElse $$ sink
I recall that in the early versions of conduit there were some dirty hacks that made this unsafe. What's the current status?
(Note that sinkSocket doesn't close the socket.)
That usage is completely safe. The issue in older versions had to do with blurring the line between resumable and non-resumable components. With modern versions (I think since 0.4), the line is very clear between the two.
It might be safe to reuse sinks in the sense that the semantics for the "used" sink doesn't change. But you should be aware of another threat: space leaks.
The situation is analogous to lazy lists: you can consume a huge list lazily in a constant space, but if you process the list twice it will be kept in memory. The same thing might happen with a recursive monadic expression: if you use it once it's constant size, but if you reuse it the structure of the computation is kept in memory, resulting in space leak.
Here's an example:
import Data.Conduit
import Data.Conduit.List
import Control.Monad.Trans.Class (lift)
consumeN 0 _ = return ()
consumeN n m = do
await >>= (lift . m)
consumeN (n-1) m
main = do
let sink = consumeN 1000000 (\i -> putStrLn ("Got one: " ++ show i))
sourceList [1..9000000::Int] $$ sink
sourceList [1..22000000::Int] $$ sink
This program uses about 150M of ram on my machine, but if you remove the last line or repeat the definition of sink in both places, you get a nice constant space usage.
I agree that this is a contrived example (this was the first that came to my mind), and this is not very likely to happen with most Sinks. For example this will not happen with your sinkSocket. (Why is this contrived: because the control structure of the sink doesn't depend on the values it gets. And that is also why it can leak.) But, for example, for sources this would be much more common. (Many of the common Sources exhibit this behavior. The sourceList would be an obvious example, because it would actually keep the source list in memory. But, enumFromTo is no different, although there is no data to keep in memory, just the structure of the monadic computation.)
So, all in all, I think it's important to be aware of this.

Safer Handles in Haskell?

I felt a bit insecure when using Haskell Handles. Namely, I'm looking for two features (maybe they are already there and in that case please forgive my ignorance).
When I've obtained a handle (e.g., returned by Network.accept), which
is both readable and writable, I wish to convert them into a pair of
read-only and write-only handles such that writing to a read-only
handle won't type check and vice versa. (Perhaps one can achieve
this using phantom types and wraps around IO functions?)
In a concurrent setting, I found that it is possible for multiple threads to write to the same handle, which gives rise to quite nasty consequences. How could one prevent that through the type system (if possible) or at least get notified of such case via thrown exception during run-time?
Any idea is welcome.
It looks like the safer-file-handles library does what you want. The first part is handled pretty clearly. The concurrency-safety appears to be handled by RegionT from the regions library. I haven't used this at all, but it looks like a pretty common approach.
You may want to consider using the network conduit package. It describes a network application as something that is given two "endpoints" - one sink pushes data into a socket and one source that reads data from the socket:
type Application m = AppData m -> m ()
data AppData m Source -- ...
appSource :: AppData m -> Source m ByteStringSource
appSink :: AppData m -> Sink ByteString m ()
This cleanly separates the writing and the reading part. Now you can do whatever you like with such a source and a sink, even passing each to a different thread and processing input and output separately. Of course, each of them can only read or write, depending on what endpoint you give to it.
If you want to enforce single-threaded processing, you can restrict yourself to implement your program components as Conduit ByteString m ByteString. Such a conduit can be aseily turned into an Applications like
asApp :: MonadIO m => Conduit ByteString m ByteString -> Application m
asApp cond ad = appSource ad $= cond $$ appSink ad
But a conduit can only request data using await and write output using yield, otherwise has no access to any kind of handles and never sees any of its endpoints, so it can't expose or leak them anywhere.

Resources