Distributing Haskell on a cluster

Distributing Haskell on a cluster - haskell

I have a piece of code that process files,
processFiles :: [FilePath] -> (FilePath -> IO ()) -> IO ()
This function spawns an async process that execute an IO action. This IO action must be submitted to a cluster through a job scheduling system (e.g Slurm).
Because I must use the job scheduling system, it's not possible to use cloudHaskell to distribute the closure. Instead the program writes a new Main.hs containing the desired computations, that is copy to the cluster node together with all the modules that main depends on and then it is executed remotely with "runhaskell Main.hs [opts]". Then the async process should ask periodically to the job scheduling system (using threadDelay) if the job is done.
Is there a way to avoid creating a new Main? Can I serialize the IO action and execute it somehow in the node?

Yep. There is a magical library called packman. It allows you to turn any haskell thing into data (as long as it does not have IORefs or related things in them.) Here the things you would need:
trySerialize :: a -> IO (Serialized a)
deserialize :: Serialized a -> IO a
instance Typeable a => Binary (Serialized a)
Yep, those are the exact types. You can package up your IO actions using trySerialize, use Binary to transfer it to wherever, and then deserialize to get the IO action out, ready for use.
Caveats for packman is that:
It stores things as thunks. This is probably what you want, so that the node can do the evaluating.
That said, if your thunk is huge, the Binary will probably be huge. Evaluating the thunk can fix this.
Like I said, mutable references are a no-no. One thing to watch out is them being inside thunks without you knowing it.
Other than that, this seems like what you want!

Related

How to modify a state monad?

I use State Monad Transformer to manage global state like this
data State = State ...
StateT State IO ()
And I use amqp to consume messages from RabbitMQ. The state will be modified according to messages received. The function has the type like
consumeMsgs :: Channel
-> Text
-> Ack
-> ((Message, Envelope) -> IO ()) -- ^ the callback function
-> IO ConsumerTag
Right now we can ignore other parameters but the third which is a callback function I will supply and where the modification happen.
Because it's a mainly IO Monad, so I use this function as follows
consumeMsgs chan queue Rmq.Ack (flip evalStateT ssss . rmqCallback)
Here the ssss is the state I put in and I find that during the process of my callback function rmqCallback the state can be correctly modified. But, every time next callback happens the global state is the same as before the consumeMsgs is called or equal with ssss.
I understand State Monad is just a process needing an initial state to put in and maintain the state during whole way but has nothing to do with the state out of Monad (am I missing something?), so I count on MVar to hold and modify the state, and that works. I want to know it's there other way to handle this, maybe another Monad?

It looks like you could use Network.AMQP.Lifted.consumeMsgs. StateT s IO is an instance of MonadBaseControl IO m, so you could run whole consumeMsgs inside single runStateT
Yes, StateT monad transformer is basically a nice notation for a pure code, so if your API accepts only IO callbacks you have no choice but to use "real" state like MVar or IORef etc.
PS: As other answer suggests, the state changes done in Network.AMQP.Lifted.consumeMsgs's callback do not propagate to subsequent callback runs or resulting state. I cannot wrap my head around the implementation, but I tried liftBaseWith a bit and it really looks like so.

To add a clarification that might be useful for future reference, the accepted answer is not exact. While Network.AMQP.Lifted.consumeMsgs should work with StateT s IO, the RabbitMQ haskell library actually discards the monadic state after each use. This means that if you do use that instance, you will not see changes made after the initial consumeMsgs call, including changes made by the callback itself. The callback is basically called with the same Monadic state every time - the state in which it was when the callback was registered.
This means that you can use it to pass global configuration state, but not to keep track of state between callback executions.

Concurrency considerations between pipes and non-pipes code

I'm in the process of wrapping a C library for some encoding in a pipes interface, but I've hit upon some design decisions that need to be made.
After the C library is set up, we hold on to an encoder context. With this, we can either encode, or change some parameters (let's call the Haskell interface to this last function tune :: Context -> Int -> IO ()). There are two parts to my question:
The encoding part is easily wrapped up in a Pipe Foo Bar IO (), but I would also like to expose tune. Since simultaneous use of the encoding context must be lock protected, I would need to take a lock at every iteration in the pipe, and protect tune with taking the same lock. But now I feel I'm forcing hidden locks on the user. Am I barking up the wrong tree here? How is this kind of situation normally resolved in the pipes ecosystem? In my case I expect the pipe that my specific code is part of to always run in its own thread, with tuning happening concurrently, but I don't want to force this point of view upon any users. Other packages in the pipes ecosystem do not seem to force their users like either.
An encoding context that is no longer used needs to be properly de-initialized. How does one, in the pipes ecosystem, ensure that such things (in this case performing som IO actions) are taken care of when the pipe is destroyed?
A concrete example would be wrapping a compression library, in which case the above can be:
The compression strength is tunable. We set up the pipe and it runs along merrily. How should one best go about allowing the compression strength setting to be changed while the pipe keeps running, assuming that concurrent access to the compression codec context must be serialized?
The compression library allocated a bunch of memory off the Haskell heap when set up, and we'll need to call some library function to clean this up when the pipe is torn down.
Thanks… this might all be obvious, but I'm quite new to the pipes ecosystem.
Edit: Reading this after posting, I'm quite sure it's the vaguest question I've ever asked here. Ugh! Sorry ;-)

Regarding (1), the general solution is to change your Pipe's type to:
Pipe (Either (Context, Int) Foo) Bar IO ()
In other words, it accepts both Foo inputs and tune requests, which it processes internally.
So let's then assume that you have two concurrent Producers corresponding to inputs and tune requests:
producer1 :: Producer Foo IO ()
producer2 :: Producer (Context, Int) IO ()
You can use pipes-concurrency to create a buffer that they both feed into, like this:
example = do
(output, input) <- spawn Unbounded
-- input :: Input (Either (Context, Int) Foo)
-- output :: Output (Either (Context, Int) Foo)
let io1 = runEffect $ producer1 >-> Pipes.Prelude.map Right >-> toOutput output
io2 = runEffect $ producer2 >-> Pipes.Prelude.map Left >-> toOutput output
as <- mapM async [io1, io2]
runEffect (fromInput >-> yourPipe >-> someConsumer)
mapM_ wait as
You can learn more about the pipes-concurrency library by reading this tutorial.
By forcing all tune requests to go through the same single-threaded Pipe you can ensure that you don't accidentally have two concurrent invocations of the tune function.
Regarding (2) there are two ways you can acquire a resource using pipes. The more sophisticated approach is to use the pipes-safe library, which provides a bracket function that you can use within a Pipe, but that is probably overkill for your purpose and only exists for acquiring and releasing multiple resources over the lifetime of a pipe. A simpler solution is just to use the following with idiom to acquire the pipe:
withEncoder :: (Pipe Foo Bar IO () -> IO r) -> IO r
withEncoder k = bracket acquire release $ \resource -> do
k (createPipeFromResource resource)
Then a user would just write:
withEncoder $ \yourPipe -> do
runEffect (someProducer >-> yourPipe >-> someConsumer)
You can optionally use the managed package, which simplifies the types a bit and makes it easier to acquire multiple resources. You can learn more about it from reading this blog post of mine.

Safer Handles in Haskell?

I felt a bit insecure when using Haskell Handles. Namely, I'm looking for two features (maybe they are already there and in that case please forgive my ignorance).
When I've obtained a handle (e.g., returned by Network.accept), which
is both readable and writable, I wish to convert them into a pair of
read-only and write-only handles such that writing to a read-only
handle won't type check and vice versa. (Perhaps one can achieve
this using phantom types and wraps around IO functions?)
In a concurrent setting, I found that it is possible for multiple threads to write to the same handle, which gives rise to quite nasty consequences. How could one prevent that through the type system (if possible) or at least get notified of such case via thrown exception during run-time?
Any idea is welcome.

It looks like the safer-file-handles library does what you want. The first part is handled pretty clearly. The concurrency-safety appears to be handled by RegionT from the regions library. I haven't used this at all, but it looks like a pretty common approach.

You may want to consider using the network conduit package. It describes a network application as something that is given two "endpoints" - one sink pushes data into a socket and one source that reads data from the socket:
type Application m = AppData m -> m ()
data AppData m Source -- ...
appSource :: AppData m -> Source m ByteStringSource
appSink :: AppData m -> Sink ByteString m ()
This cleanly separates the writing and the reading part. Now you can do whatever you like with such a source and a sink, even passing each to a different thread and processing input and output separately. Of course, each of them can only read or write, depending on what endpoint you give to it.
If you want to enforce single-threaded processing, you can restrict yourself to implement your program components as Conduit ByteString m ByteString. Such a conduit can be aseily turned into an Applications like
asApp :: MonadIO m => Conduit ByteString m ByteString -> Application m
asApp cond ad = appSource ad $= cond $$ appSink ad
But a conduit can only request data using await and write output using yield, otherwise has no access to any kind of handles and never sees any of its endpoints, so it can't expose or leak them anywhere.

Nondeterministically interleaving conduit's Sources

I was hoping to see a nondeterministic interleaving operation for sources, with a type signature like
interleave :: WhateverIOMonadClassItWouldWant m => [(k, Source m a)] -> Source m (k, a)
The use case is that I have a p2p application that maintains open connections to many nodes on the network, and it is mostly just sitting around waiting for messages from any of them. When a message arrives, it doesn't care where it came from, but needs to process the message as soon as possible. In theory this kind of application (at least when used for socket-like sources) could bypass GHC's IO manager entirely and run the select/epoll/etc. calls directly, but I don't particularly care how it's implemented, as long as it works.
Is something like this possible with conduit? A less general but probably more feasible approach might be to write a [(k, Socket)] -> Source m (k, ByteString) function that handles receiving on all the sockets for you.
I noticed the ResumableSource operations in conduit, but they all seem to want to be aware of a particular Sink, which feels like a bit of an abstraction leak, at least for this operation.

The stm-conduit package provides the mergeSources which performs something similar- though not identical- to what you're looking for. It's probably a good place to start.

Yes, it is possible.
You can poll a bunch of Sources without blocking by forking threads to poll where in each thread you pair the Source up with a Sink that sends the output to some concurrency channel:
concur :: (WhateverIOMonadClassItWouldWant m) => TChan a -> Sink a m r
... and then you define a Source that reads from that channel:
synchronize :: (WhateverIOMonadClassItWouldWant m) => TChan a -> Source a m r
Notice that this would be no different than just forking the threads to poll the sockets themselves, but it would be useful to other users of conduit that might want to poll other things than sockets using Sources they defined because it's more general.
If you combined those capabilities into one function, then the overall API of the call would look something like:
poll :: (WhateverIOMonadClassItWouldWant m) => [Source a m r] -> m (Source a m r)
... but you can still throw in those ks if you want.

How can one implement a forking try-catch in Haskell?

I want to write a function
forkos_try :: IO (Maybe α) -> IO (Maybe α)
which Takes a command x. x is an imperative operation which first mutates state, and then checks whether that state is messed up or not. (It does not do anything external, which would require some kind of OS-level sandboxing to revert the state.)
if x evaluates to Just y, forkos_try returns Just y.
otherwise, forkos_try rolls back state, and returns Nothing.
Internally, it should fork() into threads parent and child, with x running on child.
if x succeeds, child should keep running (returning x's result) and parent should die
otherwise, parent should keep running (returning Nothing) and child should die
Question: What's the way to write something with equivalent, or more powerful semantics than forkos_try? N.B. -- the state mutated (by x) is in an external library, and cannot be passed between threads. Hence, the semantic of which thread to keep alive is important.
Formally, "keep running" means "execute some continuation rest :: Maybe α -> IO () ". But, that continuation isn't kept anywhere explicit in code.
For my case, I think it will (for the time) work to write it in different style, using forkOS (which takes the entire computation child will run), since I can write an explicit expression for rest. But, it troubles me that I can't figure out how do this with the primitive function forkOS -- one would think it would be general enough to support any specific case (which could appear as a high-level API, like forkos_try).
EDIT -- please see the example code with explicit rest if the problem's still not clear [ http://pastebin.com/nJ1NNdda ].
p.s. I haven't written concurrency code in a while; hopefully my knowledge of POSIX fork() is correct! Thanks in advance.

Things are a lot simpler to reason about if you model state explicitly.
someStateFunc :: (s -> Maybe (a, s))
-- inside some other function
case someStateFunc initialState of
Nothing -> ... -- it failed. stick with initial state
Just (a, newState) -> ... -- it suceeded. do something with
-- the result and new state
With immutable state, "rolling back" is simple: just keep using initialState. And "not rolling back" is also simple: just use newState.
So...I'm assuming from your explanation that this "external library" performs some nontrivial IO effects that are nevertheless restricted to a few knowable and reversible operations (modify a file, an IORef, etc). There is no way to reverse some things (launch the missiles, write to stdout, etc), so I see one of two choices for you here:
clone the world, and run the action in a sandbox. If it succeeds, then go ahead and run the action in the Real World.
clone the world, and run the action in the real world. If it fails, then replace the Real World with the snapshot you took earlier.
Of course, both of these are actually the same approach: fork the world. One world runs the action, one world doesn't. If the action succeeds, then that world continues; otherwise, the other world continues. You are proposing to accomplish this by building upon forkOS, which would clone the entire state of the program, but this would not be sufficient to deal with, for example, file modifications. Allow me to suggest instead an approach that is nearer to the simplicity of immutable state:
tryIO :: IO s -> (s -> IO ()) -> IO (Maybe a) -> IO (Maybe a)
tryIO save restore action = do
initialState <- save
result <- action
case result of
Nothing -> restore initialState >> return Nothing
Just x -> return (Just x)
Here you must provide some data structure s, and a way to save to and restore from said data structure. This allows you the flexibility to perform any cloning you know to be necessary. (e.g. save could copy a certain file to a temporary location, and then restore could copy it back and delete the temporary file. Or save could copy the value of certain IORefs, and then restore could put the value back.) This approach may not be the most efficient, but it's very straightforward.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string