STM-friendly list as a change log

STM-friendly list as a change log - haskell

I need an advice on the data structure to use as an atomic change log.
I'm trying to implement the following algorithm. There is a flow of incoming
changes updating an in-memory map. In Haskell-like pseudocode it is
update :: DataSet -> SomeListOf Change -> Change -> STM (DataSet, SomeListOf Change)
update dataSet existingChanges newChange = do
...
return (dataSet, existingChanges ++ [newChange])
where DataSet is a map (currently it is the Map from the stm-containers package, https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Map.html). The whole "update" is called from arbitrary number of threads. Some of the Change's can be rejected due to domain semantics, I use throwSTM for that to throw away the effect of the transaction. In case of successful commit the "newChange" is added to the list.
There exists separate thread which calls the following function:
flush :: STM (DataSet, SomeListOf Change) -> IO ()
this function is supposed to take the current snapshot of DataSet together with the list of changes (it has to a consistent pair) and flush it to the filesystem, i.e.
flush data = do
(dataSet, changes) <- atomically $ readTVar data_
-- write them both to FS
-- ...
atomically $ writeTVar data_ (dataSet, [])
I need an advice about the data structure to use for "SomeListOf Change". I don't want to use [Change] because it is "too ordered" and I'm afraid there will be too many conflicts, which will force the whole transaction to retry. Please correct me, if I'm wrong here.
I cannot use the Set (https://hackage.haskell.org/package/stm-containers-0.2.10/docs/STMContainers-Set.html) because I still need to preserve some order, e.g. the order of transaction commits. I could use TChan for it and it looks like a good match (exactly the order of transaction commits), but I don't know how to implement the "flush" function so that it would give the consistent view of the whole change log together with the DataSet.
The current implementation of that is here https://github.com/lolepezy/rpki-pub-server/blob/add-storage/src/RRDP/Repo.hs, in the functions applyActionsToState and rrdpSyncThread, respectively. It uses TChan and seems to do it in a wrong way.
Thank you in advance.
Update: A reasonable answer seems to be like that
type SomeListOf c = TChan [c]
update :: DataSet -> TChan [Change] -> Change -> STM DataSet
update dataSet existingChanges newChange = do
...
writeTChan changeChan $ reverse (newChange : existingChanges)
return dataSet
flush data_ = do
(dataSet, changes) <- atomically $ (,) <$> readTVar data_ <*> readTChan changeChan
-- write them both to FS
-- ...
But I'm still not sure whether it's a neat solution to pass the whole list as an element of the channel.

I'd probably just go with the list and see how far it takes performance-wise. Given that, you should consider that both, appending to the end of a list and reversing it are O(n) operations, so you should try to avoid this. Maybe you can just prepend the incoming changes like this:
update dataSet existingChanges newChange = do
-- ...
return (dataSet, newChange : existingChanges)
Also, your example for flush has the problem that reading and updating the state is not atomic at all. You must accomplish this using a single atomically call like so:
flush data = do
(dataSet, changes) <- atomically $ do
result <- readTVar data_
writeTVar data_ (dataSet, [])
return result
-- write them both to FS
-- ...
You could then just write them out in reverse order (because now changes contains the elements from newest to oldest) or reverse here once if it's important to write them out oldest to newest. If that's important I'd probably go with some data structure which allows O(1) element access like a good old vector.
When using a fixed-size vector you would obviously have to deal with the problem that it can become "full" which would mean your writers would have to wait for flush to do it's job before adding fresh changes. That's why I'd personally go for the simple list first and see if it's sufficient or where it needs to be improved.
PS: A dequeue might be a good fit for your problem as well, but going fixed size forces you to deal with the problem that your writers can potentially produce more changes than your reader can flush out. The dequeue can grow infinitely, but you your RAM probably isn't. And the vector has pretty low overhead.

I made some (very simplistic) investigation
https://github.com/lolepezy/rpki-pub-server/tree/add-storage/test/changeLog
imitating exactly the type of load I supposedly going to have. I used the same STMContainers.Map for the data set and usual list for the change log. To track the number of transaction retries, I used Debug.Trace.trace, meaning, the number of lines printed by trace. And the number of unique lines printed by trace gives me the number of committed transactions.
The result is here (https://github.com/lolepezy/rpki-pub-server/blob/add-storage/test/changeLog/numbers.txt). The first column is the number of threads, the second is the number of change sets generated in total. The third column is the number of trace calls for the case without change log and the last one is the number of trace calls with the change log.
Apparently most of the time change log adds some extra retries, but it's pretty much insignificant. So, I guess, it's fair to say that any data structure would be good enough, because most of the work is related to updating the map and most of the retries are happening because of it.

Related

How to impurely modify a state associated with an object?

In Haskell, I have a container like:
data Container a = Container { length :: Int, buffer :: Unboxed.Vector (Int,a) }
This container is a flattened tree. Its accessor (!) performs a binary (log(N)) search through the vector in order to find the right bucket where index is stored.
(!) :: Container a -> Int -> a
container ! index = ... binary search ...
Since consecutive accesses are likely to be in the same bucket, this could be optimized in the following way:
if `index` is on the the last accessed bucket, skip the search
The tricky point is the last accessed bucket part. In JavaScript, I'd just impurely modify a hidden variable on the container object.
function read(index,object){
var lastBucket = object.__lastBucket;
// if the last bucket contains index, no need to search
if (contains(object, lastBucket, index))
var bucket = lastBucket;
// if it doesn't
else {
// then we search the bucket
var bucket = searchBucket(index,object);
// And impurely annotate it on the container, so the
// next time we access it we could skip the search.
container.__lastBucket = bucket;
}
return object.buffer[bucket].value;
}
Since this is just an optimization and the result is the same independent of the branch taken, I believe it doesn't break referential transparency. How is it possible, in Haskell, to impurely modify an state associated with a runtime value?
~
I have thought in 2 possible solutions.
A global, mutable hashmap linking pointers to the lastBucket value, and use unsafePerformIO to write on it. But I'd need a way to get the runtime pointer of an object, or at least an unique id of some sort (how?).
Add an extra field to Container, lastBucket :: Int, and somehow impurely modify it within (!), and consider that field internal (because it obviously break referential transparency).

Using solution (1), I managed to get the following design. First, I added a __lastAccessedBucket :: IORef Int field to my datatype, as suggested by #Xicò:
data Container a = Container {
length :: Int,
buffer :: V.Vector (Int,a),
__lastAccessedBucket :: IORef Int }
Then, I had to update the functions that create a new Container in order to create a new IORef using unsafePerformIO:
fromList :: [a] -> Container a
fromList list = unsafePerformIO $ do
ref <- newIORef 0
return $ Container (L.length list) buffer ref
where buffer = V.fromList (prepare list)
Finally, I created two new functions, findBucketWithHint, a pure function which searches the bucket of an index with guess (i.e., the bucket where you think it might be), and the unsafeFindBucket function, which replaces the pure findBucket when performance is needed, by always using the last accessed bucket as the hint:
unsafeFindBucket :: Int -> Container a -> Int
unsafeFindBucket findIdx container = unsafePerformIO $ do
let lastBucketRef = __lastAccessedBucket contianer
lastBucket <- readIORef lastBucketRef
let newBucket = findBucketWithHint lastBucket findIdx container
writeIORef lastBucketRef newBucket
return $ newBucket
With this, unsafeFindBucket is technically a pure function with the same API of the original findBucket function, but is an order of magnitude faster in some benchmarks. I have no idea how safe this is and where it could cause bugs. Threads are certainly a concern.

(This is more an extended comment than an answer.)
First I'd suggest to check if this isn't a case of premature optimization. After all, O(log n) ins't that bad.
If this part is indeed performance-critical, your intention is definitely valid. The usual warning for unsafePerformIO is "use it only if you know what you're doing", which you obviously do, and it can help to make things pure and fast at the same time.
Be sure that you follow all the precautions in the docs, in particular setting the proper compiler flags (you might want to use the OPTIONS_GHC pragma).
Also make sure that the IO operation is thread safe. The easiest way to ensure that is to use IORef together with atomicModifyIORef.
The disadvantage of an internal mutable state is that the performance of the cache will deteriorate if it's accessed from multiple threads, if they lookup different elements.
One remedy would be to explicitly thread the updated state instead of using the internal mutable state. This is obviously what you want to avoid, but if your program is using monads, you could just add another monadic layer that'd internally keep the state for you and expose the lookup operation as a monadic action.
Finally, you could consider using splay trees instead of the array. You'd still have (amortized) O(log n) complexity, but their big advantage is that by design they move frequently accessed elements near the top. So if you'll be accessing a subset of elements of size k, they'll be soon moved to the top, so the lookup operations will be just O(log k) (constant for a single, repeatedly accessed element). Again, they update the structure on lookups, but you could use the same approach with unsafePerformIO and atomic updates of IORef to keep the outer interface pure.

Reasoning about IORef operation reordering in concurrent programs

The docs say:
In a concurrent program, IORef operations may appear out-of-order to
another thread, depending on the memory model of the underlying
processor architecture...The implementation is required to ensure that
reordering of memory operations cannot cause type-correct code to go
wrong. In particular, when inspecting the value read from an IORef,
the memory writes that created that value must have occurred from the
point of view of the current thread.
Which I'm not even entirely sure how to parse. Edward Yang says
In other words, “We give no guarantees about reordering, except that
you will not have any type-safety violations.” ...
the last sentence remarks that an IORef is not allowed to point to
uninitialized memory
So... it won't break the whole haskell; not very helpful. The discussion from which the memory model example arose also left me with questions (even Simon Marlow seemed a bit surprised).
Things that seem clear to me from the documentation
within a thread an atomicModifyIORef "is never observed to take place ahead of any earlier IORef operations, or after any later IORef operations" i.e. we get a partial ordering of: stuff above the atomic mod -> atomic mod -> stuff after. Although, the wording "is never observed" here is suggestive of spooky behavior that I haven't anticipated.
A readIORef x might be moved before writeIORef y, at least when there are no data dependencies
Logically I don't see how something like readIORef x >>= writeIORef y could be reordered
What isn't clear to me
Will newIORef False >>= \v-> writeIORef v True >> readIORef v always return True?
In the maybePrint case (from the IORef docs) would a readIORef myRef (along with maybe a seq or something) before readIORef yourRef have forced a barrier to reordering?
What's the straightforward mental model I should have? Is it something like:
within and from the point of view of an individual thread, the
ordering of IORef operations will appear sane and sequential; but the
compiler may actually reorder operations in such a way that break
certain assumptions in a concurrent system; however when a thread does
atomicModifyIORef, no threads will observe operations on that
IORef that appeared above the atomicModifyIORef to happen after,
and vice versa.
...? If not, what's the corrected version of the above?
If your response is "don't use IORef in concurrent code; use TVar" please convince me with specific facts and concrete examples of the kind of things you can't reason about with IORef.

I don't know Haskell concurrency, but I know something about memory models.
Processors can reorder instructions the way they like: loads may go ahead of loads, loads may go ahead of stores, loads of dependent stuff may go ahead of loads of stuff they depend on (a[i] may load the value from array first, then the reference to array a!), stores may be reordered with each other. You simply cannot put a finger on it and say "these two things definitely appear in a particular order, because there is no way they can be reordered". But in order for concurrent algorithms to operate, they need to observe the state of other threads. This is where it is important for thread state to proceed in a particular order. This is achieved by placing barriers between instructions, which guarantee the order of instructions to appear the same to all processors.
Typically (one of the simplest models), you want two types of ordered instructions: ordered load that does not go ahead of any other ordered loads or stores, and ordered store that does not go ahead of any instructions at all, and a guarantee that all ordered instructions appear in the same order to all processors. This way you can reason about IRIW kind of problem:
Thread 1: x=1
Thread 2: y=1
Thread 3: r1=x;
r2=y;
Thread 4: r4=y;
r3=x;
If all of these operations are ordered loads and ordered stores, then you can conclude the outcome (1,0,0,1)=(r1,r2,r3,r4) is not possible. Indeed, ordered stores in Threads 1 and 2 should appear in some order to all threads, and r1=1,r2=0 is witness that y=1 is executed after x=1. In its turn, this means that Thread 4 can never observe r4=1 without observing r3=1 (which is executed after r4=1) (if the ordered stores happen to be executed that way, observing y==1 implies x==1).
Also, if the loads and stores were not ordered, the processors would usually be allowed to observe the assignments to appear even in different orders: one might see x=1 appear before y=1, the other might see y=1 appear before x=1, so any combination of values r1,r2,r3,r4 is permitted.
This is sufficiently implemented like so:
ordered load:
load x
load-load -- barriers stopping other loads to go ahead of preceding loads
load-store -- no one is allowed to go ahead of ordered load
ordered store:
load-store
store-store -- ordered store must appear after all stores
-- preceding it in program order - serialize all stores
-- (flush write buffers)
store x,v
store-load -- ordered loads must not go ahead of ordered store
-- preceding them in program order
Of these two, I can see IORef implements a ordered store (atomicWriteIORef), but I don't see a ordered load (atomicReadIORef), without which you cannot reason about IRIW problem above. This is not a problem, if your target platform is x86, because all loads will be executed in program order on that platform, and stores never go ahead of loads (in effect, all loads are ordered loads).
A atomic update (atomicModifyIORef) seems to me a implementation of a so-called CAS loop (compare-and-set loop, which does not stop until a value is atomically set to b, if its value is a). You can see the atomic modify operation as a fusion of a ordered load and ordered store, with all those barriers there, and executed atomically - no processor is allowed to insert a modification instruction between load and store of a CAS.
Furthermore, writeIORef is cheaper than atomicWriteIORef, so you want to use writeIORef as much as your inter-thread communication protocol permits. Whereas writeIORef x vx >> writeIORef y vy >> atomicWriteIORef z vz >> readIORef t does not guarantee the order in which writeIORefs appear to other threads with respect to each other, there is a guarantee that they both will appear before atomicWriteIORef - so, seeing z==vz, you can conclude at this moment x==vx and y==vy, and you can conclude IORef t was loaded after stores to x, y, z can be observed by other threads. This latter point requires readIORef to be a ordered load, which is not provided as far as I can tell, but it will work like a ordered load on x86.
Typically you don't use concrete values of x, y, z, when reasoning about the algorithm. Instead, some algorithm-dependent invariants about the assigned values must hold, and can be proven - for example, like in IRIW case you can guarantee that Thread 4 will never see (0,1)=(r3,r4), if Thread 3 sees (1,0)=(r1,r2), and Thread 3 can take advantage of this: this means something is mutually excluded without acquiring any mutex or lock.
An example (not in Haskell) that will not work if loads are not ordered, or ordered stores do not flush write buffers (the requirement to make written values visible before the ordered load executes).
Suppose, z will show either x until y is computed, or y, if x has been computed, too. Don't ask why, it is not very easy to see outside the context - it is a kind of a queue - just enjoy what sort of reasoning is possible.
Thread 1: x=1;
if (z==0) compareAndSet(z, 0, y == 0? x: y);
Thread 2: y=2;
if (x != 0) while ((tmp=z) != y && !compareAndSet(z, tmp, y));
So, two threads set x and y, then set z to x or y, depending on whether y or x were computed, too. Assuming initially all are 0. Translating into loads and stores:
Thread 1: store x,1
load z
if ==0 then
load y
if == 0 then load x -- if loaded y is still 0, load x into tmp
else load y -- otherwise, load y into tmp
CAS z, 0, tmp -- CAS whatever was loaded in the previous if-statement
-- the CAS may fail, but see explanation
Thread 2: store y,2
load x
if !=0 then
loop: load z -- into tmp
load y
if !=tmp then -- compare loaded y to tmp
CAS z, tmp, y -- attempt to CAS z: if it is still tmp, set to y
if ! then goto loop -- if CAS did not succeed, go to loop
If Thread 1 load z is not a ordered load, then it will be allowed to go ahead of a ordered store (store x). It means wherever z is loaded to (a register, cache line, stack,...), the value is such that existed before the value of x can be visible. Looking at that value is useless - you cannot then judge where Thread 2 is up to. For the same reason you've got to have a guarantee that the write buffers were flushed before load z executed - otherwise it will still appear as a load of a value that existed before Thread 2 could see the value of x. This is important as will become clear below.
If Thread 2 load x or load z are not ordered loads, they may go ahead of store y, and will observe the values that were written before y is visible to other threads.
However, see that if the loads and stores are ordered, then the threads can negotiate who is to set the value of z without contending z. For example, if Thread 2 observes x==0, there is guarantee that Thread 1 will definitely execute x=1 later, and will see z==0 after that - so Thread 2 can leave without attempting to set z.
If Thread 1 observes z==0, then it should try to set z to x or y. So, first it will check if y has been set already. If it wasn't set, it will be set in the future, so try to set to x - CAS may fail, but only if Thread 2 concurrently set z to y, so no need to retry. Similarly there is no need to retry if Thread 1 observed y has been set: if CAS fails, then it has been set by Thread 2 to y. Thus we can see Thread 1 sets z to x or y in accordance with the requirement, and does not contend z too much.
On the other hand, Thread 2 can check if x has been computed already. If not, then it will be Thread 1's job to set z. If Thread 1 has computed x, then need to set z to y. Here we do need a CAS loop, because a single CAS may fail, if Thread 1 is attempting to set z to x or y concurrently.
The important takeaway here is that if "unrelated" loads and stores are not serialized (including flushing write buffers), no such reasoning is possible. However, once loads and stores are ordered, both threads can figure out the path each of them _will_take_in_the_future, and that way eliminate contention in half the cases. Most of the time x and y will be computed at significantly different times, so if y is computed before x, it is likely Thread 2 will not touch z at all. (Typically, "touching z" also possibly means "wake up a thread waiting on a cond_var z", so it is not only a matter of loading something from memory)

within a thread an atomicModifyIORef "is never observed to take place
ahead of any earlier IORef operations, or after any later IORef
operations" i.e. we get a partial ordering of: stuff above the atomic
mod -> atomic mod -> stuff after. Although, the wording "is never
observed" here is suggestive of spooky behavior that I haven't
anticipated.
"is never observed" is standard language when discussing memory reordering issues. For example, a CPU may issue a speculative read of a memory location earlier than necessary, so long as the value doesn't change between when the read is executed (early) and when the read should have been executed (in program order). That's entirely up to the CPU and cache though, it's never exposed to the programmer (hence language like "is never observed").
A readIORef x might be moved before writeIORef y, at least when there
are no data dependencies
True
Logically I don't see how something like readIORef x >>= writeIORef y
could be reordered
Correct, as that sequence has a data dependency. The value to be written depends upon the value returned from the first read.
For the other questions: newIORef False >>= \v-> writeIORef v True >> readIORef v will always return True (there's no opportunity for other threads to access the ref here).
In the myprint example, there's very little you can do to ensure this works reliably in the face of new optimizations added to future GHCs and across various CPU architectures. If you write:
writeIORef myRef True
x <- readIORef myRef
yourVal <- x `seq` readIORef yourRef
Even though GHC 7.6.3 produces correct cmm (and presumably asm, although I didn't check), there's nothing to stop a CPU with a relaxed memory model from moving the readIORef yourRef to before all of the myref/seq stuff. The only 100% reliable way to prevent it is with a memory fence, which GHC doesn't provide. (Edward's blog post does go through some of the other things you can do now, as well as why you may not want to rely on them).
I think your mental model is correct, however it's important to know that the possible apparent reorderings introduced by concurrent ops can be really unintuitive.
Edit: at the cmm level, the code snippet above looks like this (simplified, pseudocode):
[StackPtr+offset] := True
x := [StackPtr+offset]
if (notEvaluated x) (evaluate x)
yourVal := [StackPtr+offset2]
So there are a couple things that can happen. GHC as it currently stands is unlikely to move the last line any earlier, but I think it could if doing so seemed more optimal. I'm more concerned that, if you compile via LLVM, the LLVM optimizer might replace the second line with the value that was just written, and then the third line might be constant-folded out of existence, which would make it more likely that the read could be moved earlier. And regardless of what GHC does, most CPU memory models allow the CPU itself to move the read earlier absent a memory barrier.

http://en.wikipedia.org/wiki/Memory_ordering for non atomic concurrent reads and writes. (basically when you dont use atomics, just look at the memory ordering model for your target CPU)
Currently ghc can be regarded as not reordering your reads and writes for non atomic (and imperative) loads and stores. However, GHC Haskell currently doesn't specify any sort of concurrent memory model, so those non atomic operations will have the ordering semantics of the underlying CPU model, as I link to above.
In other words, Currently GHC has no formal concurrency memory model, and because any optimization algorithms tend to be wrt some model of equivalence, theres no reordering currently in play there.
that is: the only semantic model you can have right now is "the way its implemented"
shoot me an email! I'm working on some patching up atomics for 7.10, lets try to cook up some semantics!
Edit: some folks who understand this problem better than me chimed in on ghc-users thread here http://www.haskell.org/pipermail/glasgow-haskell-users/2013-December/024473.html .
Assume that i'm wrong in both this comment and anything i said in the ghc-users thread :)

Is it safe to reuse a conduit?

Is it safe to perform multiple actions using the same conduit value? Something like
do
let sink = sinkSocket sock
something $$ sink
somethingElse $$ sink
I recall that in the early versions of conduit there were some dirty hacks that made this unsafe. What's the current status?
(Note that sinkSocket doesn't close the socket.)

That usage is completely safe. The issue in older versions had to do with blurring the line between resumable and non-resumable components. With modern versions (I think since 0.4), the line is very clear between the two.

It might be safe to reuse sinks in the sense that the semantics for the "used" sink doesn't change. But you should be aware of another threat: space leaks.
The situation is analogous to lazy lists: you can consume a huge list lazily in a constant space, but if you process the list twice it will be kept in memory. The same thing might happen with a recursive monadic expression: if you use it once it's constant size, but if you reuse it the structure of the computation is kept in memory, resulting in space leak.
Here's an example:
import Data.Conduit
import Data.Conduit.List
import Control.Monad.Trans.Class (lift)
consumeN 0 _ = return ()
consumeN n m = do
await >>= (lift . m)
consumeN (n-1) m
main = do
let sink = consumeN 1000000 (\i -> putStrLn ("Got one: " ++ show i))
sourceList [1..9000000::Int] $$ sink
sourceList [1..22000000::Int] $$ sink
This program uses about 150M of ram on my machine, but if you remove the last line or repeat the definition of sink in both places, you get a nice constant space usage.
I agree that this is a contrived example (this was the first that came to my mind), and this is not very likely to happen with most Sinks. For example this will not happen with your sinkSocket. (Why is this contrived: because the control structure of the sink doesn't depend on the values it gets. And that is also why it can leak.) But, for example, for sources this would be much more common. (Many of the common Sources exhibit this behavior. The sourceList would be an obvious example, because it would actually keep the source list in memory. But, enumFromTo is no different, although there is no data to keep in memory, just the structure of the monadic computation.)
So, all in all, I think it's important to be aware of this.

How can one implement a forking try-catch in Haskell?

I want to write a function
forkos_try :: IO (Maybe α) -> IO (Maybe α)
which Takes a command x. x is an imperative operation which first mutates state, and then checks whether that state is messed up or not. (It does not do anything external, which would require some kind of OS-level sandboxing to revert the state.)
if x evaluates to Just y, forkos_try returns Just y.
otherwise, forkos_try rolls back state, and returns Nothing.
Internally, it should fork() into threads parent and child, with x running on child.
if x succeeds, child should keep running (returning x's result) and parent should die
otherwise, parent should keep running (returning Nothing) and child should die
Question: What's the way to write something with equivalent, or more powerful semantics than forkos_try? N.B. -- the state mutated (by x) is in an external library, and cannot be passed between threads. Hence, the semantic of which thread to keep alive is important.
Formally, "keep running" means "execute some continuation rest :: Maybe α -> IO () ". But, that continuation isn't kept anywhere explicit in code.
For my case, I think it will (for the time) work to write it in different style, using forkOS (which takes the entire computation child will run), since I can write an explicit expression for rest. But, it troubles me that I can't figure out how do this with the primitive function forkOS -- one would think it would be general enough to support any specific case (which could appear as a high-level API, like forkos_try).
EDIT -- please see the example code with explicit rest if the problem's still not clear [ http://pastebin.com/nJ1NNdda ].
p.s. I haven't written concurrency code in a while; hopefully my knowledge of POSIX fork() is correct! Thanks in advance.

Things are a lot simpler to reason about if you model state explicitly.
someStateFunc :: (s -> Maybe (a, s))
-- inside some other function
case someStateFunc initialState of
Nothing -> ... -- it failed. stick with initial state
Just (a, newState) -> ... -- it suceeded. do something with
-- the result and new state
With immutable state, "rolling back" is simple: just keep using initialState. And "not rolling back" is also simple: just use newState.
So...I'm assuming from your explanation that this "external library" performs some nontrivial IO effects that are nevertheless restricted to a few knowable and reversible operations (modify a file, an IORef, etc). There is no way to reverse some things (launch the missiles, write to stdout, etc), so I see one of two choices for you here:
clone the world, and run the action in a sandbox. If it succeeds, then go ahead and run the action in the Real World.
clone the world, and run the action in the real world. If it fails, then replace the Real World with the snapshot you took earlier.
Of course, both of these are actually the same approach: fork the world. One world runs the action, one world doesn't. If the action succeeds, then that world continues; otherwise, the other world continues. You are proposing to accomplish this by building upon forkOS, which would clone the entire state of the program, but this would not be sufficient to deal with, for example, file modifications. Allow me to suggest instead an approach that is nearer to the simplicity of immutable state:
tryIO :: IO s -> (s -> IO ()) -> IO (Maybe a) -> IO (Maybe a)
tryIO save restore action = do
initialState <- save
result <- action
case result of
Nothing -> restore initialState >> return Nothing
Just x -> return (Just x)
Here you must provide some data structure s, and a way to save to and restore from said data structure. This allows you the flexibility to perform any cloning you know to be necessary. (e.g. save could copy a certain file to a temporary location, and then restore could copy it back and delete the temporary file. Or save could copy the value of certain IORefs, and then restore could put the value back.) This approach may not be the most efficient, but it's very straightforward.

A way to form a 'select' on MVars without polling

I have two MVars (well an MVar and a Chan). I need to pull things out of the Chan and process them until the other MVar is not empty any more. My ideal solution would be something like the UNIX select function where I pass in a list of (presumably empty) MVars and the thread blocks until one of them is full, then it returns the full MVar. Try as I might I can think of no way of doing this beyond repeatedly polling each MVar with isEmptyMVar until I get false. This seems inefficient.
A different thought was to use throwTo, but it interrupts what ever is happening in the thread and I need to complete processing a job out the the Chan in an atomic fashion.
A final thought as I'm typing is to create a new forkIO for each MVar which tries to read its MVar then fill a newly created MVar with its own instance. The original thread can then block on that MVar. Are Haskell threads cheap enough to go running that many?

Haskell threads are very cheap, so you could solve it that way, but it sounds like STM would be a better fit for your problem. With STM you can do
do var <- atomically (takeTMVar a `orElse` takeTMVar b)
... do stuff with var
Because of the behavior of retry and orElse, this code tries to get a, then if that fails, get b. If both fail, it blocks until either of them is updated and tries again.
You could even use this to make your own rudimentary version of select:
select :: [TMVar a] -> STM a
select = foldr1 orElse . map takeTMVar

How about using STM versions, TChan and TVar, with the retry and orElse behavior?
Implementing select is one of STM's nice capabilities. From "Composable Memory Transactions":
Beyond this, we also provide orElse,
which allows them to be composed as alternatives, so that
the second is run if the ﬁrst retries (Section 3.4). This ability allows threads to wait for many things at once, like the
Unix select system call – except that orElse composes well,
whereas select does not.
orElse in RWH.
The STM package
Papers on Haskell's STM

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string