This is just a hypothetical scenario to illustrate my question. Suppose that there are two threads and one TVar shared between them. In one thread there is an atomically block that reads the TVar and takes 10s to complete. In another thread is an atomically block that modifies the TVar every second. Will the first atomically block ever complete? Surely it will just keep going back to the beginning, because the log is perpetually in an inconsistent state?
As others have said: in theory there is no guarantee of progress. In practice there is also no guarantee of progress:
import Control.Monad -- not needed, but cleans some things up
import Control.Monad.STM
import Control.Concurrent.STM
import Control.Concurrent
import GHC.Conc
import System.IO
main = do
tv <- newTVarIO 0
forkIO (f tv)
g tv
f :: TVar Int -> IO ()
f tv = forever $ do
atomically $ do
n <- readTVar tv
writeTVar tv (n + 1)
unsafeIOToSTM (threadDelay 100000)
putStr "."
hFlush stdout
g :: TVar Int -> IO ()
g tv = forever $ do
atomically $ do
n <- readTVar tv
writeTVar tv (n + 1)
unsafeIOToSTM (threadDelay 1000000)
putStrLn "Done with long STM"
The above never says "Done with long STM" in my tests.
Obviously if you think the computation is still going to be valid/pertinent then you would want to either
Leave the atomic block, perform expensive computation, enter the atomic block / confirm assumptions are valid / and update the value. Potentially dangerous, but no more so than most locking strategies.
Memoize the results in the atomic block so the still valid result will be no more than a cheap lookup after a retry.
STM prevents deadlock, but is still vulnerable to starvation. It is possible in a pathological case for the 1s atomic action to always aquire the resource.
However, the changes of this happening are very rare -- I don't believe I've ever seen it in practice.
For the semantics, see Composable Memory Transactions, section 6.5 "Progress". STM in Haskell guarantees only that a running transaction will successfully commit (i.e. no deadlock), but in the worst case an infinite transaction will block others.
No, it would work fine. Exactly how the two threads would interact depends on
the retry logic.
For example, let's say you have:
ten tv = do
n <- readTVar tv
when (n < 7) retry
writeTVar tv 0
-- do something that takes about 10 seconds
one tv = do
modifyTVar tv (+1)
-- do something that takes about 1 second
So the "ten" thread will be in retry state until the TVar reaches
the value 7, then it will proceed.
Note that you can't directly control how long these computations will take
inside the STM monad. That would be a side-effect, and side-effects are not
allowed in STM calculations. The only way to communicate with the outside
world is via values passed through transactional memory.
And that means that if the "baton-passing" logic through transactional memory is
correct, the program will work correctly independently of the exact amount
of time any part of it takes. That's part of the guarantee of STM.
Related
I'm running a haskell websocket server using Wai:
application :: MVar ServerState -> Wai.Application
application state = WaiWS.websocketsOr WS.defaultConnectionOptions wsApp staticApp
where
wsApp :: WS.ServerApp
wsApp pendingConn = do
conn <- WS.acceptRequest pendingConn
talk conn state
To allow a single client to send asynchronous messages, talk is defined as follows:
talk :: WS.Connection -> MVar ServerState -> IO ()
talk conn state = forever $ do
msg <- WS.receiveMessage conn
putStrLn "received message"
successLock <- newEmptyMVar
tid <- timeoutAsync successLock $ processMessage c state msg
putStrLn "forked thread"
modifyMVar_ state $ \curState ->
return $ curState & threads %~ (M.insert mid tid) -- thread bookkeeping
putStrLn "modified state"
putMVar successLock ()
putStrLn "unlocked success"
where
mid = serverMessageId msg
timeoutAsync lock f = forkIO $ do
timeout S.process_message_timeout onTimeout (onSuccess lock) f
onSuccess lock = do
-- block until the first modifyMVar_ above finishes.
takeMVar lock
modifyMVar_ state $ \curState ->
return $ curState & threads %~ (M.delete mid) -- thread cleanup
onTimeout = ...
Here's the thing: when I bombard this server with many messages (from a single client) that are CPU-heavy, the the main thread occasionally hangs at "forked thread".
This is surprising because all work on messages are (in theory) being done in separate threads, and so the main thread (forever) should never block.
What's going on here?
[EDIT]
A minimum verifiable example is pretty hard to provide in this case (the work is done in processMessage, but comprises a lot of moving parts, any of which might be the problem). Instead, I'm looking for high-level pointers to things I could investigate.
Here is data from an example run (send the server an expensive request, then a bunch of smaller less-expensive ones):
gc productivity 36%: http://puu.sh/nSxnj/d8bb5995ae.png
event log (using +RTS -ls and -eventlog): http://puu.sh/nSxDy/efe457bee2.eventlog
CPU usage ~300% (for 4 caps) -- made me think GC might be competing with OS resources; I decreased the num capabilities to n-1, and this seemed to improve responsiveness
Also, the app has the following properties, which I think are potential causes of the problem:
ratio of GC'd to live data is high; processMessage basically constructs a giant list which is aeson'd and sent back to the user, but not kept in state
many foreign calls are made (due to ZMQ, which iirc makes unsafe foreign calls) on a single request
ThreadScope tells me that lots of heapoverflows occur, causing GC requests
I have a relatively simple "copy" program that merely copies all the lines of one file to another. I'm playing around with Haskell's concurrency support with TMQueue and STM so I thought I'd try it like this:
{-# LANGUAGE BangPatterns #-}
module Main where
import Control.Applicative
import Control.Concurrent.Async -- from async
import Control.Concurrent.Chan
import Control.Concurrent.STM (atomically)
import Control.Concurrent.STM.TMQueue -- from stm-chans
import Control.Monad (replicateM, forM_, forever, unless)
import qualified Data.ByteString.Char8 as B
import Data.Function (fix)
import Data.Maybe (catMaybes, maybe)
import System.IO (withFile, IOMode(..), hPutStrLn, hGetLine)
import System.IO.Error (catchIOError)
input = "data.dat"
output = "out.dat"
batch = 100 :: Int
consumer :: TMQueue B.ByteString -> IO ()
consumer q = withFile output WriteMode $ \fh -> fix $ \loop -> do
!items <- catMaybes <$> replicateM batch readitem
forM_ items $ B.hPutStrLn fh
unless (length items < batch) loop
where
readitem = do
!item <- atomically $ readTMQueue q
return item
producer :: TMQueue B.ByteString -> IO ()
producer q = withFile input ReadMode $ \fh ->
(forever (B.hGetLine fh >>= atomically . writeTMQueue q))
`catchIOError` const (atomically (closeTMQueue q) >> putStrLn "Done")
main :: IO ()
main = do
q <- atomically newTMQueue
thread <- async $ consumer q
producer q
wait thread
I can make a little test input file like this
ghc -e 'writeFile "data.dat" (unlines (map show [1..5000000]))'
And build it like this
ghc --make QueueTest.hs -O2 -prof -auto-all -caf-all -threaded -rtsopts -o q
When I run it like so ./q +RTS -s -prof -hc -L60 -N2, it says that "2117 MB total memory in use"! But the input file is only 38 MB!
I am new to profiling, but I have produced graph after graph and cannot pinpoint my mistake.
As the OP points out, by now I may as well write a real answer. Let's start with the memory consumption.
Two useful references are Memory footprint of Haskell data types and http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html. We'll also need to look at the definitions of some of our structures.
-- from http://hackage.haskell.org/package/stm-chans-3.0.0.2/docs/src/Control-Concurrent-STM-TMQueue.html
data TMQueue a = TMQueue
{-# UNPACK #-} !(TVar Bool)
{-# UNPACK #-} !(TQueue a)
deriving Typeable
-- from http://hackage.haskell.org/package/stm-2.4.3/docs/src/Control-Concurrent-STM-TQueue.html
-- | 'TQueue' is an abstract type representing an unbounded FIFO channel.
data TQueue a = TQueue {-# UNPACK #-} !(TVar [a])
{-# UNPACK #-} !(TVar [a])
The TQueue implementation uses a standard functional queue with a read end and write end.
Let's set an upper bound on memory usage and assume that we read the entire file into the TMQueue before the consumer does anything. In that case, the write end of our TQueue will contain a list with one element per input line (stored as a bytestring). Each list node will look like
(:) bytestring tail
which takes 3 words (1 per field + 1 for the constructor). Each bytestring is 9 words, so add the two together and there are 12 words of overhead per line, not including the actual data. Your test data is 5 million lines, so that's 60 million words of overhead for the whole file (plus some constants), which on a 64-bit system is about 460MB (assuming I did my math right, always questionable). Add in 40MB for the actual data, and we get values pretty close to what I see on my system.
So, why is our memory usage close to this upper bound? I have a theory (investigation left as an exercise!). First, the producer is likely to run a bit faster than the consumer simply because reading is usually faster than writing (I'm using spinning disks, maybe an SSD would be different). Here's the definition of readTQueue:
-- |Read the next value from the 'TQueue'.
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
xs <- readTVar read
case xs of
(x:xs') -> do writeTVar read xs'
return x
[] -> do ys <- readTVar write
case ys of
[] -> retry
_ -> case reverse ys of
[] -> error "readTQueue"
(z:zs) -> do writeTVar write []
writeTVar read zs
return z
First we try to read from the read end, and if that's empty we try to read from the write end, after reversing that list.
What I think is happening is this: when the consumer needs to read from the write end, it needs to traverse the input list within the STM transaction. This takes some time, which will cause it to contend with the producer. As the producer gets further ahead, this list gets longer, causing the read to take yet more time, during which the producer is able to write more values, causing the read to fail. This process repeats until the producer finishes, and only then does the consumer get a chance to process the bulk of the data. Not only does this ruin concurrency, it adds more CPU overhead because the consumer transaction is continually retrying and failing.
So, what about unagi? There are a couple key differences. First, unagi-chan uses arrays internally instead of lists. This reduces the overhead a little. Most of the overhead is from the ByteString pointers, so not much, but a little. Secondly, unagi keeps chunks of arrays. Even if we pessimistically assume that the producer always wins contentions, after the array gets filled it's pushed off the producer's side of the channel. Now the producer is writing to a new array and the consumer reads from the old array. This situation is near-ideal; there's no contention to shared resources, the consumer has good locality of reference, and because the consumer is working on a different chunk of memory there aren't issues with cache coherence. Unlike my theoretical description of the TMQueue, now you're getting concurrent operations, allowing the producer to clear some of the memory usage so it never hits the upper bound.
As an aside, I think the consumer batching is not beneficial. Handles are buffered by the IO subsystem already, so I don't think this gains anything. For me performance improved a little when I changed the consumer to operate line-by-line anyway.
Now, what can you do about this problem? Going from my working hypothesis that TMQueue is suffering from contention problems, and your specified requirements, you'll just need to use another type of queue. Obviously unagi works pretty well. I also tried TMChan, it was about 25% slower than unagi but used 45% less memory, so that could be a good option too. (this isn't too surprising, TMChan has a different structure from TMQueue so it'll have different performance characteristics)
You could also try to change your algorithm so that the producer sends multi-line chunks. This would lower the memory overhead from all the ByteStrings.
So, when is it ok to use TMQueue? If the producer and consumer are about the same speed, or the consumer is faster, it should be ok. Also, if processing times are non-uniform, or the producer runs in bursts, you'll probably get good amortized performance. This is pretty much a worst-case situation, and perhaps it should be reported as a bug against stm? I think if the read function were changed to
-- |Read the next value from the 'TQueue'.
readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
xs <- readTVar read
case xs of
(x:xs') -> do writeTVar read xs'
return x
[] -> do ys <- readTVar write
case ys of
[] -> retry
_ -> do writeTVar write []
let (z:zs) = reverse ys
writeTVar read zs
return z
it would avoid this problem. Now the z and zs bindings should both be evaluated lazily, so the list traversal would happen outside this transaction, allowing the read operation to succeed sometimes under contention. Assuming I'm correct about the issue in the first place, of course (and that this definition is lazy enough). There might be other unexpected downsides though.
I am doing things with STM and have among other things used the TBQueue data structure with great success. An useful feature I've been using it for involves reading from it based on a precondition in a TVar, basically like so:
shouldRead <- readTVar shouldReadVar
if shouldRead
then do
a <- readTBQueue queue
doSomethingWith a
else doSomethingElse
If we assume that queue is empty and shouldReadVar contains True before executing this block, it will result in readTBQueue calling retry, and the block will be re-executed when shouldReadVar contains False or queue contains an element, whatever happens first.
I am now in need of a synchronous channel data structure, similar to the structure described in this article (Please read it if you want to understand this question), except it needs to be readable with a pre-condition like in the previous example, and possibly compose with other stuff as well.
Let's call this data structure SyncChan with writeSyncChan and readSyncChan operations defined on it.
And here's a possible use case: This (pseudo) code (which will not work because I mix STM/IO concepts):
shouldRead <- readTVar shouldReadVar
if shouldRead
then do
a <- readSyncChan syncChan
doSomethingWith a
else doSomethingElse
Assuming that no other thread is currently blocking on a writeSyncChan call, and shouldReadChan contains True, I want the block to "retry" until either shouldReadChan contains False, or a different thread blocks on a writeSyncChan. In other words: when one thread retrys on writeSyncChan and another thread blocks reaches a readSyncChan, or vice versa, I want the value to be transferred along the channel. In all other cases, both sides should be in a retry state and thus react to a change in shouldReadVar, so that the read or write can be cancelled.
The naïve approach described in the article linked above using two (T)MVars is of course not possible. Because the data structure is synchronous, it is impossible to use it within two atomically blocks, because you cannot change one TMVar and wait for another TMVar to be changed in an atomic context.
Instead, I am looking for a kind of partial atomicity, where I can "commit" a certain part of a transaction and only roll it back when certain variables change, but not others. If I have "msg" and "ack" variables like the first example in the article above, I want to be able to write to the "msg" variable, then wait for either a value to arrive on "ack", or for my other transactional variables to change. If other transactional variables change, the whole atomic block should be retried, and if an "ack" value arrives, the transaction should continue as it was in the previous state. For the reading side, something similar should happen, except I would of course be reading from "msg" and writing to "ack."
Is this possible to do using GHC STM, or do I need to do manual MVar/rollback handling?
This is what you want:
import Control.Concurrent
import Control.Concurrent.STM
import Control.Monad
data SyncChan a = SyncChan (TMVar a) (TMVar ())
newSyncChan :: IO (SyncChan a)
newSyncChan = do
msg <- newEmptyTMVarIO
ack <- newEmptyTMVarIO
return (SyncChan msg ack)
readIf :: SyncChan a -> TVar Bool -> STM (Maybe a)
readIf (SyncChan msg ack) shouldReadVar = do
b <- readTVar shouldReadVar
if b
then do
a <- takeTMVar msg
putTMVar ack ()
return (Just a)
else return Nothing
write :: SyncChan a -> a -> IO ()
write (SyncChan msg ack) a = do
atomically $ putTMVar msg a
atomically $ takeTMVar ack
main = do
sc <- newSyncChan
tv <- newTVarIO True
forkIO $ forever $ forM_ [False, True] $ \b -> do
threadDelay 2000000
atomically $ writeTVar tv b
forkIO $ forM_ [0..] $ \i -> do
putStrLn "Writing..."
write sc i
putStrLn "Write Complete"
threadDelay 300000
forever $ do
putStrLn "Reading..."
a <- atomically $ readIf sc tv
print a
putStrLn "Read Complete"
This gives the behavior you had in mind. While the TVar is True the input and output ends will be synchronized with each other. When the TVar switches to False then the read end freely aborts and returns Nothing.
Is there any way to increase a time interval, on the basis of which the RTS decides that thread has blocked indefinitely in an STM transaction?
Here is my code:
import Control.Concurrent (ThreadId)
import Control.Concurrent.MVar (MVar,newMVar,withMVar)
import Control.Concurrent.STM
import qualified Control.Concurrent.ThreadManager as TM
data ThreadManager = ThreadManager { tmCounter::TVar Int, tmTM::MVar TM.ThreadManager }
data Settings = Settings {
maxThreadsCount::Int }
createThreadManager :: Settings -> IO ThreadManager
createThreadManager s = do
counter <- atomically $ newTVar (maxThreadsCount s)
tm <- TM.make >>= newMVar
return $ ThreadManager counter tm
forkManaged :: ThreadManager -> IO () -> IO ThreadId
forkManaged tm fn = do
atomically $ do
counter <- readTVar $ tmCounter tm
check $ counter > 0
writeTVar (tmCounter tm) (counter - 1)
withMVar (tmTM tm) $ \thrdmgr -> TM.fork thrdmgr $ do
fn
atomically $ do
counter <- readTVar $ tmCounter tm
writeTVar (tmCounter tm) (counter + 1)
forkManaged makes sure that amount of simultaneously running managed threads does not exceed maxThreadsCount. It works fine until heavy load. Under heavy load RTS throws an exception. I think under heavy load, on hard concurrent competition for resources, some of threads just have no time to get access to the STM context. So I think, increasing time interval when RTS decides to throw this exception may solve the problem.
Daniel Wagner is right. The decision is not made with timeouts. The relevant code in the rts is in Schedule.c
See the resurrectThreads function for where the exception is thrown. The comment describes that this is only thrown to threads found to be garbage after GC. ezyang described how this worked for mvars: http://blog.ezyang.com/2011/07/blockedindefinitelyonmvar/
[bad speculation concerning check removed when I checked its source and realized that it was just a simple guard/retry and not what was described in an earlier paper -- oops! I now suspect that Daniel Wagner is correct here as well, and the issue is somehow that the counter isn't being incremented.]
I'm toying with Haskell threads, and I'm running into the problem of communicating lazily-evaluated values across a channel. For example, with N worker threads and 1 output thread, the workers communicate unevaluated work and the output thread ends up doing the work for them.
I've read about this problem in various documentation and seen various solutions, but I only found one solution that works and the rest do not. Below is some code in which worker threads start some computation that can take a long time. I start the threads in descending order, so that the first thread should take the longest, and the later threads should finish earlier.
import Control.Concurrent (forkIO)
import Control.Concurrent.Chan -- .Strict
import Control.Concurrent.MVar
import Control.Exception (finally, evaluate)
import Control.Monad (forM_)
import Control.Parallel.Strategies (using, rdeepseq)
main = (>>=) newChan $ (>>=) (newMVar []) . run
run :: Chan (Maybe String) -> MVar [MVar ()] -> IO ()
run logCh statVars = do
logV <- spawn1 readWriteLoop
say "START"
forM_ [18,17..10] $ spawn . busyWork
await
writeChan logCh Nothing -- poison the logger
takeMVar logV
putStrLn "DONE"
where
say mesg = force mesg >>= writeChan logCh . Just
force s = mapM evaluate s -- works
-- force s = return $ s `using` rdeepseq -- no difference
-- force s = return s -- no-op; try this with strict channel
busyWork = say . show . sum . filter odd . enumFromTo 2 . embiggen
embiggen i = i*i*i*i*i
readWriteLoop = readChan logCh >>= writeReadLoop
writeReadLoop Nothing = return ()
writeReadLoop (Just mesg) = putStrLn mesg >> readWriteLoop
spawn1 action = do
v <- newEmptyMVar
forkIO $ action `finally` putMVar v ()
return v
spawn action = do
v <- spawn1 action
modifyMVar statVars $ \vs -> return (v:vs, ())
await = do
vs <- modifyMVar statVars $ \vs -> return ([], vs)
mapM_ takeMVar vs
Using most techniques, the results are reported in the order spawned; that is, the longest-running computation first. I interpret this to mean that the output thread is doing all the work:
-- results in order spawned (longest-running first = broken)
START
892616806655
503999185040
274877906943
144162977343
72313663743
34464808608
15479341055
6484436675
2499999999
DONE
I thought the answer to this would be strict channels, but they didn't work. I understand that WHNF for strings is insufficient because that would just force the outermost constructor (nil or cons for the first character of the string). The rdeepseq is supposed to fully evaluate, but it makes no difference. The only thing I've found that works is to map Control.Exception.evaluate :: a -> IO a over all the characters in the string. (See the force function comments in the code for several different alternatives.) Here's the result with Control.Exception.evaluate:
-- results in order finished (shortest-running first = correct)
START
2499999999
6484436675
15479341055
34464808608
72313663743
144162977343
274877906943
503999185040
892616806655
DONE
So why don't strict channels or rdeepseq produce this result? Are there other techniques? Am I misinterpreting why the first result is broken?
There are two issues going on here.
The reason the first attempt (using an explicit rnf) doesn't work is that, by using return, you've created a thunk that fully evaluates itself when it is evaluated, but the thunk itself has not being evaluated. Notice that the type of evaluate is a -> IO a: the fact that it returns a value in IO means that evaluate can impose ordering:
return (error "foo") >> return 1 == return 1
evaluate (error "foo") >> return 1 == error "foo"
The upshot is that this code:
force s = evaluate $ s `using` rdeepseq
will work (as in, have the same behavior as mapM_ evaluate s).
The case of using strict channels is a little trickier, but I believe this is due to a bug in strict-concurrency. The expensive computation is actually being run on the worker threads, but it's not doing you much good (you can check for this explicitly by hiding some asynchronous exceptions in your strings and seeing which thread the exception surfaces on).
What's the bug? Let's take a look at the code for strict writeChan:
writeChan :: NFData a => Chan a -> a -> IO ()
writeChan (Chan _read write) val = do
new_hole <- newEmptyMVar
modifyMVar_ write $ \old_hole -> do
putMVar old_hole $! ChItem val new_hole
return new_hole
We see that modifyMVar_ is called on write before we evaluate the thunk. The sequence of operations then is:
writeChan is entered
We takeMVar write (blocking anyone else who wants to write to the channel)
We evaluate the expensive thunk
We put the expensive thunk onto the channel
We putMVar write, unblocking all of the other threads
You don't see this behavior with the evaluate variants, because they perform the evaluation before the lock is acquired.
I’ll send Don mail about this and see if he agrees that this behavior is kind of suboptimal.
Don agrees that this behavior is suboptimal. We're working on a patch.