How to Properly Terminate a Thread which is Blocking (Lparallel Common Lisp) - multithreading

In the Lparallel API, the recommended way to terminate all threaded tasks is to stop the kernel with (lparallel:end-kernel). But when a thread is blocking—eg, with (pop-queue queue1) waiting for an item to appear in the queue—it will still be active when the kernel is stopped. In this case (at least in SBCL) the kernel shutdown occasionally (but not every time) fails with:
debugger invoked on a SB-KERNEL:BOUNDING-INDICES-BAD-ERROR in thread
#<THREAD "lparallel" RUNNING {1002F04973}>:
The bounding indices 1 and NIL are bad for a sequence of length 0.
See also:
The ANSI Standard, Glossary entry for "bounding index designator"
The ANSI Standard, writeup for Issue SUBSEQ-OUT-OF-BOUNDS:IS-AN-ERROR
debugger invoked on a SB-SYS:INTERACTIVE-INTERRUPT in thread
#<THREAD "main thread" RUNNING {10012E0613}>:
Interactive interrupt at #x1001484328.
I’m assuming this has something to do with the blocking thread not terminating correctly. How should a blocking thread be properly terminated before shutting down the kernel? (The API says kill-tasks should only be used in exceptional circumstances, which I’m taking not to apply to this “normal” shutdown circumstance.)

The problem with killing a thread is that it might happen anywhere, when the thread could be in any unknown state.
The only way to safely terminate a thread it is to let it shutdown itself gracefully, meaning you expect that during normal operations, there is a way for the thread to know it should stop working. Then you can properly clean your resources, close databases, free foreign pointers, log all things, ...
The queues you are using have operations that can timeout, that is a simple yet safe way to ensure you can avoid blocking forever and exit properly. But that's not the only option (you can use them in addition to what is shown below).
 Shared / global flag
When a timeout occurs, or when you receive a message, you check a global boolean variable (or one that is shared among all interested threads). That's also a simple way to exit, and it can be read by multiple threads. This is however a concurrent access, so you should use locks or atomic operations (http://www.sbcl.org/manual/#Atomic-Operations), for example use defglobal and a fixnum type with atomic-incf, etc.
 Control messages
Send control data in the queues and use them to determine how to shutdown gracefully, and how to propagate the information down the pipes, or how to restart things. This is safe (just message-passing) and allows any kind of control you might want to implement in your thread.
(defpackage :so (:use :cl :bt :lparallel.queue))
(in-package :so)
Let's define two services.
The first one echoes back its input:
(defun echo (in out)
(lambda ()
(loop
for value = (pop-queue in)
do (push-queue value out)
until (eq value :stop))))
Notice how it is expected to finish properly when given a :stop input, and how it also propagates the :stop message to its output queue.
The second thread will perform a modular addition, and also sleeps a bit between requests:
(defun modulo-adder (x m in out)
(lambda ()
(loop
for value = (progn (sleep 0.02)
(pop-queue in))
do (push-queue (typecase value
(keyword value)
(number (mod (+ x value) m)))
out)
until (eq value :stop))))
Create queues:
(defparameter *q1* (make-queue))
(defparameter *q2* (make-queue))
Create threads:
(progn
(bt:make-thread (echo *q1* *q2*) :name "echo")
(bt:make-thread (modulo-adder 5 1024 *q2* *q1*) :name "adder"))
Both threads are connected to each others in a circular fashion, creating an infinite loop of additions. No value is currently exchanged between threads, and you can see them running for example with slime-list-threads or any other implementation-provided way; In any case (bt:all-threads) returns a list.
slime-list-threads
10 adder Running
11 echo Running
...
Add an item, now there is an infinite exchange of data between threads:
(push-queue 10 *q1*)
Wait, then stop them both:
(push-queue :stop *q1*)
Both threads stopped gracefully (they are no more visible in lists of threads).
We can inspect what remains in the queues (result vary from one test to another):
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(99 NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(:STOP NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(NIL NIL)
Interrupting a thread
You create a service, controlled by messages or a global flag, but then you have a bug and the thread hangs. Instead of killing it and lose everything, you want at least to unwind the thread stack properly. This is a dangerous too, but you can use bt:interrupt to stop a thread anywhere it is running right now and execute a function.
(define-condition stop () ())
(defun signal-stop ()
(signal 'stop))
(defun endless ()
(let ((output *standard-output*))
(lambda ()
(print "START" output)
(unwind-protect (handler-case (loop)
(stop ()
(print "INTERRUPTED" output)))
(print "STOP" output)))))
Start it:
(bt:make-thread (endless) :name "loop")
This prints "START" and loops.
Then we interrupt it:
(bt:interrupt-thread (find "loop"
(bt:all-threads)
:test #'string=
:key #'bt:thread-name)
#'signal-stop)
The following is printed:
"INTERRUPTED"
"STOP"
Those messages would not be printed if the thread was killed, but note that you could still manage to have corrupted data given how random the interruption is. Also, it can unblock blocking calls like sleep or pop-queue.

Related

How to yield a thread's current continuation from an exception handler

This code is really pushing the limits of my understanding so bear with me.
Previously I implemented coroutines in Racket in the following code:
;; Coroutine definition
(define (make-generator procedure)
(define last-return values)
(define last-value #f)
(define status 'suspended)
(define (last-continuation _)
(let ([result (procedure yield)])
(last-return result)))
(define (yield value)
(call/cc (lambda (continuation)
(set! last-continuation continuation)
(set! last-value value)
(set! status 'suspended)
(last-return value))))
(lambda args
(call/cc (lambda (return)
(set! last-return return)
(cond ((null? args)
(let ()
(set! status 'dead)
(last-continuation last-value)))
((eq? (car args) 'coroutine?) 'coroutine)
((eq? (car args) 'status?) status)
((eq? (car args) 'dead?) (eq? status 'dead))
((eq? (car args) 'alive?) (not (eq? status 'dead)))
((eq? (car args) 'kill!) (set! status 'dead))
(#t (apply last-continuation args)))))))
;;Define a function that will return a suspended coroutine created from given args and body forms
(define-syntax (define-coroutine stx)
(syntax-case stx ()
((_ (name . args) . body )
#`(define (name . args)
(make-generator
(lambda (#,(datum->syntax stx 'yield))
. body))))))
What I want to do is implement an exception handler (with-handlers) that calls the (yield) function. The idea is a second thread can send a signal to the thread evaluating the coroutine forcing it to yield when its running for too long.
I've tried the following in the args lambda, which successfully returned early but later evaluations of the coroutine (my-coroutine 'dead?) returned that the coroutine was in the 'dead state:
(with-handlers
([exn:break?
(lambda (break)
(yield 'coroutine-timeout))])
(break-enabled #t) ;register for yield requests from coroutine manager thread
(last-continuation last-value))))
Alternatively, I've tried the following, but it didn't produce a procedure that can be applied to arguments:
(with-handlers
([exn:break?
(lambda (break)
(set! last-continuation (exn:break-continuation break))
(set! last-value 'coroutine-timeout)
(set! status 'suspended)
(last-return 'coroutine-timeout))])
(break-enabled #t) ;register for yield requests from coroutine manager thread
(last-continuation last-value))))
I'm trying to understand how continuations and exceptions interact/block each other. It seems like I may need to use Parameters somehow?
How can I successfully write a signal handler that will (yield) correctly so that I can resume the coroutine later?
Edit:
I am mixing metaphores here (cooperative and preemptive multithreading). However, my question seems possible to me (from a layman's perspective) as I can evaluate functions defined in my coroutine (including (yield)) from within the exception handler. I'm essentially trying to limit resource starvation in my worker threads, as well as mitigate a certain class of deadlock (where task 1 can only complete after task 2 has run, and there are no free threads for task 2 to run on).
I have written a (go) function for these coroutines that is modeled after go's goroutines. I assume they achieve their asynchronous behavior on single threads by having cooperative yield checks in the underlying code they control. Perhaps it runs in a VM as you suggested and there are checks, perhaps their operators have the checks. Whatever the case may be I'm trying to achieve similar behavior with a different strategy.
As far as "how continuations and exceptions interact/block each other," it's important to know that exceptions are implemented using delimited continuations. In particular, the exception system makes use of continuation barriers. Both of these are introduced in the Racket reference §1.1.12 Prompts, Delimited Continuations, and Barriers:
A continuation barrier is another kind of continuation frame that prohibits certain replacements of the current continuation with another. … A continuation barrier thus prevents “downward jumps” into a continuation that is protected by a barrier. Certain operations install barriers automatically; in particular, when an exception handler is called, a continuation barrier prohibits the continuation of the handler from capturing the continuation past the exception point.
You may also want to see the material on exceptions from later in the evaluation model section and from the control flow section, which cites an academic paper on the subject. The differences between call-with-exception-handler and with-handlers are also relevant to capturing continuations from within exception handlers.
Basically, though, the continuation barrier prevents using exception handlers for continuations that you abort and might later resume: you should use continuation barriers and prompts directly for that.
More broadly, I would suggest that you look at Racket's substantial existing support for concurrency. Even if you want to implement coroutines as an experiment, they would be useful for inspiration and examples of implementation techniques. Racket comes with derived constructs such as engines ("processes that can be preempted by a timer or other external trigger") and generators, in addition to the fundamental building-blocks, green threads and synchronizable events (which are based on Concurrent ML model).
The gist of your question:
How can I implement an exception handler for coroutines, such that a second thread can send
a signal to a thread evaluating a coroutine, forcing it to yield
when its running for too long.
And once more:
How can I successfully write a signal handler that will (yield)
correctly so that I can resume the coroutine later?
It seems to me that you are not cleanly separating cooperative and preemptive multitasking, since you seem to want to combine coroutines (cooperative) with time-outs (preemptive). (You also mention threads, but seem to conflate them with coroutines.)
With cooperative multitasking there is no way that you can force anyone else to stop running; hence the moniker "cooperative".
With preemptive multitasking you do not need to yield, because the scheduler will preempt you when your allocated time has run out. The scheduler is also responsible for saving your continuation, but it is not the (scheduler's) current continuation, since the scheduler is wholly separate from the user thread.
Perhaps the closest thing to what you are proposing is simulating preemptive multitasking via polling. Every (simulated) timestep (i.e. a VM instruction) the simulation needs to check whether any interrupts/signals have been received by a running thread and handle them.

go block vs thread in core.async

From http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/ :
To get a bit more concrete let's see what happens when we try to issue
some HTTP GET request using core.async. Let's start with the naive
solution, using blocking IO via clj-http.
(defn blocking-get [url]
(clj-http.client/get url))
(time
(def data
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(go (>! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
Here we're trying to fetch 90 code snippets (in parallel) using go
blocks (and blocking IO). This took a long time, and that's because
the go block threads are "hogged" by the long running IO operations.
The situation can be improved by switching the go blocks to normal
threads.
(time
(def data-thread
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(thread (>!! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
What does it mean that "go block threads are hogged by the long running IO operations"?
Go blocks are intended to be a sort of light-weight cooperative threads; they provide thread-like behaviour with less overhead than full JVM threads by using a few threads in a pool and switching go blocks when they park - for instance, when waiting on a channel using <!. The thread-switching cannot work when you call a method in the block that blocks the JVM thread, so you quickly run out of JVM threads. Most standard Java (and Clojure) IO operations will block the current thread when waiting.
What does it mean that "go block threads are hogged by the long running IO operations"?
There are a limited number of threads dedicated to serving go blocks*. If you perform a blocking I/O operation on one of those threads, then it cannot be used for any other purpose until that operation completes (unless the thread is interrupted). This is also true for non-go block threads (i.e., threads that are returned from the thread function), but non-go block threads do not come from the limited go block thread pool. So if you do blocking I/O in a go block, you are "hogging" that go block's thread from being used by other go blocks, even though the thread isn't doing any actual work (it's just waiting for the I/O operation).
* That number currently happens to be 42 + the number of processors available to the JVM.

Upper limit for number of jobs in a go block?

Here is the code:
(ns typedclj.async
(:require [clojure.core.async
:as a
:refer [>! <! >!! <!!
go chan buffer
close! thread
alts! alts!! timeout]]
[clj-http.client :as -cc]))
(time (dorun
(let [c (chan)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
And I got an error:
Exception in thread "async-dispatch-12" java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:150)
at clojure.core.async.impl.ioc_macros$put_BANG_.invoke(ioc_macros.clj:959)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817$fn__11819.invoke(async.clj:19)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817.invoke(async.clj:19)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at typedclj.async$eval11807$fn__11816.invoke(async.clj:19)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)...
According to http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/
... This will break the 1 job = 1 thread knot, thus this thread
parking will allow us to scale the number of jobs way beyond any
thread limit on the platform (usually around 1000 on the JVM).
core.async gives (blocking) channels and a new (unbounded) thread pool
when using 'thread'. This (in effect) is just some sugar over using
java threads (or clojure futures) and BlockingQueues from
java.util.concurrent. The main feature is go blocks in which threads
can be parked and resumed on the (potentially) blocking calls dealing
with core.async's channels...
Is 1e4 jobs already too many? What is the upper limit then?
I don't usually rant like this so I hope you will forgive me this one transgression:
In a more prefect world every programmer would repeat to themselves "there is no such thing as an unbounded queue" five times before sleeping and first thing upon waking. This mode of thinking requires firguring out how backpressure will be handled in your system so when there is a slowdown somewhere in the process the parts before that have a way to find out about it and slow themselves down in response. In core.async the default back pressure is immediate because the default buffer size is zero. No go block succeeds in putting something into a chan until someone is ready to consume it.
chans look basically like this:
"queue of pending puts" --> buffer --> "queue of pending takes"
The putter and taker queues are intended to allow time for the two processes that are communicating via this pipe to schedule themselves so progress can be made. Without these there would be no room for threads to schedule and deadlocks would happen. They are NOT intended to be used as the buffer. thats what the buffer in the middle is for, and this was the design behind making that the only one that has a explicit size. explicitly set the buffer size for your system by setting the size of the buffer in the chan:
user> (time (dorun
(let [c (chan 1e6)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
"Elapsed time: 83.526679 msecs"
nil
In this case I have "calculated" that my system as a whole will be in a good state if there are up to a million waiting jobs. Of course your real world expierences will be different, and very much unique to your situation.
Thanks for your patience,
The limit of unconsumed puts is the size of the channels buffer plus the size of the queue.
The queue size in core.async is limited to 1024 but one should not rely on that.

Clojure - Using agents slows down execution too much

I am writing a benchmark for a program in Clojure. I have n threads accessing a cache at the same time. Each thread will access the cache x times. Each request should be logged inside a file.
To this end I created an agent that holds the path to the file to be written to. When I want to write I send-off a function that writes to the file and simply returns the path. This way my file-writes are race-condition free.
When I execute my code without the agent it finished in a few miliseconds. When I use the agent, and ask each thread to send-off to the agent each time my code runs horribly slow. I'm talking minutes.
(defn load-cache-only [usercount cache-size]
"Test requesting from the cache only."
; Create the file to write the benchmark results to.
(def sink "benchmarks/results/load-cache-only.txt")
(let [data-agent (agent sink)
; Data for our backing store generated at runtime.
store-data (into {} (map vector (map (comp keyword str)
(repeat "item")
(range 1 cache-size))
(range 1 cache-size)))
cache (create-full-cache cache-size store-data)]
(barrier/run-with-barrier (fn [] (load-cache-only-work cache store-data data-agent)) usercount)))
(defn load-cache-only-work [cache store-data data-agent]
"For use with 'load-cache-only'. Requests each item in the cache one.
We time how long it takes for each request to be handled."
(let [cache-size (count store-data)
foreachitem (fn [cache-item]
(let [before (System/nanoTime)
result (cache/retrieve cache cache-item)
after (System/nanoTime)
diff_ms ((comp str float) (/ (- after before) 1000))]
;(send-off data-agent (fn [filepath]
;(file/insert-record filepath cache-size diff_ms)
;filepath))
))]
(doall (map foreachitem (keys store-data)))))
The (barrier/run-with-barrier) code simply spawns usercount number of threads and starts them at the same time (using an atom). The function I pass is the body of each thread.
The body willl simply map over a list named store-data, which is a key-value list (e.g., {:a 1 :b 2}. The length of this list in my code right now is 10. The number of users is 10 as well.
As you can see, the code for the agent send-off is commented out. This makes the code execute normally. However, when I enable the send-offs, even without writing to the file, the execution time is too slow.
Edit:
I made each thread, before he sends off to the agent, print a dot.
The dots appear just as fast as without the send-off. So there must be something blocking in the end.
Am I doing something wrong?
You need to call (shutdown-agents) when you're done sending stuff to your agent if you want the JVM to exit in reasonable time.
The underlying problem is that if you don't shutdown your agents, the threads backing its threadpool will never get shut down, and prevent the JVM from exiting. There's a timeout that will shutdown the pool if there's nothing else running, but it's fairly lengthy. Calling shutdown-agents as soon as you're done producing actions will resolve this problem.

Canonical Way to Ensure Only One Instance of a Service Is Running / Starting / Stopping in Clojure?

I'm writing a stateful server in Clojure backed by Neo4j that can serve socket requests, like HTTP. Which means, of course, that I need to be able to start and stop socket servers from within this server. Design-wise, I would want to be able to declare a "service" within this server and start and stop it.
What I'm trying to wrap my mind around in Clojure is how to ensure that starting and stopping these services is thread-safe. This server I'm writing will have NREPL embedded inside it and process incoming requests in a parallel way. Some of these requests will be administrative: start service X, stop service Y. Which opens up the possibility that two start requests come in at the same time.
Starting should synchronously check a "running" flag and a "starting" flag and fail if either are set. In the same transaction, the "starting" flag should be set.
After the "starting" flag is set, the transaction closes. That makes the "starting" flag visible to other transactions.
Then the (start) function actually starts the service.
If (start) succeeds, the "running" and "starting" flags are synchronously set.
If (start) fails, the "starting" flag is set and the exception is returned.
Stopping needs the same thing, checking a "running" flag and checking and setting it's own "stopping" flag.
I'm trying to reason through all possible combinations of (start) and (stop).
Have I missed anything?
Is there a library for this already? If not, what should a library like this look like? I'll open source it and put it on Github.
Edit:
This is what I have so far. There's a hole I can see though. What am I missing?
(ns extenium.db
(:require [clojure.tools.logging :as log])
(:import org.neo4j.graphdb.factory.GraphDatabaseFactory))
(def ^:private
db- (ref {:ref nil
:running false
:starting false
:stopping false}))
(defn stop []
(dosync
(if (or (not (:running (ensure db-)))
(:stopping (ensure db-)))
(throw (IllegalStateException. "Database already stopped or stopping."))
(alter db- assoc :stopping true)))
(try
(log/info "Stopping database")
(.shutdown (:ref db-))
(dosync
(alter db- assoc :ref nil))
(log/info "Stopped database")
(finally
(dosync
(alter db- assoc :stopping false)))))
In the try block, I log, then call .shutdown, then log again. If the first log fails (I/O exceptions can happen), then (:stopping db-) is set to false, which unblocks it and is fine. .shutdown is a void function from Neo4j, so I don't have to evaluate a return value. If it fails, (:stopping db-) is set to false, so that's fine too. Then I set the (:ref db-) to nil. What if that fails? (:stopping db-) is set to false, but the (:ref db-) is left hanging. So that's a hole. Same case with the second log call. What am I missing?
Would this be better if I just used Clojure's locking primitives instead of a ref dance?
This is actually a natural fit for a simple lock:
(locking x
(do-stuff))
Here x is the object on which to synchronize.
To elaborate: starting and stopping a service is a side effect; side effects should not be initiated from inside a transaction, except possibly as Agent actions. Here though locks are exactly what the design calls for. Note that there's nothing wrong in using them in Clojure when they are a good fit for the problem at hand, in fact I would say locking is the canonical solution here. (See Stuart Halloway's Lancet, introduced in Programming Clojure (1st ed.), for an example of a Clojure library using locks which has seen some widespread use, mostly in Leiningen.)
Update: Adding fail-fast behaviour:
This is still a good fit for a lock, namely a java.util.concurrent.locks.ReentrantLock (follow link for Javadoc):
(import java.util.concurrent.locks.ReentrantLock)
(def lock (ReentrantLock.))
(defn start []
(if (.tryLock lock)
(try
(do-stuff)
(finally (.unlock lock)))
(do-other-stuff)))
(do-stuff) will be executed if lock acquisition succeeds; otherwise, (do-other-stuff) will happen. Current thread will not block in either case.
This sounds like a good use case for agents, they allow you to serialize changes to a piece of mutable state, the Clojure Agents documentation has a good overview.
You can use the error handler and agent-error methods to handle exceptions and never need to worry about locks or race conditions.
(def service (agent {:status :stopped}))
(defn start-service [{:keys [status] :as curr}]
(if (= :stopped status)
(do
(println "starting service")
{:status :started})
(do
(println "service already running")
curr)))
;; start the service like this
(send-off service start-service)
;; gets the current status of the service
#service

Resources