go block vs thread in core.async - multithreading

From http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/ :
To get a bit more concrete let's see what happens when we try to issue
some HTTP GET request using core.async. Let's start with the naive
solution, using blocking IO via clj-http.
(defn blocking-get [url]
(clj-http.client/get url))
(time
(def data
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(go (>! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
Here we're trying to fetch 90 code snippets (in parallel) using go
blocks (and blocking IO). This took a long time, and that's because
the go block threads are "hogged" by the long running IO operations.
The situation can be improved by switching the go blocks to normal
threads.
(time
(def data-thread
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(thread (>!! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
What does it mean that "go block threads are hogged by the long running IO operations"?

Go blocks are intended to be a sort of light-weight cooperative threads; they provide thread-like behaviour with less overhead than full JVM threads by using a few threads in a pool and switching go blocks when they park - for instance, when waiting on a channel using <!. The thread-switching cannot work when you call a method in the block that blocks the JVM thread, so you quickly run out of JVM threads. Most standard Java (and Clojure) IO operations will block the current thread when waiting.

What does it mean that "go block threads are hogged by the long running IO operations"?
There are a limited number of threads dedicated to serving go blocks*. If you perform a blocking I/O operation on one of those threads, then it cannot be used for any other purpose until that operation completes (unless the thread is interrupted). This is also true for non-go block threads (i.e., threads that are returned from the thread function), but non-go block threads do not come from the limited go block thread pool. So if you do blocking I/O in a go block, you are "hogging" that go block's thread from being used by other go blocks, even though the thread isn't doing any actual work (it's just waiting for the I/O operation).
* That number currently happens to be 42 + the number of processors available to the JVM.

Related

How to Properly Terminate a Thread which is Blocking (Lparallel Common Lisp)

In the Lparallel API, the recommended way to terminate all threaded tasks is to stop the kernel with (lparallel:end-kernel). But when a thread is blocking—eg, with (pop-queue queue1) waiting for an item to appear in the queue—it will still be active when the kernel is stopped. In this case (at least in SBCL) the kernel shutdown occasionally (but not every time) fails with:
debugger invoked on a SB-KERNEL:BOUNDING-INDICES-BAD-ERROR in thread
#<THREAD "lparallel" RUNNING {1002F04973}>:
The bounding indices 1 and NIL are bad for a sequence of length 0.
See also:
The ANSI Standard, Glossary entry for "bounding index designator"
The ANSI Standard, writeup for Issue SUBSEQ-OUT-OF-BOUNDS:IS-AN-ERROR
debugger invoked on a SB-SYS:INTERACTIVE-INTERRUPT in thread
#<THREAD "main thread" RUNNING {10012E0613}>:
Interactive interrupt at #x1001484328.
I’m assuming this has something to do with the blocking thread not terminating correctly. How should a blocking thread be properly terminated before shutting down the kernel? (The API says kill-tasks should only be used in exceptional circumstances, which I’m taking not to apply to this “normal” shutdown circumstance.)
The problem with killing a thread is that it might happen anywhere, when the thread could be in any unknown state.
The only way to safely terminate a thread it is to let it shutdown itself gracefully, meaning you expect that during normal operations, there is a way for the thread to know it should stop working. Then you can properly clean your resources, close databases, free foreign pointers, log all things, ...
The queues you are using have operations that can timeout, that is a simple yet safe way to ensure you can avoid blocking forever and exit properly. But that's not the only option (you can use them in addition to what is shown below).
 Shared / global flag
When a timeout occurs, or when you receive a message, you check a global boolean variable (or one that is shared among all interested threads). That's also a simple way to exit, and it can be read by multiple threads. This is however a concurrent access, so you should use locks or atomic operations (http://www.sbcl.org/manual/#Atomic-Operations), for example use defglobal and a fixnum type with atomic-incf, etc.
 Control messages
Send control data in the queues and use them to determine how to shutdown gracefully, and how to propagate the information down the pipes, or how to restart things. This is safe (just message-passing) and allows any kind of control you might want to implement in your thread.
(defpackage :so (:use :cl :bt :lparallel.queue))
(in-package :so)
Let's define two services.
The first one echoes back its input:
(defun echo (in out)
(lambda ()
(loop
for value = (pop-queue in)
do (push-queue value out)
until (eq value :stop))))
Notice how it is expected to finish properly when given a :stop input, and how it also propagates the :stop message to its output queue.
The second thread will perform a modular addition, and also sleeps a bit between requests:
(defun modulo-adder (x m in out)
(lambda ()
(loop
for value = (progn (sleep 0.02)
(pop-queue in))
do (push-queue (typecase value
(keyword value)
(number (mod (+ x value) m)))
out)
until (eq value :stop))))
Create queues:
(defparameter *q1* (make-queue))
(defparameter *q2* (make-queue))
Create threads:
(progn
(bt:make-thread (echo *q1* *q2*) :name "echo")
(bt:make-thread (modulo-adder 5 1024 *q2* *q1*) :name "adder"))
Both threads are connected to each others in a circular fashion, creating an infinite loop of additions. No value is currently exchanged between threads, and you can see them running for example with slime-list-threads or any other implementation-provided way; In any case (bt:all-threads) returns a list.
slime-list-threads
10 adder Running
11 echo Running
...
Add an item, now there is an infinite exchange of data between threads:
(push-queue 10 *q1*)
Wait, then stop them both:
(push-queue :stop *q1*)
Both threads stopped gracefully (they are no more visible in lists of threads).
We can inspect what remains in the queues (result vary from one test to another):
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(99 NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(:STOP NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(NIL NIL)
Interrupting a thread
You create a service, controlled by messages or a global flag, but then you have a bug and the thread hangs. Instead of killing it and lose everything, you want at least to unwind the thread stack properly. This is a dangerous too, but you can use bt:interrupt to stop a thread anywhere it is running right now and execute a function.
(define-condition stop () ())
(defun signal-stop ()
(signal 'stop))
(defun endless ()
(let ((output *standard-output*))
(lambda ()
(print "START" output)
(unwind-protect (handler-case (loop)
(stop ()
(print "INTERRUPTED" output)))
(print "STOP" output)))))
Start it:
(bt:make-thread (endless) :name "loop")
This prints "START" and loops.
Then we interrupt it:
(bt:interrupt-thread (find "loop"
(bt:all-threads)
:test #'string=
:key #'bt:thread-name)
#'signal-stop)
The following is printed:
"INTERRUPTED"
"STOP"
Those messages would not be printed if the thread was killed, but note that you could still manage to have corrupted data given how random the interruption is. Also, it can unblock blocking calls like sleep or pop-queue.

Correct way to do multithreaded computations in SBCL

Context
I need to do computations using multi-threading. I use SBCL and portability is not a concern. I am aware that bordeaux-threads and lparallel exist but I want to implement something at the relatively low level provided by the specific SBCL threading implementation. I need maximal speed, even at the expense of readability/programming effort.
Example of computation intensive operation
We can define a sufficiently computation-intensive function that will benefit from multi-threading.
(defun intensive-sqrt (x)
"Dummy calculation for intensive algorithm.
Approx 50 ms for 1e6 iterations."
(let ((y x))
(dotimes (it 1000000 t)
(if (> y 1.01d0)
(setf y (sqrt y))
(setf y (* y y y))))
y))
Mapping each computation to a thread and execute
Given a list of argument-lists llarg and a function fun, we want to compute nthreads results and return the list of results res-list. Here is what I came up with using the resources I found (see below).
(defmacro splice-arglist-help (fun arglist)
"Helper macro.
Splices a list 'arglist' (arg1 arg2 ...) into the function call of 'fun'
Returns (funcall fun arg1 arg2 ...)"
`(funcall ,fun ,#arglist))
(defun splice-arglist (fun arglist)
(eval `(splice-arglist-help ,fun ,arglist)))
(defun maplist-fun-multi (fun llarg nthreads)
"Maps 'fun' over list of argument lists 'llarg' using multithreading.
Breaks up llarg and feeds it to each thread.
Appends all the result lists at the end."
(let ((thread-list nil)
(res-list nil))
;; Create and run threads
(dotimes (it nthreads t)
(let ((larg-temp (elt llarg it)))
(setf thread-list (append thread-list
(list (sb-thread:make-thread
(lambda ()
(splice-arglist fun larg-temp))))))))
;; Join threads
;; Threads are joined in order, not optimal for speed.
;; Should be joined when finished ?
(dotimes (it (list-length thread-list) t)
(setf res-list (append res-list (list (sb-thread:join-thread (elt thread-list it))))))
res-list))
nthreads does not necessarily match the length of llarg, but I avoid the extra book-keeping just for the example simplicity's sake. I also omitted the various declare used for optimization.
We can test the multi-threading and compare timings using :
(defparameter *test-args-sqrt-long* nil)
(dotimes (it 10000 t)
(push (list (+ 3d0 it)) *test-args-sqrt-long*))
(time (intensive-sqrt 5d0))
(time (maplist-fun-multi #'intensive-sqrt *test-args-sqrt-long* 100))
The number of threads is quite high. I think the optimum would be to use as many threads as the CPU has, but I noticed the performance drop-off is barely noticeable in terms of time/operations. Doing more operations would involve breaking up the input lists into smaller pieces.
The above code outputs, on a 2 cores/4 threads machine :
Evaluation took:
0.029 seconds of real time
0.015625 seconds of total run time (0.015625 user, 0.000000 system)
55.17% CPU
71,972,879 processor cycles
22,151,168 bytes consed
Evaluation took:
1.415 seconds of real time
4.703125 seconds of total run time (4.437500 user, 0.265625 system)
[ Run times consist of 0.205 seconds GC time, and 4.499 seconds non-GC time. ]
332.37% CPU
3,530,632,834 processor cycles
2,215,345,584 bytes consed
What's bugging me
The example I've given works very well and is robust (ie results don't get mixed up between threads, and I experience no crash). The speed gain is also there and the computations do use several cores/threads on the machines I've tested this code on. But there are a few things that I'd like an opinion/help on :
The use of the argument list llarg and larg-temp. Is this really necessary ? Is there any way to avoid manipulating potentially huge lists ?
Threads are joined in the order in which they are stored in the thread-list. I imagine this would not be optimal if operations each took a different time to complete. Is there a way to join each thread when it is finished, instead of waiting ?
The answers should be in the resources I already found, but I find the more advanced stuff hard to grapple with.
Resources found so far
http://www.sbcl.org/manual/#Threading
http://cl-cookbook.sourceforge.net/process.html
https://lispcookbook.github.io/cl-cookbook/process.html
Stylistic issues
The splice-arglist helpers are not needed at all (so I'll also skip details in them). Use apply in your thread function instead:
(lambda ()
(apply fun larg-temp))
You don't need to (and should not) index into a list, because that is O(n) for each lookup—your loops are quadratic. Use dolist for simple side-effective loops, or loop when you have e. g. parallel iteration:
(loop :repeat nthreads
:for args :in llarg
:collect (sb-thread:make-thread (lambda () (apply fun args))))
For going over a list while creating a new list of the same length where each element is calculated from the corresponding element in the source list, use mapcar:
(mapcar #'sb-thread:join-thread threads)
Your function thus becomes:
(defun map-args-parallel (fun arglists nthreads)
(let ((threads (loop :repeat nthreads
:for args :in arglists
:collect (sb-thread:make-thread
(lambda ()
(apply fun args))))))
(mapcar #'sb-thread:join-thread threads)))
Performance
You are right that one usually creates only as many threads as ca. the number of cores available. If you test performance by always creating n threads, then joining them, then going to the next batch, you will indeed have not much difference in performance. That is because the inefficiency lies in creating the threads. A thread is about as resource intensive as a process.
What one usually does is to create a thread pool where the threads do not get joined, but instead reused. For that, you need some other mechanism to communicate arguments and results, e. g. channels (e. g. from chanl).
Note however that e. g. lparallel already provides a pmap function, and it does things right. The purpose of such wrapper libraries is not only to give the user (programmer) a nice interface, but also to think really hard about the problems and optimize sensibly. I am quite confident that pmap will be significantly faster than your attempt.

How to execute some Clojure futures in a single thread?

I'd like to create some futures in Clojure and run them all on a specific thread, to make sure they run one-at-a-time. Is this possible?
It's not hard to wrap the Java libraries to do this, but before I do that I want to make sure I'm not missing a Clojure way of doing it. In Java I can do this by implementing FutureTask and submitting those tasks to a single-threaded executor.
Clojure's future macro calls future-call function which uses a dedicated executor service. This means that you have no control to enforce a sequential execution.
On the other hand you can use promise instead of future objects and one future thread to sequentially deliver the results. Promise's API is similar to what futures provide. They have deref and realized? too.
The following code example has the subtasks executed sequentially on a new thread in the background while the immediately returned result of the function contains the promises to the computed values.
(defn start-async-calc []
(let [f1 (promise)
f2 (promise)
f3 (promise)]
(future
(deliver f1 (task-1))
(deliver f2 (task-2))
(deliver f3 (task-3)))
{:task1 f1
:task2 f2
:task3 f3}))
if you want to sequentialize the calls to future you can use it manually like this:
(do #(future 1)
#(future 2)
#(future 3))
they would still possibly called in different threads, but the next one won't be called until the previous has finished. This is guaranteed by the # (or deref function). This means that the thread in which you execute do form would be blocked with prev promise before it completes, and then spawn next one.
you can prettify it with macro like this:
(defmacro sequentialize [& futures]
`(do ~#(map #(list `deref %) futures)))
user> (let [a (atom 1)]
(sequentialize
(future (swap! a #(* 10 %)))
(future (swap! a #(+ 20 %)))
(future (swap! a #(- %))))
#a)
;;=> -30
this does exactly the same as manual do. Notice that mutations to a atom are in-order even if some threads run longer:
user> (let [a (atom 1)]
(sequentialize
(future (Thread/sleep 100)
(swap! a #(* 10 %)))
(future (Thread/sleep 200)
(swap! a #(+ 20 %)))
(future (swap! a #(- %))))
#a)
;;=> -30
Manifold provides a way to create future with specific executor. It's not part of core Clojure lib, but it's still a high quality lib and probably a best option in case you need more flexibility dealing with futures than core lib provides (without resorting to Java interop).
In addition the promises mentioned, you can use a delay. Promises have the problem that you can accidentally not deliver them, and create a deadlock scenario that's not possible with futures and delays. The difference between a future and a delay is only the thread that the work is executed on. With a future, the work is done in the background, and with a delay the work is done by the first thread that tries to deref it. So if future's are a better fit than promises, you could always do something like:
(def result-1 (delay (long-calculation-1)))
(def result-2 (delay (long-calculation-2)))
(def result-3 (delay (long-calculation-3)))
(defn run-calcs []
#(future
#result-1
#result-2
#result-3))

Upper limit for number of jobs in a go block?

Here is the code:
(ns typedclj.async
(:require [clojure.core.async
:as a
:refer [>! <! >!! <!!
go chan buffer
close! thread
alts! alts!! timeout]]
[clj-http.client :as -cc]))
(time (dorun
(let [c (chan)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
And I got an error:
Exception in thread "async-dispatch-12" java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:150)
at clojure.core.async.impl.ioc_macros$put_BANG_.invoke(ioc_macros.clj:959)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817$fn__11819.invoke(async.clj:19)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817.invoke(async.clj:19)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at typedclj.async$eval11807$fn__11816.invoke(async.clj:19)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)...
According to http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/
... This will break the 1 job = 1 thread knot, thus this thread
parking will allow us to scale the number of jobs way beyond any
thread limit on the platform (usually around 1000 on the JVM).
core.async gives (blocking) channels and a new (unbounded) thread pool
when using 'thread'. This (in effect) is just some sugar over using
java threads (or clojure futures) and BlockingQueues from
java.util.concurrent. The main feature is go blocks in which threads
can be parked and resumed on the (potentially) blocking calls dealing
with core.async's channels...
Is 1e4 jobs already too many? What is the upper limit then?
I don't usually rant like this so I hope you will forgive me this one transgression:
In a more prefect world every programmer would repeat to themselves "there is no such thing as an unbounded queue" five times before sleeping and first thing upon waking. This mode of thinking requires firguring out how backpressure will be handled in your system so when there is a slowdown somewhere in the process the parts before that have a way to find out about it and slow themselves down in response. In core.async the default back pressure is immediate because the default buffer size is zero. No go block succeeds in putting something into a chan until someone is ready to consume it.
chans look basically like this:
"queue of pending puts" --> buffer --> "queue of pending takes"
The putter and taker queues are intended to allow time for the two processes that are communicating via this pipe to schedule themselves so progress can be made. Without these there would be no room for threads to schedule and deadlocks would happen. They are NOT intended to be used as the buffer. thats what the buffer in the middle is for, and this was the design behind making that the only one that has a explicit size. explicitly set the buffer size for your system by setting the size of the buffer in the chan:
user> (time (dorun
(let [c (chan 1e6)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
"Elapsed time: 83.526679 msecs"
nil
In this case I have "calculated" that my system as a whole will be in a good state if there are up to a million waiting jobs. Of course your real world expierences will be different, and very much unique to your situation.
Thanks for your patience,
The limit of unconsumed puts is the size of the channels buffer plus the size of the queue.
The queue size in core.async is limited to 1024 but one should not rely on that.

Synchronising threads for multiple readers / single writer in Clojure

I have some non-thread-safe code (a writer to shared data) that can only be called from multiple threads in a serialised manner, but I don't want to block any other thread-safe work (multiple readers) when this code is not being called.
This is essentially a multiple reader / single writer type locking situation where writers need to exclude both readers and other writers.
i.e. I have two functions:
(defn reader-function [] ....) // only reads from shared data
(defn writer-function [] ....) // writes to shared data
And a number of threads that are running (possibly in a loop) the following:
(do
(reader-function)
...
(writer-function))
If any single thread is executing the writer function, all the other threads must block. i.e. at any one time either:
one thread is executing the
writer and all
others are blocked
multiple threads are
executing the reader function, possibly some threads are
blocked waiting to execute
the writer once all readers are completed
What's the best way to achieve this kind of synchronisation in Clojure?
Put your data in a ref. The data should be a Clojure data structure (not a Java class). Use dosync to create a transaction around the read and write.
Example. Because you split your writer into a separate function, that function must modify a ref with something like an alter. Doing so requires a transaction (dosync). You could rely on writer being called only in a dosync but you can also put a dosync inside the write and rely on nested transactions doing what you want - this makes writer safe to call either in or out of a transaction.
(defn reader [shared]
(println "I see" #shared))
(defn writer [shared item]
(dosync
(println "Writing to shared")
(alter shared conj item)))
;; combine the read and the write in a transaction
(defn combine [shared item]
(dosync
(reader shared)
(writer shared item)))
;; run a loop that adds n thread-specific items to the ref
(defn test-loop [shared n]
(doseq [i (range n)]
(combine shared (str (System/identityHashCode (Thread/currentThread)) "-" i))
(Thread/sleep 50)))
;; run t threads adding n items in parallel
(defn test-threaded [t n]
(let [shared (ref [])]
(doseq [_ (range t)]
(future (test-loop shared n)))))
Run the test with something like (test-threaded 3 10).
More info here: http://clojure.org/refs
You didn't ask about this case, but it's important to note that anyone can read the shared ref by derefing it at any time. This does not block concurrent writers.
Take a look at java.util.concurrent.locks.ReentrantReadWriteLock. This class allow you to have multiple readers that do not contend with each other on one writer at a time.

Resources