Here is the code:
(ns typedclj.async
(:require [clojure.core.async
:as a
:refer [>! <! >!! <!!
go chan buffer
close! thread
alts! alts!! timeout]]
[clj-http.client :as -cc]))
(time (dorun
(let [c (chan)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
And I got an error:
Exception in thread "async-dispatch-12" java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:150)
at clojure.core.async.impl.ioc_macros$put_BANG_.invoke(ioc_macros.clj:959)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817$fn__11819.invoke(async.clj:19)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817.invoke(async.clj:19)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at typedclj.async$eval11807$fn__11816.invoke(async.clj:19)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)...
According to http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/
... This will break the 1 job = 1 thread knot, thus this thread
parking will allow us to scale the number of jobs way beyond any
thread limit on the platform (usually around 1000 on the JVM).
core.async gives (blocking) channels and a new (unbounded) thread pool
when using 'thread'. This (in effect) is just some sugar over using
java threads (or clojure futures) and BlockingQueues from
java.util.concurrent. The main feature is go blocks in which threads
can be parked and resumed on the (potentially) blocking calls dealing
with core.async's channels...
Is 1e4 jobs already too many? What is the upper limit then?
I don't usually rant like this so I hope you will forgive me this one transgression:
In a more prefect world every programmer would repeat to themselves "there is no such thing as an unbounded queue" five times before sleeping and first thing upon waking. This mode of thinking requires firguring out how backpressure will be handled in your system so when there is a slowdown somewhere in the process the parts before that have a way to find out about it and slow themselves down in response. In core.async the default back pressure is immediate because the default buffer size is zero. No go block succeeds in putting something into a chan until someone is ready to consume it.
chans look basically like this:
"queue of pending puts" --> buffer --> "queue of pending takes"
The putter and taker queues are intended to allow time for the two processes that are communicating via this pipe to schedule themselves so progress can be made. Without these there would be no room for threads to schedule and deadlocks would happen. They are NOT intended to be used as the buffer. thats what the buffer in the middle is for, and this was the design behind making that the only one that has a explicit size. explicitly set the buffer size for your system by setting the size of the buffer in the chan:
user> (time (dorun
(let [c (chan 1e6)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
"Elapsed time: 83.526679 msecs"
nil
In this case I have "calculated" that my system as a whole will be in a good state if there are up to a million waiting jobs. Of course your real world expierences will be different, and very much unique to your situation.
Thanks for your patience,
The limit of unconsumed puts is the size of the channels buffer plus the size of the queue.
The queue size in core.async is limited to 1024 but one should not rely on that.
Related
In the Lparallel API, the recommended way to terminate all threaded tasks is to stop the kernel with (lparallel:end-kernel). But when a thread is blocking—eg, with (pop-queue queue1) waiting for an item to appear in the queue—it will still be active when the kernel is stopped. In this case (at least in SBCL) the kernel shutdown occasionally (but not every time) fails with:
debugger invoked on a SB-KERNEL:BOUNDING-INDICES-BAD-ERROR in thread
#<THREAD "lparallel" RUNNING {1002F04973}>:
The bounding indices 1 and NIL are bad for a sequence of length 0.
See also:
The ANSI Standard, Glossary entry for "bounding index designator"
The ANSI Standard, writeup for Issue SUBSEQ-OUT-OF-BOUNDS:IS-AN-ERROR
debugger invoked on a SB-SYS:INTERACTIVE-INTERRUPT in thread
#<THREAD "main thread" RUNNING {10012E0613}>:
Interactive interrupt at #x1001484328.
I’m assuming this has something to do with the blocking thread not terminating correctly. How should a blocking thread be properly terminated before shutting down the kernel? (The API says kill-tasks should only be used in exceptional circumstances, which I’m taking not to apply to this “normal” shutdown circumstance.)
The problem with killing a thread is that it might happen anywhere, when the thread could be in any unknown state.
The only way to safely terminate a thread it is to let it shutdown itself gracefully, meaning you expect that during normal operations, there is a way for the thread to know it should stop working. Then you can properly clean your resources, close databases, free foreign pointers, log all things, ...
The queues you are using have operations that can timeout, that is a simple yet safe way to ensure you can avoid blocking forever and exit properly. But that's not the only option (you can use them in addition to what is shown below).
Shared / global flag
When a timeout occurs, or when you receive a message, you check a global boolean variable (or one that is shared among all interested threads). That's also a simple way to exit, and it can be read by multiple threads. This is however a concurrent access, so you should use locks or atomic operations (http://www.sbcl.org/manual/#Atomic-Operations), for example use defglobal and a fixnum type with atomic-incf, etc.
Control messages
Send control data in the queues and use them to determine how to shutdown gracefully, and how to propagate the information down the pipes, or how to restart things. This is safe (just message-passing) and allows any kind of control you might want to implement in your thread.
(defpackage :so (:use :cl :bt :lparallel.queue))
(in-package :so)
Let's define two services.
The first one echoes back its input:
(defun echo (in out)
(lambda ()
(loop
for value = (pop-queue in)
do (push-queue value out)
until (eq value :stop))))
Notice how it is expected to finish properly when given a :stop input, and how it also propagates the :stop message to its output queue.
The second thread will perform a modular addition, and also sleeps a bit between requests:
(defun modulo-adder (x m in out)
(lambda ()
(loop
for value = (progn (sleep 0.02)
(pop-queue in))
do (push-queue (typecase value
(keyword value)
(number (mod (+ x value) m)))
out)
until (eq value :stop))))
Create queues:
(defparameter *q1* (make-queue))
(defparameter *q2* (make-queue))
Create threads:
(progn
(bt:make-thread (echo *q1* *q2*) :name "echo")
(bt:make-thread (modulo-adder 5 1024 *q2* *q1*) :name "adder"))
Both threads are connected to each others in a circular fashion, creating an infinite loop of additions. No value is currently exchanged between threads, and you can see them running for example with slime-list-threads or any other implementation-provided way; In any case (bt:all-threads) returns a list.
slime-list-threads
10 adder Running
11 echo Running
...
Add an item, now there is an infinite exchange of data between threads:
(push-queue 10 *q1*)
Wait, then stop them both:
(push-queue :stop *q1*)
Both threads stopped gracefully (they are no more visible in lists of threads).
We can inspect what remains in the queues (result vary from one test to another):
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(99 NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(:STOP NIL)
(list (try-pop-queue *q1*)
(try-pop-queue *q2*))
(NIL NIL)
Interrupting a thread
You create a service, controlled by messages or a global flag, but then you have a bug and the thread hangs. Instead of killing it and lose everything, you want at least to unwind the thread stack properly. This is a dangerous too, but you can use bt:interrupt to stop a thread anywhere it is running right now and execute a function.
(define-condition stop () ())
(defun signal-stop ()
(signal 'stop))
(defun endless ()
(let ((output *standard-output*))
(lambda ()
(print "START" output)
(unwind-protect (handler-case (loop)
(stop ()
(print "INTERRUPTED" output)))
(print "STOP" output)))))
Start it:
(bt:make-thread (endless) :name "loop")
This prints "START" and loops.
Then we interrupt it:
(bt:interrupt-thread (find "loop"
(bt:all-threads)
:test #'string=
:key #'bt:thread-name)
#'signal-stop)
The following is printed:
"INTERRUPTED"
"STOP"
Those messages would not be printed if the thread was killed, but note that you could still manage to have corrupted data given how random the interruption is. Also, it can unblock blocking calls like sleep or pop-queue.
I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?
You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.
Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.
I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();
I have a thread that gets new frames, 2 other threads that process the newly gotten image and 1 that prints the output based on the processing threads.
The program cycle goes,
>thread 3, print an output based on the previous outputs of the thread 0 and 1
>thread 0 get new image
>> thread 1, process image for color
>> thread 2, process image for haar cascade
going cyclically 3&0>1&2>3&0>1&2>
'>' indicates join before spawning the next set
How do I pass the opencv Mat between the threads 0 to 1&2?
Also how would I pass the data from threads 1&2 to thread 0?
I would guess a message queue system, how does one implement that?
I suspect I may get some abuse and down-votes for this because global variables are generally frowned upon, however I feel the situation is different in image-processing where:
the code is multi-threaded, and
the data structures (images) are large.
Here I think it is important to avoid expensive copying and "transmission" of data down sockets etc. when it is already available in memory that is shared and visible amongst threads.
So, in concrete terms, I would go for an array, or vector (according to your preference) of say 16 OpenCV Mats that is globally accessible.
Acquire into the first, notify the the next thread when that buffer is full, then acquire into the next. And so on.
As regards the notification, you have several options. The cleanest, and most modern is probably using condition variables letting each processing thread (not the acquiring thread) wait on a condvar for "buffer full". Next is probably POSIX message queues, though if you are porting to macOS later you may regret that as there is no support. Another, easily programmed method is to use sockets but just send a single byte that is the index into the global array of 16 Mats - that way there is no problem with incomplete, multi-byte reads on sockets. The processing threads then just sit in a loop doing blocking reads on the socket to know which buffer to process. You can also define a special index that means "quit".
Check the size of your images in terms of width x height x channels x bytes per channel and get a feel for how much memory a global vector of 16 Mats will need and be sure you have that available before using this strategy - you may be up against the wall with a Raspberry Pi for example.
I'm having internal conflict about using Thread/sleep inside a Clojure go block. It is generally not recommended to have any long-running process inside go blocks, and Stuart Sierra mentions that sleeping in go block is not preferred:
In general, any work which doesn't block, sleep, or do I/O can be safely put in a go block without having a major impact on the throughput of the system.
In my case, I want to listen to the messages on a channel, and group them before notifying user if they come rapidly. Or, if only one message comes along, notify user about it. In my particular use case, it's rather common for several messages to come either in rapid succession or alone, and unlikely for messages to come in regular intervals.
To achieve this, I have a go-loop block that waits input on a channel. When it receives it sleeps a bit (a second or two at most), checks if any other input arrived on a channel in the meantime, and based on that either notifies user about a whole thing or just passes along the first message.
I will not have a bunch of such go blocks in my application, only one. As Clojure always spawns more then one thread to serve go blocks, blocking one thread with sleep shouldn't be a problem in practice. But in theory, I wonder if there's a more elegant way to handle this without tying up a thread like this?
Instead of blocking in Thread/sleep you should rather use clojure.core.async/timeout. There is an example on ClojureDocs very similar to your scenario:
(go-loop [seconds (atom 0)
add-seconds! #(swap! seconds + %)]
(println "Waiting 1 second")
(<! (timeout 1000))
(add-seconds! 1)
(println "Waiting 2 seconds")
(<! (timeout 2000))
(add-seconds! 2)
(println
(format "Waited %s seconds"
#seconds)))
From http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/ :
To get a bit more concrete let's see what happens when we try to issue
some HTTP GET request using core.async. Let's start with the naive
solution, using blocking IO via clj-http.
(defn blocking-get [url]
(clj-http.client/get url))
(time
(def data
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(go (>! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
Here we're trying to fetch 90 code snippets (in parallel) using go
blocks (and blocking IO). This took a long time, and that's because
the go block threads are "hogged" by the long running IO operations.
The situation can be improved by switching the go blocks to normal
threads.
(time
(def data-thread
(let [c (chan)
res (atom [])]
;; fetch em all
(doseq [i (range 10 100)]
(thread (>!! c (blocking-get (format "http://fssnip.net/%d" i)))))
;; gather results
(doseq [_ (range 10 100)]
(swap! res conj (<!! c)))
#res
)))
What does it mean that "go block threads are hogged by the long running IO operations"?
Go blocks are intended to be a sort of light-weight cooperative threads; they provide thread-like behaviour with less overhead than full JVM threads by using a few threads in a pool and switching go blocks when they park - for instance, when waiting on a channel using <!. The thread-switching cannot work when you call a method in the block that blocks the JVM thread, so you quickly run out of JVM threads. Most standard Java (and Clojure) IO operations will block the current thread when waiting.
What does it mean that "go block threads are hogged by the long running IO operations"?
There are a limited number of threads dedicated to serving go blocks*. If you perform a blocking I/O operation on one of those threads, then it cannot be used for any other purpose until that operation completes (unless the thread is interrupted). This is also true for non-go block threads (i.e., threads that are returned from the thread function), but non-go block threads do not come from the limited go block thread pool. So if you do blocking I/O in a go block, you are "hogging" that go block's thread from being used by other go blocks, even though the thread isn't doing any actual work (it's just waiting for the I/O operation).
* That number currently happens to be 42 + the number of processors available to the JVM.