I have an application that spins up a number of futures to do prolonged work. It's intermittently failing and I'm trying to work out why.
The symptom is that the code is just ceasing to execute, and stops in a random place. My future-creation-code is something like this:
(def future-timeout
; 1 hour
3600000)
(def concurrency 200)
(defn do-parallel
[f coll]
(let [chunks (partition-all concurrency coll)]
(doseq [chunk chunks]
(let [futures (doall
(map #(future
(try
(f %)
(catch Exception e
(log/error "Unhandled error in do-parallel:" (.getMessage e))
:exception)))
chunk))
results (doall (map #(deref % future-timeout :timeout) futures))
all-ok (every? true? results)]
(when all-ok
(log/info "Chunk successful."))
(when-not all-ok
(log/error "Chunk unsuccessful.")
(log/warn "Parallel execution results:" results))
(swap! chunk-count inc)))
(log/info "Finished batch")))
The concurrency variable controls the size of batches, and therefore the number of concurrent executions it attempts. f returns true on success. If there's a timeout or exception, they return :timeout or :exception.
I'm doing this instead of pmap because I want to control concurrency, f is a long-running (~10 minutes), network-intensive task. pmap seems to be tuned toward mid-sized, smaller batches.
Normally this works fine. But after a few hours it stops:
During the execution of f, the function stops running.
No exception is caught. No timeout occurs.
The loop in do-parallel stops and no more log entries appear.
Other threads, e.g. Kafka Client, keep running.
Any ideas of what might be causing this? Or steps to put in place to help diagnose?
You might want to try to install an uncaught exception handler, to see if a stray exception on the Executor itself is causing work to stop.
https://github.com/pyr/uncaught has a facility for this, but it's also straightforward to do from the code directly.
Claypool is useful for controlling parallelism.
(cp/pmap (count chunk) f chunk)
will create a temporary threadpool the same size as your chunk and execute all the functions in parallel.
This is just a suggestion for expressing parallelism, not an answer to your question which is about error handling; which I'm curious about also!
Maybe try catch Throwable instead of Exception? I've had issues before that slipped through catch Exception because of it.
I think if there's an uncaught exception in the futures, it catches it and dies without throwing it further out, so setting the default uncaughtexception isn't going to help. Untested - but that's my gut feel.
Do you get the "Chunk unsuccessful" message at the end at least, when it stops? because if you don't, then that's really weird...
Looking around the implementation of future - it uses a cachedthreadpool underneath - which doesn't have a thread limit, so you're probably better off using the ExecutorService directly, or something like claypoole, like the other suggestions indicate.
Related
I need to make an HTTP request which quite often can fail and I'm totally not interested in the result, if it worked or not. Also, I don't want to wait for it to return.
So, I'd like to wrap that call in a separate thread and make sure that the thread won't stick around when something is blocking.
My current approach is something like this:
(defn- call-and-forget [url]
(let [timeout 250
combined-timeout (* timeout 2.5)
f (future
(try
(http/delete url
{:socket-timeout timeout
:conn-timeout timeout})
(catch Throwable e
(printf "Could not call %s: %s"
url (.getMessage e)))))]
(deref f combined-timeout)
(when-not (future-done? f)
(future-cancel f))))
I hereby put this code under the Apache 2.0 license
It uses clj-http to make the call and a Future to create another thread. I am aware of this using a thread of the built-in pool and the discussion over in this thread. The amount of complexity added by using my own thread pool, thread factory, executor service, uncaught handler and so on is not really worth it.
Would you agree that the code above is a good, working solution, or do you see a better way?
Looks good. You could also do
(when (= :failed (deref f timeout-ms :failed))
(future-cancel f))
Here is the code:
(ns typedclj.async
(:require [clojure.core.async
:as a
:refer [>! <! >!! <!!
go chan buffer
close! thread
alts! alts!! timeout]]
[clj-http.client :as -cc]))
(time (dorun
(let [c (chan)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
And I got an error:
Exception in thread "async-dispatch-12" java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:150)
at clojure.core.async.impl.ioc_macros$put_BANG_.invoke(ioc_macros.clj:959)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817$fn__11819.invoke(async.clj:19)
at typedclj.async$eval11807$fn__11816$state_machine__6185__auto____11817.invoke(async.clj:19)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at typedclj.async$eval11807$fn__11816.invoke(async.clj:19)
at clojure.lang.AFn.run(AFn.java:22)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)...
According to http://martintrojer.github.io/clojure/2013/07/07/coreasync-and-blocking-io/
... This will break the 1 job = 1 thread knot, thus this thread
parking will allow us to scale the number of jobs way beyond any
thread limit on the platform (usually around 1000 on the JVM).
core.async gives (blocking) channels and a new (unbounded) thread pool
when using 'thread'. This (in effect) is just some sugar over using
java threads (or clojure futures) and BlockingQueues from
java.util.concurrent. The main feature is go blocks in which threads
can be parked and resumed on the (potentially) blocking calls dealing
with core.async's channels...
Is 1e4 jobs already too many? What is the upper limit then?
I don't usually rant like this so I hope you will forgive me this one transgression:
In a more prefect world every programmer would repeat to themselves "there is no such thing as an unbounded queue" five times before sleeping and first thing upon waking. This mode of thinking requires firguring out how backpressure will be handled in your system so when there is a slowdown somewhere in the process the parts before that have a way to find out about it and slow themselves down in response. In core.async the default back pressure is immediate because the default buffer size is zero. No go block succeeds in putting something into a chan until someone is ready to consume it.
chans look basically like this:
"queue of pending puts" --> buffer --> "queue of pending takes"
The putter and taker queues are intended to allow time for the two processes that are communicating via this pipe to schedule themselves so progress can be made. Without these there would be no room for threads to schedule and deadlocks would happen. They are NOT intended to be used as the buffer. thats what the buffer in the middle is for, and this was the design behind making that the only one that has a explicit size. explicitly set the buffer size for your system by setting the size of the buffer in the chan:
user> (time (dorun
(let [c (chan 1e6)]
(doseq [i (range 10 1e4)]
(go (>! c i))))))
"Elapsed time: 83.526679 msecs"
nil
In this case I have "calculated" that my system as a whole will be in a good state if there are up to a million waiting jobs. Of course your real world expierences will be different, and very much unique to your situation.
Thanks for your patience,
The limit of unconsumed puts is the size of the channels buffer plus the size of the queue.
The queue size in core.async is limited to 1024 but one should not rely on that.
We have run into a high CPU usage situation when one of our EventHandlers broke.
Let's say we have several consumers (EventHanlders), that are configured to run sequentially over the buffer. If the first EventHandler throws an exception, is there a way to halt (and awake them later) all the other EventHandlers.
What we are doing is putting the failing thread to sleep and after we try to consume the same event again. But we have notice that the other threads continue running and trying to read from the RingBuffer even where there are not events to read, raising the CPU behind acceptable levels.
For the moment I'm discarding that this is happening because WaitStrategy of the disruptor, because under normal conditions is working as expected. We are using a BlockingWaitStrategy there.
Some more explanations for the sake of the example
INPUT -> [A*] -> [B] -> [C] -> [D]
Where INPUT is the event polled from the RingBuffer and A, B, C and D are the different EventHandlers that are executing sequentially. A* is the consumer throwing an exception.
What we want to achieve is that when consumer A cannot consume an event (eg. after an exception happens), the OnEvent(...) method of that consumer does not exit but will stay in a loop with regular sleeps trying to consume again the same event when it wakes up. In the meanwhile all the other consumers should be parked or sleeping until A succeeds.
We are using disruptor version 3.3.0.
I have been googling but haven't found a working solution.
Thanks in advance.
Salva.
A college has founded out that this issue could be related with a while loop in the waitFor method in BlockingWaitStrategy.
long availableSequence;
while((availableSequence = dependentSequence.get()) < sequence) {
barrier.checkAlert();
}
After several test we have came across this possible solution:
var availableSequence: Long = dependentSequence.get()
while(availableSequence < sequence) {
this.lock.lock()
this.lock.unlock()
availableSequence = dependentSequence.get()
}
availableSequence
Basically it makes that one thread locks the resource and with that we park momentary all the other consumers avoiding the high usage of CPU.
The second point here is the while condition. This is happening just when the available sequence (that is the sequence of the dependent threads) is below the current sequence number. And that only happens when one thread is holding the lock, for example when A throws the exception.
We still investigating if this is a valid solution, or if it can have some undesired side effects.
Any though about it is welcome.
I am writing a benchmark for a program in Clojure. I have n threads accessing a cache at the same time. Each thread will access the cache x times. Each request should be logged inside a file.
To this end I created an agent that holds the path to the file to be written to. When I want to write I send-off a function that writes to the file and simply returns the path. This way my file-writes are race-condition free.
When I execute my code without the agent it finished in a few miliseconds. When I use the agent, and ask each thread to send-off to the agent each time my code runs horribly slow. I'm talking minutes.
(defn load-cache-only [usercount cache-size]
"Test requesting from the cache only."
; Create the file to write the benchmark results to.
(def sink "benchmarks/results/load-cache-only.txt")
(let [data-agent (agent sink)
; Data for our backing store generated at runtime.
store-data (into {} (map vector (map (comp keyword str)
(repeat "item")
(range 1 cache-size))
(range 1 cache-size)))
cache (create-full-cache cache-size store-data)]
(barrier/run-with-barrier (fn [] (load-cache-only-work cache store-data data-agent)) usercount)))
(defn load-cache-only-work [cache store-data data-agent]
"For use with 'load-cache-only'. Requests each item in the cache one.
We time how long it takes for each request to be handled."
(let [cache-size (count store-data)
foreachitem (fn [cache-item]
(let [before (System/nanoTime)
result (cache/retrieve cache cache-item)
after (System/nanoTime)
diff_ms ((comp str float) (/ (- after before) 1000))]
;(send-off data-agent (fn [filepath]
;(file/insert-record filepath cache-size diff_ms)
;filepath))
))]
(doall (map foreachitem (keys store-data)))))
The (barrier/run-with-barrier) code simply spawns usercount number of threads and starts them at the same time (using an atom). The function I pass is the body of each thread.
The body willl simply map over a list named store-data, which is a key-value list (e.g., {:a 1 :b 2}. The length of this list in my code right now is 10. The number of users is 10 as well.
As you can see, the code for the agent send-off is commented out. This makes the code execute normally. However, when I enable the send-offs, even without writing to the file, the execution time is too slow.
Edit:
I made each thread, before he sends off to the agent, print a dot.
The dots appear just as fast as without the send-off. So there must be something blocking in the end.
Am I doing something wrong?
You need to call (shutdown-agents) when you're done sending stuff to your agent if you want the JVM to exit in reasonable time.
The underlying problem is that if you don't shutdown your agents, the threads backing its threadpool will never get shut down, and prevent the JVM from exiting. There's a timeout that will shutdown the pool if there's nothing else running, but it's fairly lengthy. Calling shutdown-agents as soon as you're done producing actions will resolve this problem.
Well... – apparently, nothing! If I try
Prelude Control.Concurrent.Async Data.List> do {_ <- async $ return $! foldl'(+) 0 [0,0.1 .. 1e+8 :: Double]; print "Async is lost!"}
"Async is lost!"
one processor core starts going wild for a while, the interface stays as normal. Evidently the thread is started and simply runs as long as there is something to do.
But (efficiency aside), is that in principle ok, or must Asyncs always be either cancelled or waited for? Does something break because there just isn't a way to read the result anymore? And does the GC properly clean up everything? Will perhaps the thread in fact be stopped, and that just doesn't happen yet when I try it (for lack of memory pressure)? Does the thread even properly "end" at all, simply when the forkIOed action comes to an end?
I'm quite uncertain about this concurrency stuff. Perhaps I'm still thinking too much in a C++ way about this. RAII / deterministic garbage collection certainly make you feel a bit better cared for in such regards...
Internally, an Async is just a Haskell thread that writes to an STM TMVar when finished. A cancel is just sending the Haskell thread a kill signal. In Haskell, you don't need to explcititly kill threads. If the Async itself can be garbage collected, then the thread will still run to its end, and then everything will be properly cleaned up. However, if the Async ends in an exception, then wait will propagate the exception to the waiting thread. If you don't wait, you'll never know that the exception happened.