Guile Scheme parallel forms speedup - multithreading

I am experimenting with the parallel forms of Guile Scheme and I have the following code:
(use-modules (srfi srfi-1)
(ice-9 pretty-print)
(ice-9 receive))
(define (busy-work limit)
(if (> limit 0)
(begin (sqrt (+ (expt limit limit) 1))
(busy-work (- limit 1)))
'done))
(define (busy-work-2 lst)
(cond [(null? lst) 'done]
[else
(expt (car lst) (car lst))
(busy-work-2 (cdr lst))]))
(define (time thunk)
(define starting-time (current-time))
(define res (thunk))
(define ending-time (current-time))
(display "elapsed time: ")
(display (- ending-time starting-time))
(display "s")
(newline)
res)
(define (partition-4 numbers)
(define (loop numbers rc0 rc1 rc2 rc3)
(cond [(null? numbers) (list (reverse rc0)
(reverse rc1)
(reverse rc2)
(reverse rc3))]
[else
(let* ([number (car numbers)]
[residue (remainder number 4)])
(cond [(= residue 0) (loop (cdr numbers)
(cons number rc0)
rc1
rc2
rc3)]
[(= residue 1) (loop (cdr numbers)
rc0
(cons number rc1)
rc2
rc3)]
[(= residue 2) (loop (cdr numbers)
rc0
rc1
(cons number rc2)
rc3)]
[(= residue 3) (loop (cdr numbers)
rc0
rc1
rc2
(cons number rc3))]))]))
(loop numbers '() '() '() '()))
(or in my experimenting repository at https://github.com/ZelphirKaltstahl/guile-scheme-tutorials/blob/5321470f8f3cbbdb7f64d4ed60e4b1eaf8d8f444/parallellism/utils.scm)
The 2 procedures busy-work and busy-work-2 are pure number crunching, with lists of numbers, where no calculation depends on another, as far as I am aware. I know the time measurement might not be completely accurate.
However, consistently I do not get the expected speedup from using more threads (cores, as I can see in core usage in my CPU indicator).
Here are some examples, from which I would expect 2 threads to be twice as quickly done with the task as 1 core and 4 cores to be twice as fast as 2 cores. Well more or less at least, because I am splitting up the lists in a way, which should spread work more or less evenly.
Using 4 cores and parallel
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(parallel (busy-work-2 (car residue-classes))
(busy-work-2 (cadr residue-classes))
(busy-work-2 (caddr residue-classes))
(busy-work-2 (cadddr residue-classes))))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s.
Using par-map which uses 4 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(par-map busy-work-2
residue-classes))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s. Just like with parallel.
Using n-par-map with 4 threads (on my machine)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map (current-processor-count)
busy-work-2
residue-classes))))
Also 10s. Here the manual (https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html) says:
Unlike those above, the functions described below take a number of threads as an argument. This makes them inherently non-portable since the specified number of threads may differ from the number of available CPU cores as returned by current-processor-count (see Processes). In addition, these functions create the specified number of threads when they are called and terminate them upon completion, which makes them quite expensive.
Therefore, they should be avoided.
While I find this explanation does not make 100% sense as it is (why would n-par-map not use the same pre-created threads as parallel, if there are sufficient of those? Like 4 as on my machine?), I do not see any great overhead and again I see it finishes in 10s approximately. My guess is, that the thread creation takes so little time, that it is simply not noticed compared to all the computation when number crunching.
Using n-par-map with 2 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map 2
busy-work-2
residue-classes))))
Expectation: Might finish in 20s.
Result: This finishes in 12s.
Now of course I am thinking: "Well there must be some massive overhead in the runs with 4 cores!".
Question: But where does this overhead come from, when I am doing purely number crunching without inter-dependencies of any results? Does it use some shared memory so that memory access becomes a bottle neck?

You are probably using a machine with two physical cores which are hyperthreaded, so that 4 cpus are reported. What it shows is that this workload does not suit hyperthreads.
I get a similar result on a machine with two hyperthreaded physical cores. However with a machine with 4 physical cores I get 9 secs using all 4 cores, and 16 secs using only 2 cores, which is more along the lines of what you were expecting.

Related

Racket AND logical statement or nested if's with struct

I'm trying to just make a simple Racket block of code with an AND or just nested-if logical statement. This is supposed to just specify if variable hours is above 0 and below 24 hours and minutes is above 0 and below 60 minutes within a struct called 24hourClock, if not it should type an else error statement for hours individually or minutes individually. If you can, just let me know if there's a minor syntax error, I'm just now learning struct and Racket in general. Thanks.
;Attempt 1
(struct 24HourClock (hours minutes)
#:transparent
#:guard (if (number? hours)
(error name "24HourClock: Hours must between 0 and 23:24.")
(if (number? minutes)
(error name "24HourClock: Minutes must between 0 and 59:60.")
(values x y))))
;Attempt 2
(struct 24HourClock5 (hours minutes)
#:guard (lambda (hours minutes error_name)
(if (not (and (number? hours) (number? minutes)))
(error error_name "Both fields hours and minutes must be between 0:24 and 0:60")
(values hours minutes))))
The guard argument must be a procedure, as in your second attempt.
It is probably a good idea to define it outside the struct definition, since it is not exactly tiny.
Something like this, perhaps:
(define (valid-hours h)
(and (>= h 0) (< h 24)))
(define (valid-minutes m)
(and (>= m 0) (< m 60)))
(define (valid-time h m name)
(cond
[(not (valid-hours h)) (error name "Bad hours")]
[(not (valid-minutes m)) (error name "Bad minutes")]
[else (values h m)]))
(struct 24HourClock (hours minutes)
#:transparent
#:guard valid-time)

Clojure computing array in parallel

I need some help. I have on array(sequence) in size of 4000 elements. I need to compute each part(1000 elements is one part) by four different functions. On function computing one part at least 10 sec. Another functions computing their parts in 1 sec. I need to do it at list 10 times. So, of course I want to do it in parallel. I'm trying to do it using clojure future, but it steel take computing time like a sequence program. At first I think, that's because when i'm using one array and send data to each function each future thread can get access sequental. How can I do this computing in parallel?
here s a simple example of code:
(defn funct [t n]
(let [ a (future (fun1 (fun_take_i_part_array n 1)))
b (future (fun2 (fun_take_i_part_array n 2)))
c (future (fun3 (fun_take_i_part_array n 3)))
d (future (fun4 (fun_take_i_part_array n 4)))
]
(concatenate #a #b #c #d)
)
)
So I this future-kind program take longest than sequence for 1 sec.
The array access is never sequential unless these are not arrays but lazy sequences. It does not need to be because arrays are immutable - that is the distinguishing feature of Clojure. Your slowdown does not come from that.
Your idea of parallelisation is correct, look at this:
(defn fibonacci [n] (case n 1 1 2 1 (+ (fibonacci (- n 2)) (fibonacci (- n 1)))))
(defn parallel-fib []
(let [s1 (future (fibonacci 34))
s2 (future (fibonacci 35))
s3 (future (fibonacci 36))
s4 (future (fibonacci 33))]
[#s1 #s2 #s3 #s4]))
user=> (time (fibonacci 35))
"Elapsed time: 1514.663902 msecs"
9227465
user=> (time (fibonacci 36))
"Elapsed time: 2403.761528 msecs"
14930352
user=> (time (parallel-fib))
"Elapsed time: 2552.572043 msecs"
[5702887 9227465 14930352 3524578]
It is clear that the parallel executed futures, running on a 4-core machine, yield time close to the most expensive computation and not to the sum of computation times.
Therefore, I see no obvious reasons why your code experiences times closer to sequential execution. The reasons that come to my mind might be:
fun1 (or any other of those funs) could be a parallel function by itself (e.g. using pmap) consuming all of your cores. Then, the CPU would get crammed and no speedup would be observed.
Somehow 'concatenate' is eating a lot of CPU to put the results together. This should not be a major hassle as if you have used standard Clojure 'concat', it produces a lazy sequence which should incur its access cost only after it's being accessed. Not to mention the access cost should be very low.
If the computations inside futures are very simple, no good scaling could be archived either. We know it is not true, though, because you have already mentioned it takes fun1 10 seconds to complete.
It could help if you could send me the whole code. My email: contact [at] spinney.io.

Why does looping over a large amount of data in another thread cause an overactive GC, and prevent some data from being freed?

I'm writing code that takes some lazy results produced by pmap, and draws them onto a BufferedImage. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.
I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.
This is the best MCVE that I've come up with:
(ns mandelbrot-redo.irrelevant.write-image-mcve
(:import [java.awt.image BufferedImage]
(java.util.concurrent Executors Executor)))
(defn lazy-producer [width height]
(for [y (range height)
x (range width)]
[x y (+ x y)]))
; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i 1e6))
(println (str (double (/ i (* width height) 0.01)) "%")))
((fn boop [x] x) result)))) ; Data gets consumed here
; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(fn []
(sync-consumer results width height)
(println "Done...")))
(.start)))
(defn -main []
(let [width 5000
height (int (* width 2/3))]
(-> (lazy-producer width height)
(async-consumer width height))))
When -main is run with sync-consumer, it finishes after a few seconds. With async-consumer however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.
If I use an explicit Thread., or use a local thread pool in async-consumer, it hangs and crashes. If I use future however, it finishes fine, just like sync-consumer.
The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Longs when using the async version:
The sync version shows a peak amount of Longs to be about 45mb at once in comparison.
The CPU usage is quite different too:
There's massive GC spikes, but it doesn't seem like the Longs are being disposed of.
I could use future for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.
Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?
Can anyone explain this behavior?
The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.
The async one closes over results in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?
I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).
I simplified you example, and get inconsistent results. I suspect that the manual Thread object is somehow being regarded (sometimes) as a daemon thread, so the JVM sometimes exits before it has completed:
(def N 5e3)
(def total-count (* N N))
(def report-fact (int (/ total-count 20)))
(defn lazy-producer []
(for [y (range N)
x (range N)]
[x y (+ x y)]))
(defn sync-consumer [results]
(println "sync-consumer: start")
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i report-fact))
(println (str (Math/round (/ (* 100 i) total-count)) " %")))))
(println "sync-consumer: stop"))
(defn async-consumer [results]
; (spyx (count results))
(spyx (nth results 99))
(let [thread (Thread. (fn []
(println "thread start")
(sync-consumer results)
(println "thread done")
(flush)))]
; (.setDaemon thread false)
(.start thread)
(println "daemon? " (.isDaemon thread))
thread))
(dotest
(println "test - start")
(let [thread (async-consumer
(lazy-producer))]
(when true
(println "test - sleeping")
(Thread/sleep 5000))
; (.join thread)
)
(println "test - end"))
with results:
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
60 %
lein test 54.58s user 1.37s system 372% cpu 15.028 total
If we uncomment the (.join thread) line, we get a complete run:
~/expr/demo > lein test
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
90 %
95 %
"Elapsed time: 9388.313828 msecs"
sync-consumer: stop
thread done
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
lein test 72.52s user 1.69s system 374% cpu 19.823 total
It seems to exit early as if Clojure killed off the manual Thread object.
Perhaps you have found an (intermittent) bug.
Thanks to #amalloy and #Alex, I got it working.
I implemented the suggestions by #amalloy in the comments, and both variants work here and in my real case:
; Brittle since "once" is apparently more of an implementation detail of the language
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(^:once fn* []
(sync-consumer results width height)
(println "Done...")))
(.start)))
; Arguably less brittle under the assumption that if they replace "once" with another mechanism,
; they'll update "delay".
(defn async-consumer [results width height]
(let [d (delay (sync-consumer results width height))]
(doto
(Thread. ^Runnable
(fn []
#d
(println "Done...")))
(.start))))
I also tried updating to 1.9.0. I thought that might fix it since #Alex says he's on 1.9.0 and can't reproduce this, and there's also this bug fix that seems related. Unfortunately, I didn't notice any difference.
It would be nice if there was an actual, solid mechanism for this. ^:once seems fine, but I don't want to use if just to have it potentially break later, and the use of delay seems like blatant abuse of the object just to make use of its inner (^:once fn* ...).
Oh well, at least it works now. Thanks guys.

When to use non-blocking >! / threads and blocking >!! / goroutines with clojure core.async

I'm writing a an ETL process to read event level data from a product database, transform / aggregate it and write to to an analytics data warehouse. I'm using clojure's core.async library to separate these process into concurrently executing components. Here's what the main part of my code looks like right now
(ns data-staging.main
(:require [clojure.core.async :as async])
(:use [clojure.core.match :only (match)]
[data-staging.map-vecs]
[data-staging.tables])
(:gen-class))
(def submissions (make-table "Submission" "Valid"))
(def photos (make-table "Photo"))
(def videos (make-table "Video"))
(def votes (make-table "Votes"))
;; define channels used for sequential data processing
(def chan-in (async/chan 100))
(def chan-out (async/chan 100))
(defn write-thread [table]
"infinitely loops between reading subsequent 10000 rows from
table and ouputting a vector of the rows(maps)
into 'chan-in'"
(while true
(let [next-rows (get-rows table)]
(async/>!! chan-in next-rows)
(set-max table (:max-id (last next-rows))))))
(defn aggregator []
"takes output from 'chan-in' and aggregates it by coupon_id, date.
then adds / drops any fields that are needed / not needed and inputs
into 'chan-out'"
(while true
(->>
(async/<!! chan-in)
aggregate
(async/>!! chan-out))))
(defn read-thread []
"reads data from chan out and interts into Analytics DB"
(while true
(upsert (async/<!! chan-out))))
(defn -main []
(async/thread (write-thread submissions))
(async/thread (write-thread photos))
(async/thread (write-thread videos))
(async/thread-call aggregator)
(async/thread-call read-thread))
As you can see, I'm putting each os component on to its own thread and using the blocking >!! call on the channels. It feels like using the non-blocking >! calls along with go routines might be better for this use case, especially for the database reads which spend most of their time performing i/o and waiting for new rows in the product db. Is this the case, and if so, what would be the best way to implement it? I'm a little unclear on all the tradeoffs between the two methods and on exactly how to effectively use go routines. Also any other suggestions on how to improve the overall architecture would be much appreciated!
Personally, I think your use of threads here is probably the right call. The magic non-blocking nature of go-blocks comes from "parking," which is a special sort of pseudo-blocking that core.async's state machine uses — but since your database calls genuinely block instead of putting the state machine into a parked state, you'd just be blocking some thread from the core.async thread pool. It does depend on how long your synchronous calls take, so this is the sort of thing where benchmarks can be informative, but I strongly suspect threads are the right approach here.
The one exception is your aggregator function. It looks to me like it could just be folded into the definition of chan-out, as (def chan-out (map< aggregate chan-in)).
For a general overview of go-blocks versus threads, Martin Trojer wrote a good examination of the two approaches and which one is faster in which situation. The Cliff's Notes version is that go-blocks are good for adapting already-asynchronous libraries for use with core.async, while threads are good for making asynchronous processes out of synchronous parts. If your database had a callback-based API, for example, then go-blocks would be a definite win. But since it is synchronous, they are not a good fit.
i think that it would be a better approach to use "go" macros to have non-blocking-threads in this ETL case.
I've written a very simple code to achieve the synchronized sequence of processes implied in Extract Transform and Load tasks
Type on your repl the following code:
(require '[clojure.core.async :as async :refer [<! >! <!! timeout chan alt! go]])
(def output(chan))
(defn extract [origin]
(let [value-extracted (chan)
value-transformed (chan)
value-loaded (chan)]
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-extracted (str origin " > extracted ")))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-transformed (str (<! value-extracted) " > transformed " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-loaded (str (<! value-transformed) " > loaded " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! output [origin (<! value-loaded)]))))
(go
(loop [origins-already-loaded []]
(let [[id message] (<! output)
origins-updated (conj origins-already-loaded id)]
(println message)
(println origins-updated)
(recur origins-updated)
)
))
Type on the repl:
(doseq [example (take 10 (range))] (extract example))
1 > extracted > transformed > loaded
[1]
7 > extracted > transformed > loaded
[1 7]
0 > extracted > transformed > loaded
[1 7 0]
8 > extracted > transformed > loaded
[1 7 0 8]
3 > extracted > transformed > loaded
[1 7 0 8 3]
6 > extracted > transformed > loaded
[1 7 0 8 3 6]
2 > extracted > transformed > loaded
[1 7 0 8 3 6 2]
5 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5]
9 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9]
4 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9 4]
UPDATE:
the error fixed was to use <!! (timeout (+ 100 (* 100 (rand-int 20))))) inside the removed function "wait-a-while" that was blocking the others no blocking go processes

Racket refresh-now, thread, and yield

I've been writing some simple racket GUI programs to prepare for a class I'm teaching in the fall. I'm having some problems with animation. I'm using a basic canvas, and using the model for animation in which the entire canvas is refreshed each frame, by calling the paint procedure. An example program is below.
My problem is that I have to either run the animation as a separate thread, or call yield after each instance of refresh-now. Why is this? I expected the refresh-now to cause the image to refresh right away, without additional work on my part.
I have read the animation examples on the racket pages, and see that they usually draw directly to the canvas. I understand that since the canvas is double-buffered this works fine ... but for my application it's easier to just have the paint procedure carry the load, since I need a working paint procedure anyway in case of minimizing, etc. (Of course, the yield is not a huge burden, but it would be easier to teach if it were not needed.)
Thanks,
John
#lang racket
; Demonstrate simple animation in Racket
(require racket/gui)
(define min-x 0)
(define min-y 0)
(define max-x 200)
(define max-y 200)
; Three vertexes of the triangle, expressed relative to a starting x and y location.
(define triangle-vertexes [list
(list 10 0)
(list 0 20)
(list 20 20)])
(define triangle-x 20)
(define triangle-y 20)
; Move a triangle by a (delta-x, delta-y) pair
(define (move-triangle adjust)
(set! triangle-x (+ triangle-x (first adjust)))
(set! triangle-y (+ triangle-y (second adjust))))
; Adjust the location of a vertex by adding an (x,y) adjustment to it.
; Could also be defined using map.
(define (triangle-adjust adjust vertex)
(list (+ (first adjust) (first vertex))
(+ (second adjust) (second vertex))))
; Create the paint-callback function.
; It should:
; - draw a triangle at the current location
(define (draw-triangle dc)
(let ((vertex1 (triangle-adjust (list triangle-x triangle-y) (first triangle-vertexes)))
(vertex2 (triangle-adjust (list triangle-x triangle-y) (second triangle-vertexes)))
(vertex3 (triangle-adjust (list triangle-x triangle-y) (third triangle-vertexes))))
(send dc draw-line (first vertex1) (second vertex1) (first vertex2) (second vertex2))
(send dc draw-line (first vertex2) (second vertex2) (first vertex3) (second vertex3))
(send dc draw-line (first vertex3) (second vertex3) (first vertex1) (second vertex1))))
(define frame (new frame%
[label "Animation Example"]
[width 800]
[height 800]))
(define triangle-canvas (new canvas% [parent frame]
[paint-callback
(lambda (canvas dc)
(display "callback called")
(draw-triangle dc))]))
(send frame show #t)
; run a thunk (a procedure of zero arguments) n times
; only useful if thunk has side-effects
(define (loop n thunk)
(cond
((> n 0) (thunk)
(loop (- n 1) thunk))
(else false)))
; Animate the triangle. We have to either run this in a different thread from
; the event loop or yield each time we want something to be drawn.
(define (animate-triangle)
(loop 30
(lambda ()
(move-triangle (list 10 10))
(send triangle-canvas refresh-now)
; (send triangle-canvas flush)
(yield)
; (sleep 0.1)
)))
This isn't an answer to your question about refresh-now, but a better alternative to an explicit thread and loop is the timer% class:
;; This goes after (send frame show #t), replacing loop and animate-triangle
(define timer-counter 0)
(define timer
(new timer%
(interval 100) ;; update every 100 ms
(notify-callback
(lambda ()
(cond [(< timer-counter 30)
(set! timer-counter (add1 timer-counter))
(move-triangle (list 10 10))
(send triangle-canvas refresh)]
[else
(send timer stop)])))))
If you redefine your stopping condition based on the triangle's state, you can get rid of the auxiliary timer-counter; I put it in to mimic the behavior of your original code.
The timer is created in the same eventspace as the frame, and the eventspace has an event-handling thread, which is why you don't have to create your own thread explicitly.
How to Design Programs, 2nd ed has another approach to animation, where the canvas and updates are managed automatically. You just call big-bang with functions to (functionally) update the "state of the world" and render the "state of the world" as an image. Depending on exactly what you're teaching, it may or may not be useful to you.

Resources