Why does looping over a large amount of data in another thread cause an overactive GC, and prevent some data from being freed? - multithreading

I'm writing code that takes some lazy results produced by pmap, and draws them onto a BufferedImage. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.
I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.
This is the best MCVE that I've come up with:
(ns mandelbrot-redo.irrelevant.write-image-mcve
(:import [java.awt.image BufferedImage]
(java.util.concurrent Executors Executor)))
(defn lazy-producer [width height]
(for [y (range height)
x (range width)]
[x y (+ x y)]))
; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i 1e6))
(println (str (double (/ i (* width height) 0.01)) "%")))
((fn boop [x] x) result)))) ; Data gets consumed here
; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(fn []
(sync-consumer results width height)
(println "Done...")))
(.start)))
(defn -main []
(let [width 5000
height (int (* width 2/3))]
(-> (lazy-producer width height)
(async-consumer width height))))
When -main is run with sync-consumer, it finishes after a few seconds. With async-consumer however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.
If I use an explicit Thread., or use a local thread pool in async-consumer, it hangs and crashes. If I use future however, it finishes fine, just like sync-consumer.
The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Longs when using the async version:
The sync version shows a peak amount of Longs to be about 45mb at once in comparison.
The CPU usage is quite different too:
There's massive GC spikes, but it doesn't seem like the Longs are being disposed of.
I could use future for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.
Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?
Can anyone explain this behavior?

The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.
The async one closes over results in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?
I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).

I simplified you example, and get inconsistent results. I suspect that the manual Thread object is somehow being regarded (sometimes) as a daemon thread, so the JVM sometimes exits before it has completed:
(def N 5e3)
(def total-count (* N N))
(def report-fact (int (/ total-count 20)))
(defn lazy-producer []
(for [y (range N)
x (range N)]
[x y (+ x y)]))
(defn sync-consumer [results]
(println "sync-consumer: start")
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i report-fact))
(println (str (Math/round (/ (* 100 i) total-count)) " %")))))
(println "sync-consumer: stop"))
(defn async-consumer [results]
; (spyx (count results))
(spyx (nth results 99))
(let [thread (Thread. (fn []
(println "thread start")
(sync-consumer results)
(println "thread done")
(flush)))]
; (.setDaemon thread false)
(.start thread)
(println "daemon? " (.isDaemon thread))
thread))
(dotest
(println "test - start")
(let [thread (async-consumer
(lazy-producer))]
(when true
(println "test - sleeping")
(Thread/sleep 5000))
; (.join thread)
)
(println "test - end"))
with results:
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
60 %
lein test 54.58s user 1.37s system 372% cpu 15.028 total
If we uncomment the (.join thread) line, we get a complete run:
~/expr/demo > lein test
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
90 %
95 %
"Elapsed time: 9388.313828 msecs"
sync-consumer: stop
thread done
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
lein test 72.52s user 1.69s system 374% cpu 19.823 total
It seems to exit early as if Clojure killed off the manual Thread object.
Perhaps you have found an (intermittent) bug.

Thanks to #amalloy and #Alex, I got it working.
I implemented the suggestions by #amalloy in the comments, and both variants work here and in my real case:
; Brittle since "once" is apparently more of an implementation detail of the language
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(^:once fn* []
(sync-consumer results width height)
(println "Done...")))
(.start)))
; Arguably less brittle under the assumption that if they replace "once" with another mechanism,
; they'll update "delay".
(defn async-consumer [results width height]
(let [d (delay (sync-consumer results width height))]
(doto
(Thread. ^Runnable
(fn []
#d
(println "Done...")))
(.start))))
I also tried updating to 1.9.0. I thought that might fix it since #Alex says he's on 1.9.0 and can't reproduce this, and there's also this bug fix that seems related. Unfortunately, I didn't notice any difference.
It would be nice if there was an actual, solid mechanism for this. ^:once seems fine, but I don't want to use if just to have it potentially break later, and the use of delay seems like blatant abuse of the object just to make use of its inner (^:once fn* ...).
Oh well, at least it works now. Thanks guys.

Related

Multi-threaded parallelism performance problem with Fibonacci sequence in Julia (1.3)

I'm trying the multithread function of Julia 1.3 with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
#time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.#spawn
function fib(n::Int)
if n < 2
return n
end
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?
Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that #spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the #spawn overhead.
Note that there is work going into lessening the overhead of #spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.
If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> #btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> #btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> #btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
#Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare #code_llvm F(10) #code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.
julia> #btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> #btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.
function fib(n::Int)
if n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> #btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools #btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.
#Anush
As an example of using memoization and multithreading manually
_fib(::Val{1}, _, _) = 1
_fib(::Val{2}, _, _) = 1
import Base.Threads.#spawn
_fib(x::Val{n}, d = zeros(Int, n), channel = Channel{Bool}(1)) where n = begin
# lock the channel
put!(channel, true)
if d[n] != 0
res = d[n]
take!(channel)
else
take!(channel) # unlock channel so I can compute stuff
#t = #spawn _fib(Val(n-2), d, channel)
t1 = _fib(Val(n-2), d, channel)
t2 = _fib(Val(n-1), d, channel)
res = fetch(t1) + fetch(t2)
put!(channel, true) # lock channel
d[n] = res
take!(channel) # unlock channel
end
return res
end
fib(n) = _fib(Val(n), zeros(Int, n), Channel{Bool}(1))
fib(1)
fib(2)
fib(3)
fib(4)
#time fib(43)
using BenchmarkTools
#benchmark fib(43)
But the speed up came from memmiozation and not so much multithreading. Lesson here is that we should think better algorithms before multithreading.

Clojure computing array in parallel

I need some help. I have on array(sequence) in size of 4000 elements. I need to compute each part(1000 elements is one part) by four different functions. On function computing one part at least 10 sec. Another functions computing their parts in 1 sec. I need to do it at list 10 times. So, of course I want to do it in parallel. I'm trying to do it using clojure future, but it steel take computing time like a sequence program. At first I think, that's because when i'm using one array and send data to each function each future thread can get access sequental. How can I do this computing in parallel?
here s a simple example of code:
(defn funct [t n]
(let [ a (future (fun1 (fun_take_i_part_array n 1)))
b (future (fun2 (fun_take_i_part_array n 2)))
c (future (fun3 (fun_take_i_part_array n 3)))
d (future (fun4 (fun_take_i_part_array n 4)))
]
(concatenate #a #b #c #d)
)
)
So I this future-kind program take longest than sequence for 1 sec.
The array access is never sequential unless these are not arrays but lazy sequences. It does not need to be because arrays are immutable - that is the distinguishing feature of Clojure. Your slowdown does not come from that.
Your idea of parallelisation is correct, look at this:
(defn fibonacci [n] (case n 1 1 2 1 (+ (fibonacci (- n 2)) (fibonacci (- n 1)))))
(defn parallel-fib []
(let [s1 (future (fibonacci 34))
s2 (future (fibonacci 35))
s3 (future (fibonacci 36))
s4 (future (fibonacci 33))]
[#s1 #s2 #s3 #s4]))
user=> (time (fibonacci 35))
"Elapsed time: 1514.663902 msecs"
9227465
user=> (time (fibonacci 36))
"Elapsed time: 2403.761528 msecs"
14930352
user=> (time (parallel-fib))
"Elapsed time: 2552.572043 msecs"
[5702887 9227465 14930352 3524578]
It is clear that the parallel executed futures, running on a 4-core machine, yield time close to the most expensive computation and not to the sum of computation times.
Therefore, I see no obvious reasons why your code experiences times closer to sequential execution. The reasons that come to my mind might be:
fun1 (or any other of those funs) could be a parallel function by itself (e.g. using pmap) consuming all of your cores. Then, the CPU would get crammed and no speedup would be observed.
Somehow 'concatenate' is eating a lot of CPU to put the results together. This should not be a major hassle as if you have used standard Clojure 'concat', it produces a lazy sequence which should incur its access cost only after it's being accessed. Not to mention the access cost should be very low.
If the computations inside futures are very simple, no good scaling could be archived either. We know it is not true, though, because you have already mentioned it takes fun1 10 seconds to complete.
It could help if you could send me the whole code. My email: contact [at] spinney.io.

Guile Scheme parallel forms speedup

I am experimenting with the parallel forms of Guile Scheme and I have the following code:
(use-modules (srfi srfi-1)
(ice-9 pretty-print)
(ice-9 receive))
(define (busy-work limit)
(if (> limit 0)
(begin (sqrt (+ (expt limit limit) 1))
(busy-work (- limit 1)))
'done))
(define (busy-work-2 lst)
(cond [(null? lst) 'done]
[else
(expt (car lst) (car lst))
(busy-work-2 (cdr lst))]))
(define (time thunk)
(define starting-time (current-time))
(define res (thunk))
(define ending-time (current-time))
(display "elapsed time: ")
(display (- ending-time starting-time))
(display "s")
(newline)
res)
(define (partition-4 numbers)
(define (loop numbers rc0 rc1 rc2 rc3)
(cond [(null? numbers) (list (reverse rc0)
(reverse rc1)
(reverse rc2)
(reverse rc3))]
[else
(let* ([number (car numbers)]
[residue (remainder number 4)])
(cond [(= residue 0) (loop (cdr numbers)
(cons number rc0)
rc1
rc2
rc3)]
[(= residue 1) (loop (cdr numbers)
rc0
(cons number rc1)
rc2
rc3)]
[(= residue 2) (loop (cdr numbers)
rc0
rc1
(cons number rc2)
rc3)]
[(= residue 3) (loop (cdr numbers)
rc0
rc1
rc2
(cons number rc3))]))]))
(loop numbers '() '() '() '()))
(or in my experimenting repository at https://github.com/ZelphirKaltstahl/guile-scheme-tutorials/blob/5321470f8f3cbbdb7f64d4ed60e4b1eaf8d8f444/parallellism/utils.scm)
The 2 procedures busy-work and busy-work-2 are pure number crunching, with lists of numbers, where no calculation depends on another, as far as I am aware. I know the time measurement might not be completely accurate.
However, consistently I do not get the expected speedup from using more threads (cores, as I can see in core usage in my CPU indicator).
Here are some examples, from which I would expect 2 threads to be twice as quickly done with the task as 1 core and 4 cores to be twice as fast as 2 cores. Well more or less at least, because I am splitting up the lists in a way, which should spread work more or less evenly.
Using 4 cores and parallel
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(parallel (busy-work-2 (car residue-classes))
(busy-work-2 (cadr residue-classes))
(busy-work-2 (caddr residue-classes))
(busy-work-2 (cadddr residue-classes))))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s.
Using par-map which uses 4 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(par-map busy-work-2
residue-classes))))
This finishes in approximately 10s on my machine. Sometimes 9s sometimes 10s. Just like with parallel.
Using n-par-map with 4 threads (on my machine)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map (current-processor-count)
busy-work-2
residue-classes))))
Also 10s. Here the manual (https://www.gnu.org/software/guile/manual/html_node/Parallel-Forms.html) says:
Unlike those above, the functions described below take a number of threads as an argument. This makes them inherently non-portable since the specified number of threads may differ from the number of available CPU cores as returned by current-processor-count (see Processes). In addition, these functions create the specified number of threads when they are called and terminate them upon completion, which makes them quite expensive.
Therefore, they should be avoided.
While I find this explanation does not make 100% sense as it is (why would n-par-map not use the same pre-created threads as parallel, if there are sufficient of those? Like 4 as on my machine?), I do not see any great overhead and again I see it finishes in 10s approximately. My guess is, that the thread creation takes so little time, that it is simply not noticed compared to all the computation when number crunching.
Using n-par-map with 2 threads (cores)
(let ([residue-classes (partition-4 (iota 30000))])
(time
(lambda ()
(n-par-map 2
busy-work-2
residue-classes))))
Expectation: Might finish in 20s.
Result: This finishes in 12s.
Now of course I am thinking: "Well there must be some massive overhead in the runs with 4 cores!".
Question: But where does this overhead come from, when I am doing purely number crunching without inter-dependencies of any results? Does it use some shared memory so that memory access becomes a bottle neck?
You are probably using a machine with two physical cores which are hyperthreaded, so that 4 cpus are reported. What it shows is that this workload does not suit hyperthreads.
I get a similar result on a machine with two hyperthreaded physical cores. However with a machine with 4 physical cores I get 9 secs using all 4 cores, and 16 secs using only 2 cores, which is more along the lines of what you were expecting.

When to use non-blocking >! / threads and blocking >!! / goroutines with clojure core.async

I'm writing a an ETL process to read event level data from a product database, transform / aggregate it and write to to an analytics data warehouse. I'm using clojure's core.async library to separate these process into concurrently executing components. Here's what the main part of my code looks like right now
(ns data-staging.main
(:require [clojure.core.async :as async])
(:use [clojure.core.match :only (match)]
[data-staging.map-vecs]
[data-staging.tables])
(:gen-class))
(def submissions (make-table "Submission" "Valid"))
(def photos (make-table "Photo"))
(def videos (make-table "Video"))
(def votes (make-table "Votes"))
;; define channels used for sequential data processing
(def chan-in (async/chan 100))
(def chan-out (async/chan 100))
(defn write-thread [table]
"infinitely loops between reading subsequent 10000 rows from
table and ouputting a vector of the rows(maps)
into 'chan-in'"
(while true
(let [next-rows (get-rows table)]
(async/>!! chan-in next-rows)
(set-max table (:max-id (last next-rows))))))
(defn aggregator []
"takes output from 'chan-in' and aggregates it by coupon_id, date.
then adds / drops any fields that are needed / not needed and inputs
into 'chan-out'"
(while true
(->>
(async/<!! chan-in)
aggregate
(async/>!! chan-out))))
(defn read-thread []
"reads data from chan out and interts into Analytics DB"
(while true
(upsert (async/<!! chan-out))))
(defn -main []
(async/thread (write-thread submissions))
(async/thread (write-thread photos))
(async/thread (write-thread videos))
(async/thread-call aggregator)
(async/thread-call read-thread))
As you can see, I'm putting each os component on to its own thread and using the blocking >!! call on the channels. It feels like using the non-blocking >! calls along with go routines might be better for this use case, especially for the database reads which spend most of their time performing i/o and waiting for new rows in the product db. Is this the case, and if so, what would be the best way to implement it? I'm a little unclear on all the tradeoffs between the two methods and on exactly how to effectively use go routines. Also any other suggestions on how to improve the overall architecture would be much appreciated!
Personally, I think your use of threads here is probably the right call. The magic non-blocking nature of go-blocks comes from "parking," which is a special sort of pseudo-blocking that core.async's state machine uses — but since your database calls genuinely block instead of putting the state machine into a parked state, you'd just be blocking some thread from the core.async thread pool. It does depend on how long your synchronous calls take, so this is the sort of thing where benchmarks can be informative, but I strongly suspect threads are the right approach here.
The one exception is your aggregator function. It looks to me like it could just be folded into the definition of chan-out, as (def chan-out (map< aggregate chan-in)).
For a general overview of go-blocks versus threads, Martin Trojer wrote a good examination of the two approaches and which one is faster in which situation. The Cliff's Notes version is that go-blocks are good for adapting already-asynchronous libraries for use with core.async, while threads are good for making asynchronous processes out of synchronous parts. If your database had a callback-based API, for example, then go-blocks would be a definite win. But since it is synchronous, they are not a good fit.
i think that it would be a better approach to use "go" macros to have non-blocking-threads in this ETL case.
I've written a very simple code to achieve the synchronized sequence of processes implied in Extract Transform and Load tasks
Type on your repl the following code:
(require '[clojure.core.async :as async :refer [<! >! <!! timeout chan alt! go]])
(def output(chan))
(defn extract [origin]
(let [value-extracted (chan)
value-transformed (chan)
value-loaded (chan)]
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-extracted (str origin " > extracted ")))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-transformed (str (<! value-extracted) " > transformed " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-loaded (str (<! value-transformed) " > loaded " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! output [origin (<! value-loaded)]))))
(go
(loop [origins-already-loaded []]
(let [[id message] (<! output)
origins-updated (conj origins-already-loaded id)]
(println message)
(println origins-updated)
(recur origins-updated)
)
))
Type on the repl:
(doseq [example (take 10 (range))] (extract example))
1 > extracted > transformed > loaded
[1]
7 > extracted > transformed > loaded
[1 7]
0 > extracted > transformed > loaded
[1 7 0]
8 > extracted > transformed > loaded
[1 7 0 8]
3 > extracted > transformed > loaded
[1 7 0 8 3]
6 > extracted > transformed > loaded
[1 7 0 8 3 6]
2 > extracted > transformed > loaded
[1 7 0 8 3 6 2]
5 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5]
9 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9]
4 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9 4]
UPDATE:
the error fixed was to use <!! (timeout (+ 100 (* 100 (rand-int 20))))) inside the removed function "wait-a-while" that was blocking the others no blocking go processes

Racket refresh-now, thread, and yield

I've been writing some simple racket GUI programs to prepare for a class I'm teaching in the fall. I'm having some problems with animation. I'm using a basic canvas, and using the model for animation in which the entire canvas is refreshed each frame, by calling the paint procedure. An example program is below.
My problem is that I have to either run the animation as a separate thread, or call yield after each instance of refresh-now. Why is this? I expected the refresh-now to cause the image to refresh right away, without additional work on my part.
I have read the animation examples on the racket pages, and see that they usually draw directly to the canvas. I understand that since the canvas is double-buffered this works fine ... but for my application it's easier to just have the paint procedure carry the load, since I need a working paint procedure anyway in case of minimizing, etc. (Of course, the yield is not a huge burden, but it would be easier to teach if it were not needed.)
Thanks,
John
#lang racket
; Demonstrate simple animation in Racket
(require racket/gui)
(define min-x 0)
(define min-y 0)
(define max-x 200)
(define max-y 200)
; Three vertexes of the triangle, expressed relative to a starting x and y location.
(define triangle-vertexes [list
(list 10 0)
(list 0 20)
(list 20 20)])
(define triangle-x 20)
(define triangle-y 20)
; Move a triangle by a (delta-x, delta-y) pair
(define (move-triangle adjust)
(set! triangle-x (+ triangle-x (first adjust)))
(set! triangle-y (+ triangle-y (second adjust))))
; Adjust the location of a vertex by adding an (x,y) adjustment to it.
; Could also be defined using map.
(define (triangle-adjust adjust vertex)
(list (+ (first adjust) (first vertex))
(+ (second adjust) (second vertex))))
; Create the paint-callback function.
; It should:
; - draw a triangle at the current location
(define (draw-triangle dc)
(let ((vertex1 (triangle-adjust (list triangle-x triangle-y) (first triangle-vertexes)))
(vertex2 (triangle-adjust (list triangle-x triangle-y) (second triangle-vertexes)))
(vertex3 (triangle-adjust (list triangle-x triangle-y) (third triangle-vertexes))))
(send dc draw-line (first vertex1) (second vertex1) (first vertex2) (second vertex2))
(send dc draw-line (first vertex2) (second vertex2) (first vertex3) (second vertex3))
(send dc draw-line (first vertex3) (second vertex3) (first vertex1) (second vertex1))))
(define frame (new frame%
[label "Animation Example"]
[width 800]
[height 800]))
(define triangle-canvas (new canvas% [parent frame]
[paint-callback
(lambda (canvas dc)
(display "callback called")
(draw-triangle dc))]))
(send frame show #t)
; run a thunk (a procedure of zero arguments) n times
; only useful if thunk has side-effects
(define (loop n thunk)
(cond
((> n 0) (thunk)
(loop (- n 1) thunk))
(else false)))
; Animate the triangle. We have to either run this in a different thread from
; the event loop or yield each time we want something to be drawn.
(define (animate-triangle)
(loop 30
(lambda ()
(move-triangle (list 10 10))
(send triangle-canvas refresh-now)
; (send triangle-canvas flush)
(yield)
; (sleep 0.1)
)))
This isn't an answer to your question about refresh-now, but a better alternative to an explicit thread and loop is the timer% class:
;; This goes after (send frame show #t), replacing loop and animate-triangle
(define timer-counter 0)
(define timer
(new timer%
(interval 100) ;; update every 100 ms
(notify-callback
(lambda ()
(cond [(< timer-counter 30)
(set! timer-counter (add1 timer-counter))
(move-triangle (list 10 10))
(send triangle-canvas refresh)]
[else
(send timer stop)])))))
If you redefine your stopping condition based on the triangle's state, you can get rid of the auxiliary timer-counter; I put it in to mimic the behavior of your original code.
The timer is created in the same eventspace as the frame, and the eventspace has an event-handling thread, which is why you don't have to create your own thread explicitly.
How to Design Programs, 2nd ed has another approach to animation, where the canvas and updates are managed automatically. You just call big-bang with functions to (functionally) update the "state of the world" and render the "state of the world" as an image. Depending on exactly what you're teaching, it may or may not be useful to you.

Resources