In one of Timmy Jose's blog posts at https://z0ltan.wordpress.com/2016/09/02/basic-concurrency-and-parallelism-in-common-lisp-part-3-concurrency-using-bordeaux-and-sbcl-threads/ he gives an example of the wrong way to print to the top level from inside a thread (using Bordeaux Threads as an example, although I am using Lparallel):
(defun print-message-top-level-wrong ()
(bt:make-thread
(lambda ()
(format *standard-output* "Hello from thread!")))
nil)
(print-message-top-level-wrong) -> NIL
The explanation is that "The same code would have run fine if we had not run it in a separate thread. What happens is that each thread has its own stack where the variables are rebound. In this case, even for *standard-output*, which being a global variable, we would assume should be available to all threads, is rebound inside each thread!"
And this is exactly what happens if the function is run in Allegro CL. However, in SBCL the function does print the intended output at the terminal. Does this mean *standard-output* is not being rebound in SBCL? In general, is there a cross-platform way to print to *standard-output* from inside a thread?
In a multithreaded situation printing to the terminal should normally be coordinated to avoid potentially printing from several streams at the same time. But there don't seem to be any functions like atomic-format or atomic-print available. Is there a straightforward way to avoid printing interference when there are multiple threads (assuming that locks/mutexes are too expensive to use for each individual printing operation)?
If you actually have a global binding (a binding in the global environment), it does work for all threads; see the documentation for bt:make-thread. Only dynamic (re-)bindings are thread-local. Implementations differ in how/when they bind those streams; sometimes the binding that is actually in effect for user programs is global, sometimes not.
I like to use some sort of queue or channel to coordinate output where necessary; I have not yet run into situations where the locking overhead was prohibitive.
Maybe you could try something with optimistic locking, but I don't know what has been done for that librarywise (some Lisp implementations do have CAS operations that could be used). This should be orthogonal to the parallelism library used.
EDIT: Just found in the SBCL manual: sb-concurrency has lock-free queues and mailboxes.
My question may seem weird but I think I'm facing an issue with volatile objects.
I have written a library implemented like this (just a scheme, not real content):
(def var1 (volatile! nil))
(def var2 (volatile! nil))
(def do-things [a]
(vreset! var1 a)
(vswap! var2 (inc #var2))
{:a #var1 :b #var2})
So I have global var which are initialized by external values, others that are calculated and I return their content.
i used volatile to have better speed than with atoms and not to redefine everytime a new var for every calculation.
The problem is that this seems to fail in practice because I map do-things to a collection (in another program) with inner sub-calls to this function occasionaly, like (pseudo-code) :
(map
(fn [x]
(let [analysis (do-things x)]
(if blabla
(do-things (f x))
analysis)))) coll)
Will inner conditionnal call spawn another thread under the hood ? It seems yes because somethimes calls work, sometimes not.
is there any other way to do apart from defining volatile inside every do-things body ?
EDIT
Actually the error was another thing but the question is still here : is this an acceptable/safe way to do without any explicit call to multithreading capabilities ?
There are very few constructs in Clojure that create threads on your behalf - generally Clojure can and will run on one or more threads depending on how you structure your program. pmap is a good example that creates and manages a pool of threads to map in parallel. Another is clojure.core.reducers/fold, which uses a fork/join pool, but really that's about it. In all other cases it's up to you to create and manage threads.
Volatiles should only be used with great care and in circumstances where you control the scope of use such that you are guaranteed not to be competing with threads to read and write the same volatile. Volatiles guarantee that writes can be read on another thread, but they do nothing to guarantee atomicity. For that, you must use either atoms (for uncoordinated) or refs and the STM (for coordinated).
I wrote a function to get the old value of an atom while putting a new value, all in one atomic operation:
(defn get-and-reset! [at newval]
"Resets atom to newval and returns the old value. Atomic."
(let [tmp (atom [])]
(swap! at #(do (reset! tmp %) newval))
#tmp))
The documentation says the swap! function shouldn't have side effects because it can be called multiple times. That alone doesn't seem like a problem since tmp never leaves the function and it's the last value that it gets reset! to that matters. The function seems to work but I haven't tested it thoroughly with multiple threads, etc. Is this local side-effect a safe exception to the documentation, or am I missing some other subtle problem?
Yes, that will work with the current implementation of atoms in Clojure, and is (almost) guaranteed to work by contract.
The key here is that atoms are sychronous. Therefore, the inner swap! is guaranteed to complete before the outer swap!. Since tmp is only used locally, from a single thread, the inner swap! is also guaranteed not to conflict with a swap! (of tmp) on another thread.
While the outer swap! (i.e., swap! of at) could conflict with other threads, this swap! will retry when a conflict is detected. Since swap! is synchronous, these retries will occur serially w.r.t. the thread the swap! is invoked on. I suppose it's conceivable this last condition does not necessarily hold. E.g., it would be possible for an implementation of atoms to perform the swap! on a different thread, and issue retries as soon as a conflict is detected (without waiting for previous tries to finish). However, that's not the way atoms are currently implemented, and (in my opinion) doesn't seem like a very likely way to implement atoms.
If this weakness bothers you, you can use compare-and-set! instead:
(defn get-and-reset! [at newval]
"Resets atom to newval and returns the old value. Atomic."
(loop [oldval #at]
(if (compare-and-set! at oldval newval)
;; then (no conflict => return oldval)
oldval
;; else (conflict => retry)
(recur #at))))
atom's cannot do what you are trying to do.
Atoms are only defined for uncoordinated syncronous updates of a single identity. for instance the functions used to update atoms may run many times so whatever you do with that value may happen many time for each value that makes it into the atom.
Agents are often a better choice for this sort of thing because If you send an action to an agent it will run at most once:
"At any point in time, at most one action for each Agent is being executed.
Actions dispatched to an agent from another single agent or thread will occur in the order they were sent"
Another option is to add a watch to the agent or atom and have that watch react to each change after it happens. If you can convince your self that neither of these cases work for you then you may have found one of the cases where coordinated change are actually required and then refs would be the better tool, though this is rare. Usually agents or atoms with watches cover most situations.
I am writing a benchmark for a program in Clojure. I have n threads accessing a cache at the same time. Each thread will access the cache x times. Each request should be logged inside a file.
To this end I created an agent that holds the path to the file to be written to. When I want to write I send-off a function that writes to the file and simply returns the path. This way my file-writes are race-condition free.
When I execute my code without the agent it finished in a few miliseconds. When I use the agent, and ask each thread to send-off to the agent each time my code runs horribly slow. I'm talking minutes.
(defn load-cache-only [usercount cache-size]
"Test requesting from the cache only."
; Create the file to write the benchmark results to.
(def sink "benchmarks/results/load-cache-only.txt")
(let [data-agent (agent sink)
; Data for our backing store generated at runtime.
store-data (into {} (map vector (map (comp keyword str)
(repeat "item")
(range 1 cache-size))
(range 1 cache-size)))
cache (create-full-cache cache-size store-data)]
(barrier/run-with-barrier (fn [] (load-cache-only-work cache store-data data-agent)) usercount)))
(defn load-cache-only-work [cache store-data data-agent]
"For use with 'load-cache-only'. Requests each item in the cache one.
We time how long it takes for each request to be handled."
(let [cache-size (count store-data)
foreachitem (fn [cache-item]
(let [before (System/nanoTime)
result (cache/retrieve cache cache-item)
after (System/nanoTime)
diff_ms ((comp str float) (/ (- after before) 1000))]
;(send-off data-agent (fn [filepath]
;(file/insert-record filepath cache-size diff_ms)
;filepath))
))]
(doall (map foreachitem (keys store-data)))))
The (barrier/run-with-barrier) code simply spawns usercount number of threads and starts them at the same time (using an atom). The function I pass is the body of each thread.
The body willl simply map over a list named store-data, which is a key-value list (e.g., {:a 1 :b 2}. The length of this list in my code right now is 10. The number of users is 10 as well.
As you can see, the code for the agent send-off is commented out. This makes the code execute normally. However, when I enable the send-offs, even without writing to the file, the execution time is too slow.
Edit:
I made each thread, before he sends off to the agent, print a dot.
The dots appear just as fast as without the send-off. So there must be something blocking in the end.
Am I doing something wrong?
You need to call (shutdown-agents) when you're done sending stuff to your agent if you want the JVM to exit in reasonable time.
The underlying problem is that if you don't shutdown your agents, the threads backing its threadpool will never get shut down, and prevent the JVM from exiting. There's a timeout that will shutdown the pool if there's nothing else running, but it's fairly lengthy. Calling shutdown-agents as soon as you're done producing actions will resolve this problem.
Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.