What multithreading package for Lua "just works" as shipped? - multithreading

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.

Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.

Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.

This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.

Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(

I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.

I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?
You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.
Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.
I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

How to safely use [NSTask waitUntilExit] off the main thread?

I have a multithreaded program that needs to run many executables at once and wait for their results.
I use [nstask waitUntilExit] in an NSOperationQueue that runs it on non-main thread (running NSTask on the main thread is completely out of the question).
My program randomly crashes or runs into assertion failures, and the crash stacks always point to the runloop run by waitUntilExit, which executes various callbacks and handlers, including—IMHO incorrectly—KVO and bindings updating the UI, which causes them to run on non-main thread (It's probably the problem described by Mike Ash)
How can I safely use waitUntilExit?
Is it a problem of waitUntilExit being essentially unusable, or do I need to do something special (apart from explicitly scheduling my callbacks on the main thread) when using KVO and IB bindings to prevent them from being handled on a wrong thread running waitUntilExit?
As Mike Ash points out, you just can't call waitUntilExit on a random runloop. It's convenient, but it doesn't work. You have to include "doesn't work" in your computation of "is this actually convenient?"
You can, however, use terminationHandler in 10.7+. It does not pump the runloop, so shouldn't create this problem. You can recreate waitUntilExit with something along these lines (untested; probably doesn't compile):
dispatch_group group = dispatch_group_create();
dispatch_group_enter(group);
task.terminationHandler = ^{ dispatch_group_leave(group); };
[task launch];
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// If not using ARC:
dispatch_release(group);
Hard to say without general context of what are you doing...
In general you can't update interface from the non main threads. So if you observe some KVO notifications of NSTasks in non main thread and update UI then you are wrong.
In that case you can fix situation by simple
-[NSObject performSelectorOnMainThread:];
or similar when you want to update UI.
But as for me more grace solution:
write separated NSOperationQueue with maxConcurentOperationsCount = 1 (so FIFO queue) and write subclass of NSOperation which will execute NSTask and update UI through delegate methods. In that way you will control amount of executing tasks in application. (or you may stop all of them or else)
But high level solution for your problem I think will be writing privileged helper tool. Using this approach you will get 2 main benefits: your NSTask's will be executes in separated process and you will have root privilegies for executing your tasks.
I hope my answer covers your problem.

How does NodeJS handle multi-core concurrency?

Currently I am working on a database that is updated by another java application, but need a NodeJS application to provide Restful API for website use. To maximize the performance of NodeJS application, it is clustered and running in a multi-core processor.
However, from my understanding, a clustered NodeJS application has a their own event loop on each CPU core, if so, does that mean, with cluster architect, NodeJS will have to face traditional concurrency issues like in other multi-threading architect, for example, writing to same object which is not writing protected? Or even worse, since it is multi-process running at same time, not threads within a process blocked by another...
I have been searching Internet, but seems nobody cares that at all. Can anyone explain the cluster architect of NodeJS? Thanks very much
Add on:
Just to clarify, I am using express, it is not like running multiple instances on different ports, it is actually listening on the same port, but has one process on each CPUs competing to handle requests...
the typical problem I am wondering now is: a request to update Object A base on given Object B(not finish), another request to update Object A again with given Object C (finish before first request)...then the result would base on Object B rather than C, because first request actually finishes after the second one.
This will not be problem in real single-threaded application, because second one will always be executed after first request...
The core of your question is:
NodeJS will have to face traditional concurrency issues like in other multi-threading architect, for example, writing to same object which is not writing protected?
The answer is that that scenario is usually not possible because node.js processes don't share memory. ObjectA, ObjectB and ObjectC in process A are different from ObjectA, ObjectB and ObjectC in process B. And since each process are single-threaded contention cannot happen. This is the main reason you find that there are no semaphore or mutex modules shipped with node.js. Also, there are no threading modules shipped with node.js
This also explains why "nobody cares". Because they assume it can't happen.
The problem with node.js clusters is one of caching. Because ObjectA in process A and ObjectA in process B are completely different objects, they will have completely different data. The traditional solution to this is of course not to store dynamic state in your application but to store them in the database instead (or memcache). It's also possible to implement your own cache/data synchronization scheme in your code if you want. That's how database clusters work after all.
Of course node, being a program written in C, can be easily extended in C and there are modules on npm that implement threads, mutex and shared memory. If you deliberately choose to go against node.js/javascript design philosophy then it is your responsibility to ensure nothing goes wrong.
Additional answer:
a request to update Object A base on given Object B(not finish), another request to update Object A again with given Object C (finish before first request)...then the result would base on Object B rather than C, because first request actually finishes after the second one.
This will not be problem in real single-threaded application, because second one will always be executed after first request...
First of all, let me clear up a misconception you're having. That this is not a problem for a real single-threaded application. Here's a single-threaded application in pseudocode:
function main () {
timeout = FOREVER
readFd = []
writeFd = []
databaseSock1 = socket(DATABASE_IP,DATABASE_PORT)
send(databaseSock1,UPDATE_OBJECT_B)
databaseSock2 = socket(DATABASE_IP,DATABASE_PORT)
send(databaseSock2,UPDATE_OPJECT_C)
push(readFd,databaseSock1)
push(readFd,databaseSock2)
while(1) {
event = select(readFD,writeFD,timeout)
if (event) {
for (i=0; i<length(readFD); i++) {
if (readable(readFD[i]) {
data = read(readFD[i])
if (data == OBJECT_B_UPDATED) {
update(objectA,objectB)
}
if (data == OBJECT_C_UPDATED) {
update(objectA,objectC)
}
}
}
}
}
}
As you can see, there's no threads in the program above, just asynchronous I/O using the select system call. The program above can easily be translated directly into single-threaded C or Java etc. (indeed, something similar to it is at the core of the javascript event loop).
However, if the response to UPDATE_OBJECT_C arrives before the response to UPDATE_OBJECT_B the final state would be that objectA is updated based on the value of objectB instead of objectC.
No asynchronous single-threaded program is immune to this in any language and node.js is no exception.
Note however that you don't end up in a corrupted state (though you do end up in an unexpected state). Multithreaded programs are worse off because without locks/semaphores/mutexes the call to update(objectA,objectB) can be interrupted by the call to update(objectA,objectC) and objectA will be corrupted. This is what you don't have to worry about in single-threaded apps and you won't have to worry about it in node.js.
If you need strict temporally sequential updates you still need to either wait for the first update to finish, flag the first update as invalid or generate error for the second update. Typically for web apps (like stackoverflow) an error would be returned (for example if you try to submit a comment while someone else have already updated the comments).

Scala - best API for doing work inside multiple threads

In Python, I am using a library called futures, which allows me to do my processing work with a pool of N worker processes, in a succinct and crystal-clear way:
schedulerQ = []
for ... in ...:
workParam = ... # arguments for call to processingFunction(workParam)
schedulerQ.append(workParam)
with futures.ProcessPoolExecutor(max_workers=5) as executor: # 5 CPUs
for retValue in executor.map(processingFunction, schedulerQ):
print "Received result", retValue
(The processingFunction is CPU bound, so there is no point for async machinery here - this is about plain old arithmetic calculations)
I am now looking for the closest possible way to do the same thing in Scala. Notice that in Python, to avoid the GIL issues, I was using processes (hence the use of ProcessPoolExecutor instead of ThreadPoolExecutor) - and the library automagically marshals the workParam argument to each process instance executing processingFunction(workParam) - and it marshals the result back to the main process, for the executor's map loop to consume.
Does this apply to Scala and the JVM? My processingFunction can, in principle, be executed from threads too (there's no global state at all) - but I'd be interested to see solutions for both multiprocessing and multithreading.
The key part of the question is whether there is anything in the world of the JVM with as clear an API as the Python futures you see above... I think this is one of the best SMP APIs I've ever seen - prepare a list with the function arguments of all invocations, and then just two lines: create the poolExecutor, and map the processing function, getting back your results as soon as they are produced by the workers. Results start coming in as soon as the first invocation of processingFunction returns and keep coming until they are all done - at which point the for loop ends.
You have way less boilerplate than that using parallel collections in Scala.
myParameters.par.map(x => f(x))
will do the trick if you want the default number of threads (same as number of cores).
If you insist on setting the number of workers, you can like so:
import scala.collection.parallel._
import scala.concurrent.forkjoin._
val temp = myParameters.par
temp.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(5))
temp.map(x => f(x))
The exact details of return timing are different, but you can put as much machinery as you want into f(x) (i.e. both compute and do something with the result), so this may satisfy your needs.
In general, simply having the results appear as completed is not enough; you then need to process them, maybe fork them, collect them, etc.. If you want to do this in general, Akka Streams (follow links from here) are nearing 1.0 and will facilitate the production of complex graphs of parallel processing.
There is both a Futures api that allows you to run work-units on a thread pool (docs: http://docs.scala-lang.org/overviews/core/futures.html) and a "parallell collections api" that you can use to perform parallell operations on collections: http://docs.scala-lang.org/overviews/parallel-collections/overview.html

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Resources