multithreading with MQ - multithreading

I'm having problem using MQSeries Perl module in multi-threading environment. Here what I have tried:
create two handle in different thread with $mqMgr = MQSeries::QueueManager->new(). I thought this would give me two different connection to MQ, but instead I got return code 2219 on the second call to MQOPEN(), which probably means I got the same underling connection to mq from two separate call to new() method.
declare only one $mqMgr as global shared variable. But I can't assign reference to an MQSeries::QueueManager object to $mqMgr. The reason is "Type of arg 1 to threads::shared::share must be one of [$#%] (not subroutine entry)"
declare only one $mqMgr as global variable. Got same 2219 code.
Tried to pass MQCNO_HANDLE_SHARE_NO_BLOCK into MQSeries::QueueManager->new(), so that a single connection can be shared across thread. But I can not find a way to pass it in.
My question is, with Perl module MQSeries
How/can I get separate connection to MQ queue manager from different thread?
How/can I share a connection to MQ queue manager across different thread?
I have looked around but with little luck, Any info would be appreciated.
related question:
C++ - MQ RC Code 2219
Update 1: add a example that two local MQSeries::QueueManager object in two thread cause MQ error code 2219.
use threads;
use Thread::Queue;
use MQSeries;
use MQSeries::QueueManager;
use MQSeries::Queue;
# globals
our $jobQ = Thread::Queue->new();
our $resultQ = Thread::Queue->new();
# ----------------------------------------------------------------------------
# sub routines
# ----------------------------------------------------------------------------
sub worker {
# fetch work from $jobQ and put result to $resultQ
# ...
}
sub monitor {
# fetch result from $resultQ and put it onto another MQ queue
my $mqQMgr = MQSeries::QueueManager->new( ... );
# different queue from the one in main
# this would cause error with MQ code 2219
my $mqQ = MQSeries::Queue->new( ... );
while (defined(my $result = $resultQ->dequeue())) {
# create an mq message and put it into $mqQ
my $mqMsg = MQSeries::Message->new();
$mqQ->put($mqMsg);
}
}
# main
unless (caller()) {
# create connection to MQ
my $mqQMgr = MQSeries::QueueManager->new( ... );
my $mqQ = MQSeries::Queue->new( ... );
# create worker and monitor thread
my #workers;
for (1 .. $nThreads) {
push(#workers, threads->create('worker'));
}
my $monitor = threads->create('monitor');
while (True) {
my $mqMsg = MQSeries::Message->new ();
my $retCode = $mqQ->get(
Message => $mqMsg,
GetMsgOptions => $someOption,
Wait => $sometime
);
die("error") if ($retCode == 0);
next if ($retCode == -1); # no message
# not we have some job to do
$jobQ->enqueue($mqMsg->Data);
}
}

There is a very real danger when trying to multithread with modules that the module is not thread safe. There's a bunch of things that can just break messily because of the way threading works - you clone the current process state, and that includes things like file handles, sockets, etc.
But if you try and use them in an asynchronous/threaded way, they'll act really weird because the operations aren't (necessarily) atomic.
So whilst I can't answer your question directly, because I have no experience of the particular module:
Unless you know otherwise, assume you can't share between threads. It might be thread safe, it might not. If it isn't, it might still look ok, until one day you get a horrifically difficult to find bug as a result of a race condition in concurrent conditions.
A shared scalar/list is explicitly described in threads::shared as basically safe (and even then, you can still have problems with non-atomicity if you're not locking).
I would suggest therefore that what you need to do is either:
have a 'comms' thread, that does all the work related to the module, and make the other threads use IPC to talk to it. Thread::Queue can work nicely for this.
treat each thread as entirely separate for purposes of the module. That includes loading it (with require and import - not use because that acts earlier) and instantiating. (You might get away with 'loading' the module before threads start, but instantiating does things like creating descriptors, sockets etc.)
lock stuff when there's any danger of interruption of an atomic operation.
Much of the above also applies to fork parallelism too - but not in quite the same way, as fork makes "sharing" stuff considerably harder, so you're less likely to trip over it.
Edit:
Looking at the code you've posted, and crossreferencing against the MQSeries source:
There is a BEGIN block, that sets up some stuff with the MQSeries at the point at which you use it.
Whilst I can't say for sure that this is your problem, it makes me very wary - because bear in mind that when it does that, it sets up some stuff - and then when your threads start, they inherit non-shared copies of "whatever it did" during that "BEGIN" block.
So in light of what I suggested earlier on - I would recommend you try (because I can't say for sure, as I don't have a reference implementation):
require MQSeries;
MQSeries->import;
Put this in your code - in lieu of use - after thread start. E.g. after you do the creates and within the thread subroutine.

Related

Concurrent read write

Below is my code in which two variables are updated upon callback which is called every 50ms. There is also a reader thread that just wakes up every 50ms and reads the variables.
After going through this, I guessed that there will be some instances when the thread that reads will wake up just when a callback is received and since i am not locking the same mutex while reading and writing, in this case it will result in inconsistent memory being read.
However, when i run it, this scenario is never occurring. Is it that I have not run it for long enough or is there any mistake in my understanding?
recursive_mutex mutex1
recursive_mutex mutex2
var1, var2
//called every 50ms
onUpdateListener() {
lock(mutex1)
update var1
update var2
}
VarReaderThread::readVar() {
sleep(50)
while(true) {
{
lock(mutex2)
read var1
read var2
sleep(50)
}
}
}
A data race occurs when two or more threads are accessing shared data in which at least one of which is a write operation in which one or more threads may encounter an inconsistent state.
The code provided has two mutexes one for read and one for write.
That will in no way exclude a thread from reading the shared data (var1,var2 in the example) while another is writing to them or more precisely after one has started writing to one of them and before completing writing to the other in a situation where that intermediate state would be considered an inconsistent state within the parameters of the code logic and its purpose.
One mutex is required to make reading and writing exclusive.
As an aside, it's not clear why a recursive mutex has been declared. That would only be required if a given thread may encounter code requiring it to obtain a mutex it already holds. It's better design to design and code that situation out where possible.
Explicit unlock steps have been introduced which may or may not be required explicitly depending on the coding model (required in C, use std::lock_guard in C++ or synchronized or finally in Java, etc.).
Some condition has been introduced to indicate towards a proper termination condition. while(true) is bad practice unless some other condition will terminate it gracefully.
More sophisticated models may use a 'shared mutex' in which more than one thread may hold it in 'read' mode but only one if one holds it in 'write' mode. That may or may not require some signaling to make sure 'write' doesn't get into a live-lock where many readers endlessly block 'write' access.
mutex mutex_v1_v2 //The mutex controlling shared state var1, var2.
var1, var2
//called every 50ms
onUpdateListener() {
lock(mutex_v1_v2)
update var1
update var2
unlock(mutex_v1_v2)
}
VarReaderThread::readVar() {
sleep(50)
while(!finished) {
lock(mutex_v1_v2)
read var1
read var2
unlock(mutex_v1_v2)
sleep(50) //V. V. important to sleep not holding the mutex!
}
}

Implement API to wait for d3d10 commands to finish

I am implementing some functionality which requires me to implement an API to wait for d3d10 to finish rendering.Basically i am trying to implement synchronize access to a shared texture such that i can update a texture after d3d10 is successfully able to present to backbuffer. By calling this black box api i think this can be achieved and i think it will be something similar like glfinish().I have read we can use ID3D10Query queries for implementing synchronize access.
D3D10_QUERY_DESC queryDesc;
... // Fill out queryDesc structure
ID3D10Query * pQuery;
pDevice->CreateQuery( &queryDesc, &pQuery );
pQuery->Begin();
... // Issue graphis commands, do whatever
pQuery->End();
UINT64 queryData; // This data type is different depending on the query type
while( S_OK != pQuery->GetData( &queryData, sizeof( UINT64 ), 0 ) )
{
}
should i put some dummy command between begin and end this? since i want to expose this functionality as public API something named as waitforgraphiscompletion
what should be a dummy command here?
If you are trying to synchronize execution CPU and GPU in OpenGL, you would use glFenceSync to followed by a glClientWaitSync. The equivalents in Direct 10 are ID3D10Asynchronous::End, and ID3D10Asynchronous::GetData (note, in DX11, the interfaces are slightly different). These let you know when the GPU has finished processing the command buffer to a particular point. This allows you to know when previous read/write operations on a resource have completed, and the CPU can safely access the resource without additional synchronization.
You are not required to put any commands in the while loop. The command buffer will eventually process your query, and will return S_OK (or an error message, which you might want to handle). However, this is somewhat wasteful, as the CPU will just spin waiting for the GPU, so if possible, you should do some extra useful work within the loop.
Note, if you used D3D10_ASYNC_GETDATA_DONOTFLUSH as the final parameter to GetData (instead of '0'), the above would not be the case - there is no guarantee that the command buffer would 'automatically' kick off, and you could end up in an infinite loop (and, as such is not the recommended usage).

What multithreading package for Lua "just works" as shipped?

Coding in Lua, I have a triply nested loop that goes through 6000 iterations. All 6000 iterations are independent and can easily be parallelized. What threads package for Lua compiles out of the box and gets decent parallel speedups on four or more cores?
Here's what I know so far:
luaproc comes from the core Lua team, but the software bundle on luaforge is old, and the mailing list has reports of it segfaulting. Also, it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread.
Lua Lanes makes interesting claims but seems to be a heavyweight, complex solution. Many messages on the mailing list report trouble getting Lua Lanes to build or work for them. I myself have had trouble getting the underlying "Lua rocks" distribution mechanism to work for me.
LuaThread requires explicit locking and requires that communication between threads be mediated by global variables that are protected by locks. I could imagine worse, but I'd be happier with a higher level of abstraction.
Concurrent Lua provides an attractive message-passing model similar to Erlang, but it says that processes do not share memory. It is not clear whether spawn actually works with any Lua function or whether there are restrictions.
Russ Cox proposed an occasional threading model that works only for C threads. Not useful for me.
I will upvote all answers that report on actual experience with these or any other multithreading package, or any answer that provides new information.
For reference, here is the loop I would like to parallelize:
for tid, tests in pairs(tests) do
local results = { }
matrix[tid] = results
for i, test in pairs(tests) do
if test.valid then
results[i] = { }
local results = results[i]
for sid, bin in pairs(binaries) do
local outcome, witness = run_test(test, bin)
results[sid] = { outcome = outcome, witness = witness }
end
end
end
end
The run_test function is passed in as an argument, so a package can be useful to me only if it can run arbitrary functions in parallel. My goal is enough parallelism to get 100% CPU utilization on 6 to 8 cores.
Norman wrote concerning luaproc:
"it's not obvious to me how to use the scalar message-passing model to get results ultimately into a parent thread"
I had the same problem with a use case I was dealing with. I liked lua proc due to its simple and light implementation, but my use case had C code that was calling lua, which was triggering a co-routine that needed to send/receive messages to interact with other luaproc threads.
To achieve my desired functionality I had to add features to luaproc to allow sending and receiving messages from the parent thread or any other thread not running from the luaproc scheduler. Additionally, my changes allow using luaproc send/receive from coroutines created from luaproc.newproc() created lua states.
I added an additional luaproc.addproc() function to the api which is to be called from any lua state running from a context not controlled by the luaproc scheduler in order to set itself up with luaproc for sending/receiving messages.
I am considering posting the source as a new github project or contacting the developers and seeing if they would like to pull my additions. Suggestions as to how I should make it available to others are welcome.
Check the threads library in torch family. It implements a thread pool model: a few true threads (pthread in linux and windows thread in win32) are created first. Each thread has a lua_State object and a blocking job queue that admits jobs added from the main thread.
Lua objects are copied over from main thread to the job thread. However C objects such as Torch tensors or tds data structures can be passed to job threads via pointers -- this is how limited shared memory is achieved.
This is a perfect example of MapReduce
You can use LuaRings to accomplish your parallelization needs.
Concurrent Lua might seem like the way to go, but as I note in my updates below, it doesn't run things in parallel. The approach I tried was to spawn several processes that execute pickled closures received through the message queue.
Update
Concurrent Lua seems to handle first-class functions and closures without a hitch. See the following example program.
require 'concurrent'
local NUM_WORKERS = 4 -- number of worker threads to use
local NUM_WORKITEMS = 100 -- number of work items for processing
-- calls the received function in the local thread context
function worker(pid)
while true do
-- request new work
concurrent.send(pid, { pid = concurrent.self() })
local msg = concurrent.receive()
-- exit when instructed
if msg.exit then return end
-- otherwise, run the provided function
msg.work()
end
end
-- creates workers, produces all the work and performs shutdown
function tasker()
local pid = concurrent.self()
-- create the worker threads
for i = 1, NUM_WORKERS do concurrent.spawn(worker, pid) end
-- provide work to threads as requests are received
for i = 1, NUM_WORKITEMS do
local msg = concurrent.receive()
-- send the work as a closure
concurrent.send(msg.pid, { work = function() print(i) end, pid = pid })
end
-- shutdown the threads as they complete
for i = 1, NUM_WORKERS do
local msg = concurrent.receive()
concurrent.send(msg.pid, { exit = true })
end
end
-- create the task process
local pid = concurrent.spawn(tasker)
-- run the event loop until all threads terminate
concurrent.loop()
Update 2
Scratch all of that stuff above. Something didn't look right when I was testing this. It turns out that Concurrent Lua isn't concurrent at all. The "processes" are implemented with coroutines and all run cooperatively in the same thread context. That's what we get for not reading carefully!
So, at least I eliminated one of the options I guess. :(
I realize that this is not a works-out-of-the-box solution, but, maybe go old-school and play with forks? (Assuming you're on a POSIX system.)
What I would have done:
Right before your loop, put all tests in a queue, accessible between processes. (A file, a Redis LIST or anything else you like most.)
Also before the loop, spawn several forks with lua-posix (same as the number of cores or even more depending on the nature of tests). In parent fork wait until all children will quit.
In each fork in a loop, get a test from the queue, execute it, put results somewhere. (To a file, to a Redis LIST, anywhere else you like.) If there are no more tests in queue, quit.
In the parent fetch and process all test results as you do now.
This assumes that test parameters and results are serializable. But even if they are not, I think that it should be rather easy to cheat around that.
I've now built a parallel application using luaproc. Here are some misconceptions that kept me from adopting it sooner, and how to work around them.
Once the parallel threads are launched, as far as I can tell there is no way for them to communicate back to the parent. This property was the big block for me. Eventually I realized the way forward: when it's done forking threads, the parent stops and waits. The job that would have been done by the parent should instead be done by a child thread, which should be dedicated to that job. Not a great model, but it works.
Communication between parent and children is very limited. The parent can communicate only scalar values: strings, Booleans, and numbers. If the parent wants to communicate more complex values, like tables and functions, it must code them as strings. Such coding can take place inline in the program, or (especially) functions can be parked into the filesystem and loaded into the child using require.
The children inherit nothing of the parent's environment. In particular, they don't inherit package.path or package.cpath. I had to work around this by the way I wrote the code for the children.
The most convenient way to communicate from parent to child is to define the child as a function, and to have the child capture parental information in its free variables, known in Lua parlances as "upvalues." These free variables may not be global variables, and they must be scalars. Still, it's a decent model. Here's an example:
local function spawner(N, workers)
return function()
local luaproc = require 'luaproc'
for i = 1, N do
luaproc.send('source', i)
end
for i = 1, workers do
luaproc.send('source', nil)
end
end
end
This code is used as, e.g.,
assert(luaproc.newproc(spawner(randoms, workers)))
This call is how values randoms and workers are communicated from parent to child.
The assertion is essential here, as if you forget the rules and accidentally capture a table or a local function, luaproc.newproc will fail.
Once I understood these properties, luaproc did indeed work "out of the box", when downloaded from askyrme on github.
ETA: There is an annoying limitation: in some circumstances, calling fread() in one thread can prevent other threads from being scheduled. In particular, if I run the sequence
local file = io.popen(command, 'r')
local result = file:read '*a'
file:close()
return result
the read operation blocks all other threads. I don't know why this is---I assume it is some nonsense going on within glibc. The workaround I used was to call directly to read(2), which required a little glue code, but this works properly with io.popen and file:close().
There's one other limitation worth noting:
Unlike Tony Hoare's original conception of communicating sequential processing, and unlike most mature, serious implementations of synchronous message passing, luaproc does not allow a receiver to block on multiple channels simultaneously. This limitation is serious, and it rules out many of the design patterns that synchronous message-passing is good at, but it's still find for many simple models of parallelism, especially the "parbegin" sort that I needed to solve for my original problem.

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Interview Question on .NET Threading

Could you describe two methods of synchronizing multi-threaded write access performed
on a class member?
Please could any one help me what is this meant to do and what is the right answer.
When you change data in C#, something that looks like a single operation may be compiled into several instructions. Take the following class:
public class Number {
private int a = 0;
public void Add(int b) {
a += b;
}
}
When you build it, you get the following IL code:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: dup
// Pushes the value of the private variable 'a' onto the stack
IL_0003: ldfld int32 Simple.Number::a
// Pushes the value of the argument 'b' onto the stack
IL_0008: ldarg.1
// Adds the top two values of the stack together
IL_0009: add
// Sets 'a' to the value on top of the stack
IL_000a: stfld int32 Simple.Number::a
IL_000f: ret
Now, say you have a Number object and two threads call its Add method like this:
number.Add(2); // Thread 1
number.Add(3); // Thread 2
If you want the result to be 5 (0 + 2 + 3), there's a problem. You don't know when these threads will execute their instructions. Both threads could execute IL_0003 (pushing zero onto the stack) before either executes IL_000a (actually changing the member variable) and you get this:
a = 0 + 2; // Thread 1
a = 0 + 3; // Thread 2
The last thread to finish 'wins' and at the end of the process, a is 2 or 3 instead of 5.
So you have to make sure that one complete set of instructions finishes before the other set. To do that, you can:
1) Lock access to the class member while it's being written, using one of the many .NET synchronization primitives (like lock, Mutex, ReaderWriterLockSlim, etc.) so that only one thread can work on it at a time.
2) Push write operations into a queue and process that queue with a single thread. As Thorarin points out, you still have to synchronize access to the queue if it isn't thread-safe, but it's worth it for complex write operations.
There are other techniques. Some (like Interlocked) are limited to particular data types, and there are even more (like the ones discussed in Non-blocking synchronization and Part 4 of Joseph Albahari's Threading in C#), though they are more complex: approach them with caution.
In multithreaded applications, there are many situations where simultaneous access to the same data can cause problems. In such cases synchronization is required to guarantee that only one thread has access at any one time.
I imagine they mean using the lock-statement (or SyncLock in VB.NET) vs. using a Monitor.
You might want to read this page for examples and an understanding of the concept. However, if you have no experience with multithreaded application design, it will likely become quickly apparent, should your new employer put you to the test. It's a fairly complicated subject, with many possible pitfalls such as deadlock.
There is a decent MSDN page on the subject as well.
There may be other options, depending on the type of member variable and how it is to be changed. Incrementing an integer for example can be done with the Interlocked.Increment method.
As an excercise and demonstration of the problem, try writing an application that starts 5 simultaneous threads, incrementing a shared counter a million times per thread. The intended end result of the counter would be 5 million, but that is (probably) not what you will end up with :)
Edit: made a quick implementation myself (download). Sample output:
Unsynchronized counter demo:
expected counter = 5000000
actual counter = 4901600
Time taken (ms) = 67
Synchronized counter demo:
expected counter = 5000000
actual counter = 5000000
Time taken (ms) = 287
There are a couple of ways, several of which are mentioned previously.
ReaderWriterLockSlim is my preferred method. This gives you a database type of locking, and allows for upgrading (although the syntax for that is incorrect in the MSDN last time I looked and is very non-obvious)
lock statements. You treat a read like a write and just prevent access to the variable
Interlocked operations. This performs an operations on a value type in an atomic step. This can be used for lock free threading (really wouldn't recommend this)
Mutexes and Semaphores (haven't used these)
Monitor statements (this is essentially how the lock keyword works)
While I don't mean to denigrate other answers, I would not trust anything that does not use one of these techniques. My apologies if I have forgotten any.

Resources