I'm using OpenCV in a commercial application, and don't have management permission to purchase TBB licensing, so I built OpenCV with it OpenMP as the parallelism framework.
All the machine vision cameras we use as sources of frames we're processing in real-time have SDKs that fill frame buffers in a circular queue with data and call user-supplied callbacks to process them concurrently in threads of the SDKs' own thread pools.
This works fine when not considering OpenMP, as I'm doing a bunch of (memoryless) processing on individual frames before serializing them though interthread buffers to feed to the stateful processing stage where frames need to be processed in order. If it was just the concurrent frame processing, then I wouldn't need OpenMP at all; however, I need to leave it enabled in OpenCV so that the in-order frame processing is accelerated as well.
The concern I have is how well I can expect the OpenMP to work when it's used in the first phase, the concurrently executed callbacks in the threads created explicitly by the camera SDKs. Can I assume the OpenMP runtime is smart enough to use its thread pool in an efficient manner when there are parallel regions being triggered in multiple externally created threads?
The platform is guaranteed to be x86-64 (VC++15 or GCC).
Situation
If I've understood the question properly, the camera library you're using will spawn a number of threads, and each one of those will call your callback function. Inside your callback you want to use OpenMP to accelerate that processing. The results of that are sent through some interthread channel to a pipeline of threads doing more processing.
If that's wrong, please ignore the rest of this answer!
Rest Of Answer
Using OpenMP in the callbacks would seem to be chopping the compute load of this part of your application up into little pieces for not much benefit. The camera library is already overlapping the processing of frames. Using OpenMP here is going to mean that the processing of frames doesn't actually overlap (but the camera library is still using multiple threads as if it does).
If it does still overlap, then logically speaking you haven't got enough cores in your system to keep up with the overall workload anyway (assuming your use of OpenMP resulted in all cores being maxed out processing a single frame)... I'm assuming that your system is succesfully keeping up with the flow of frames, and thereofre must have enough grunt to be able to do so.
So, I think it won't really be a question of will OpenMP be intelligent in its use of its threadpool; the threadpool will be dedicated to processing a single frame, and it will complete before the next frame arrives.
The non-overalapping does mean the latency is lower, which might be what you want. However you could achieve the same thing if the camera library used a single thread with your callback using OpenMP (and taking on the responsibility to complete before the next frame arrives). With less thread context switching going on it would even be a tiny bit quicker. So if you can stop the library spawning all those threads (maybe there's a config parameter, or environment variable, or some other part of its API), it might be worth it.
Here's some example code that explains a workaround I found. LibraryFunction() represents some function I can't modify that already uses OpenMP parallelization, such as something from OpenCV.
void LibraryFunction(int phase, mutex &mtx)
{
#pragma omp parallel num_threads(3)
{
lock_guard<mutex> l{mtx};
cerr << "Phase: " << phase << "\tTID: " << this_thread::get_id() << "\tOMP: " << omp_get_thread_num() << endl;
}
}
The problem of oversubscription with external threads is visible with this:
int main(void)
{
omp_set_dynamic(thread::hardware_concurrency());
omp_set_nested(0);
vector<std::thread> threads;
threads.reserve(3);
mutex mtx;
for (int i = 0; i < 3; ++i)
{
threads.emplace_back([&]
{
this_thread::sleep_for(chrono::milliseconds(200));
LibraryFunction(1, mtx);
});
}
for (auto &t : threads) t.join();
cerr << endl;
LibraryFunction(2, mtx);
return EXIT_SUCCESS;
}
The output is:
Phase: 1 TID: 7812 OMP: 0
Phase: 1 TID: 3928 OMP: 0
Phase: 1 TID: 2984 OMP: 0
Phase: 1 TID: 9924 OMP: 1
Phase: 1 TID: 9560 OMP: 2
Phase: 1 TID: 2576 OMP: 1
Phase: 1 TID: 5380 OMP: 2
Phase: 1 TID: 3428 OMP: 1
Phase: 1 TID: 10096 OMP: 2
Phase: 2 TID: 9948 OMP: 0
Phase: 2 TID: 10096 OMP: 1
Phase: 2 TID: 3428 OMP: 2
Phase 1 represents execution of the OpenMPed library code in the camera SDK threads, whereas Phase 2 is OpenMPed library code used for later processing pipeline stages triggered from just one thread. The problem is obvious—the number of threads is multiplied by the nesting of OpenMP-in-external threads, and results in oversubscription during phase 1. Building OpenCV with OpenMP turned off, while fixing oversubscription in Phase 1, would on the other hand result in no acceleration of Phase 2.
The workaround I found is that wrapping the call to LibraryFunction() during phase 1 in #pragma omp parallel sections {} suppresses generation of threads within that particular call. Now the result is:
Phase: 1 TID: 3168 OMP: 0
Phase: 1 TID: 8888 OMP: 0
Phase: 1 TID: 5712 OMP: 0
Phase: 2 TID: 10232 OMP: 0
Phase: 2 TID: 5012 OMP: 1
Phase: 2 TID: 4224 OMP: 2
I haven't tested this with OpenCV yet, but I expect it to work.
Related
I have 10000 tasks that I am trying to schedule with tbb across N threads. 9900 tasks take O(1) unit time for execution whereas the remaining 100 tasks take O(100)-O(1000) time for execution. I want tbb to schedule these tasks such that the top 100 longest tasks are scheduled first on the threads, so that we maximize efficiency. If some threads finish faster, they can then run the shorter jobs at the end.
I have one (hypothetical) example: Out of 10000 tasks, I have one super long task which takes 1111 units, remaining 9999 tasks all take just 1 unit, and I have 10 threads. I want thread j to run this super long task in 1111 units of time, and the other 9 threads run the remaining 9999 tasks which take 1 unit each so those 9 threads run 9999/9=1111 tasks in 1111 units of time. Which means I am using my threads at 100% efficiency (ignore any overhead).
What I have is a function which does something like this:
bool run( Worker& worker, size_t numTasks, WorkerData& workerData ) {
xtbbTaskData<Worker,WorkerData> taskData( worker, workerData, numThreads);
arena.execute( [&]{ tbb::parallel_for( tbb::blocked_range<size_t>( 0, numTasks ), xtbbTask<Worker,WorkerData>( taskData ) ); } );
}
where I have a xtbb::arena arena created with numThreads. Worker is my worker class with 10000 tasks, workerData is the class with the data needed to run each task, and I have a template class xtbbTaskData which takes the worker, workerdata and eventually has the operator() which calls run on each task in the worker.
What syntax should I use to schedule these tasks such that the longest task gets schedules first? There is task priority, tbb area, enque etc that I have come across but I am not finding good examples of how to code this.
Do I need to create multiple arenas? Or multiple workers? Or put the longest tasks at the end of the vector in worker? Or something else?
If someone can point me to examples or templates that are already doing this, that would be great.
A task_group_context represents a group of tasks that can be canceled, or have their priority level set, together.
Refer to page no: 378 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for examples.
We can also define priority as an attribute to the task arena.
Refer to page no: 494 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for example.
The GC time is too long in my spark streaming programme. In the GC log, I found that Someone called System.gc() in the programme. I do not call System.gc() in my code. So the caller should be the api I used.
I add -XX:-DisableExplicitGC to JVM and fix this problem. However, I want to know who call the System.gc().
I tried some methods.
Use jstack. But the GC is not so frequent, it is difficult to dump the thread that call the method.
I add trigger that add thread dump when invoke method java.lang.System.gc() in JProfiler. But it doesn't seem to work.
How can I know who call System.gc() in spark streaming program?
You will not catch System.gc with jstack, because during stop-the-world pauses JVM does not accept connections from Dynamic Attach tools, including jstack, jmap, jcmd and similar.
It's possible to trace System.gc callers with async-profiler:
Start profiling beforehand:
$ profiler.sh start -e java.lang.System.gc <pid>
After one or more System.gc happens, stop profiling and print the stack traces:
$ profiler.sh stop -o traces <pid>
Example output:
--- Execution profile ---
Total samples : 6
Frame buffer usage : 0.0007%
--- 4 calls (66.67%), 4 samples
[ 0] java.lang.System.gc
[ 1] java.nio.Bits.reserveMemory
[ 2] java.nio.DirectByteBuffer.<init>
[ 3] java.nio.ByteBuffer.allocateDirect
[ 4] Allocate.main
--- 2 calls (33.33%), 2 samples
[ 0] java.lang.System.gc
[ 1] sun.misc.GC$Daemon.run
In the above example, System.gc is called 6 times from two places. Both are typical situations when JDK internally forces Garbage Collection.
The first one is from java.nio.Bits.reserveMemory. When there is not enough free memory to allocate a new direct ByteBuffer (because of -XX:MaxDirectMemorySize limit), JDK forces full GC to reclaim unreachable direct ByteBuffers.
The second one is from GC Daemon thread. This is called periodically by Java RMI runtime. For example, if you use JMX remote, periodic GC is automatically enabled once per hour. This can be tuned with -Dsun.rmi.dgc.client.gcInterval system property.
I have a similar issue as Synchronizing timer hangs with simple setup, but with Precise Throughput Timer which suppose to replace Synchronizing timer:
Certain cases might be solved via Synchronizing Timer, however Precise Throughput Timer has native way to issue requests in packs. This behavior is disabled by default, and it is controlled with "Batched departures" settings
Number of threads in the batch (threads). Specifies the number of samples in a batch. Note the overall number of samples will still be in line with Target Throughput
Delay between threads in the batch (ms). For instance, if set to 42, and the batch size is 3, then threads will depart at x, x+42ms, x+84ms
I'm setting 10 thread number , 1 ramp up and 1 loop count,
I'm adding 1 HTTP Request only (less than 1 seconds response) and before it Test Action with Precise Throughput Timer as a child with the following setup:
Thread stuck after 5 threads succeeded:
EDIT 1
According to #Dimitri T solution:
Change Duration to 100 and add line to logging configuration and got 5 errors:
2018-03-12 15:43:42,330 INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -99886 ms and the throughput timer generates a delay of 20004077. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.
EDIT 2
According to #Dimitri T solution set "Loop Count" to -1 executed 10 threads, but if I change Number of threads in batch from 2 to 5 it execute only 3 threads and stops
INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -89233 ms and the throughput timer generates a delay of 19999450. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.
Set "Duration (seconds)" in your Thread Group to something non-zero (i.e. to 100)
Depending on what you're trying to achieve you might also want to set "Loop Count" to -1
You can also add the following line to log4j2.xml file:
<Logger name="org.apache.jmeter.timers" level="debug" />
This way you will be able to see what's going on with your timer(s) in jmeter.log file
I have a CPU intensive task (looping through a some data and evaluating results). I want to make use of multiple cores for these but my performance is consistently worse than just using a single core.
I've tried:
Creating multiple processes on different ports with express and sending the tasks to these processes
Using webworker-threads to run the tasks in different threads using the thread pool
I'm measuring the results by counting the total number of iterations I can complete and dividing by the amount of time I spent working on the problem. When using a single core, my results are significantly better.
some points of interest:
I can identify when I am just using one core and when I am using multiple cores through task manager. I am using the expected number of cores.
I have lots of ram
I've tried running on just 2 or 3 cores
I added nextTicks which doesn't seem to impact anything in this case
The tasks take several seconds each so I don't feel like I'm losing a lot to overhead
Any idea as to what is going on here?
Update for threads: I suspect a bug in webworker-threads
Skipping express for now, I think the issue may have to do with my thread loop. What I'm doing is creating a threads and then trying to continuously run them but send data back and forth between them. Even though both of the threads are using up CPU, only thread 0 is returning values. My assumption was emit any would generally end up emitting the message to the thread that had been idle the longest but that does not seem to be the case. My set up looks like this
Within threadtask.js
thread.on('init', function() {
thread.emit('ready');
thread.on('start', function(data) {
console.log("THREAD " + thread.id + ": execute task");
//...
console.log("THREAD " + thread.id + ": emit result");
thread.emit('result', otherData));
});
});
main.js
var tp = Threads.createPool(NUM_THREADS);
tp.load(threadtaskjsFilePath);
var readyCount = 0;
tp.on('ready', function() {
readyCount++;
if(readyCount == tp.totalThreads()) {
console.log('MAIN: Sending first start event');
tp.all.emit('start', JSON.stringify(data));
}
});
tp.on('result', function(eresult) {
var result = JSON.parse(eresult);
console.log('MAIN: result from thread ' + result.threadId);
//...
console.log('MAIN: emit start' + result.threadId);
tp.any.emit('start' + result.threadId, data);
});
tp.all.emit("init", JSON.stringify(data2));
The output to this disaster
MAIN: Sending first start event
THREAD 0: execute task
THREAD 1: execute task
THREAD 1: emit result
MAIN: result from thread 1
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
THREAD 0: execute task
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
THREAD 0: execute task
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
I did try another approach as well where I would emit all but then have each thread listen for a message that only it could answer. Eg, thread.on('start' + thread.id, function() { ... }). This doesn't work because in the result when I do tp.all.emit('start' + result.threadId, ... ), the message doesn't get picked up.
MAIN: Sending first start event
THREAD 0: execute task
THREAD 1: execute task
THREAD 1: emit result
THREAD 0: emit result
Nothing more happens after that.
Update for multiple express servers: I'm getting improvements but smaller than expected
I revisited this solution and had more luck. I think my original measurement may have been flawed. New results:
Single process: 3.3 iterations/second
Main process + 2 servers: 4.2 iterations/second
Main process + 3 servers: 4.9 iterations/second
One thing I find a little odd is that I'm not seeing around 6 iterations/second for 2 servers and 9 for 3. I get that there are some losses for networking but if I increase my task time to be sufficiently high, the network losses should be pretty minor I would think.
You shouldn't be pushing your Node.js processes to run multiple threads for performance improvements. Running on a quad-core processor, having 1 express process handling general requests and 3 express processes handling the CPU intensive requests would probably be the most effective setup, which is why I would suggest that you try to design your express processes to defer from using Web workers and simply block until they produce a result. This will get you down to running a single process with a single thread, as per design, most likely yielding the best results.
I do not know the intricacies of how the Web workers package handles synchronization, affects the I/O thread pools of Node.js that happen in c space, etc., but I believe you would generally want to introduce Web workers to be able to manage more blocking tasks at the same time without severely affecting other requests that require no threading and system I/O, or can otherwise be expediently responded to. It doesn't necessarily mean that applying this would yield improved performance for the particular tasks being performed. If you run 4 processes with 4 threads that perform I/O, you might be locking yourself into wasting time continuously switching between the thread contexts outside the application space.
The MPI standard 3.0 says in Section 5.13 that
Finally, in multithreaded implementations, one can have more than one,
concurrently executing, collective communication call at a process. In
these situations, it is the user’s re- sponsibility to ensure that the
same communicator is not used concurrently by two different collective
communication calls at the same process.
I wrote the following program which does NOT execute correctly (but compiles) and dumps a core
void main(int argc, char *argv[])
{
int required = MPI_THREAD_MULTIPLE, provided, rank, size, threadID, threadProcRank ;
MPI_Comm comm = MPI_COMM_WORLD ;
MPI_Init_thread(&argc, &argv, required, &provided);
MPI_Comm_size(comm, &size);
MPI_Comm_rank(comm, &rank);
int buffer1[10000] = {0} ;
int buffer2[10000] = {0} ;
#pragma omp parallel private(threadID,threadProcRank) shared(comm, buffer1)
{
threadID = omp_get_thread_num();
MPI_Comm_rank(comm, &threadProcRank);
printf("\nMy thread ID is %d and I am in process ranked %d", threadID, threadProcRank);
if(threadID == 0)
MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);
If (threadID == 1)
MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);
}
MPI_Finalize();
}
My question is: Two threads in each process having thread ID 0 and thread ID 1 post a broadcast call which can be taken as a MPI_Send() in the root process ( i.e. process 0). I am interpreting it as two loops of MPI_Send() where the destination is the remaining processes. The destination processes also post MPI_Bcast() in thread ID 0 and thread ID 1. These can be taken as two MPI_Recv()'s posted by each process in the two threads. Since the MPI_Bcast() are identical - there should be no matching problems in receiving the messages sent by Process 0 (the root). But still the program does not work. Why ? Is it because of the possibility that messages might get mixed up on different/same collectives on the same communicator ? And since MPI (mpich2) sees the possibility of this, it just does not allow two collectives on the same communicator pending at the same time ?
First of all, you are not checking the value of provided where the MPI implementation returns the actually provided thread support level. The standard allows for this level to be lower than the requested one and a correct MPI application would rather do something like:
MPI_Init_thread(&argc, &argv, required, &provided);
if (provided < required)
{
printf("Error: MPI does not provide the required thread support\n");
MPI_Abort(MPI_COMM_WORLD, 1);
exit(1);
}
Second, this line of code is redundant:
MPI_Comm_rank(comm, &threadProcRank);
Threads in MPI do not have separate ranks - only processes have ranks. There was a proposal to bring the so-called endpoints in MPI 3.0 which would have allowed a single process to have more than one ranks and to bind them to different threads but it didn't make it into the final version of the standard.
Third, you are using the same buffer variable in both collectives. I guess your intention was to use buffer1 in the call in thread 0 and buffer2 in the call in thread 1. Also MPI_INTEGER is the datatype that corresponds to INTEGER in Fortran. For the C int type the corresponding MPI datatype is MPI_INT.
Fourth, the interpretation of MPI_BCAST as a loop of MPI_SEND and the corresponding MPI_RECV is just that - an interpretation. In reality the implementation is much different - see here. For example, with smaller messages where the initial network setup latency is much higher than the physical data transmission time, binary and binomial trees are used in order to minimise the latency of the collective. Larger messages are usually broken into many segments and then a pipeline is used to pass the segments from the root rank to all the others. Even in the tree distribution case the messages could still be segmented.
The catch is that in practice each collective operation is implemented using messages with the same tag, usually with negative tag values (these are not allowed to be used by the application programmer). That means that both MPI_Bcast calls in your case would use the same tags to transmit their messages and since the ranks would be the same and the communicator is the same, the messages would get all mixed up. Therefore the requirement for doing concurrent collectives only on separate communicators.
There are two reasons why your program crashes. Reason one is that the MPI library does not provide MPI_THREAD_MULTIPLE. The second reason is if the message is split in two unevenly sized chunks, e.g. a larger first part and a smaller second part. The interference between both collective calls could cause the second thread to receive a large first chunk directed to the first thread while waiting for the second smaller chunk. The result would be message truncation and the abort MPI error handler would get called. This usually does not result in segfault and core dumps, so I would suppose that your MPICH2 is simply not compiled as thread-safe.
This is not MPICH2-specific. Open MPI and other implementations are also prone to the same limitations.