I've reduced a crash to the following toy code:
// DLLwithOMP.cpp : build into a dll *with* /openmp
#include <tchar.h>
extern "C"
{
__declspec(dllexport) void funcOMP()
{
#pragma omp parallel for
for (int i = 0; i < 100; i++)
_tprintf(_T("Please fondle my buttocks\n"));
}
}
_
// ConsoleApplication1.cpp : build into an executable *without* /openmp
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
typedef void(*tDllFunc) ();
int main()
{
HMODULE hDLL = LoadLibrary(_T("DLLwithOMP.dll"));
tDllFunc pDllFunc = (tDllFunc)GetProcAddress(hDLL, "funcOMP");
pDllFunc();
FreeLibrary(hDLL);
// At this point the omp runtime vcomp140[d].dll refcount is zero
// and windows unloads it, but the omp thread team remains active.
// A crash usually ensues.
return 0;
}
Is this an MS bug? Is there some OMP thread-cleanup API I missed (probably not, but maybe)? I don't have other compilers under hand. Do they treat this scenario differently? (again, probably not) Does the OMP standard has anything to say on such a scenario?
I got an answer from Eric Brumer # MS Connect. Re-posting it here in case it is of interest to anyone in the future:
for optimal performance, the openmp threadpool spin waits for about a
second prior to shutting down in case more work becomes available. If
you unload a DLL that's in the process of spin-waiting, it will crash
in the manner you see (most of the time).
You can tell openmp not to spin-wait and the threads will immediately
block after the loop finishes. Just set OMP_WAIT_POLICY=passive in
your environment, or call SetEnvironmentVariable(L"OMP_WAIT_POLICY",
L"passive"); in your function before loading the dll. The default is
"active" which tells the threadpool to spin wait. Use the environment
variable, or just wait a few seconds before calling FreeLibrary.
Related
I have developed a distributed memory MPI application which involves processing of a grid. Now i want to apply shared memory techniques (essentially making it a hybrid - parallel program), with OpenMP, to see if it can become any faster, or more efficient. I'm having a hard time with OpenMP, especially with a nested for loop. My application involves printing the grid to the screen every half a second, but when i parallelize it with OpenMP, execution proceeds 10 times slower, or not at all. The console screen lags and refreshes itself with random / unexpected data. In other words, it is going completely wrong. Take a look at the following function, which does the printing:
void display2dGrid(char** grid, int nrows, int ncolumns, int ngen)
{
//#pragma omp parallel
updateScreen();
int y, x;
//#pragma omp parallel shared(grid) // garbage
//#pragma omp parallel private(y) // garbage output!
//#pragma omp for
for (y = 0; y < nrows; y++) {
//#pragma omp parallel shared(grid) // nothing?
//#pragma omp parallel private(x) // 10 times slower!
for (x = 0; x < ncolumns; x++) {
printf("%c ", grid[y][x]);
}
printf("\n");
}
printf("Gen #%d\n", ngen);
fflush(stdout);
}
(updateScreen() just clears the screen and writes from top left corner again.)
The function is executed by only one process, which makes it a perfect target for thread parallelization. As you can see i have tried many approaches and one is worse than the other. Best case, i get semi proper output every 2 seconds (because it refreshes very slowly). Worst case i get garbage output.
I would appreciate any help. Is there a place where i can find more information to proper parallelize loops with OpenMP? Thanks in advance.
The function is executed by only one process, which makes it a perfect target for thread parallelization.
That is actually not true. The function you are trying to parallelize is a very bad target for parallelization. The calls to printf in your example need to happen in a specific sequential order, or else, you're going to obtain a garbage result as your experienced (since the elements of your grid are going to be printed in an order that means nothing). Actually, your attempts at parallelizing were pretty good, the problem comes from the fact that the function itself is a bad target for parallelization.
Speedup when parallelizing programs comes from the fact that you are distributing workload across multiple cores. In order to be able to do that with maximum efficiency, said workloads need to be independent, or at least share state as little as possible, which is not the case here since the calls to printf need to happen in a specific order.
When you try to parallelize some work that is intrinsically sequential, you lose more time synchronizing your workers (your openmp threads), than you gain by parallizing the work itself (which is why you obtain crap time when your result gets better).
Also, as this answer (https://stackoverflow.com/a/20089967/3909725) suggests, you should not print the content of your grid at each loop (unless you are debugging), but rather perform all of your computations, and then print the content when you have finished doing what your ultimate goal is, since printing is only useful to see the result of the computation, and only slows the process.
An example :
Here is a very basic example of parallizing a program with openmp that achieves speedup. Here a dummy (yet heavy) computation is realized for each value of the i variable. The computations in each loop are completely independent, and the different threads can achieve their computations independently. The calls to printf can be achieved in whatever order since they are just informative.
Original (sequential.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i,j;
double x=0;
for(i=0; i < 100; i++)
{
x = 100000 * fabs(cos(i*i));
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Computed i=%2d [%g]\n",i,x);
}
}
Parallelized version (parallel.c)
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int i,j;
double x=0;
#pragma omp parallel for
for(i=0; i < 100; i++)
{
/* Dummy heavy computation */
x = 100000 * fabs(cos(i*i));
#pragma omp parallel for reduction(+: x)
for(j=0;j<100+i*20000;j++)
x += sqrt(sqrt(543*j)*fabs(sin(j)));
printf("Thread %d computed i=%2d [%g]\n",omp_get_thread_num(),i,x);
}
}
A pretty good guide to openmp can be found here : http://bisqwit.iki.fi/story/howto/openmp/
Edited:< Matlab limits TBB but not OpenMP >
My question is different than the one above, it's not duplicated though using the same sample code for illustration. In my case I specified num of threads in tbb initialization instead of using "deferred". Also I'm talking about the strange behavior between TBB in c++ and TBB in mex. The answer to that question only demonstrates thread initialization when running TBB in C++, not in MEX.
I'm trying to boost a Matlab mex file to improve performance. The strange thing I come across when using TBB within mex is that TBB initialization doesn't work as expected.
This C++ program performs 100% cpu usage and has 15 TBB threads when executing it alone:
main.cpp
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(/* int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] */) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
int main()
{
mexFunction();
}
Then I modified the code a little bit to make a MEX for matlab:
BuildMEX.mexw64
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction( int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] ) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
Eventually invoke BuildMEX.mexw64 in Matlab. I compiled(mcc) the following code snippet to Matlab binary "MEXtest.exe" and use vTune to profile its performance(run in MCR). The TBB within the process only initialized 4 tbb threads and the binary only occupies ~50% cpu usage. Why MEX is downgrading overall performance and TBB? How can I seize more cpu usage for mex?
MEXtest.exe
function MEXtest()
BuildMEX();
end
According to the scheduler class description:
This class allows to customize properties of the TBB task pool to some
extent. For example it can limit concurrency level of parallel work
initiated by the given thread. It also can be used to specify stack
size of the TBB worker threads, though this setting is not effective
if the thread pool has already been created.
This is further explained in the initialize() methods called by the constructor:
The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple
task_scheduler_inits. Doing so does no harm because the underlying
scheduler is reference counted.
(highlighted parts added by me)
I believe that MATLAB already uses Intel TBB internally, and it must have initialized a thread pool at a top level before the MEX-function is ever executed. Thus all task schedulers in your code are going to use the number of threads specified by internal parts of MATLAB, ignoring the value you specified in your code.
By default MATLAB must have initialized the thread pool with a size equal to the number of physical processors (not logicals), which is indicated by the fact that on my quad-core hyper-threaded machine I get:
>> maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release [...]
ans =
4
OpenMP on the other has no scheduler, and we can control number of threads at runtime by calling the following functions:
#include <omp.h>
..
omp_set_dynamic(1);
omp_set_num_threads(omp_get_num_procs());
or by setting the environment variable:
>> setenv('OMP_NUM_THREADS', '8')
To test this proposed explanation, here is the code I used:
test_tbb.cpp
#ifdef MATLAB_MEX_FILE
#include "mex.h"
#endif
#include <cstdlib>
#include <cstdio>
#include <vector>
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for_each.h"
#include "tbb/spin_mutex.h"
#include "tbb_helpers.hxx"
#define NTASKS 100
#define NLOOPS 400000L
tbb::spin_mutex print_mutex;
struct mytask {
mytask(size_t n) :_n(n) {}
void operator()()
{
// track maximum number of parallel workers run
ConcurrencyProfiler prof;
// burn some CPU cycles!
double x = 1.0 / _n;
for (long i=0; i<NLOOPS; ++i) {
x = sin(x) * 10.0;
while((double) rand() / RAND_MAX < 0.9);
}
{
tbb::spin_mutex::scoped_lock s(print_mutex);
fprintf(stderr, "%f\n", x);
}
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const { it(); }
};
void run()
{
// use all 8 logical cores
SetProcessAffinityMask(GetCurrentProcess(), 0xFF);
printf("numTasks = %d\n", NTASKS);
for (int t = tbb::task_scheduler_init::automatic;
t <= 512; t = (t>0) ? t*2 : 1)
{
tbb::task_scheduler_init init(t);
std::vector<mytask> tasks;
for (int i=0; i<NTASKS; ++i) {
tasks.push_back(mytask(i));
}
ConcurrencyProfiler::Reset();
tbb::parallel_for_each(tasks.begin(), tasks.end(), invoker<mytask>());
printf("pool_init(%d) -> %d worker threads\n", t,
ConcurrencyProfiler::GetMaxNumThreads());
}
}
#ifdef MATLAB_MEX_FILE
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
run();
}
#else
int main()
{
run();
return 0;
}
#endif
Here is the code for a simple helper class used to profile concurrency by keeping track of how many workers were invoked from the thread pool. You could always use Intel VTune or any other profiling tool to get the same kind of information:
tbb_helpers.hxx
#ifndef HELPERS_H
#define HELPERS_H
#include "tbb/atomic.h"
class ConcurrencyProfiler
{
public:
ConcurrencyProfiler();
~ConcurrencyProfiler();
static void Reset();
static size_t GetMaxNumThreads();
private:
static void RecordMax();
static tbb::atomic<size_t> cur_count;
static tbb::atomic<size_t> max_count;
};
#endif
tbb_helpers.cxx
#include "tbb_helpers.hxx"
tbb::atomic<size_t> ConcurrencyProfiler::cur_count;
tbb::atomic<size_t> ConcurrencyProfiler::max_count;
ConcurrencyProfiler::ConcurrencyProfiler()
{
++cur_count;
RecordMax();
}
ConcurrencyProfiler::~ConcurrencyProfiler()
{
--cur_count;
}
void ConcurrencyProfiler::Reset()
{
cur_count = max_count = 0;
}
size_t ConcurrencyProfiler::GetMaxNumThreads()
{
return static_cast<size_t>(max_count);
}
// Performs: max_count = max(max_count,cur_count)
// http://www.threadingbuildingblocks.org/
// docs/help/tbb_userguide/Design_Patterns/Compare_and_Swap_Loop.htm
void ConcurrencyProfiler::RecordMax()
{
size_t o;
do {
o = max_count;
if (o >= cur_count) break;
} while(max_count.compare_and_swap(cur_count,o) != o);
}
First I compile the code as a native executable (I am using Intel C++ Composer XE 2013 SP1, with VS2012 Update 4):
C:\> vcvarsall.bat amd64
C:\> iclvars.bat intel64 vs2012
C:\> icl /MD test_tbb.cpp tbb_helpers.cxx tbb.lib
I run the program in the system shell (Windows 8.1). It goes up to 100% CPU utilization and I get the following output:
C:\> test_tbb.exe 2> nul
numTasks = 100
pool_init(-1) -> 8 worker threads // task_scheduler_init::automatic
pool_init(1) -> 1 worker threads
pool_init(2) -> 2 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 8 worker threads
pool_init(16) -> 16 worker threads
pool_init(32) -> 32 worker threads
pool_init(64) -> 64 worker threads
pool_init(128) -> 98 worker threads
pool_init(256) -> 100 worker threads
pool_init(512) -> 98 worker threads
As expected, the thread pool is initialized as large as we asked, and being fully utilized being limited by the number of tasks we created (in the last case we have 512 threads for only 100 parallel tasks!).
Next I compile the code as a MEX-file:
>> mex -I"C:\Program Files (x86)\Intel\Composer XE\tbb\include" ...
-largeArrayDims test_tbb.cpp tbb_helpers.cxx ...
-L"C:\Program Files (x86)\Intel\Composer XE\tbb\lib\intel64\vc11" tbb.lib
Here is the output I get when I run the MEX-function in MATLAB:
>> test_tbb()
numTasks = 100
pool_init(-1) -> 4 worker threads
pool_init(1) -> 4 worker threads
pool_init(2) -> 4 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 4 worker threads
pool_init(16) -> 4 worker threads
pool_init(32) -> 4 worker threads
pool_init(64) -> 4 worker threads
pool_init(128) -> 4 worker threads
pool_init(256) -> 4 worker threads
pool_init(512) -> 4 worker threads
As you can see, no matter what we specify as pool size, the scheduler always spins at most 4 threads to execute the parallel tasks (4 being the number of physical processors on my quad-core machine). This confirms what I stated in the beginning of the post.
Note that I explicitly set the processor affinity mask to use all 8 cores, but since there are only 4 running threads, CPU usage stayed approximately at 50% in this case.
Hope this helps answer the question, and sorry for the long post :)
Assuming you have more than 4 physical cores on your machine, the affinity mask for the MATLAB standalone process is probably limiting the available CPUs. Functions called from an actual MATLAB installation should have the use of all CPUs, but this may not be the case for standalone MATLAB applications generated with the MATLAB Compiler. Try the test again, running the MEX function directly from MATLAB. In any case, you should be able to reset the affinity mask to make all cores available to TBB, but I do not think you this approach will let you coerce TBB to start more threads than you have physical cores.
Background
Since TBB 3.0 update 4, processor affinity settings are referenced to determine the number of available cores, according to a developer blog:
So the only thing that TBB should do instead of asking the system how many CPUs it has, is to retrieve the current process affinity mask, count the number of non-zero bits in it, and voilà, TBB uses no more worker threads than necessary! And this is exactly what TBB 3.0 Update 4 does. Clarifying the statement in the end of my previous blog TBB’s methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() return not simply the total number of logical CPUs in the system or the current processor group, but rather the number of CPUs available to the process in accordance with its affinity settings.
Similarly, the docs for tbb::default_num_threads indicate this change:
Before TBB 3.0 U4 this method returned the number of logical CPU in the system. Currently on Windows, Linux and FreeBSD it returns the number of logical CPUs available to the current process in accordance with its affinity mask.
The docs for tbb::task_scheduler_init::initialize also suggest that the number of threads is "limited by the processor affinity mask".
Resolution
To check if you are being limited by the affinity mask, Windows .NET functions are available:
numCoresInSystem = 16;
proc = System.Diagnostics.Process.GetCurrentProcess();
dec2bin(proc.ProcessorAffinity.ToInt32,numCoresInSystem)
The output string should have no zeros in any position representing a real (present in the system) core.
You can set the affinity mask in MATLAB or C, as described in the Q&A, Set processor affinity for MATLAB engine (Windows 7). The MATLAB way:
proc = System.Diagnostics.Process.GetCurrentProcess();
proc.ProcessorAffinity = System.IntPtr(int32(2^numCoresInSystem-1));
proc.Refresh()
Or using the Windows API, in a mexFunction, before calling task_scheduler_init:
SetProcessAffinityMask(GetCurrentProcess(),(1 << N) - 1)
For *nix, you can call taskset:
system(sprintf('taskset -p %d %d',2^N - 1,feature('getpid')))
I am trying to use C++ AMP to execute a long running kernel on the GPU. This requires using DirectX to create a device which won't timeout. I am setting the flag but it is still triggering Timeout Detection Recovery. I have a dedicated Radeon HD 7970 in my box without a monitor plugged into it. Is there anything else I need to do to keep Windows 8 from canceling my kernel before it is finished?
#include <amp.h>
#include <amp_math.h>
#include <amp_graphics.h>
#include <d3d11.h>
#include <dxgi.h>
#include <vector>
#include <iostream>
#include <iomanip>
#include "amp_tinymt_rng.h"
#include "timer.h"
#include <assert.h>
#pragma comment(lib, "d3d11")
#pragma comment(lib, "dxgi")
//Inside Main()
unsigned int createDeviceFlags = D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT;
ID3D11Device * pDevice = nullptr;
ID3D11DeviceContext * pContext = nullptr;
D3D_FEATURE_LEVEL targetFeatureLevels = D3D_FEATURE_LEVEL_11_1;
D3D_FEATURE_LEVEL featureLevel;
auto hr = D3D11CreateDevice(pAdapter,
D3D_DRIVER_TYPE_UNKNOWN,
nullptr,
createDeviceFlags,
&targetFeatureLevels,
1,
D3D11_SDK_VERSION,
&pDevice,
&featureLevel,
&pContext);
if (FAILED( hr) ||
( featureLevel != D3D_FEATURE_LEVEL_11_1))
{
std:: wcerr << "Failed to create Direct3D 11 device" << std:: endl;
return 10;
}
accelerator_view noTimeoutAcclView = concurrency::direct3d::create_accelerator_view(pDevice);
wcout << noTimeoutAcclView.accelerator.description;
//Setup kernal here
concurrency::parallel_for_each(noTimeoutAcclView, e_size, [=] (index<1> idx) restrict(amp) {
//Execute kernel here
});
Your snippet looks good, the problem has to be elsewhere. Here are few ideas:
Please double check all parallel_for_each invocations and make sure
they all use accelerator_view with the device that you created with
this snippet (explicitly pass accelerator_view as first argument to parallel_for_each).
If possible reduce the problem size and see if your code runs without
TDR, perhaps something else is causing a TDR (e.g. driver bugs are common cause of TDRs). Once you will know that your algorithm runs correctly for smaller problem you can go back to searching why is TDR triggered for larger problem size.
Disable TDR completely (see MSDN article on TDR registry keys) and see if your large problem
set ever completes, if so go back to first point. This will indicate that your code runs on accelerator_view that has TDR enabled.
Good luck!
I'm at very beginning with OpenMP, i just compiled with gcc -fopenmp openmp_c_helloworld.c the following piece of code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
I just run the executable on a quad-core Intel CPU with HyperThreading and i obtain the following output:
Hello World from thread 2
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
There are 4 threads
Technically speaking i have 8 thread available on my CPU and 4 CPU-core, why OpenMP shows me only 4 thread?
To put it simply, I think it's because OpenMP looks for the number of CPU's (cores) rather than the number of processor threads.
See this page: `
Implementation default - usually the number of CPUs on a node, though
it could be dynamic (see next bullet).
Something you could try out is setting the number of threads in your program to be equal to the number of processor threads and see if there's a performance improvement (you'll have to create your own benchmarking program).
In parallel programming, good performance is obtained when the number of worker threads are equal to the number of processor threads. You can keep a thread or two extra for I/O as well.
Why does the time to execute function f1() changes from one run to another in debug mode? Why it's always zero in release mode?
I didn't include stdio.h nor cstdio and the code compiled. How ?
#include <iostream>
#include <ctime>
void f1()
{
for( int i = 0; i < 10000; i++ );
}
int main()
{
clock_t start, finish;
start = clock();
for( int i = 0; i < 100000; i++ ) f1();
finish = clock();
double duration = (double)(finish - start) / CLOCKS_PER_SEC;
printf( "Duration = %6.2f seconds\n", duration);
}
Possible the machine you're running your test code from is too fast. Try increasing the loop count to a really huge number.
Other things to try is to test with the sleep() function.
This should confirm the behavior of your clock() measurements.
I believe the reason you are seeing zero runtime for f1() in release mode is because the compiler is optimizing the function. Since your for loop doesn't have a code block, it can effectively be pulled out during compilation.
I'm guessing that this optimization is not performed in debug mode, which would explain why you see a longer execution time. It varies between runs simply because your OS scheduler (almost certainly) does not guarantee a fixed time slot for processes.
As for why you can use printf() when you have not explicitly included <cstdio>, it's because of the <iostream> include.
From looking my headers at C:\Program Files\Microsoft Visual Studio 10.0\VC\include, I can see that iostream includes istream and ostream, both of which include ios, which includes xlocnum, which includes both cstdlib and cstdio.