OpenMP does not use all threads / Clion / Windows - multithreading

learning the first steps with openMP and got stuck a little bit. Why my code does not use all allowable threads? OMP_NUM_THREADS=6 has been set as environmental variable.
#include <omp.h>
int max = omp_get_max_threads();
std::cout<<"Max threads: "<<max<<std::endl;
#pragma omp parallel
{
int n = omp_get_num_threads();
int tid = omp_get_thread_num();
std::cout<<"There are threads:"<<n<<"Hello from thread: "<<tid<<std::endl;
};
/*end of parallel section */
std::cout<<"Hello from the master thread\n";
Output:
Max threads: 6
There are threads:1 Hello from thread: 0
Hello from the master thread
Update: I also tried omp_set_dynamic(0); with no success.
Update: it was solved with compilator flag:
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /openmp")

Answer:
It was problem with the compilator/cmakelists.txt, I needed to set this flag:
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /openmp")

Related

openmp core assignation fails

My Centos 6 VM shows four cores when displaying the content of /proc/cpuinfo, and /sys/devices/system/cpu/online shows 0-3.
I am trying to run the following code on the core 2 and 3 using KMP_AFFINITY="explicit,proclist=[2-3]"
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main (int argc, char *argv[]) {
int nthreads, tid, cid;
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
cid = sched_getcpu();
printf("Hello from thread %d on core %d\n", tid, cid);
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
}
When compiled with icc (ICC) 16.0.1 20151021, it it fails to detect the available cores and executes everything on the core 0.
$ OMP_NUM_THREADS=4 ./a.out
OMP: Warning #123: Ignoring invalid OS proc ID 2.
OMP: Warning #123: Ignoring invalid OS proc ID 3.
OMP: Warning #124: No valid OS proc IDs specified - not using affinity.
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Hello from thread 0 on core 0
Number of threads = 1
Where as gcc (GCC) 4.4.7 20120313, with GOMP_CPU_AFFINITY="2-3", executes properly on core 2 and 3, like set.
I used strace to check what's going on under the hood, and I noticed something strange :
[...]
open("/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
read(3, "0-3\n", 8192) = 4
[...]
sched_getaffinity(0, 1048576, { 1 }) = 8
sched_setaffinity(0, 8, { 4521c26fbb1c38c1 }) = -1 EFAULT (Bad addres
[...]
Could this be an error from the implementation of OpenMP made by intel?
I cannot upgrade my compiler to fix it in this case. Is it possible to use the GCC OpenMP library instead of the Intel one when compiling with icc ?
Update:
I managed to compile the code with gcc and linking it with iomp using the following command
gcc omp.c -L/opt/intel/compilers_and_libraries_2016/linux/lib/intel64_lin/ -liomp5
The execution outputs no warning, and is still not correct:
$ OMP_NUM_THREADS=4 ./a.out
Hello from thread 0 on core 0
Number of threads = 1
Same sched_setaffinity error than previously shown.

VC++: crash when freeing a DLL built with openMP

I've reduced a crash to the following toy code:
// DLLwithOMP.cpp : build into a dll *with* /openmp
#include <tchar.h>
extern "C"
{
__declspec(dllexport) void funcOMP()
{
#pragma omp parallel for
for (int i = 0; i < 100; i++)
_tprintf(_T("Please fondle my buttocks\n"));
}
}
_
// ConsoleApplication1.cpp : build into an executable *without* /openmp
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
typedef void(*tDllFunc) ();
int main()
{
HMODULE hDLL = LoadLibrary(_T("DLLwithOMP.dll"));
tDllFunc pDllFunc = (tDllFunc)GetProcAddress(hDLL, "funcOMP");
pDllFunc();
FreeLibrary(hDLL);
// At this point the omp runtime vcomp140[d].dll refcount is zero
// and windows unloads it, but the omp thread team remains active.
// A crash usually ensues.
return 0;
}
Is this an MS bug? Is there some OMP thread-cleanup API I missed (probably not, but maybe)? I don't have other compilers under hand. Do they treat this scenario differently? (again, probably not) Does the OMP standard has anything to say on such a scenario?
I got an answer from Eric Brumer # MS Connect. Re-posting it here in case it is of interest to anyone in the future:
for optimal performance, the openmp threadpool spin waits for about a
second prior to shutting down in case more work becomes available. If
you unload a DLL that's in the process of spin-waiting, it will crash
in the manner you see (most of the time).
You can tell openmp not to spin-wait and the threads will immediately
block after the loop finishes. Just set OMP_WAIT_POLICY=passive in
your environment, or call SetEnvironmentVariable(L"OMP_WAIT_POLICY",
L"passive"); in your function before loading the dll. The default is
"active" which tells the threadpool to spin wait. Use the environment
variable, or just wait a few seconds before calling FreeLibrary.

TBB acting strange in Matlab Mex file

Edited:< Matlab limits TBB but not OpenMP >
My question is different than the one above, it's not duplicated though using the same sample code for illustration. In my case I specified num of threads in tbb initialization instead of using "deferred". Also I'm talking about the strange behavior between TBB in c++ and TBB in mex. The answer to that question only demonstrates thread initialization when running TBB in C++, not in MEX.
I'm trying to boost a Matlab mex file to improve performance. The strange thing I come across when using TBB within mex is that TBB initialization doesn't work as expected.
This C++ program performs 100% cpu usage and has 15 TBB threads when executing it alone:
main.cpp
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(/* int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] */) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
int main()
{
mexFunction();
}
Then I modified the code a little bit to make a MEX for matlab:
BuildMEX.mexw64
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction( int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] ) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
Eventually invoke BuildMEX.mexw64 in Matlab. I compiled(mcc) the following code snippet to Matlab binary "MEXtest.exe" and use vTune to profile its performance(run in MCR). The TBB within the process only initialized 4 tbb threads and the binary only occupies ~50% cpu usage. Why MEX is downgrading overall performance and TBB? How can I seize more cpu usage for mex?
MEXtest.exe
function MEXtest()
BuildMEX();
end
According to the scheduler class description:
This class allows to customize properties of the TBB task pool to some
extent. For example it can limit concurrency level of parallel work
initiated by the given thread. It also can be used to specify stack
size of the TBB worker threads, though this setting is not effective
if the thread pool has already been created.
This is further explained in the initialize() methods called by the constructor:
The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple
task_scheduler_inits. Doing so does no harm because the underlying
scheduler is reference counted.
(highlighted parts added by me)
I believe that MATLAB already uses Intel TBB internally, and it must have initialized a thread pool at a top level before the MEX-function is ever executed. Thus all task schedulers in your code are going to use the number of threads specified by internal parts of MATLAB, ignoring the value you specified in your code.
By default MATLAB must have initialized the thread pool with a size equal to the number of physical processors (not logicals), which is indicated by the fact that on my quad-core hyper-threaded machine I get:
>> maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release [...]
ans =
4
OpenMP on the other has no scheduler, and we can control number of threads at runtime by calling the following functions:
#include <omp.h>
..
omp_set_dynamic(1);
omp_set_num_threads(omp_get_num_procs());
or by setting the environment variable:
>> setenv('OMP_NUM_THREADS', '8')
To test this proposed explanation, here is the code I used:
test_tbb.cpp
#ifdef MATLAB_MEX_FILE
#include "mex.h"
#endif
#include <cstdlib>
#include <cstdio>
#include <vector>
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for_each.h"
#include "tbb/spin_mutex.h"
#include "tbb_helpers.hxx"
#define NTASKS 100
#define NLOOPS 400000L
tbb::spin_mutex print_mutex;
struct mytask {
mytask(size_t n) :_n(n) {}
void operator()()
{
// track maximum number of parallel workers run
ConcurrencyProfiler prof;
// burn some CPU cycles!
double x = 1.0 / _n;
for (long i=0; i<NLOOPS; ++i) {
x = sin(x) * 10.0;
while((double) rand() / RAND_MAX < 0.9);
}
{
tbb::spin_mutex::scoped_lock s(print_mutex);
fprintf(stderr, "%f\n", x);
}
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const { it(); }
};
void run()
{
// use all 8 logical cores
SetProcessAffinityMask(GetCurrentProcess(), 0xFF);
printf("numTasks = %d\n", NTASKS);
for (int t = tbb::task_scheduler_init::automatic;
t <= 512; t = (t>0) ? t*2 : 1)
{
tbb::task_scheduler_init init(t);
std::vector<mytask> tasks;
for (int i=0; i<NTASKS; ++i) {
tasks.push_back(mytask(i));
}
ConcurrencyProfiler::Reset();
tbb::parallel_for_each(tasks.begin(), tasks.end(), invoker<mytask>());
printf("pool_init(%d) -> %d worker threads\n", t,
ConcurrencyProfiler::GetMaxNumThreads());
}
}
#ifdef MATLAB_MEX_FILE
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
run();
}
#else
int main()
{
run();
return 0;
}
#endif
Here is the code for a simple helper class used to profile concurrency by keeping track of how many workers were invoked from the thread pool. You could always use Intel VTune or any other profiling tool to get the same kind of information:
tbb_helpers.hxx
#ifndef HELPERS_H
#define HELPERS_H
#include "tbb/atomic.h"
class ConcurrencyProfiler
{
public:
ConcurrencyProfiler();
~ConcurrencyProfiler();
static void Reset();
static size_t GetMaxNumThreads();
private:
static void RecordMax();
static tbb::atomic<size_t> cur_count;
static tbb::atomic<size_t> max_count;
};
#endif
tbb_helpers.cxx
#include "tbb_helpers.hxx"
tbb::atomic<size_t> ConcurrencyProfiler::cur_count;
tbb::atomic<size_t> ConcurrencyProfiler::max_count;
ConcurrencyProfiler::ConcurrencyProfiler()
{
++cur_count;
RecordMax();
}
ConcurrencyProfiler::~ConcurrencyProfiler()
{
--cur_count;
}
void ConcurrencyProfiler::Reset()
{
cur_count = max_count = 0;
}
size_t ConcurrencyProfiler::GetMaxNumThreads()
{
return static_cast<size_t>(max_count);
}
// Performs: max_count = max(max_count,cur_count)
// http://www.threadingbuildingblocks.org/
// docs/help/tbb_userguide/Design_Patterns/Compare_and_Swap_Loop.htm
void ConcurrencyProfiler::RecordMax()
{
size_t o;
do {
o = max_count;
if (o >= cur_count) break;
} while(max_count.compare_and_swap(cur_count,o) != o);
}
First I compile the code as a native executable (I am using Intel C++ Composer XE 2013 SP1, with VS2012 Update 4):
C:\> vcvarsall.bat amd64
C:\> iclvars.bat intel64 vs2012
C:\> icl /MD test_tbb.cpp tbb_helpers.cxx tbb.lib
I run the program in the system shell (Windows 8.1). It goes up to 100% CPU utilization and I get the following output:
C:\> test_tbb.exe 2> nul
numTasks = 100
pool_init(-1) -> 8 worker threads // task_scheduler_init::automatic
pool_init(1) -> 1 worker threads
pool_init(2) -> 2 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 8 worker threads
pool_init(16) -> 16 worker threads
pool_init(32) -> 32 worker threads
pool_init(64) -> 64 worker threads
pool_init(128) -> 98 worker threads
pool_init(256) -> 100 worker threads
pool_init(512) -> 98 worker threads
As expected, the thread pool is initialized as large as we asked, and being fully utilized being limited by the number of tasks we created (in the last case we have 512 threads for only 100 parallel tasks!).
Next I compile the code as a MEX-file:
>> mex -I"C:\Program Files (x86)\Intel\Composer XE\tbb\include" ...
-largeArrayDims test_tbb.cpp tbb_helpers.cxx ...
-L"C:\Program Files (x86)\Intel\Composer XE\tbb\lib\intel64\vc11" tbb.lib
Here is the output I get when I run the MEX-function in MATLAB:
>> test_tbb()
numTasks = 100
pool_init(-1) -> 4 worker threads
pool_init(1) -> 4 worker threads
pool_init(2) -> 4 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 4 worker threads
pool_init(16) -> 4 worker threads
pool_init(32) -> 4 worker threads
pool_init(64) -> 4 worker threads
pool_init(128) -> 4 worker threads
pool_init(256) -> 4 worker threads
pool_init(512) -> 4 worker threads
As you can see, no matter what we specify as pool size, the scheduler always spins at most 4 threads to execute the parallel tasks (4 being the number of physical processors on my quad-core machine). This confirms what I stated in the beginning of the post.
Note that I explicitly set the processor affinity mask to use all 8 cores, but since there are only 4 running threads, CPU usage stayed approximately at 50% in this case.
Hope this helps answer the question, and sorry for the long post :)
Assuming you have more than 4 physical cores on your machine, the affinity mask for the MATLAB standalone process is probably limiting the available CPUs. Functions called from an actual MATLAB installation should have the use of all CPUs, but this may not be the case for standalone MATLAB applications generated with the MATLAB Compiler. Try the test again, running the MEX function directly from MATLAB. In any case, you should be able to reset the affinity mask to make all cores available to TBB, but I do not think you this approach will let you coerce TBB to start more threads than you have physical cores.
Background
Since TBB 3.0 update 4, processor affinity settings are referenced to determine the number of available cores, according to a developer blog:
So the only thing that TBB should do instead of asking the system how many CPUs it has, is to retrieve the current process affinity mask, count the number of non-zero bits in it, and voilà, TBB uses no more worker threads than necessary! And this is exactly what TBB 3.0 Update 4 does. Clarifying the statement in the end of my previous blog TBB’s methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() return not simply the total number of logical CPUs in the system or the current processor group, but rather the number of CPUs available to the process in accordance with its affinity settings.
Similarly, the docs for tbb::default_num_threads indicate this change:
Before TBB 3.0 U4 this method returned the number of logical CPU in the system. Currently on Windows, Linux and FreeBSD it returns the number of logical CPUs available to the current process in accordance with its affinity mask.
The docs for tbb::task_scheduler_init::initialize also suggest that the number of threads is "limited by the processor affinity mask".
Resolution
To check if you are being limited by the affinity mask, Windows .NET functions are available:
numCoresInSystem = 16;
proc = System.Diagnostics.Process.GetCurrentProcess();
dec2bin(proc.ProcessorAffinity.ToInt32,numCoresInSystem)
The output string should have no zeros in any position representing a real (present in the system) core.
You can set the affinity mask in MATLAB or C, as described in the Q&A, Set processor affinity for MATLAB engine (Windows 7). The MATLAB way:
proc = System.Diagnostics.Process.GetCurrentProcess();
proc.ProcessorAffinity = System.IntPtr(int32(2^numCoresInSystem-1));
proc.Refresh()
Or using the Windows API, in a mexFunction, before calling task_scheduler_init:
SetProcessAffinityMask(GetCurrentProcess(),(1 << N) - 1)
For *nix, you can call taskset:
system(sprintf('taskset -p %d %d',2^N - 1,feature('getpid')))

Segmentation fault using glGetString() with pthreads under linux

I'm trying to load textures in a background thread to help speed up my application.
The stack we are using is C/C++ on Linux, compiling with gcc. We're using OpenGL, GLUT and GLEW. We have been using libSOIL for texture loading.
Ultimately, launching texture loads with libSOIL fails because it encounters a glGetString() call that causes a segfault. Trying to narrow down the problem, I wrote a very simple OpenGL application that reproduces the behavior. The below code sample shouldn't "do anything," but it also shouldn't segfault. If I knew why it did, I could in theory rework libSOIL so that it would behave in a pthreaded environment.
void *glPthreadTest( void* arg ) {
glGetString( GL_EXTENSIONS ); //SIGSEGV
return NULL;
}
int main( int argc, char **argv ) {
glutInit( &argc, argv );
glutInitDisplayMode( GLUT_RGBA | GLUT_DOUBLE | GLUT_DEPTH );
glewInit();
glGetString( GL_EXTENSIONS ); // Does not cause SIGSEGV
pthread_t id;
if (pthread_create( &id, NULL, glPthreadTest, (void*)NULL ) != 0)
fprintf( stderr, "phtread_create glPthreadTest failed.\n" );
glutMainLoop();
return EXIT_SUCCESS;
}
A sample stacktrace for this application from gdb looks like this:
#0 0x00000038492f86e9 in glGetString () from /usr/lib64/nvidia/libGL.so.1
No symbol table info available.
#1 0x0000000000404425 in glPthreadTest (arg=0x0) at sf.cpp:168
No locals.
#2 0x0000003148e07d15 in start_thread (arg=0x7ffff7b36700) at pthread_create.c:308
__res = <optimized out>
pd = 0x7ffff7b36700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737349117696, -5802871742031723458, 1, 211665686528, 140737349117696, 0, 5802854601940796478,
-5829171783283899330}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = 0
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
#3 0x00000031486f246d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:114
No locals.
You'll notice I am using the nvidia libGL implementation, but this also occurs identically with the mesa libgl that Ubuntu uses for Intel HD graphics cards.
Any tips for what might be going wrong, or how to investigate further to see what's happening?
Edit: Here are the #includes and the compile string for my example test:
#include <SOIL.h>
#include <GL/glew.h>
#include <GL/freeglut.h>
#include <GL/freeglut_ext.h>
#include <signal.h>
#include <pthread.h>
#include <cstdio>
g++ -Wall -pedantic -I/usr/include/SOIL -O0 -ggdb -o sf sf.cpp -lSOIL -pthread -lGL -lGLU -lGLEW -lglut -lX11
In order for any OpenGL call to operate properly, it requires an OpenGL context. Contexts are created using a window-system binding call (like wglCreateContext or similar). After creating a context, it needs to be "made current", which means associating the context with the current thread of execution. This is accomplished with another window-system specific call (like wglMakeCurrent for Microsoft Windows, or glXMakeCurrent for X Windows). GLUT abstracts all of that complexity away from you, doing all of those operations when you call glutCreateWindow.
Now, an important rule to know is that only a single OpenGL context can be current to a thread of execution at any one time. So, in the OP's original example, if she/he could make the context current in the Pthread they created, then the context would be lost in the main thread. The way to keep all this consistent is to only use a single context in a single thread. (It's possible to have OpenGL contexts share data, but that's neither exposed by GLUT, nor possible without using the window-system context creation calls).
In your case, it's likely that GLUT doesn't allow access to what you really need (i.e., the OpenGL context), to make it current in the other thread. You'd need to create and manage OpenGL contexts yourself.

OpenMP behaviour detecting CPU and thread

I'm at very beginning with OpenMP, i just compiled with gcc -fopenmp openmp_c_helloworld.c the following piece of code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
I just run the executable on a quad-core Intel CPU with HyperThreading and i obtain the following output:
Hello World from thread 2
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
There are 4 threads
Technically speaking i have 8 thread available on my CPU and 4 CPU-core, why OpenMP shows me only 4 thread?
To put it simply, I think it's because OpenMP looks for the number of CPU's (cores) rather than the number of processor threads.
See this page: `
Implementation default - usually the number of CPUs on a node, though
it could be dynamic (see next bullet).
Something you could try out is setting the number of threads in your program to be equal to the number of processor threads and see if there's a performance improvement (you'll have to create your own benchmarking program).
In parallel programming, good performance is obtained when the number of worker threads are equal to the number of processor threads. You can keep a thread or two extra for I/O as well.

Resources