TBB acting strange in Matlab Mex file - multithreading

Edited:< Matlab limits TBB but not OpenMP >
My question is different than the one above, it's not duplicated though using the same sample code for illustration. In my case I specified num of threads in tbb initialization instead of using "deferred". Also I'm talking about the strange behavior between TBB in c++ and TBB in mex. The answer to that question only demonstrates thread initialization when running TBB in C++, not in MEX.
I'm trying to boost a Matlab mex file to improve performance. The strange thing I come across when using TBB within mex is that TBB initialization doesn't work as expected.
This C++ program performs 100% cpu usage and has 15 TBB threads when executing it alone:
main.cpp
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction(/* int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] */) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
int main()
{
mexFunction();
}
Then I modified the code a little bit to make a MEX for matlab:
BuildMEX.mexw64
#include "tbb/parallel_for_each.h"
#include "tbb/task_scheduler_init.h"
#include <iostream>
#include <vector>
#include "mex.h"
struct mytask {
mytask(size_t n)
:_n(n)
{}
void operator()() {
for (long i=0;i<10000000000L;++i) {} // Deliberately run slow
std::cerr << "[" << _n << "]";
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const {it();}
};
void mexFunction( int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[] ) {
tbb::task_scheduler_init init(15); // 15 threads
std::vector<mytask> tasks;
for (int i=0;i<10000;++i)
tasks.push_back(mytask(i));
tbb::parallel_for_each(tasks.begin(),tasks.end(),invoker<mytask>());
}
Eventually invoke BuildMEX.mexw64 in Matlab. I compiled(mcc) the following code snippet to Matlab binary "MEXtest.exe" and use vTune to profile its performance(run in MCR). The TBB within the process only initialized 4 tbb threads and the binary only occupies ~50% cpu usage. Why MEX is downgrading overall performance and TBB? How can I seize more cpu usage for mex?
MEXtest.exe
function MEXtest()
BuildMEX();
end

According to the scheduler class description:
This class allows to customize properties of the TBB task pool to some
extent. For example it can limit concurrency level of parallel work
initiated by the given thread. It also can be used to specify stack
size of the TBB worker threads, though this setting is not effective
if the thread pool has already been created.
This is further explained in the initialize() methods called by the constructor:
The number_of_threads is ignored if any other task_scheduler_inits currently exist. A thread may construct multiple
task_scheduler_inits. Doing so does no harm because the underlying
scheduler is reference counted.
(highlighted parts added by me)
I believe that MATLAB already uses Intel TBB internally, and it must have initialized a thread pool at a top level before the MEX-function is ever executed. Thus all task schedulers in your code are going to use the number of threads specified by internal parts of MATLAB, ignoring the value you specified in your code.
By default MATLAB must have initialized the thread pool with a size equal to the number of physical processors (not logicals), which is indicated by the fact that on my quad-core hyper-threaded machine I get:
>> maxNumCompThreads
Warning: maxNumCompThreads will be removed in a future release [...]
ans =
4
OpenMP on the other has no scheduler, and we can control number of threads at runtime by calling the following functions:
#include <omp.h>
..
omp_set_dynamic(1);
omp_set_num_threads(omp_get_num_procs());
or by setting the environment variable:
>> setenv('OMP_NUM_THREADS', '8')
To test this proposed explanation, here is the code I used:
test_tbb.cpp
#ifdef MATLAB_MEX_FILE
#include "mex.h"
#endif
#include <cstdlib>
#include <cstdio>
#include <vector>
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for_each.h"
#include "tbb/spin_mutex.h"
#include "tbb_helpers.hxx"
#define NTASKS 100
#define NLOOPS 400000L
tbb::spin_mutex print_mutex;
struct mytask {
mytask(size_t n) :_n(n) {}
void operator()()
{
// track maximum number of parallel workers run
ConcurrencyProfiler prof;
// burn some CPU cycles!
double x = 1.0 / _n;
for (long i=0; i<NLOOPS; ++i) {
x = sin(x) * 10.0;
while((double) rand() / RAND_MAX < 0.9);
}
{
tbb::spin_mutex::scoped_lock s(print_mutex);
fprintf(stderr, "%f\n", x);
}
}
size_t _n;
};
template <typename T> struct invoker {
void operator()(T& it) const { it(); }
};
void run()
{
// use all 8 logical cores
SetProcessAffinityMask(GetCurrentProcess(), 0xFF);
printf("numTasks = %d\n", NTASKS);
for (int t = tbb::task_scheduler_init::automatic;
t <= 512; t = (t>0) ? t*2 : 1)
{
tbb::task_scheduler_init init(t);
std::vector<mytask> tasks;
for (int i=0; i<NTASKS; ++i) {
tasks.push_back(mytask(i));
}
ConcurrencyProfiler::Reset();
tbb::parallel_for_each(tasks.begin(), tasks.end(), invoker<mytask>());
printf("pool_init(%d) -> %d worker threads\n", t,
ConcurrencyProfiler::GetMaxNumThreads());
}
}
#ifdef MATLAB_MEX_FILE
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
run();
}
#else
int main()
{
run();
return 0;
}
#endif
Here is the code for a simple helper class used to profile concurrency by keeping track of how many workers were invoked from the thread pool. You could always use Intel VTune or any other profiling tool to get the same kind of information:
tbb_helpers.hxx
#ifndef HELPERS_H
#define HELPERS_H
#include "tbb/atomic.h"
class ConcurrencyProfiler
{
public:
ConcurrencyProfiler();
~ConcurrencyProfiler();
static void Reset();
static size_t GetMaxNumThreads();
private:
static void RecordMax();
static tbb::atomic<size_t> cur_count;
static tbb::atomic<size_t> max_count;
};
#endif
tbb_helpers.cxx
#include "tbb_helpers.hxx"
tbb::atomic<size_t> ConcurrencyProfiler::cur_count;
tbb::atomic<size_t> ConcurrencyProfiler::max_count;
ConcurrencyProfiler::ConcurrencyProfiler()
{
++cur_count;
RecordMax();
}
ConcurrencyProfiler::~ConcurrencyProfiler()
{
--cur_count;
}
void ConcurrencyProfiler::Reset()
{
cur_count = max_count = 0;
}
size_t ConcurrencyProfiler::GetMaxNumThreads()
{
return static_cast<size_t>(max_count);
}
// Performs: max_count = max(max_count,cur_count)
// http://www.threadingbuildingblocks.org/
// docs/help/tbb_userguide/Design_Patterns/Compare_and_Swap_Loop.htm
void ConcurrencyProfiler::RecordMax()
{
size_t o;
do {
o = max_count;
if (o >= cur_count) break;
} while(max_count.compare_and_swap(cur_count,o) != o);
}
First I compile the code as a native executable (I am using Intel C++ Composer XE 2013 SP1, with VS2012 Update 4):
C:\> vcvarsall.bat amd64
C:\> iclvars.bat intel64 vs2012
C:\> icl /MD test_tbb.cpp tbb_helpers.cxx tbb.lib
I run the program in the system shell (Windows 8.1). It goes up to 100% CPU utilization and I get the following output:
C:\> test_tbb.exe 2> nul
numTasks = 100
pool_init(-1) -> 8 worker threads // task_scheduler_init::automatic
pool_init(1) -> 1 worker threads
pool_init(2) -> 2 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 8 worker threads
pool_init(16) -> 16 worker threads
pool_init(32) -> 32 worker threads
pool_init(64) -> 64 worker threads
pool_init(128) -> 98 worker threads
pool_init(256) -> 100 worker threads
pool_init(512) -> 98 worker threads
As expected, the thread pool is initialized as large as we asked, and being fully utilized being limited by the number of tasks we created (in the last case we have 512 threads for only 100 parallel tasks!).
Next I compile the code as a MEX-file:
>> mex -I"C:\Program Files (x86)\Intel\Composer XE\tbb\include" ...
-largeArrayDims test_tbb.cpp tbb_helpers.cxx ...
-L"C:\Program Files (x86)\Intel\Composer XE\tbb\lib\intel64\vc11" tbb.lib
Here is the output I get when I run the MEX-function in MATLAB:
>> test_tbb()
numTasks = 100
pool_init(-1) -> 4 worker threads
pool_init(1) -> 4 worker threads
pool_init(2) -> 4 worker threads
pool_init(4) -> 4 worker threads
pool_init(8) -> 4 worker threads
pool_init(16) -> 4 worker threads
pool_init(32) -> 4 worker threads
pool_init(64) -> 4 worker threads
pool_init(128) -> 4 worker threads
pool_init(256) -> 4 worker threads
pool_init(512) -> 4 worker threads
As you can see, no matter what we specify as pool size, the scheduler always spins at most 4 threads to execute the parallel tasks (4 being the number of physical processors on my quad-core machine). This confirms what I stated in the beginning of the post.
Note that I explicitly set the processor affinity mask to use all 8 cores, but since there are only 4 running threads, CPU usage stayed approximately at 50% in this case.
Hope this helps answer the question, and sorry for the long post :)

Assuming you have more than 4 physical cores on your machine, the affinity mask for the MATLAB standalone process is probably limiting the available CPUs. Functions called from an actual MATLAB installation should have the use of all CPUs, but this may not be the case for standalone MATLAB applications generated with the MATLAB Compiler. Try the test again, running the MEX function directly from MATLAB. In any case, you should be able to reset the affinity mask to make all cores available to TBB, but I do not think you this approach will let you coerce TBB to start more threads than you have physical cores.
Background
Since TBB 3.0 update 4, processor affinity settings are referenced to determine the number of available cores, according to a developer blog:
So the only thing that TBB should do instead of asking the system how many CPUs it has, is to retrieve the current process affinity mask, count the number of non-zero bits in it, and voilà, TBB uses no more worker threads than necessary! And this is exactly what TBB 3.0 Update 4 does. Clarifying the statement in the end of my previous blog TBB’s methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() return not simply the total number of logical CPUs in the system or the current processor group, but rather the number of CPUs available to the process in accordance with its affinity settings.
Similarly, the docs for tbb::default_num_threads indicate this change:
Before TBB 3.0 U4 this method returned the number of logical CPU in the system. Currently on Windows, Linux and FreeBSD it returns the number of logical CPUs available to the current process in accordance with its affinity mask.
The docs for tbb::task_scheduler_init::initialize also suggest that the number of threads is "limited by the processor affinity mask".
Resolution
To check if you are being limited by the affinity mask, Windows .NET functions are available:
numCoresInSystem = 16;
proc = System.Diagnostics.Process.GetCurrentProcess();
dec2bin(proc.ProcessorAffinity.ToInt32,numCoresInSystem)
The output string should have no zeros in any position representing a real (present in the system) core.
You can set the affinity mask in MATLAB or C, as described in the Q&A, Set processor affinity for MATLAB engine (Windows 7). The MATLAB way:
proc = System.Diagnostics.Process.GetCurrentProcess();
proc.ProcessorAffinity = System.IntPtr(int32(2^numCoresInSystem-1));
proc.Refresh()
Or using the Windows API, in a mexFunction, before calling task_scheduler_init:
SetProcessAffinityMask(GetCurrentProcess(),(1 << N) - 1)
For *nix, you can call taskset:
system(sprintf('taskset -p %d %d',2^N - 1,feature('getpid')))

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?
TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.
I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

VC++: crash when freeing a DLL built with openMP

I've reduced a crash to the following toy code:
// DLLwithOMP.cpp : build into a dll *with* /openmp
#include <tchar.h>
extern "C"
{
__declspec(dllexport) void funcOMP()
{
#pragma omp parallel for
for (int i = 0; i < 100; i++)
_tprintf(_T("Please fondle my buttocks\n"));
}
}
_
// ConsoleApplication1.cpp : build into an executable *without* /openmp
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
typedef void(*tDllFunc) ();
int main()
{
HMODULE hDLL = LoadLibrary(_T("DLLwithOMP.dll"));
tDllFunc pDllFunc = (tDllFunc)GetProcAddress(hDLL, "funcOMP");
pDllFunc();
FreeLibrary(hDLL);
// At this point the omp runtime vcomp140[d].dll refcount is zero
// and windows unloads it, but the omp thread team remains active.
// A crash usually ensues.
return 0;
}
Is this an MS bug? Is there some OMP thread-cleanup API I missed (probably not, but maybe)? I don't have other compilers under hand. Do they treat this scenario differently? (again, probably not) Does the OMP standard has anything to say on such a scenario?
I got an answer from Eric Brumer # MS Connect. Re-posting it here in case it is of interest to anyone in the future:
for optimal performance, the openmp threadpool spin waits for about a
second prior to shutting down in case more work becomes available. If
you unload a DLL that's in the process of spin-waiting, it will crash
in the manner you see (most of the time).
You can tell openmp not to spin-wait and the threads will immediately
block after the loop finishes. Just set OMP_WAIT_POLICY=passive in
your environment, or call SetEnvironmentVariable(L"OMP_WAIT_POLICY",
L"passive"); in your function before loading the dll. The default is
"active" which tells the threadpool to spin wait. Use the environment
variable, or just wait a few seconds before calling FreeLibrary.

OpenMP behaviour detecting CPU and thread

I'm at very beginning with OpenMP, i just compiled with gcc -fopenmp openmp_c_helloworld.c the following piece of code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int th_id, nthreads;
#pragma omp parallel private(th_id)
{
th_id = omp_get_thread_num();
printf("Hello World from thread %d\n", th_id);
#pragma omp barrier
if ( th_id == 0 ) {
nthreads = omp_get_num_threads();
printf("There are %d threads\n",nthreads);
}
}
return EXIT_SUCCESS;
}
I just run the executable on a quad-core Intel CPU with HyperThreading and i obtain the following output:
Hello World from thread 2
Hello World from thread 0
Hello World from thread 3
Hello World from thread 1
There are 4 threads
Technically speaking i have 8 thread available on my CPU and 4 CPU-core, why OpenMP shows me only 4 thread?
To put it simply, I think it's because OpenMP looks for the number of CPU's (cores) rather than the number of processor threads.
See this page: `
Implementation default - usually the number of CPUs on a node, though
it could be dynamic (see next bullet).
Something you could try out is setting the number of threads in your program to be equal to the number of processor threads and see if there's a performance improvement (you'll have to create your own benchmarking program).
In parallel programming, good performance is obtained when the number of worker threads are equal to the number of processor threads. You can keep a thread or two extra for I/O as well.

max thread per process in linux

I wrote a simple program to calculate the maximum number of threads that a process can have in linux (Centos 5). here is the code:
int main()
{
pthread_t thrd[400];
for(int i=0;i<400;i++)
{
int err=pthread_create(&thrd[i],NULL,thread,(void*)i);
if(err!=0)
cout << "thread creation failed: " << i <<" error code: " << err << endl;
}
return 0;
}
void * thread(void* i)
{
sleep(100);//make the thread still alive
return 0;
}
I figured out that max number for threads is only 300!? What if i need more than that?
I have to mention that pthread_create returns 12 as error code.
Thanks before
There is a thread limit for linux and it can be modified runtime by writing desired limit to /proc/sys/kernel/threads-max. The default value is computed from the available system memory. In addition to that limit, there's also another limit: /proc/sys/vm/max_map_count which limits the maximum mmapped segments and at least recent kernels will mmap memory per thread. It should be safe to increase that limit a lot if you hit it.
However, the limit you're hitting is lack of virtual memory in 32bit operating system. Install a 64 bit linux if your hardware supports it and you'll be fine. I can easily start 30000 threads with a stack size of 8MB. The system has a single Core 2 Duo + 8 GB of system memory (I'm using 5 GB for other stuff in the same time) and it's running 64 bit Ubuntu with kernel 2.6.32. Note that memory overcommit (/proc/sys/vm/overcommit_memory) must be allowed because otherwise system would need at least 240 GB of committable memory (sum of real memory and swap space).
If you need lots of threads and cannot use 64 bit system your only choice is to minimize the memory usage per thread to conserve virtual memory. Start with requesting as little stack as you can live with.
Your system limits may not be allowing you to map the stacks of all the threads you require. Look at /proc/sys/vm/max_map_count, and see this answer. I'm not 100% sure this is your problem, because most people run into problems at much larger thread counts.
I had also encountered the same problem when my number of threads crosses some threshold.
It was because of the user level limit (number of process a user can run at a time) set to 1024 in /etc/security/limits.conf .
so check your /etc/security/limits.conf and look for entry:-
username -/soft/hard -nproc 1024
change it to some larger values to something 100k(requires sudo privileges/root) and it should work for you.
To learn more about security policy ,see http://linux.die.net/man/5/limits.conf.
check the stack size per thread with ulimit, in my case Redhat Linux 2.6:
ulimit -a
...
stack size (kbytes, -s) 10240
Each of your threads will get this amount of memory (10MB) assigned for it's stack. With a 32bit program and a maximum address space of 4GB, that is a maximum of only 4096MB / 10MB = 409 threads !!! Minus program code, minus heap-space will probably lead to your observed max. of 300 threads.
You should be able to raise this by compiling a 64bit application or setting ulimit -s 8192 or even ulimit -s 4096. But if this is advisable is another discussion...
You will run out of memory too unless u shrink the default thread stack size. Its 10MB on our version of linux.
EDIT:
Error code 12 = out of memory, so I think the 1mb stack is still too big for you. Compiled for 32 bit, I can get a 100k stack to give me 30k threads. Beyond 30k threads I get Error code 11 which means no more threads allowed. A 1MB stack gives me about 4k threads before error code 12. 10MB gives me 427 threads. 100MB gives me 42 threads. 1 GB gives me 4... We have 64 bit OS with 64 GB ram. Is your OS 32 bit? When I compile for 64bit, I can use any stack size I want and get the limit of threads.
Also I noticed if i turn the profiling stuff (Tools|Profiling) on for netbeans and run from the ide...I only can get 400 threads. Weird. Netbeans also dies if you use up all the threads.
Here is a test app you can run:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
// this prevents the compiler from reordering code over this COMPILER_BARRIER
// this doesnt do anything
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
sigset_t _fSigSet;
volatile int _cActive = 0;
pthread_t thrd[1000000];
void * thread(void *i)
{
int nSig, cActive;
cActive = __sync_fetch_and_add(&_cActive, 1);
COMPILER_BARRIER(); // make sure the active count is incremented before sigwait
// sigwait is a handy way to sleep a thread and wake it on command
sigwait(&_fSigSet, &nSig); //make the thread still alive
COMPILER_BARRIER(); // make sure the active count is decrimented after sigwait
cActive = __sync_fetch_and_add(&_cActive, -1);
//printf("%d(%d) ", i, cActive);
return 0;
}
int main(int argc, char** argv)
{
pthread_attr_t attr;
int cThreadRequest, cThreads, i, err, cActive, cbStack;
cbStack = (argc > 1) ? atoi(argv[1]) : 0x100000;
cThreadRequest = (argc > 2) ? atoi(argv[2]) : 30000;
sigemptyset(&_fSigSet);
sigaddset(&_fSigSet, SIGUSR1);
sigaddset(&_fSigSet, SIGSEGV);
printf("Start\n");
pthread_attr_init(&attr);
if ((err = pthread_attr_setstacksize(&attr, cbStack)) != 0)
printf("pthread_attr_setstacksize failed: err: %d %s\n", err, strerror(err));
for (i = 0; i < cThreadRequest; i++)
{
if ((err = pthread_create(&thrd[i], &attr, thread, (void*)i)) != 0)
{
printf("pthread_create failed on thread %d, error code: %d %s\n",
i, err, strerror(err));
break;
}
}
cThreads = i;
printf("\n");
// wait for threads to all be created, although we might not wait for
// all threads to make it through sigwait
while (1)
{
cActive = _cActive;
if (cActive == cThreads)
break;
printf("Waiting A %d/%d,", cActive, cThreads);
sched_yield();
}
// wake em all up so they exit
for (i = 0; i < cThreads; i++)
pthread_kill(thrd[i], SIGUSR1);
// wait for them all to exit, although we might be able to exit before
// the last thread returns
while (1)
{
cActive = _cActive;
if (!cActive)
break;
printf("Waiting B %d/%d,", cActive, cThreads);
sched_yield();
}
printf("\nDone. Threads requested: %d. Threads created: %d. StackSize=%lfmb\n",
cThreadRequest, cThreads, (double)cbStack/0x100000);
return 0;
}

Getting stack traces on Unix systems, automatically

What methods are there for automatically getting a stack trace on Unix systems? I don't mean just getting a core file or attaching interactively with GDB, but having a SIGSEGV handler that dumps a backtrace to a text file.
Bonus points for the following optional features:
Extra information gathering at crash time (eg. config files).
Email a crash info bundle to the developers.
Ability to add this in a dlopened shared library
Not requiring a GUI
FYI,
the suggested solution (using backtrace_symbols in a signal handler) is dangerously broken. DO NOT USE IT -
Yes, backtrace and backtrace_symbols will produce a backtrace and a translate it to symbolic names, however:
backtrace_symbols allocates memory using malloc and you use free to free it - If you're crashing because of memory corruption your malloc arena is very likely to be corrupt and cause a double fault.
malloc and free protect the malloc arena with a lock internally. You might have faulted in the middle of a malloc/free with the lock taken, which will cause these function or anything that calls them to dead lock.
You use puts which uses the standard stream, which is also protected by a lock. If you faulted in the middle of a printf you once again have a deadlock.
On 32bit platforms (e.g. your normal PC of 2 year ago), the kernel will plant a return address to an internal glibc function instead of your faulting function in your stack, so the single most important piece of information you are interested in - in which function did the program fault, will actually be corrupted on those platform.
So, the code in the example is the worst kind of wrong - it LOOKS like it's working, but it will really fail you in unexpected ways in production.
BTW, interested in doing it right? check this out.
Cheers,
Gilad.
If you are on systems with the BSD backtrace functionality available (Linux, OSX 1.5, BSD of course), you can do this programmatically in your signal handler.
For example (backtrace code derived from IBM example):
#include <execinfo.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
void sig_handler(int sig)
{
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (int i = 0; i < nSize; i++)
{
puts(symbols[i]);;
}
free(symbols);
signal(sig, &sig_handler);
}
void h()
{
kill(0, SIGSEGV);
}
void g()
{
h();
}
void f()
{
g();
}
int main(int argc, char ** argv)
{
signal(SIGSEGV, &sig_handler);
f();
}
Output:
0 a.out 0x00001f2d sig_handler + 35
1 libSystem.B.dylib 0x95f8f09b _sigtramp + 43
2 ??? 0xffffffff 0x0 + 4294967295
3 a.out 0x00001fb1 h + 26
4 a.out 0x00001fbe g + 11
5 a.out 0x00001fcb f + 11
6 a.out 0x00001ff5 main + 40
7 a.out 0x00001ede start + 54
This doesn't get bonus points for the optional features (except not requiring a GUI), however, it does have the advantage of being very simple, and not requiring any additional libraries or programs.
Here is an example of how to get some more info using a demangler. As you can see this one also logs the stacktrace to file.
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <cxxabi.h>
void sig_handler(int sig)
{
std::stringstream stream;
void * array[25];
int nSize = backtrace(array, 25);
char ** symbols = backtrace_symbols(array, nSize);
for (unsigned int i = 0; i < size; i++) {
int status;
char *realname;
std::string current = symbols[i];
size_t start = current.find("(");
size_t end = current.find("+");
realname = NULL;
if (start != std::string::npos && end != std::string::npos) {
std::string symbol = current.substr(start+1, end-start-1);
realname = abi::__cxa_demangle(symbol.c_str(), 0, 0, &status);
}
if (realname != NULL)
stream << realname << std::endl;
else
stream << symbols[i] << std::endl;
free(realname);
}
free(symbols);
std::cerr << stream.str();
std::ofstream file("/tmp/error.log");
if (file.is_open()) {
if (file.good())
file << stream.str();
file.close();
}
signal(sig, &sig_handler);
}
Dereks solution is probably the best, but here's an alternative anyway:
Recent Linux kernel version allow you to pipe core dumps to a script or program. You could write a script to catch the core dump, collect any extra information you need and mail everything back.
This is a global setting though, so it'd apply to any crashing program on the system. It will also require root rights to set up.
It can be configured through the /proc/sys/kernel/core_pattern file. Set that to something like ' | /home/myuser/bin/my-core-handler-script'.
The Ubuntu people use this feature as well.

Resources