C/C++ MPI speedup is not as expected - linux

I am trying to write an MPI application to speedup a math algorithm with a computer cluster. But before this I am doing some kind of benchmarking. But the first results are not as much as expected.
The test application has linear speedup with 4 cores but 5,6 cores are not speeding up the application. I am doing a test with Odroid N2 platform. It has 6 cores. Nproc says there are 6 cores available.
Am I missing some kind of configuration? Or is my code not prepared well enought ( it is based on one of the base example of mpi)?
Is there any response time or syncronization time which shall be considered ?
Here are some measures from my MPI based application. I measured a total calculation time for a function.
1 core 0.838052sec
2 core 0.438483sec
3 core 0.405501sec
4 core 0.416391sec
5 core 0.514472sec
6 core 0.435128sec
12 core (4 core from 3 N2 boards) 0.06867sec
18 core (6 core from 3 N2 boards) 0.152759sec
I did a benchmark with raspberry pi4 with 4 core:
1 core 1.51 sec
2 core 0.75 sec
3 core 0.69 sec
4 core 0.67 sec
And this is my benchmark application:
int MyFun(int *array, int num_elements, int j)
{
int result_overall = 0;
for (int i = 0; i < num_elements; i++)
{
result_overall += array[i] / 1000;
}
return result_overall;
}
int compute_sum(int* sub_sums,int num_of_cpu)
{
int sum = 0;
for(int i = 0; i<num_of_cpu; i++)
{
sum += sub_sums[i];
}
return sum;
}
//measuring performance from main(): num_elements_per_proc is equal to 604800
if (world_rank == 0)
{
startTime = std::chrono::high_resolution_clock::now();
}
// Compute the sum of your subset
int sub_sum = 0;
for(int j=0;j<1000;j++)
{
sub_sum += MyFun(sub_intArray, num_elements_per_proc, world_rank);
}
MPI_Allgather(&sub_sum, 1, MPI_INT, sub_sums, 1, MPI_INT, MPI_COMM_WORLD);
int total_sum = compute_sum(sub_sums, num_of_cpu);
if (world_rank == 0)
{
elapsedTime = std::chrono::high_resolution_clock::now() - startTime;
timer = elapsedTime.count();
}
I build it with -O3 optimization level.
UPDATE:
new measures:
60480 sample, MyFun called 100000 times:
1.47 -> 0.74 -> 0.48 -> 0.36
6048 samples, MyFun called 1000000 times:
1.43 -> 0.7 -> 0.47 -> 0.35
6048 samples, MyFun called 10000000 times:
14.43 -> 7.08 -> 4.72 -> 3.59
UPDATE2:
By the way when I list the CPU info in linux I got this:
Is this normal?
The quad-core A73 core is not present. And it says there are two sockets with 3-3 cores.
And here is the CPU utilization with sar:
Seems like all of the cores are utilized.
I create some plots from speedup:
Seems like calculation on float instead of int helps a bit but the core 5-6 do not help much. And I think memory bandwidth is okay. Is this a normal behavior when utilizing all CPU equally with little.BIG architecture?

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?
TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.
I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

OpenMP worst performance with more threads (following openMP tutorials)

I'm starting to work with OpenMP and I follow these tutorials:
OpenMP Tutorials
I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.
Here's my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 2
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
t = clock() - t;
cout << "time: " << t << " miliseconds" << endl;
}
As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.
On the tutorial, the more threads we use the better a performance.
In my case, that doesn't happen. Here are the timing I got:
1 thread: 433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds
Why do I get this behavior?
-- EDIT --
gcc version: gcc 5.5.0
result of lscpu:
Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU # 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
-- EDIT --
I've tried using omp_get_wtime() instead, like this:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 8
int main()
{
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
double time = omp_get_wtime() - start_time;
cout << "time: " << time << " seconds" << endl;
}
The behavior is different, although I have some questions.
Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).
My question is: Why I don't have the same behaviour as in the tutorial?
2 threads get better performance than 1,
3 threads get better performance than 2, etc.
Why do I only get better performance with much bigger amount of threads?
instead of better performances with more threads I get worse ... I don't understand why.
Well,let's make the testing a bit more systematic and repeatable to see if :
// time: 1535120 milliseconds 1 thread
// time: 200679 milliseconds 1 thread -O2
// time: 191205 milliseconds 1 thread -O3
// time: 184502 milliseconds 2 threads -O3
// time: 189947 milliseconds 3 threads -O3
// time: 202277 milliseconds 4 threads -O3
// time: 182628 milliseconds 5 threads -O3
// time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
// time: 187606 milliseconds 16 threads -O3
// time: 187231 milliseconds 32 threads -O3
// time: 186131 milliseconds 64 threads -O3
ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...
This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.
Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking
If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules,
why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.
#include <time.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 7
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0 / ( double )num_steps;
omp_set_num_threads( NUM_THREADS );
// struct timespec start;
// t = clock(); // _________________________________________ BEST START HERE
// clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
#pragma omp parallel
{
int i,
nthrds = omp_get_num_threads(),
id = omp_get_thread_num();;
double x;
if ( id == 0 ) nthreads = nthrds;
for ( i = id, sum[id] = 0.0;
i < num_steps;
i += nthrds
)
{
x = ( i + 0.5 ) * step;
sum[id] += 4.0 / ( 1.0 + x * x );
}
}
// t = clock() - t; // _____________________________________ BEST STOP HERE
// clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
for ( i = 0, pi = 0.0;
i < nthreads;
i++
) pi += sum[i] * step;
t = clock() - t;
// // time: 1535120 milliseconds 1 thread
// // time: 200679 milliseconds 1 thread -O2
// // time: 191205 milliseconds 1 thread -O3
printf( "time: %d milliseconds %d threads\n", // time: 184502 milliseconds 2 threads -O3
t, // time: 189947 milliseconds 3 threads -O3
NUM_THREADS // time: 202277 milliseconds 4 threads -O3
); // time: 182628 milliseconds 5 threads -O3
} // time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.
That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.
Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.
Why do I only get better performance with much bigger amount of threads?
This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

c++ std async : almost no effect to use several cores

This question is related to:
c++ std::async : faster on 4 cores compared to 8 cores
In the previous question, I was wondering why some code would run faster on 4 cores rather than 8 (answer: my cpu had 4 cores and 8 threads)
Now I am discovering that code is more or less the same speed independently of the number of cores used.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
Here code for benchmarking computation time against number of core used
#include <math.h>
#include <future>
#include <ctime>
#include <vector>
#include <iostream>
#define NB_JOBS 2000.0
#define MAX_CORES 8
// no special meaning to this function,
// just uses some CPU
static bool _expensive(int nb_jobs){
for(int job=0;job<nb_jobs;job++){
float x = 0.6;
bool b = true;
double f = 1;
for(int i=0;i<1000;i++){
if(!b) f=-1;
for(double j=1;j<2.0;j+=0.01) x+= f* pow(1.0/sin(x),j);
b = !b;
}
}
return true;
}
static double _duration(int nb_cores){
std::clock_t begin = clock();
int nb_jobs_per_core = rint ( NB_JOBS / (float)nb_cores );
std::vector < std::future<bool> > futures;
for(int i=0;i<nb_cores;i++){
futures.push_back( std::async(std::launch::async,_expensive,nb_jobs_per_core));
}
for (auto &e: futures) {
bool foo = e.get();
}
std::clock_t end = clock();
double duration = double(end - begin) / CLOCKS_PER_SEC;
return duration;
}
int main(){
for(int nb_cores=1 ; nb_cores<=MAX_CORES ; nb_cores++){
double duration = _duration(nb_cores);
std::cout << nb_cores << " threads: " << duration << "\n";
}
return 0;
}
Here the output:
1 threads: 8.55817
2 threads: 8.76621
3 threads: 7.90191
4 threads: 8.4656
5 threads: 10.5494
6 threads: 11.6175
7 threads: 21.697
8 threads: 24.3621
using cores seems to have marginal impacts.
What troubles me is that the CPU has 4 cores. So I was expecting the program to run (around) 4 times faster when using 4 threads. It does not.
Note: "htop" shows usage of virtual cores as expected by the program, i.e. first one core used at 100%, then 2, ..., and at the end 8.
If I replace:
futures.push_back( std::async(std::launch::async,[...]
by :
futures.push_back( std::async(std::launch::async|std::launch::deferred,[...]
then I get:
1 threads: 8.6459
2 threads: 8.69905
3 threads: 10.7763
4 threads: 11.4505
5 threads: 11.8426
6 threads: 10.4282
7 threads: 9.55181
8 threads: 9.05565
and htop shows only 1 virtual core being used 100% during the full duration.
Anything I am doing wrong ?
note: I tried on several desktops, all with various specs (nb of core and nb of threads), and observed something similar.

Performance decrease with threaded implementation

I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.
The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.
Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!
For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.

Why is clock_gettime so erratic?

Intro
Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (rdtsc, clock_gettime and QueryThreadCycleTime).
I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
Old Question
In short: I am using clock_gettime to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc is not suitable). I also need a high resolution (which is why times is not suitable).
I've tried the following methods:
rdtsc (on Linux and Windows),
clock_gettime (with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), and
QueryThreadCycleTime (on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
Results:
A comparison of clock_gettime with rdtsc for the same code segment (25 runs of 101 measurements = 2525 readings):
clock_gettime:
1881 measurements of 11 ns,
595 measurements were (distributed almost normally) between 3369 and 3414 ns,
2 measurements of 11680 ns,
1 measurement of 1506022 ns, and
the rest is between 900 and 5000 ns.
Min: 11 ns
Max: 1506022 ns
Mean: 1471.862 ns
Median: 11 ns
Mode: 11 ns
Stddev: 29991.034
rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):
1178 measurements between 274 and 325 ticks,
306 measurements between 326 and 375 ticks,
910 measurements between 376 and 425 ticks,
129 measurements between 426 and 990 ticks,
1 measurement of 1240 ticks, and
1 measurement of 1256 ticks.
Min: 274 ticks
Max: 1256 ticks
Mean: 355.806 ticks
Median: 333 ticks
Mode: 376 ticks
Stddev: 83.896
Discussion:
rdtsc gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).
QueryThreadCycleTime gives very consistent measurements--i.e. much lower standard deviation when compared to rdtsc. When no context switches happen, this method is almost identical to rdtsc.
clock_gettime, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared to rdtsc).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
Further Investigation
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) (see code 1 in Appendix), and
in a plain loop called clock_gettime and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call of clock_gettime).
I have measured it on two different computers with two different Linux Kernel versions:
CGT:
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
Estimated clock_gettime overhead: between 690-710 ns
Delta times:
Average: 815.22 ns
Median: 713 ns
Mode: 709 ns
Min: 698 ns
Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
697 < x ≤ 800 -> 78111 <-- cached?
800 < x ≤ 1000 -> 16412
1000 < x ≤ 1500 -> 3
1500 < x ≤ 2000 -> 4836 <-- uncached?
2000 < x ≤ 3000 -> 305
3000 < x ≤ 5000 -> 161
5000 < x ≤ 10000 -> 105
10000 < x ≤ 15000 -> 53
15000 < x ≤ 20000 -> 8
20000 < x -> 5
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
Estimated clock_gettime overhead: between 279-283 ns
Delta times:
Average: 320.00
Median: 1
Mode: 1
Min: 1
Max: 3495529
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
--------------------+-----------
x ≤ 1 -> 86738 <-- cached?
282 < x ≤ 300 -> 13118 <-- uncached?
300 < x ≤ 440 -> 78
2000 < x ≤ 5000 -> 52
5000 < x ≤ 30000 -> 5
3000000 < x -> 8
RDTSC:
Related code rdtsc_delta.c and rdtsc_overhead.c.
CPU: Core 2 Duo L9400 # 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
Estimated overhead: between 39-42 ticks
Delta times:
Average: 52.46 ticks
Median: 42 ticks
Mode: 42 ticks
Min: 35 ticks
Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
34 < x ≤ 35 -> 16240 <-- cached?
41 < x ≤ 42 -> 63585 <-- uncached? (small difference)
48 < x ≤ 49 -> 19779 <-- uncached?
49 < x ≤ 120 -> 195
3125 < x ≤ 5000 -> 144
5000 < x ≤ 10000 -> 45
10000 < x ≤ 20000 -> 9
20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
Estimated overhead: between 13.7-17.0 ticks
Delta times:
Average: 35.44 ticks
Median: 16 ticks
Mode: 16 ticks
Min: 14 ticks
Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
13 < x ≤ 14 -> 192
14 < x ≤ 21 -> 78172 <-- cached?
21 < x ≤ 50 -> 10818
50 < x ≤ 103 -> 10624 <-- uncached?
5825 < x ≤ 6500 -> 88
6500 < x ≤ 8000 -> 88
8000 < x ≤ 10000 -> 11
10000 < x ≤ 15000 -> 4
15000 < x ≤ 16372 -> 2
QTCT:
Related code qtct_delta.c and qtct_overhead.c.
CPU: Core 2 6700 # 2.66GHz
Kernel: Windows 7 64-bit
Results:
Estimated overhead: between 890-940 ticks
Delta times:
Average: 1057.30 ticks
Median: 890 ticks
Mode: 890 ticks
Min: 880 ticks
Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency
------------------+-----------
879 < x ≤ 890 -> 71347 <-- cached?
895 < x ≤ 1469 -> 844
1469 < x ≤ 1600 -> 27613 <-- uncached?
1600 < x ≤ 2000 -> 55
2000 < x ≤ 4000 -> 86
4000 < x ≤ 8000 -> 43
8000 < x ≤ 16000 -> 10
16000 < x -> 1
Conclusion
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means that the call to clock_gettime took less than a nanosecond! Moreover, it also produced a number of extraordinary large deltas (of more than 3000000 ns)! This seems to be erroneous behaviour. (Maybe unaccounted core migrations?)
Remarks:
The overhead of CGT and QTCT is quite big.
It is also difficult to account for their overhead, because CPU caching seems to make quite a big difference.
Maybe sticking to RDTSC, locking the process to one core, and assigning real-time priority is the most accurate way to tell how many cycles a piece of code used...
Appendix
Code 1: clock_gettime_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
./clock_gettime_overhead 100000
*/
int main(int argc, char **args) {
struct timespec tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
printf("Estimated overhead: %lld ns\n",
((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
- ((int64_t) tstart.tv_sec * 1000000000
+ (int64_t) tstart.tv_nsec)) / N / 10);
return 0;
}
Code 2: clock_gettime_delta.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
/* Compiled & executed with:
gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
./clock_gettime_delta > results
*/
#define N 100000
int main(int argc, char **args) {
struct timespec sample, results[N];
int n;
for (n = 0; n < N; ++n) {
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
results[n] = sample;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n",
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t)results[n].tv_nsec,
(int64_t) results[n].tv_sec * 1000000000 +
(int64_t) results[n].tv_nsec -
((int64_t) results[n-1].tv_sec * 1000000000 +
(int64_t)results[n-1].tv_nsec));
}
return 0;
}
Code 3: rdtsc.h
static uint64_t rdtsc() {
#if defined(__GNUC__)
# if defined(__i386__)
uint64_t x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
# elif defined(__x86_64__)
uint32_t hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)lo) | ((uint64_t)hi << 32);
# else
# error Unsupported architecture.
# endif
#elif defined(_MSC_VER)
return __rdtsc();
#else
# error Other compilers not supported...
#endif
}
Code 4: rdtsc_delta.c
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_delta.c -O0 -o rdtsc_delta
./rdtsc_delta > rdtsc_delta_results
Windows:
cl -Od rdtsc_delta.c
rdtsc_delta.exe > windows_rdtsc_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t results[N];
int n;
for (n = 0; n < N; ++n) {
results[n] = rdtsc();
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 5: rdtsc_overhead.c
#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"
/* Compiled & executed with:
gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
./rdtsc_overhead 1000000 > rdtsc_overhead_results
Windows:
cl -Od rdtsc_overhead.c
rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/
int main(int argc, char **args) {
uint64_t tstart, tend, dummy;
int n, N;
N = atoi(args[1]);
tstart = rdtsc();
for (n = 0; n < N; ++n) {
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
dummy = rdtsc();
}
tend = rdtsc();
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Code 6: qtct_delta.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_delta.c
qtct_delta.exe > windows_qtct_delta_results
*/
#define N 100000
int main(int argc, char **args) {
uint64_t ticks, results[N];
int n;
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
results[n] = ticks;
}
printf("%s\t%s\n", "Absolute time", "Delta");
for (n = 1; n < N; ++n) {
printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
}
return 0;
}
Code 7: qtct_overhead.c
#include <stdio.h>
#include <stdint.h>
#include <Windows.h>
/* Compiled & executed with:
cl -Od qtct_overhead.c
qtct_overhead.exe 1000000
*/
int main(int argc, char **args) {
uint64_t tstart, tend, ticks;
int n, N;
N = atoi(args[1]);
QueryThreadCycleTime(GetCurrentThread(), &tstart);
for (n = 0; n < N; ++n) {
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
QueryThreadCycleTime(GetCurrentThread(), &ticks);
}
QueryThreadCycleTime(GetCurrentThread(), &tend);
printf("%G\n", (double)(tend - tstart)/N/10);
return 0;
}
Well as CLOCK_THREAD_CPUTIME_ID is implemented using rdtsc it will likely suffer from the same problems as it. The manual page for clock_gettime says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks
are realized on many platforms using timers from the CPUs (TSC on
i386, AR.ITC on Itanium). These registers may differ between CPUs and
as a consequence these clocks may return bogus results if a
process is migrated to another CPU.
Which sounds like it might explain your problems? Maybe you should lock your process to one CPU to get stable results?
When you have a highly skewed distribution that cannot go negative, you're going to see large discrepancies between mean, median, and mode.
The standard deviation is fairly meaningless for such a distribution.
It's usually a good idea to log-transform it.
That will make it "more normal".

Resources