OpenMP worst performance with more threads (following openMP tutorials)

OpenMP worst performance with more threads (following openMP tutorials) - multithreading

I'm starting to work with OpenMP and I follow these tutorials:
OpenMP Tutorials
I'm coding exactly what appears on the video, but instead of a better performance with more threads I get worse. I don't understand why.
Here's my code:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 2
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
t = clock() - t;
cout << "time: " << t << " miliseconds" << endl;
}
As you can see, it's exactly the same as in the video, I only added a code to measure an elapsed time.
On the tutorial, the more threads we use the better a performance.
In my case, that doesn't happen. Here are the timing I got:
1 thread: 433590 miliseconds
2 threads: 1705704 miliseconds
3 threads: 2689001 miliseconds
4 threads: 4221881 miliseconds
Why do I get this behavior?
-- EDIT --
gcc version: gcc 5.5.0
result of lscpu:
Architechure: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 60
Model name: Intel(R) Core(TM) i7-4720HQ CPU # 2.60Ghz
Stepping: 3
CPU Mhz: 2594.436
CPU max MHz: 3600,0000
CPU min Mhz: 800,0000
BogoMIPS: 5188.41
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
-- EDIT --
I've tried using omp_get_wtime() instead, like this:
#include <iostream>
#include <time.h>
#include <omp.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 8
int main()
{
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double)num_steps;
double start_time = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if(id == 0) nthreads = nthrds;
for(i=id, sum[id]=0.0; i < num_steps; i = i + nthrds)
{
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i = 0, pi=0.0; i<nthreads; i++) pi += sum[i] * step;
double time = omp_get_wtime() - start_time;
cout << "time: " << time << " seconds" << endl;
}
The behavior is different, although I have some questions.
Now, if I increase the number of threads by 1, for example, 1 thread, 2 threads, 3, 4, ..., the results are basically the same as previous, the performance gets worse, although if I increase to 64 threads, or 128 threads I get indeed better performance, the timing decreases from 0.44 [s] (for 1 thread) to 0.13 [s] ( for 128 threads ).
My question is: Why I don't have the same behaviour as in the tutorial?
2 threads get better performance than 1,
3 threads get better performance than 2, etc.
Why do I only get better performance with much bigger amount of threads?

instead of better performances with more threads I get worse ... I don't understand why.
Well,let's make the testing a bit more systematic and repeatable to see if :
// time: 1535120 milliseconds 1 thread
// time: 200679 milliseconds 1 thread -O2
// time: 191205 milliseconds 1 thread -O3
// time: 184502 milliseconds 2 threads -O3
// time: 189947 milliseconds 3 threads -O3
// time: 202277 milliseconds 4 threads -O3
// time: 182628 milliseconds 5 threads -O3
// time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3
// time: 187606 milliseconds 16 threads -O3
// time: 187231 milliseconds 32 threads -O3
// time: 186131 milliseconds 64 threads -O3
ref.: a few sample runs on a TiO.RUN platform fast mock-up ... where limited resources apply a certain glass-ceiling to hit...
This did show more the effects of { -O2 |-O3 }-compilation-mode optimisation effects, than the above proposed principal degradation for growing number of threads.
Next comes the "background" noise from non-managed code-execution ecosystem, where O/S will easily skew the simplistic performance benchmarking
If indeed interested in further details, feel free to read about a Law of diminishing returns ( about real world compositions of [SERIAL], resp. [PARALLEL] parts of the process-scheduling ), where Dr. Gene AMDAHL has initiated the principal rules,
why more threads do not get way better performance ( and where a bit more contemporary re-formulation of this law explains, why more threads may even get negative improvement ( get more expensive add-on overheads ), than a right-tuned peak performance.
#include <time.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
using namespace std;
static long num_steps = 100000000;
double step;
#define NUM_THREADS 7
int main()
{
clock_t t;
t = clock();
int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0 / ( double )num_steps;
omp_set_num_threads( NUM_THREADS );
// struct timespec start;
// t = clock(); // _________________________________________ BEST START HERE
// clock_gettime( CLOCK_MONOTONIC, &start ); // ____________ USING MONOTONIC CLOCK
#pragma omp parallel
{
int i,
nthrds = omp_get_num_threads(),
id = omp_get_thread_num();;
double x;
if ( id == 0 ) nthreads = nthrds;
for ( i = id, sum[id] = 0.0;
i < num_steps;
i += nthrds
)
{
x = ( i + 0.5 ) * step;
sum[id] += 4.0 / ( 1.0 + x * x );
}
}
// t = clock() - t; // _____________________________________ BEST STOP HERE
// clock_gettime( CLOCK_MONOTONIC, &end ); // ______________ USING MONOTONIC CLOCK
for ( i = 0, pi = 0.0;
i < nthreads;
i++
) pi += sum[i] * step;
t = clock() - t;
// // time: 1535120 milliseconds 1 thread
// // time: 200679 milliseconds 1 thread -O2
// // time: 191205 milliseconds 1 thread -O3
printf( "time: %d milliseconds %d threads\n", // time: 184502 milliseconds 2 threads -O3
t, // time: 189947 milliseconds 3 threads -O3
NUM_THREADS // time: 202277 milliseconds 4 threads -O3
); // time: 182628 milliseconds 5 threads -O3
} // time: 192032 milliseconds 6 threads -O3
// time: 185771 milliseconds 7 threads -O3

The major problem in that version is false sharing. This is explained later in the video you started to watch. You get this when many threads are accessing data that is adjacent in memory (the sum array). The video also explains how to use padding to manually avoid this issue.
That said, the idiomatic solution is to use a reduction and not even bother with the manual work sharing:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < num_steps; i++)
{
double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
This is also explained in a later video of the series. It is much simpler than what you started with and most likely the most efficient way.
Although the presenter is certainly competent, the style of these OpenMP tutorial videos is very much bottom up. I'm not sure that is a good educational approach. In any case you should probably watch all of the videos to know how to best use OpenMP it in practice.
Why do I only get better performance with much bigger amount of threads?
This is a bit counterintuitive, you very rarely get better performance from using more OpenMP threads than hardware threads - unless this is indirectly fixing another issue. In your case the large amount of threads means that the sum array is spread out over a larger region in memory and false-sharing is less likely.

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?

TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.

I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

c++ std async : almost no effect to use several cores

This question is related to:
c++ std::async : faster on 4 cores compared to 8 cores
In the previous question, I was wondering why some code would run faster on 4 cores rather than 8 (answer: my cpu had 4 cores and 8 threads)
Now I am discovering that code is more or less the same speed independently of the number of cores used.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
Here code for benchmarking computation time against number of core used
#include <math.h>
#include <future>
#include <ctime>
#include <vector>
#include <iostream>
#define NB_JOBS 2000.0
#define MAX_CORES 8
// no special meaning to this function,
// just uses some CPU
static bool _expensive(int nb_jobs){
for(int job=0;job<nb_jobs;job++){
float x = 0.6;
bool b = true;
double f = 1;
for(int i=0;i<1000;i++){
if(!b) f=-1;
for(double j=1;j<2.0;j+=0.01) x+= f* pow(1.0/sin(x),j);
b = !b;
}
}
return true;
}
static double _duration(int nb_cores){
std::clock_t begin = clock();
int nb_jobs_per_core = rint ( NB_JOBS / (float)nb_cores );
std::vector < std::future<bool> > futures;
for(int i=0;i<nb_cores;i++){
futures.push_back( std::async(std::launch::async,_expensive,nb_jobs_per_core));
}
for (auto &e: futures) {
bool foo = e.get();
}
std::clock_t end = clock();
double duration = double(end - begin) / CLOCKS_PER_SEC;
return duration;
}
int main(){
for(int nb_cores=1 ; nb_cores<=MAX_CORES ; nb_cores++){
double duration = _duration(nb_cores);
std::cout << nb_cores << " threads: " << duration << "\n";
}
return 0;
}
Here the output:
1 threads: 8.55817
2 threads: 8.76621
3 threads: 7.90191
4 threads: 8.4656
5 threads: 10.5494
6 threads: 11.6175
7 threads: 21.697
8 threads: 24.3621
using cores seems to have marginal impacts.
What troubles me is that the CPU has 4 cores. So I was expecting the program to run (around) 4 times faster when using 4 threads. It does not.
Note: "htop" shows usage of virtual cores as expected by the program, i.e. first one core used at 100%, then 2, ..., and at the end 8.
If I replace:
futures.push_back( std::async(std::launch::async,[...]
by :
futures.push_back( std::async(std::launch::async|std::launch::deferred,[...]
then I get:
1 threads: 8.6459
2 threads: 8.69905
3 threads: 10.7763
4 threads: 11.4505
5 threads: 11.8426
6 threads: 10.4282
7 threads: 9.55181
8 threads: 9.05565
and htop shows only 1 virtual core being used 100% during the full duration.
Anything I am doing wrong ?
note: I tried on several desktops, all with various specs (nb of core and nb of threads), and observed something similar.

Why this simple program on shared variable does not scale? (no lock)

I'm new to concurrent programming. I implement a CPU intensive work and measure how much speedup I could gain. However, I cannot get any speedup as I increase #threads.
The program does the following task:
There's a shared counter to count from 1 to 1000001.
Each thread does the following until the counter reaches 1000001:
increments the counter atomically, then
run a loop for 10000 times.
There're 1000001*10000 = 10^10 operations in total to be perform, so I should be able to get good speedup as I increment #threads.
Here's how I implemented it:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdatomic.h>
pthread_t workers[8];
atomic_int counter; // a shared counter
void *runner(void *param);
int main(int argc, char *argv[]) {
if(argc != 2) {
printf("Usage: ./thread thread_num\n");
return 1;
}
int NUM_THREADS = atoi(argv[1]);
pthread_attr_t attr;
counter = 1; // initialize shared counter
pthread_attr_init(&attr);
const clock_t begin_time = clock(); // begin timer
for(int i=0;i<NUM_THREADS;i++)
pthread_create(&workers[i], &attr, runner, NULL);
for(int i=0;i<NUM_THREADS;i++)
pthread_join(workers[i], NULL);
const clock_t end_time = clock(); // end timer
printf("Thread number = %d, execution time = %lf s\n", NUM_THREADS, (double)(end_time - begin_time)/CLOCKS_PER_SEC);
return 0;
}
void *runner(void *param) {
int temp = 0;
while(temp < 1000001) {
temp = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
for(int i=1;i<10000;i++)
temp%i; // do some CPU intensive work
}
pthread_exit(0);
}
However, as I run my program, I cannot get better performance than sequential execution!!
gcc-4.9 -std=c11 -pthread -o my_program my_program.c
for i in 1 2 3 4 5 6 7 8; do \
./my_program $i; \
done
Thread number = 1, execution time = 19.235998 s
Thread number = 2, execution time = 20.575237 s
Thread number = 3, execution time = 25.161116 s
Thread number = 4, execution time = 28.278671 s
Thread number = 5, execution time = 28.185605 s
Thread number = 6, execution time = 28.050380 s
Thread number = 7, execution time = 28.286925 s
Thread number = 8, execution time = 28.227132 s
I run the program on a 4-core machine.
Does anyone have suggestions to improve the program? Or any clue why I cannot get speedup?

The only work here that can be done in parallel is the loop:
for(int i=0;i<10000;i++)
temp%i; // do some CPU intensive work
gcc, even with the minimal optimisation level, will not emit any code for the temp%i; void expression (disassemble it and see), so this essentially becomes an empty loop, which will execute very fast - the execution time in the case with multiple threads running on different cores will be dominated by the cacheline containing your atomic variable ping-ponging between the different cores.
You need to make this loop actually do a significant amount of work before you'll see a speed-up.

Why does calculation with OpenMP take 100x more time than with a single thread?

I am trying to test Pi calculation problem with OpenMP. I have this code:
#pragma omp parallel private(i, x, y, myid) shared(n) reduction(+:numIn) num_threads(NUM_THREADS)
{
printf("Thread ID is: %d\n", omp_get_thread_num());
myid = omp_get_thread_num();
printf("Thread myid is: %d\n", myid);
for(i = myid*(n/NUM_THREADS); i < (myid+1)*(n/NUM_THREADS); i++) {
//for(i = 0; i < n; i++) {
x = (double)rand()/RAND_MAX;
y = (double)rand()/RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
printf("Thread ID is: %d\n", omp_get_thread_num());
}
return 4. * numIn / n;
}
When I compile with gcc -fopenmp pi.c -o hello_pi and run it time ./hello_pi for n = 1000000000 I get
real 8m51.595s
user 4m14.004s
sys 60m59.533s
When I run it on with a single thread I get
real 0m20.943s
user 0m20.881s
sys 0m0.000s
Am I missing something? It should be faster with 8 threads. I have 8-core CPU.

Please take a look at the
http://people.sc.fsu.edu/~jburkardt/c_src/openmp/compute_pi.c
This might be a good implementation for pi computing.
It is quite important to know that how your data spread to different threads and how the openmp collect them back. Usually, a bad design (which has data dependencies across threads) running on multiple thread will result in a slower execution than a single thread .

rand() in stdlib.h is not thread-safe. Using it in multi-thread environment causes a race condition on its hidden state variables, thus lead to poor performance.
http://man7.org/linux/man-pages/man3/rand.3.html
In fact the following code work well as an OpenMP demo.
$ gc -fopenmp -o pi pi.c -O3; time ./pi
pi: 3.141672
real 0m4.957s
user 0m39.417s
sys 0m0.005s
code:
#include <stdio.h>
#include <omp.h>
int main()
{
const int n=50000;
const int NUM_THREADS=8;
int numIn=0;
#pragma omp parallel for reduction(+:numIn) num_threads(NUM_THREADS)
for(int i = 0; i < n; i++) {
double x = (double)i/n;
for(int j=0;j<n; j++) {
double y = (double)j/n;
if (x*x + y*y <= 1) numIn++;
}
}
printf("pi: %f\n",4.*numIn/n/n);
return 0;
}

In general I would not compare times without optimization on. Compile with something like
gcc -O3 -Wall -pedantic -fopenmp main.c
The rand() function is not thread safe in Linux (but it's fine with MSVC and I guess mingw32 which uses the same C run-time libraries, MSVCRT, as MSVC). You can use rand_r with a different seed for each thread. See openmp-program-is-slower-than-sequential-one.
In general try to avoid defining the chunk sizes when you parallelize a loop. Just use #pragma omp for schedule(shared). You also don't need to specify that the loop variable in a parallelized loop is private (the variable i in your code).
Try the following code
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
int i, numIn, n;
unsigned int seed;
double x, y, pi;
n = 1000000;
numIn = 0;
#pragma omp parallel private(seed, x, y) reduction(+:numIn)
{
seed = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rand_r(&seed) / RAND_MAX;
y = (double)rand_r(&seed) / RAND_MAX;
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
You can find a working example of this code here http://coliru.stacked-crooked.com/a/9adf1e856fc2b60d

Accurately Calculating CPU Utilization in Linux using /proc/stat

There are a number of posts and references on how to get CPU Utilization using statistics in /proc/stat. However, most of them use only four of the 7+ CPU stats (user, nice, system, and idle), ignoring the remaining jiffie CPU counts present in Linux 2.6 (iowait, irq, softirq).
As an example, see Determining CPU utilization.
My question is this: Are the iowait/irq/softirq numbers also counted in one of the first four numbers (user/nice/system/idle)? In other words, does the total jiffie count equal the sum of the first four stats? Or, is the total jiffie count equal to the sum of all 7 stats? If the latter is true, then a CPU utilization formula should take all of the numbers into account, like this:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
long double a[7],b[7],loadavg;
FILE *fp;
for(;;)
{
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&a[0],&a[1],&a[2],&a[3],&a[4],&a[5],&a[6]);
fclose(fp);
sleep(1);
fp = fopen("/proc/stat","r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf",&b[0],&b[1],&b[2],&b[3],&b[4],&b[5],&b[6]);
fclose(fp);
loadavg = ((b[0]+b[1]+b[2]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[4]+a[5]+a[6]))
/ ((b[0]+b[1]+b[2]+b[3]+b[4]+b[5]+b[6]) - (a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]));
printf("The current CPU utilization is : %Lf\n",loadavg);
}
return(0);
}

I think iowait/irq/softirq are not counted in one of the first 4 numbers. You can see the comment of irqtime_account_process_tick in kernel code for more detail:
(for Linux kernel 4.1.1)
2815 * Tick demultiplexing follows the order
2816 * - pending hardirq update <-- this is irq
2817 * - pending softirq update <-- this is softirq
2818 * - user_time
2819 * - idle_time <-- iowait is included in here, discuss below
2820 * - system time
2821 * - check for guest_time
2822 * - else account as system_time
For the idle time handling, see account_idle_time function:
2772 /*
2773 * Account for idle time.
2774 * #cputime: the cpu time spent in idle wait
2775 */
2776 void account_idle_time(cputime_t cputime)
2777 {
2778 u64 *cpustat = kcpustat_this_cpu->cpustat;
2779 struct rq *rq = this_rq();
2780
2781 if (atomic_read(&rq->nr_iowait) > 0)
2782 cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
2783 else
2784 cpustat[CPUTIME_IDLE] += (__force u64) cputime;
2785 }
If the cpu is idle AND there is some IO pending, it will count the time in CPUTIME_IOWAIT. Otherwise, it is count in CPUTIME_IDLE.
To conclude, I think the jiffies in irq/softirq should be counted as "busy" for cpu because it was actually handling some IRQ or soft IRQ. On the other hand, the jiffies in "iowait" should be counted as "idle" for cpu because it was not doing something but waiting for a pending IO to happen.

from busybox, its top magic is:
static const char fmt[] ALIGN1 = "cp%*s %llu %llu %llu %llu %llu %llu %llu %llu";
int ret;
if (!fgets(line_buf, LINE_BUF_SIZE, fp) || line_buf[0] != 'c' /* not "cpu" */)
return 0;
ret = sscanf(line_buf, fmt,
&p_jif->usr, &p_jif->nic, &p_jif->sys, &p_jif->idle,
&p_jif->iowait, &p_jif->irq, &p_jif->softirq,
&p_jif->steal);
if (ret >= 4) {
p_jif->total = p_jif->usr + p_jif->nic + p_jif->sys + p_jif->idle
+ p_jif->iowait + p_jif->irq + p_jif->softirq + p_jif->steal;
/* procps 2.x does not count iowait as busy time */
p_jif->busy = p_jif->total - p_jif->idle - p_jif->iowait;
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string