Performance decrease with threaded implementation

Performance decrease with threaded implementation - linux

I implemented a small program in C to calculate PI using a Monte Carlo method (mainly because of personal interest and training). After having implemented the basic code structure, I added a command-line option allowing to execute the calculations threaded.
I expected major speed ups, but I got disappointed. The command-line synopsis should be clear. The final number of iterations made to approximate PI is the product of the number of -iterations and -threads passed via the command-line. Leaving -threads blank defaults it to 1 thread resulting in execution in the main thread.
The tests below are tested with 80 Million iterations in total.
On Windows 7 64Bit (Intel Core2Duo Machine):
Compiled using Cygwin GCC 4.5.3: gcc-4 pi.c -o pi.exe -O3
On Ubuntu/Linaro 12.04 (8Core AMD):
Compiled using GCC 4.6.3: gcc pi.c -lm -lpthread -O3 -o pi
Performance
On Windows, the threaded version is a few milliseconds faster than the un-threaded. I expected a better performance, to be honest. On Linux, ew! What the heck? Why does it take even 2000% longer? Of course this is depending much on the implementation, so here it goes. An excerpt after the command-line argument parsing was done and the calculation is started:
// Begin computation.
clock_t t_start, t_delta;
double pi = 0;
if (args.threads == 1) {
t_start = clock();
pi = pi_mc(args.iterations);
t_delta = clock() - t_start;
}
else {
pthread_t* threads = malloc(sizeof(pthread_t) * args.threads);
if (!threads) {
return alloc_failed();
}
struct PIThreadData* values = malloc(sizeof(struct PIThreadData) * args.threads);
if (!values) {
free(threads);
return alloc_failed();
}
t_start = clock();
for (i=0; i < args.threads; i++) {
values[i].iterations = args.iterations;
values[i].out = 0.0;
pthread_create(threads + i, NULL, pi_mc_threaded, values + i);
}
for (i=0; i < args.threads; i++) {
pthread_join(threads[i], NULL);
pi += values[i].out;
}
t_delta = clock() - t_start;
free(threads);
threads = NULL;
free(values);
values = NULL;
pi /= (double) args.threads;
}
While pi_mc_threaded() is implemented as:
struct PIThreadData {
int iterations;
double out;
};
void* pi_mc_threaded(void* ptr) {
struct PIThreadData* data = ptr;
data->out = pi_mc(data->iterations);
}
You can find the full source code at http://pastebin.com/jptBTgwr.
Question
Why is this? Why this extreme difference on Linux? I expected the anmount of time taken to calculate to be at least 3/4 of the original time. It would of course be possible that I simply made wrong use of the pthread library. A clarifcation on how to do correct in this case would be very nice.

The problem is that in glibc's implementation, rand() calls __random(), and that
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
locks around each call to the function __random_r that does the actual work.
Thus, as soon as you have more than one thread using rand(), you make each thread wait for the other(s) on almost every call to rand(). Directly using random_r() with your own buffers in each thread should be much faster.

Performance and threading is a black art. The answer depends on the specifics of the compiler and libraries used to do threading, how well the kernel handles it, etc. Basically, if your libraries for *nix are not efficient in switching, moving objects around etc, threading will in fact, be slower . THis is one of the reasons a lot us doing thread work now work with JVM or JVM-like languages. We can trust the runtime JVM's behavior -- it's overall speed may vary with platform, but it's consistent on that platform. In addition, you may have some hidden wait/race conditions that you uncovered just due to timing that may not show up on Windows.
If you are in a position to change your language, consider Scala or D. Scala is the actor driven model successor to Java, and D, the successor to C. Both languages show their roots -- if you can write in C, D should be no problem. Both languages however, implement the actor model. NO MORE THREAD POOLS, NO MORE RACE CONDITIONS ETC!!!!!!

For comparison, I just tried your app on Windows Vista, compiled with Borland C++, and the 2 thread version performed nearly twice as fast as the single thread.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 12.511000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.142397
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.584000 sec
Threads: 2
That's compiled against the thread-safe run-time library. Using the single thread library, both versions run at twice their thread-safe speed.
pi.exe -iterations 20000000 -stats -threads 1
3.141167
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 6.458000 sec
Threads: Main
pi.exe -iterations 10000000 -stats -threads 2
3.141314
Number of iterations: 20000000
Method: Monte Carlo
Evaluation time: 3.978000 sec
Threads: 2
So the 2 thread version is still twice as fast, but the 1 thread version with the single thread library is actually faster than the 2 thread version on the thread-safe library.
Looking at Borland's rand implementation, they use thread local storage for the seed in the thread-safe implementation, so it's not going to have the same negative impact on threaded code as glibc's lock, but the thread-safe implementation will obviously be slower than the single thread implementation.
The bottom line though, is that your compiler's rand implementation is probably the main performance issue in both cases.
Update
I've just tried replacing your rand_01 calls with inline implementations of Borland's rand function using a local variable for the seed, and the results are consistently twice as fast in the 2 thread case.
The updated code looks like this:
#define MULTIPLIER 0x015a4e35L
#define INCREMENT 1
double pi_mc(int iterations) {
unsigned seed = 1;
long long inner = 0;
long long outer = 0;
int i;
for (i=0; i < iterations; i++) {
seed = MULTIPLIER * seed + INCREMENT;
double x = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
seed = MULTIPLIER * seed + INCREMENT;
double y = ((int)(seed >> 16) & 0x7fff) / (double) RAND_MAX;
double d = sqrt(pow(x, 2.0) + pow(y, 2.0));
if (d <= 1.0) {
inner++;
}
else {
outer++;
}
}
return ((double) inner / (double) iterations) * 4;
}
I don't know how good that is as rand implementations go, but it's worth at least trying on Linux to see whether it makes a difference to the performance.

Related

AMD SMT or Intel HT performance

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.
As monkeys asked, here is C# example emulating heavy load on each thread:
static void Main(string[] args)
{
if (IntPtr.Size != 8)
throw new Exception("use only x64 code, 2020 is coming...");
//6 for physical cores, 12 for logical cores
const int limit_threads = 12;
const int limit_actions = 256;
const int limit_loop = 1000 * 1000 * 10;
const double power = 1.0 / 17.0;
long result = 0;
var action = new Action(() =>
{
long value = 0;
for (int i = 0; i < limit_loop; i++)
value += (long)Math.Pow(i, power);
Interlocked.Add(ref result, value);
});
var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
var sw = Stopwatch.StartNew();
Parallel.Invoke(new ParallelOptions()
{
MaxDegreeOfParallelism = limit_threads
}, actions);
Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
}
Results for 6 threads (AMD Ryzen 2600):
done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000
Results for 12 threads (AMD Ryzen 2600):
done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000
It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?
Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?

TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.
You have misunderstood what SMT/HT does.
"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".
If this is true, then SMT/HT is working.
To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.
The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.
The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.
The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.
So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.
Also, note, more threads does not make things faster.

I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.
This technology relies on the fact that some micro-operations in a single
instruction have to wait for something to be executed; so instead of just waiting,
the same core uses its circuits to make some progress on the micro-operations
from another thread.
On a coarse point of view, we can perceive the execution of two (or more on
certain models) sequences of micro-operations from two different threads executed
on a single piece of hardware (except some redundant parts, like registers...)
The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the
limiting factor is the time needed to compute (add, multiply...), but not
memory bound (the data are already available, no need to wait for the memory),
then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of
micro-operations, thus the intertwined execution of two threads is not better
than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not
compute bound, there is no more benefit because both threads have to wait
for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between
data access and computation; in this case when one thread is waiting for data, the
same core can make some progress in the computations of the other thread and
vice-versa.
Edit
Below is given an example to illustrate these situations, and I obtain the
following results (quite stable when run many times,
dual Xeon E5-2697 v2, Linux 5.3.13).
In this memory bound situation HT does not help.
$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds
In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware
when computing cos, but there must be some latencies which are not due
to memory access)
$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds
In this mixed situation HT helps much more (around 70% gain)
$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds
Here is the source code (in C++, I'm not confortable with C#)
/*
g++ -std=c++17 -o prog_ht prog_ht.cpp \
-pedantic -Wall -Wextra -Wconversion \
-Wno-missing-braces -Wno-sign-conversion \
-O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
-pthread
*/
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>
#include <pthread.h>
bool // success
bind_current_thread_to_cpu(int cpu_id)
{
/* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
I checked the numbering of the CPUs according to the packages and cores
on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
0 to 11 --> different cores of package 1
12 to 23 --> different cores of package 2
24 to 35 --> different cores of package 1
36 to 47 --> different cores of package 2
Thus using cpu_id from 0 to 23 does not bind more than one thread
to each single core (no HT).
Of course using cpu_id from 0 to 47 binds two threads to each single
core (HT is used).
This numbering is absolutely NOT guaranteed on any other computer/system,
thus the relation between thread numbers and cpu_id should be adapted
accordingly.
*/
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(cpu_id, &cpu_set);
return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}
inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
const auto now=std::chrono::system_clock::now().time_since_epoch();
return 1e-6*double(std::chrono::duration_cast
<std::chrono::microseconds>(now).count());
}
constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;
void
compute_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
(void)indices; // not used here
(void)inputs; // not used here
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
result+=std::cos(double(i));
}
}
results[thread_id]+=result;
}
void
mixed_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=std::cos(inputs[index]);
}
}
results[thread_id]+=result;
}
void
memory_task(int thread_id,
int thread_count,
const int *indices,
const double *inputs,
double *results)
{
bind_current_thread_to_cpu(thread_id);
const auto work_begin=count*thread_id/thread_count;
const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
auto result=0.0;
for(auto r=0; r<repeat; ++r)
{
for(auto i=work_begin; i<work_end; ++i)
{
const auto index=indices[i];
result+=inputs[index];
}
}
results[thread_id]+=result;
}
int
main(int argc,
char **argv)
{
//~~~~ analyse command line arguments ~~~~
const auto args=std::vector<std::string>{argv, argv+argc};
const auto has_arg=
[&](const auto &a)
{
return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
};
const auto use_ht=has_arg("ht");
const auto thread_count=int(std::thread::hardware_concurrency())
/(use_ht ? 1 : 2);
const auto use_mix=has_arg("mix");
const auto use_mem=has_arg("mem");
const auto task=use_mem ? memory_task
: use_mix ? mixed_task
: compute_task;
const auto task_name=use_mem ? "memory_task"
: use_mix ? "mixed_task"
: "compute_task";
//~~~~ prepare input/output data ~~~~
auto results=std::vector<double>(thread_count);
auto indices=std::vector<int>(count);
auto inputs=std::vector<double>(count);
std::generate(begin(indices), end(indices),
[i=0]() mutable { return i++; });
std::copy(cbegin(indices), cend(indices), begin(inputs));
std::shuffle(begin(indices), end(indices), // fight the prefetcher!
std::default_random_engine{std::random_device{}()});
//~~~~ launch threads ~~~~
std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
<< " running " << task_name << "()\n";
auto threads=std::vector<std::thread>(thread_count);
const auto t0=system_time();
for(auto i=0; i<thread_count; ++i)
{
threads[i]=std::thread{task, i, thread_count,
data(indices), data(inputs), data(results)};
}
//~~~~ wait for threads ~~~~
auto result=0.0;
for(auto i=0; i<thread_count; ++i)
{
threads[i].join();
result+=results[i];
}
const auto duration=system_time()-t0;
std::cout << "result: " << result << '\n';
std::cout << "duration: " << duration << " seconds\n";
return 0;
}

The differences in the accuracy of the calculations in single / multi-threaded (OpenMP) modes

Can anybody explain/understand the different of the calculation result in single / multi-threaded mode?
Here is an example of approx. calculation of pi:
#include <iomanip>
#include <cmath>
#include <ppl.h>
const int itera(1000000000);
int main()
{
printf("PI calculation \nconst int itera = 1000000000\n\n");
clock_t start, stop;
//Single thread
start = clock();
double summ_single(0);
for (int n = 1; n < itera; n++)
{
summ_single += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time single thread %f\n", (double)(stop - start) / 1000.0);
//Multithread with OMP
//Activate OMP in Project settings, C++, Language
start = clock();
double summ_omp(0);
#pragma omp parallel for reduction(+:summ_omp)
for (int n = 1; n < itera; n++)
{
summ_omp += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
};
stop = clock();
printf("Time OMP parallel %f\n", (double)(stop - start) / 1000.0);
//Multithread with Concurrency::parallel_for
start = clock();
Concurrency::combinable<double> piParts;
Concurrency::parallel_for(1, itera, [&piParts](int n)
{
piParts.local() += 6.0 / (static_cast<double>(n)* static_cast<double>(n));
});
double summ_Conparall(0);
piParts.combine_each([&summ_Conparall](double locali)
{
summ_Conparall += locali;
});
stop = clock();
printf("Time Concurrency::parallel_for %f\n", (double)(stop - start) / 1000.0);
printf("\n");
printf("pi single = %15.12f\n", std::sqrt(summ_single));
printf("pi omp = %15.12f\n", std::sqrt(summ_omp));
printf("pi comb = %15.12f\n", std::sqrt(summ_Conparall));
printf("\n");
system("PAUSE");
}
And the results:
PI calculation VS2010 Win32
Time single thread 5.330000
Time OMP parallel 1.029000
Time Concurrency:arallel_for 11.103000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592651497
PI calculation VS2013 Win32
Time single thread 5.200000
Time OMP parallel 1.291000
Time Concurrency:arallel_for 7.413000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592647841
PI calculation VS2010 x64
Time single thread 5.190000
Time OMP parallel 1.036000
Time Concurrency::parallel_for 7.120000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592649319
PI calculation VS2013 x64
Time single thread 5.230000
Time OMP parallel 1.029000
Time Concurrency::parallel_for 5.326000
pi single = 3.141592643651
pi omp = 3.141592648425
pi comb = 3.141592648489
The tests were made on AMD and Intel CPUs, Win 7 x64.
What is the reason for difference between PI calculation in single and multicore?
Why the result of calculation with Concurrency::parallel_for is not constant on different builds (compiler, 32/64 bit platform)?
P.S.
Visual studio express doesn’t support OpenMP.

Floating-point addition is a non-associative operation due to round-off errors, therefore the order of operations matters. Having your parallel program give different result(s) than the serial version is something normal. Understanding and dealing with it is part of the art of writing (portable) parallel codes. This is exacerbated in the 32- against 64-bit builds since in 32-bit mode the VS compiler uses x87 instructions and the x87 FPU does all operations with an internal precision of 80 bits. In 64-bit mode SSE math is used.
In the serial case, one thread computes s1+s2+...+sN, where N is the number of terms in the expansion.
In the OpenMP case there are n partial sums, where n is the number of OpenMP threads. Which terms get into each partial sum depends on the way iterations are distributed among the threads. The default for many OpenMP implementations is static scheduling, which means that thread 0 (the main thread) computes ps0 = s1 + s2 + ... + sN/n; thread 1 computes ps1 = sN/n+1 + sN/n+2 + ... + s2N/n; and so on. In the end the reduction combines somehow those partial sums.
The parallel_for case is very similar to the OpenMP one. The difference is that by default the iterations are distributed in a dynamic fashion - see the documentation for auto_partitioner, therefore each partial sum contains a more or less random selection of terms. This not only gives a slightly different result, but it also gives a slightly different result with each execution, i.e. the result from two consecutive parallel_for's with the same number of threads might differ a bit. If you replace the partitioner with an instance of simple_partitioner and set the chunk size equal to itera / number-of-threads, you should get the same result as in the OpenMP case if the reduction is performed the same way.
You could use Kahan summation and implement your own reduction also using Kahan summation. Then the parallel codes should produce the same (over much more similar) result as the serial one.

I would guess that the parallel reduction that openmp does is in general more accurate as the
floating point addition round-off error gets more distributed. In general floating point
reductions are problematic because of rounding errors etc. http://floating-point-gui.de/
performing those operations in parallel is a way to improve on the accuracy by distributing the rounding error. Imagine you are doing a big reduction, at some point the accumulator is going to grow in size compared to the other values and this will increase the rounding error for each addition as the accumulators range is much larger and it may not be possible to represent the value of the smaller value in that range accurately, however if there are multiple accumulators for the same reduction operating in parallel their magnitudes would remain smaller and this kind of error would be smaller.

So...
In win32 mode the FPU with 80bit registers will be used.
In x64 mode the SSE2 with double precision float (64 bit) will be used. The use of sse2 seems like be by default in x64 mode.
Theoretically... is it possible that the calculation in win32 mode will be more precise? :)
http://en.wikipedia.org/wiki/SSE2
So the better way is to buy new CPUs with AVX or compile to 32bit code?...

Parallel processing a prime finder with openMP

I am trying to construct a prime finder for a bit of C practice. I've got the algorithm down and I've done a bunch of optimisations to make it faster, I then decided to try to parallelize it because, hey why not! Turns out to be harder than I thought. I can either get all threads running the same process (with same args) or a single thread will run if I try and supply different args to each process. I really have no idea what I'm doing here but you can see some experimental values I'm using in this code:
// gcc -std=c99 -o multithread multithread.c -fopenmp -lm
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
int pf(unsigned int start, unsigned int limit, unsigned int q);
int main(int argc, char *argv[])
{
printf("prime finder\n");
int j, slimits[4] = {1,10000000,20000000,30000000}, elimits[4] = {10000000,20000000,30000000,40000000};
double startTime = omp_get_wtime();
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
printf("%d prime numbers found in %.2f seconds.\n\n", primes, omp_get_wtime() - startTime);
return 0;
}
I havn't included the pf function as it is quite large but it works on its own, it returns the number of primes found. Im sure the issue is here somewhere.
Any help would be greatly appreciated!

You have made at least one obvious (to me) and serious mistake. You've declared primes shared and allowed all the threads in the program to update it. You have, thereby, programmed a data race. Nothing in OpenMP (nor in C if I recall correctly) guarantees that += will be implemented atomically. You haven't actually specified what the problem with your program is, or what the problems are, but this must surely be one of them.
I'll tell you how to fix this later but I think there is a more serious underlying design problem you should address first. You seem to have decided that you would have 4 threads running and that you should divide the range of integers to test for primality into 4 and pass one chunk to each thread. Sure, you can make that work but it's not a smart approach to using OpenMP. Nor is it a smart approach to dividing the work of primality testing.
A smarter approach to OpenMP program design is to start off by making no assumptions about the number of threads that will be available to the executing program. Design for any number of threads, do not design a program whose behaviour depends on the number of threads it gets at run-time. Use OpenMP's facilities, specifically the schedule clause, to distribute the workload at run time.
Turning to primality testing. Draw, or at least think about, a scatter plot of points (i,t(i)), where i is an integer and t(i) is the time it takes to determine whether or not i is prime. The pattern in this plot is about as difficult to discern as the pattern in the plot of the occurrence of primes in the integers. In other words, the time to determine the primality of an integer is very unpredictable. It does tend to rise as the integers increase (well, excluding large even integers which I'm sure your test doesn't consider anyway).
One implication of this unpredictability is that if you divide a range of integers into N sub-ranges and give one sub-range to each of N threads you are not giving the threads the same amount of work to do. Indeed, in the range of integers 1..m (any m) there is one integer which takes much longer to test than any other integer in the range, and this time is the irreducible minimum that your program will take. A naive distribution of the range will produce a seriously unbalanced workload.
Here's what I think you should do to fix your program.
First, write a function which tests the primality of a single integer. This will be the basic task for your computation. Call this is_prime. Next, study the schedule clause for the parallel for construct. OpenMP provides a number of task scheduling options, I won't explain them here, you will find plenty of good documentation online. Finally, study also the reduction clause; this provides the solution to the data race you have programmed.
Applying all this I suggest you change
#pragma omp parallel shared(slimits, elimits primes)
{
#pragma omp for
for (j = 0; j < 4; j++)
{
primes += pf(slimits[j], elimits[j], atoi(argv[2]));
}
}
to
#pragma omp parallel shared(slimits, elimits, max_int_to_test)
{
#pragma omp for reduction(+:primes) schedule (dynamic, 10)
for (j = 3; j < max_int_to_test; j += 2)
{
primes += is_prime(j);
}
}
With any luck my rudimentary C hasn't screwed up the syntax too much.

how to generate a delay

i'm new to kernel programming and i'm trying to understand some basics of OS. I am trying to generate a delay using a technique which i've implemented successfully in a 20Mhz microcontroller.
I know this is a totally different environment as i'm using linux centOS in my 2 GHz Core 2 duo processor.
I've tried the following code but i'm not getting a delay.
#include<linux/kernel.h>
#include<linux/module.h>
int init_module (void)
{
unsigned long int i, j, k, l;
for (l = 0; l < 100; l ++)
{
for (i = 0; i < 10000; i ++)
{
for ( j = 0; j < 10000; j ++)
{
for ( k = 0; k < 10000; k ++);
}
}
}
printk ("\nhello\n");
return 0;
}
void cleanup_module (void)
{
printk ("bye");
}
When i dmesg after inserting the module as quickly as possile for me, the string "hello" is already there. If my calculation is right, the above code should give me atleast 10 seconds delay.
Why is it not working? Is there anything related to threading? How could a 20 Ghz processor execute the above code instantly without any noticable delay?

The compiler is optimizing your loop away since it has no side effects.
To actually get a 10 second (non-busy) delay, you can do something like this:
#include <linux/sched.h>
//...
unsigned long to = jiffies + (10 * HZ); /* current time + 10 seconds */
while (time_before(jiffies, to))
{
schedule();
}
or better yet:
#include <linux/delay.h>
//...
msleep(10 * 1000);
for short delays you may use mdelay, ndelay and udelay
I suggest you read Linux Device Drivers 3rd edition chapter 7.3, which deals with delays for more information

To answer the question directly, it's likely your compiler seeing that these loops don't do anything and "optimizing" them away.
As for this technique, what it looks like you're trying to do is use all of the processor to create a delay. While this may work, an OS should be designed to maximize processor time. This will just waste it.
I understand it's experimental, but just the heads up.

Thread-safe random number generation for Monte-Carlo integration

Im trying to write something which very quickly calculates random numbers and can be applied on multiple threads. My current code is:
/* Approximating PI using a Monte-Carlo method. */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#define N 1000000000 /* As lareg as possible for increased accuracy */
double random_function(void);
int main(void)
{
int i = 0;
double X, Y;
double count_inside_temp = 0.0, count_inside = 0.0;
unsigned int th_id = omp_get_thread_num();
#pragma omp parallel private(i, X, Y) firstprivate(count_inside_temp)
{
srand(th_id);
#pragma omp for schedule(static)
for (i = 0; i <= N; i++) {
X = 2.0 * random_function() - 1.0;
Y = 2.0 * random_function() - 1.0;
if ((X * X) + (Y * Y) < 1.0) {
count_inside_temp += 1.0;
}
}
#pragma omp atomic
count_inside += count_inside_temp;
}
printf("Approximation to PI is = %.10lf\n", (count_inside * 4.0)/ N);
return 0;
}
double random_function(void)
{
return ((double) rand() / (double) RAND_MAX);
}
This works but from observing a resource manager I know its not using all the threads. Does rand() work for multithreaded code? And if not is there a good alternative? Many Thanks. Jack

Is rand() thread safe? Maybe, maybe not:
The rand() function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe."
One test and good learning exercise would be to replace the call to rand() with, say, a fixed integer and see what happens.
The way I think of pseudo-random number generators is as a black box which take an integer as input and return an integer as output. For any given input the output is always the same, but there is no pattern in the sequence of numbers and the sequence is uniformly distributed over the range of possible outputs. (This model isn't entirely accurate, but it'll do.) The way you use this black box is to choose a staring number (the seed) use the output value in your application and as the input for the next call to the random number generator. There are two common approaches to designing an API:
Two functions, one to set the initial seed (e.g. srand(seed)) and one to retrieve the next value from the sequence (e.g. rand()). The state of the PRNG is stored internally in sort of global variable. Generating a new random number either will not be thread safe (hard to tell, but the output stream won't be reproducible) or will be slow in multithreded code (you end up with some serialization around the state value).
A interface where the PRNG state is exposed to the application programmer. Here you typically have three functions: init_prng(seed), which returns some opaque representation of the PRNG state, get_prng(state), which returns a random number and changes the state variable, and destroy_peng(state), which just cleans up allocated memory and so on. PRNGs with this type of API should all be thread safe and run in parallel with no locking (because you are in charge of managing the (now thread local) state variable.
I generally write in Fortran and use Ladd's implementation of the Mersenne Twister PRNG (that link is worth reading). There are lots of suitable PRNG's in C which expose the state to your control. PRNG looks good and using this (with initialization and destroy calls inside the parallel region and private state variables) should give you a decent speedup.
Finally, it's often the case that PRNGs can be made to perform better if you ask for a whole sequence of random numbers in one go (e.g. the compiler can vectorize the PRNG internals). Because of this libraries often have something like get_prng_array(state) functions which give you back an array full of random numbers as if you put get_prng in a loop filling the array elements - they just do it more quickly. This would be a second optimization (and would need an added for loop inside the parallel for loop. Obviously, you don't want to run out of per-thread stack space doing this!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string