Issue in parallelising inner loop of a nested for in OpenMP

Issue in parallelising inner loop of a nested for in OpenMP - multithreading

I need to parallelize the inner of a nested loop with OpenMP. They way I did it is not working fine. Each thread should iterate on each of the M points, but only iterate(in the second loop) on its own chunk of coordinates. So I want the first loop to go from 0 to M , the second one frommy_first_coord to my_last_coord. In the code I posted, the program is faster when launched with 4 threads than when with 8, so there's some issue. I know one way to do this is by "manually" dividing the coordinates, meaning that each thread gets its own num_of_coords / thread_count(and considering the remainder), I did that with Pthread. I would prefer to make use of pragmas in OpenMP. I'm sure I'm missing something. Let me show you the code
#pragma omp parallel
...
for (int i = 0; i < M; i++) { //All iterate from i to M
# pragma omp for nowait
for (int coord = 0; coord < N; coord++) { //each works on its portion of coords
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
}
}
I put the Pthread version too, so that you don't misunderstand what I want to achieve, but with the use of pragmas
/*M is global,
first_nn and last_nn are local*/
for (long i = 0; i < M; i++)
for(long coord = first_nn; coord <= last_nn; coord++)
centroids[points[i].cluster].accumulator.coordinates[coord] += points[i].coordinates[coord];
I hope that it is clear enough. Thank you
Edit:
I'm using gcc 12.2.0. By adding the -O3 flag times have improved.
With larger inputs the difference is speedup between 4 and 8 threads is more significant.

Your comment indicates that you are worried about speedup.
How many physical cores does your processor have? Try every thread count from 1 to that number.
Do not use hyperthreads
You may find a good speedup for low thread counts, but a leveling off effect: that is because you have a "streaming" operation, which is limited by bandwidth. Unless you have a very expensive processor, there is not enough bandwidth to keep all cores running fast.
You could try setting OMP_PROC_BIND=true which prevents the OS from migrating your threads. That can improve cache usage.
You have some sort of indirect addressing going on with the i variable so further memory effects related to the TLB may make your parallel code not scale optimally.
But start with point 3 and report.

Related

What is the point of running same code under different threads - openMP?

From: https://bisqwit.iki.fi/story/howto/openmp/
The parallel construct
The parallel construct starts a parallel block. It creates a team
of N threads (where N is determined at runtime, usually from the
number of CPU cores, but may be affected by a few things), all of
which execute the next statement (or the next block, if the statement
is a {…} -enclosure). After the statement, the threads join back into
one.
#pragma omp parallel
{
// Code inside this region runs in parallel.
printf("Hello!\n");
}
I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?

By using omp_get_thread_num() you can retrieve the thread ID which enables you to parametrize the so called "same code" with respect to that thread ID.
Take this example:
A is a 1000-dimensional integer array and you need to sum its values using 2 OpenMP threads.
You would design you code something like this:
int A_dim = 1000
long sum[2] = {0,0}
#pragma omp parallel
{
int threadID = omp_get_thread_num();
int start = threadID * (A_dim / 2)
int end = (threadID + 1) * (A_dim / 2)
for(int i = start; i < end; i++)
sum[threadID] += A[i]
}
start is the lower bound which your thread will start summing from (example: thread #0 will start summing from 0, while thread #1 will start summing from 500).
end is pretty much the same of start, but it's the upper bound of which array index the thread will sum up to (example: thread #0 will sum until 500, summing values from A[0] to A[499], while thread #1 will sum until 1000 is reached, values from A[500] to A[999])

I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
When you are running the same code on different data.
For example, if I want to invert 10 matrices, I might run the matrix inversion code on 10 threads ... to get (ideally) a 10-fold speedup compared to 1 thread and a for loop.

The basic idea of OpenMP is to distribute work. For this you need to create some threads.
The parallel construct creates this number of threads. Afterwards you can distibute/share work with other constructs like omp for or omp task.
A possible benefit of this distinction is e.g. when you have to allocate memory for each thread (i.e. thread-local data).

I want to understand what is the point of running same code under different threads. In what kind of cases it can be helpful?
One example: in physics you got a random process(collision, initial maxwellian etc) in your code and you need to run the code many times to get the average results, in this case you need to run the same code several times.

Speed-up from multi-threading

I have a highly parallelizable problem. Hundreds of separate problems need to be solved by the same function. The problems each take an average of perhaps 120 ms (0.12 s) on a single core, but there is substantial variation, and some extreme and rare ones may take 10 times as long. Each problem needs memory, but this is allocated ahead of time. The problems do not need disk I/O, and they do not pass back and forth any variables once they are running. They do access different parts (array elements) of the same global struct, though.
I have C++ code, based on someone else's code, that works. (The global array of structs is not shown.) It runs 20 problems (for instance) and then returns. I think 20 is enough to even out the variability on 4 cores. I see the execution time flattening out from about 10 already.
There is a Win32 and an OpenMP version, and they behave almost identically in terms of execution time. I run the program on a 4-core Windows system. I include some OpenMP code below since it is shorter. (I changed names etc. to make it more generic and I may have made mistakes -- it won't compile stand-alone.)
The speed-up over the single-threaded version flattens out at about a factor of 2.3. So if it takes 230 seconds single-threaded, it takes 100 s multi-threaded. I am surprised that the speed-up is not a lot closer to 4, the number of cores.
Am I right to be disappointed?
Is there anything I can do to get closer to my theoretical expectation?
int split_bigtask(Inputs * inputs, Outputs * results)
{
for (int k = 0; k < MAXNO; k++)
results->solved[k].value = 0;
int res;
#pragma omp parallel shared(inputs, outputs)
{
#pragma omp for schedule(dynamic)
for (int k = 0; k < inputs->no; k++)
{
res = bigtask(inputs->values[k],
outputs->solved[k],
omp_get_thread_num()
);
}
}
return TRUE;
}

I Assume that there is no synchronization done within bigtask() (Obvious, but I'd still check it first).
It's possible that you run into a "dirty cache" problem: If you manipulate data that is close to each other (e.g. same cache line!) from multiple cores each manipulation will mark the cache line as dirty (which means that the processor needs to signal this to all other processeors which in turn involves synchronization again...).
you create too many threads (allocating a thread is quite an overhead. So creating one thread for each core is a lot more efficient than creating 5 threads each).
I personally would assume that you have case 2 ("Big Global Array").
Solution to the problem (if it's indeed case 2):
Write the results to a local array which is merged into the "Big Global Array" by the main thread after the end of the work
Split the global array into several smaller arrays (and give each thread one of these arrays)
Ensure that the records within the structure align on Cache-Line boundaries (this is a bit a hack since cache line boundaries may change for future processors)
You may want to try to create a local copy of the array for each thread (at least for the results)

Parallel, but slower

I'm using monte carlo method to calculate pi and do a basic experience with parallel programming and openmp
the problem is that when i use 1 thread, x iterations, always runs faster than n thread, x iterations. Can anyone tell me why?
For example the code runs like this "a.out 1 1000000", where 1 is threads and 1000000 the iterations
include <omp.h>
include <stdio.h>
include <stdlib.h>
include <iostream>
include <iomanip>
include <math.h>
using namespace std;
int main (int argc, char *argv[]) {
double arrow_area_circle, pi;
float xp, yp;
int i, n;
double pitg= atan(1.0)*4.0; //for pi error
cout << "Number processors: " << omp_get_num_procs() << endl;
//Number of divisions
iterarions=atoi(argv[2]);
arrow_area_circle = 0.0;
#pragma omp parallel num_threads(atoi(argv[1]))
{
srandom(omp_get_thread_num());
#pragma omp for private(xp, yp) reduction(+:arrow_area_circle) //*,/,-,+
for (i = 0; i < iterarions; i++) {
xp=rand()/(float)RAND_MAX;
yp=rand()/(float)RAND_MAX;
if(pow(xp,2.0)+pow(yp,2.0)<=1) arrow_area_circle++;
}
}
pi = 4*arrow_area_circle / iterarions;
cout << setprecision(18) << "PI = " << pi << endl << endl;
cout << setprecision(18) << "Erro = " << pitg-pi << endl << endl;
return 0;
}

A CPU intensive task like this will be slower if you do the work in more threads than there are CPU's in the system. If you are running it on a single CPU system, you will definitely see a slowdown with more than one thread. This is due to the OS having to switch between the various threads - this is pure overhead. You should ideally have the same number of threads as cores for a task like this.
Another issue is that arrow_area_circle is shared between threads. If you have a thread running on each core, incrementing arrow_area_circle will invalidate the copy in the caches of the other cores, causing them to have to refetch. arrow_area_circle++ which should take a couple cycles will take dozens or hundreds of cycles. Try creating an arrow_area_circle per thread and combining them at the end.
EDIT: Joe Duffy just posted a blog entry on the cost of sharing data between threads.

It looks like you are using some kind of auto-parallelizing compiler. I am going to assume you have more than 1 core/CPU in your system (as that would be too obvious -- and no hyperthreading on a Pentium 4 doesn't count as having two cores, regardless of what Intel's marketing would have you believe.) There are two problems that I see. The first is trivial and probably not your problem:
If the variable arrow_area_circle is shared between your processes, then the act of executing arrow_area_circle++ will cause an interlocking instruction to be used to synchronize the value in a way that is atomically sound. You should increment a "private" variable, then add that value just once at the end to the common arrow_area_circle variable instead of incrementing arrow_area_circle in your inner loop.
The rand() function, to operate soundly, must internally execute with a critical section. The reason is that its internal state/seed is a static shared variable; if it were not, it would be possible for two different processes to get the same output from rand() with unusually high probability, just because they were calling rand() at nearly the same time. That means rand() runs slowly, and especially so as more threads/processes are calling it at the same time. Unlike the arrow_area_circle variable (which just needs an atomic increment), a true critical section has to be invoked by rand() because its state update is more complicated. To work around this, you should obtain the source code for your own random number generator and use it with a private seed or state. The source code for the standard rand() implementation in most compilers is widely available.
I'd also like to point out that you are using the pow(,) function to perform the same thing as x * x. The later is about 300 times faster than the former. Though this point is irrelevant to the question you are asking. :)

Context switching.

rand() is blocking function. It means that it has critical section inside.

Just to stress that you have to be really careful using random numbers in a parallel setting. In fact you should use something like SPRNG
Whatever you do, make sure that each thread isn't using the same random numbers.

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?

No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.

First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.

I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.

Implicit parallelism is probably what you are looking for.

If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.

A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.

The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.

With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.

If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});

Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

What can make a program run slower when using more threads?

This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
bin_index is a completely deterministic function of its arguments which, for purposes of this question, does not use or change any shared state - in other words, it is manifestly reentrant.
I first wrote this program to use a single thread. Then I converted it to use multiple threads, such that thread n runs all iterations of the outer loop where i1 % nthreads == n. So the function that runs in each thread looks like
for (int i1 = n; i1 < N; i1 += nthreads)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
thread_local_histogram[bin_index(i1, i2, i3, i4)] += 1;
and all the thread_local_histograms are added up in the main thread at the end.
Here's the strange thing: when I run the program with just 1 thread for some particular size of the calculation, it takes about 6 seconds. When I run it with 2 or 3 threads, doing exactly the same calculation, it takes about 9 seconds. Why is that? I would expect that using 2 threads would be faster than 1 thread since I have a dual-core CPU. The program does not use any mutexes or other synchronization primitives so two threads should be able to run in parallel.
For reference: typical output from time (this is on Linux) for one thread:
real 0m5.968s
user 0m5.856s
sys 0m0.064s
and two threads:
real 0m9.128s
user 0m10.129s
sys 0m6.576s
The code is at http://static.ellipsix.net/ext-tmp/distintegral.ccs
P.S. I know there are libraries designed for exactly this kind of thing that probably could have better performance, but that's what my last question was about so I don't need to hear those suggestions again. (Plus I wanted to use pthreads as a learning experience.)

To avoid further comments on this: When I wrote my reply, the questioner hasn't posted a link to his source yet, so I could not tailor my reply to his specific issues. I was only answering the general question what "can" cause such an issue, I never said that this will necessarily apply to his case. When he posted a link to his source, I wrote another reply, that is exactly only focusing on his very issue (which is caused by the use of the random() function as I explained in my other reply). However, since the question of this post is still "What can make a program run slower when using more threads?" and not "What makes my very specific application run slower?", I've seen no need to change my rather general reply either (general question -> general response, specific question -> specific response).
1) Cache Poisoning
All threads access the same array, which is a block of memory. Each core has its own cache to speed up memory access. Since they don't just read from the array but also change the content, the content is changed actually in the cache only, not in real memory (at least not immediately). The problem is that the other thread on the other core may have overlapping parts of memory cached. If now core 1 changes the value in the cache, it must tell core 2 that this value has just changed. It does so by invalidating the cache content on core 2 and core 2 needs to re-read the data from memory, which slows processing down. Cache poisoning can only happen on multi-core or multi-CPU machines. If you just have one CPU with one core this is no problem. So to find out if that is your issue or not, just disable one core (most OSes will allow you to do that) and repeat the test. If it is now almost equally fast, that was your problem.
2) Preventing Memory Bursts
Memory is read fastest if read sequentially in bursts, just like when files are read from HD. Addressing a certain point in memory is actually awfully slow (just like the "seek time" on a HD), even if your PC has the best memory on the market. However, once this point has been addressed, sequential reads are fast. The first addressing goes by sending a row index and a column index and always having waiting times in between before the first data can be accessed. Once this data is there, the CPU starts bursting. While the data is still on the way it sends already the request for the next burst. As long as it is keeping up the burst (by always sending "Next line please" requests), the RAM will continue to pump out data as fast as it can (and this is actually quite fast!). Bursting only works if data is read sequentially and only if the memory addresses grow upwards (AFAIK you cannot burst from high to low addresses). If now two threads run at the same time and both keep reading/writing memory, however both from completely different memory addresses, each time thread 2 needs to read/write data, it must interrupt a possible burst of thread 1 and the other way round. This issue gets worse if you have even more threads and this issue is also an issue on a system that has only one single-core CPU.
BTW running more threads than you have cores will never make your process any faster (as you mentioned 3 threads), it will rather slow it down (thread context switches have side effects that reduce processing throughput) - that is unlike you run more threads because some threads are sleeping or blocking on certain events and thus cannot actively process any data. In that case it may make sense to run more threads than you have cores.

Everything I said so far in my other reply holds still true on general, as your question was what "can"... however now that I've seen your actual code, my first bet would be that your usage of the random() function slows everything down. Why?
See, random keeps a global variable in memory that stores the last random value calculated there. Each time you call random() (and you are calling it twice within a single function) it reads the value of this global variable, performs a calculation (that is not so fast; random() alone is a slow function) and writes the result back there before returning it. This global variable is not per thread, it is shared among all threads. So what I wrote regarding cache poisoning applies here all the time (even if you avoided it for the array by having separated arrays per thread; this was very clever of you!). This value is constantly invalidated in the cache of either core and must be re-fetched from memory. However if you only have a single thread, nothing like that happens, this variable never leaves cache after it has been initially read, since it's permanently accessed again and again and again.
Further to make things even worse, glibc has a thread-safe version of random() - I just verified that by looking at the source. While this seems to be a good idea in practice, it means that each random() call will cause a mutex to be locked, memory to be accessed, and a mutex to be unlocked. Thus two threads calling random exactly the same moment will cause one thread to be blocked for a couple of CPU cycles. This is implementation specific, though, as AFAIK it is not required that random() is thread safe. Most standard lib functions are not required to be thread-safe, since the C standard is not even aware of the concept of threads in the first place. When they are not calling it the same moment, the mutex will have no influence on speed (as even a single threaded app must lock/unlock the mutex), but then cache poisoning will apply again.
You could pre-build an array with random numbers for every thread, containing as many random number as each thread needs. Create it in the main thread before spawning the threads and add a reference to it to the structure pointer you hand over to every thread. Then get the random numbers from there.
Or just implement your own random number generator if you don't need the "best" random numbers on the planet, that works with per-thread memory for holding its state - that one might be even faster than the system's built-in generator.
If a Linux only solution works for you, you can use random_r. It allows you to pass the state with every call. Just use a unique state object per thread. However this function is a glibc extension, it is most likely not supported by other platforms (neither part of the C standards nor of the POSIX standards AFAIK - this function does not exist on Mac OS X for example, it may neither exist in Solaris or FreeBSD).
Creating an own random number generator is actually not that hard. If you need real random numbers, you shouldn't use random() in the first place. Random only creates pseudo-random numbers (numbers that look random, but are predictable if you know the generator's internal state). Here's the code for one that produces good uint32 random numbers:
static uint32_t getRandom(uint32_t * m_z, uint32_t * m_w)
{
*m_z = 36969 * (*m_z & 65535) + (*m_z >> 16);
*m_w = 18000 * (*m_w & 65535) + (*m_w >> 16);
return (*m_z << 16) + *m_w;
}
It's important to "seed" m_z and m_w in a proper way somehow, otherwise the results are not random at all. The seed value itself should already be random, but here you could use the system random number generator.
uint32_t m_z = random();
uint32_t m_w = random();
uint32_t nextRandom;
for (...) {
nextRandom = getRandom(&m_z, &m_w);
// ...
}
This way every thread only needs to call random() twice and then uses your own generator. BTW, if you need double randoms (that are between 0 and 1), the function above can be easily wrapped for that:
static double getRandomDouble(uint32_t * m_z, uint32_t * m_w)
{
// The magic number below is 1/(2^32 + 2).
// The result is strictly between 0 and 1.
return (getRandom(m_z, m_w) + 1) * 2.328306435454494e-10;
}
Try to make this change in your code and let me know how the benchmark results are :-)

You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.

One possibility is that the time taken to create the threads exceeds the savings gained by using threads. I would think that N is not very large, if the elapsed time is only 6 seconds for a O(n^4) operation.
There's also no guarantee that multiple threads will run on different cores or CPUs. I'm not sure what the default thread affinity is with Linux - it may be that both threads run on a single core which would negate the benefits of a CPU-intensive piece of code such as this.
This article details default thread affinity and how to change your code to ensure threads run on specific cores.

Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.

Off the top of my head:
Context switches
Resource contention
CPU contention (if they aren't getting split to multiple CPUs).
Cache thrashing

David,
Are you sure you run a kernel that supports multiple processors? If only one processor is utilized in your system, spawning additional CPU-intensive threads will slow down your program.
And, are you sure support for threads in your system actually utilizes multiple processors? Does top, for example, show that both cores in your processor utilized when you run your program?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string