How to ensure thread safe in iocp receive?

How to ensure thread safe in iocp receive? - multithreading

DWORD bytes;
ULONG_PTR key; ChatOverlappedData* ol;
if (!GetQueuedCompletionStatus(hComp_, &bytes, &key, (LPOVERLAPPED*)&ol, 0)) {
return false;
}
int type = ol->getNetType();
if (type == net::kAction_Accept) {
onAccept(ol, bytes, key);
} else if (type == net::kAction_Recv) {
onRecv(ol, bytes, key);
} else if (type == net::kAction_Send) {
}
return true;
Consider following scenario,
client alice sent two command to the server, which is made up of three data package, p1 p2 p3. The first two package form the first command c1, the third package form the second command c2. In the function onRecv, the server need to push the data packages to some kind of command buffer to form complete commands.
But suppose there are three threads t1 t2 t3, each thread get a data package(p1, p2, p3) from GetQueuedCompletionStatus,
Since windows is a preemptive operating system, thread t2 could run before t1, t3. The result command buffer would be p2->p1->p3 or p2->p3->p1.
How to ensure the thread safe for the action of pushing the data package to the command buffer?

The simplest solution is to only attempt one overlapped I/O request per socket in each direction. So post a single overlapped read operation, and when it completes, post another one when you finish processing the first one.
Posting more than one such operation is extremely complex because even though the completions will be posted in order, the threads processing the completions may execute out of order and you have to do some painful and complicated tracking.
The benefit from posting multiple overlapped operations in the same direction for the same connection is very, very small. It's almost never sufficient to justify the additional complexity. For servers that handle large numbers of connections, it's usually not worth doing at all because the extra memory consumption (or use of smaller buffer sizes) can actually make performance worse.
The main benefits of IOCP are more efficient discovery of which connections need work and the efficient assignment of that work to a pool of threads. This is what makes the difference between servers that max out at 800 connections and servers that can handle 10,000 connections without breaking a sweat.

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?

You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.

Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.

I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

Effects of swapping buffers on concurrent access

Consider an application with two threads, Producer and Consumer.
Both threads are running approximately equally frequent, multiple times in a second.
Both threads access the same memory region, where Producer writes to the memory, and Consumer reads the current chunk of data and does something with it, without invalidating the data.
A classical approach is this one:
int[] sharedData;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(sharedData)
{
Array.Copy(data, sharedData, LENGTH);
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
lock(sharedData)
{
Array.Copy(sharedData, data, LENGTH);
}
DoSomething(data);
}
If we assume that the Array.Copy takes time, this code would run slow, since Producer always has to wait for Consumer during copying and vice versa.
An approach to this problem would be to create two buffers, one which is accessed by the Consumer, and one which is written to by the Producer, and swap the buffers, as soon as writing has finished.
int[] frontBuffer;
int[] backBuffer;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(backBuffer)
{
Array.Copy(data, backBuffer, LENGTH);
int[] temp = frontBuffer;
frontBuffer = backBuffer;
backBuffer = temp;
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
int[] currentFrontBuffer = frontBuffer;
lock(currentForntBuffer)
{
Array.Copy(currentFrontBuffer , data, LENGTH);
}
DoSomething(currentForntBuffer );
}
Now, my questions:
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
Will the code in the 2nd example execute faster than the code in the 1st example?
Are there any better methods to efficiently solve the problem described above?
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
Note: this is no classical producer/consumer problem: It is possible for Consumer to read the values multiple times before Producer writes it again - the old data stays valid until Producer writes new data.

Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
As far as I can tell, because reference assignment is atomic, this may be safe but not ideal. Because the WriteValues() method reads from frontBuffer without a lock or memory barrier forcing a cache refresh, there no guarantee that the variable will ever be updated with new values from main memory. There is then a potential to continuously read the stale, cached values of that instance from the local register or CPU cache. I'm unsure of whether the compiler/JIT might infer a cache refresh anyway based on the local variable, maybe somebody with more specific knowledge can speak to this area.
Even if the values aren't stale, you may also run into more contention than you would like. For example...
Thread A calls WriteValues()
Thread A takes a lock on the instance in frontBuffer and starts copying.
Thread B calls WriteValues(int[])
Thread B writes its data, moves the currently locked frontBuffer instance into backBuffer.
Thread B calls WriteValues(int[])
Thread B waits on the lock for backBuffer because Thread A still has it.
Will the code in the 2nd example execute faster than the code in the 1st example?
I suggest that you profile it and find out. X being faster than Y only matters if Y is too slow for your particular needs, and you are the only one who knows what those are.
Are there any better methods to efficiently solve the problem described above?
Yes. If you are using .Net 4 and above, there is a BlockingCollection type in System.Collections.Concurrent that models the Producer/Consumer pattern well. If you consistently read more than you write, or have multiple readers to very few writers, you may also want to consider the ReaderWriterLockSlim class. As a general rule of thumb, you should do as little within a lock as you can, which will also help to alleviate your time issue.
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
You might be able to, but I wouldn't suggest trying that unless you are extremely familiar with multi-threading, cache coherency, and potential compiler/JIT optimizations. Locking will most likely be fine for your situation and it will be much easier for you (and others reading your code) to reason about and maintain.

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?

Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.

It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.

#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?

Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.

Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.

Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.

Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.

This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

What can make a program run slower when using more threads?

This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
bin_index is a completely deterministic function of its arguments which, for purposes of this question, does not use or change any shared state - in other words, it is manifestly reentrant.
I first wrote this program to use a single thread. Then I converted it to use multiple threads, such that thread n runs all iterations of the outer loop where i1 % nthreads == n. So the function that runs in each thread looks like
for (int i1 = n; i1 < N; i1 += nthreads)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
thread_local_histogram[bin_index(i1, i2, i3, i4)] += 1;
and all the thread_local_histograms are added up in the main thread at the end.
Here's the strange thing: when I run the program with just 1 thread for some particular size of the calculation, it takes about 6 seconds. When I run it with 2 or 3 threads, doing exactly the same calculation, it takes about 9 seconds. Why is that? I would expect that using 2 threads would be faster than 1 thread since I have a dual-core CPU. The program does not use any mutexes or other synchronization primitives so two threads should be able to run in parallel.
For reference: typical output from time (this is on Linux) for one thread:
real 0m5.968s
user 0m5.856s
sys 0m0.064s
and two threads:
real 0m9.128s
user 0m10.129s
sys 0m6.576s
The code is at http://static.ellipsix.net/ext-tmp/distintegral.ccs
P.S. I know there are libraries designed for exactly this kind of thing that probably could have better performance, but that's what my last question was about so I don't need to hear those suggestions again. (Plus I wanted to use pthreads as a learning experience.)

To avoid further comments on this: When I wrote my reply, the questioner hasn't posted a link to his source yet, so I could not tailor my reply to his specific issues. I was only answering the general question what "can" cause such an issue, I never said that this will necessarily apply to his case. When he posted a link to his source, I wrote another reply, that is exactly only focusing on his very issue (which is caused by the use of the random() function as I explained in my other reply). However, since the question of this post is still "What can make a program run slower when using more threads?" and not "What makes my very specific application run slower?", I've seen no need to change my rather general reply either (general question -> general response, specific question -> specific response).
1) Cache Poisoning
All threads access the same array, which is a block of memory. Each core has its own cache to speed up memory access. Since they don't just read from the array but also change the content, the content is changed actually in the cache only, not in real memory (at least not immediately). The problem is that the other thread on the other core may have overlapping parts of memory cached. If now core 1 changes the value in the cache, it must tell core 2 that this value has just changed. It does so by invalidating the cache content on core 2 and core 2 needs to re-read the data from memory, which slows processing down. Cache poisoning can only happen on multi-core or multi-CPU machines. If you just have one CPU with one core this is no problem. So to find out if that is your issue or not, just disable one core (most OSes will allow you to do that) and repeat the test. If it is now almost equally fast, that was your problem.
2) Preventing Memory Bursts
Memory is read fastest if read sequentially in bursts, just like when files are read from HD. Addressing a certain point in memory is actually awfully slow (just like the "seek time" on a HD), even if your PC has the best memory on the market. However, once this point has been addressed, sequential reads are fast. The first addressing goes by sending a row index and a column index and always having waiting times in between before the first data can be accessed. Once this data is there, the CPU starts bursting. While the data is still on the way it sends already the request for the next burst. As long as it is keeping up the burst (by always sending "Next line please" requests), the RAM will continue to pump out data as fast as it can (and this is actually quite fast!). Bursting only works if data is read sequentially and only if the memory addresses grow upwards (AFAIK you cannot burst from high to low addresses). If now two threads run at the same time and both keep reading/writing memory, however both from completely different memory addresses, each time thread 2 needs to read/write data, it must interrupt a possible burst of thread 1 and the other way round. This issue gets worse if you have even more threads and this issue is also an issue on a system that has only one single-core CPU.
BTW running more threads than you have cores will never make your process any faster (as you mentioned 3 threads), it will rather slow it down (thread context switches have side effects that reduce processing throughput) - that is unlike you run more threads because some threads are sleeping or blocking on certain events and thus cannot actively process any data. In that case it may make sense to run more threads than you have cores.

Everything I said so far in my other reply holds still true on general, as your question was what "can"... however now that I've seen your actual code, my first bet would be that your usage of the random() function slows everything down. Why?
See, random keeps a global variable in memory that stores the last random value calculated there. Each time you call random() (and you are calling it twice within a single function) it reads the value of this global variable, performs a calculation (that is not so fast; random() alone is a slow function) and writes the result back there before returning it. This global variable is not per thread, it is shared among all threads. So what I wrote regarding cache poisoning applies here all the time (even if you avoided it for the array by having separated arrays per thread; this was very clever of you!). This value is constantly invalidated in the cache of either core and must be re-fetched from memory. However if you only have a single thread, nothing like that happens, this variable never leaves cache after it has been initially read, since it's permanently accessed again and again and again.
Further to make things even worse, glibc has a thread-safe version of random() - I just verified that by looking at the source. While this seems to be a good idea in practice, it means that each random() call will cause a mutex to be locked, memory to be accessed, and a mutex to be unlocked. Thus two threads calling random exactly the same moment will cause one thread to be blocked for a couple of CPU cycles. This is implementation specific, though, as AFAIK it is not required that random() is thread safe. Most standard lib functions are not required to be thread-safe, since the C standard is not even aware of the concept of threads in the first place. When they are not calling it the same moment, the mutex will have no influence on speed (as even a single threaded app must lock/unlock the mutex), but then cache poisoning will apply again.
You could pre-build an array with random numbers for every thread, containing as many random number as each thread needs. Create it in the main thread before spawning the threads and add a reference to it to the structure pointer you hand over to every thread. Then get the random numbers from there.
Or just implement your own random number generator if you don't need the "best" random numbers on the planet, that works with per-thread memory for holding its state - that one might be even faster than the system's built-in generator.
If a Linux only solution works for you, you can use random_r. It allows you to pass the state with every call. Just use a unique state object per thread. However this function is a glibc extension, it is most likely not supported by other platforms (neither part of the C standards nor of the POSIX standards AFAIK - this function does not exist on Mac OS X for example, it may neither exist in Solaris or FreeBSD).
Creating an own random number generator is actually not that hard. If you need real random numbers, you shouldn't use random() in the first place. Random only creates pseudo-random numbers (numbers that look random, but are predictable if you know the generator's internal state). Here's the code for one that produces good uint32 random numbers:
static uint32_t getRandom(uint32_t * m_z, uint32_t * m_w)
{
*m_z = 36969 * (*m_z & 65535) + (*m_z >> 16);
*m_w = 18000 * (*m_w & 65535) + (*m_w >> 16);
return (*m_z << 16) + *m_w;
}
It's important to "seed" m_z and m_w in a proper way somehow, otherwise the results are not random at all. The seed value itself should already be random, but here you could use the system random number generator.
uint32_t m_z = random();
uint32_t m_w = random();
uint32_t nextRandom;
for (...) {
nextRandom = getRandom(&m_z, &m_w);
// ...
}
This way every thread only needs to call random() twice and then uses your own generator. BTW, if you need double randoms (that are between 0 and 1), the function above can be easily wrapped for that:
static double getRandomDouble(uint32_t * m_z, uint32_t * m_w)
{
// The magic number below is 1/(2^32 + 2).
// The result is strictly between 0 and 1.
return (getRandom(m_z, m_w) + 1) * 2.328306435454494e-10;
}
Try to make this change in your code and let me know how the benchmark results are :-)

You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.

One possibility is that the time taken to create the threads exceeds the savings gained by using threads. I would think that N is not very large, if the elapsed time is only 6 seconds for a O(n^4) operation.
There's also no guarantee that multiple threads will run on different cores or CPUs. I'm not sure what the default thread affinity is with Linux - it may be that both threads run on a single core which would negate the benefits of a CPU-intensive piece of code such as this.
This article details default thread affinity and how to change your code to ensure threads run on specific cores.

Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.

Off the top of my head:
Context switches
Resource contention
CPU contention (if they aren't getting split to multiple CPUs).
Cache thrashing

David,
Are you sure you run a kernel that supports multiple processors? If only one processor is utilized in your system, spawning additional CPU-intensive threads will slow down your program.
And, are you sure support for threads in your system actually utilizes multiple processors? Does top, for example, show that both cores in your processor utilized when you run your program?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string