Is CGAL 2D Regularized Boolean Set-Operations lib thread safe?

Is CGAL 2D Regularized Boolean Set-Operations lib thread safe? - multithreading

I am currently using the library mentioned in the title, see
CGAL 2D-reg-bool-set-op-pol
The library provides types for polygons and polygon sets which are internally represented as so called arrangements.
My question is: How far is this library thread safe, that is, fit for parallel computation on its objects?
There could be several levels in which thread safety is guaranteed:
1) If I take an object from a library like an arrangement
Polygon_set_2 S;
I might be able to execute
Polygon_2 P;
S.join(P);
and
Polygon_2 Q;
S.join(Q);
in two different concurrent execution units/threads in parallel without harm and get the right result, as if I had done everything sequentially. That would be the highest degree of thread safety/possible parallelism.
2) In fact for me a much lesser degree would be enough. In that case S and P would be members of a class C so that two class instances have different S and P instances. Then I would like to compute (say) S.join(P) in parallel for a list of instances of the class C, say, by calling a suitable member function of C with std::async
Just to be complete, I insert here a bit of actual code from my project which gives more flesh to these terse descriptions.
// the following typedefs are more or less standard from the
// CGAL library examples.
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef Kernel::Point_2 Point_2;
typedef Kernel::Circle_2 Circle_2;
typedef Kernel::Line_2 Line_2;
typedef CGAL::Gps_circle_segment_traits_2<Kernel> Traits_2;
typedef CGAL::General_polygon_set_2<Traits_2> Polygon_set_2;
typedef Traits_2::General_polygon_2 Polygon_2;
typedef Traits_2::General_polygon_with_holes_2 Polygon_with_holes_2;
typedef Traits_2::Curve_2 Curve_2;
typedef Traits_2::X_monotone_curve_2 X_monotone_curve_2;
typedef Traits_2::Point_2 Point_2t;
typedef Traits_2::CoordNT coordnt;
typedef CGAL::Arrangement_2<Traits_2> Arrangement_2;
typedef Arrangement_2::Face_handle Face_handle;
// the following type is not copied from the CGAL library example code but
// introduced by me
typedef std::vector<Polygon_with_holes_2> pwh_vec_t;
// the following is an excerpt of my full GerberLayer class,
// that retains only data members which are used in the join()
// member function. These data is therefore local to the class instance.
class GerberLayer
{
public:
GerberLayer();
~GerberLayer();
void join();
pwh_vec_t raw_poly_lis;
pwh_vec_t joined_poly_lis;
Polygon_set_2 Saux;
annotate_vec_t annotate_lis;
polar_vec_t polar_lis;
};
//
// it is not necessary to understand the working of the function
// I deleted all debug and timing output etc. It is just to "showcase" some typical
// operations from the CGAL reg set boolean ops for polygons library from
// Efi Fogel et.al.
//
void GerberLayer::join()
{
Saux.clear();
auto it_annbase = annotate_lis.begin();
annotate_vec_t::iterator itann = annotate_lis.begin();
bool first_block = true;
int cnt = 0;
while (itann != annotate_lis.end()) {
gpolarity akt_polar = itann->polar;
auto itnext = std::find_if(itann, annotate_lis.end(),
[=](auto a) {return a.polar != akt_polar;});
Polygon_set_2 Sblock;
if (first_block) {
if (akt_polar == Dark) {
Saux.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
}
first_block = false;
} else {
if (akt_polar == Dark) {
Saux.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
} else {
Polygon_set_2 Saux1;
Saux1.join(raw_poly_lis.begin() + (itann - it_annbase),
raw_poly_lis.begin() + (itnext - it_annbase));
Saux.complement();
pwh_vec_t auxlis;
Saux1.polygons_with_holes(std::back_inserter(auxlis));
Saux.join(auxlis.begin(), auxlis.end());
Saux.complement();
}
}
itann = itnext;
}
ende:
joined_poly_lis.clear();
annotate_lis.clear();
Saux.polygons_with_holes (std::back_inserter (joined_poly_lis));
}
int join_wrapper(GerberLayer* p_layer)
{
p_layer->join();
return 0;
}
// here the parallelism (of the "embarassing kind") occurs:
// for every GerberLayer a dedicated task is started, which calls
// the above GerberLayer::join() function
void Window::do_unify()
{
std::vector<std::future<int>> fivec;
for(int i = 0; i < gerber_layer_manager.num_layers(); ++i) {
GerberLayer* p_layer = gerber_layer_manager.at(i);
fivec.push_back(std::async(join_wrapper, p_layer));
}
int sz = wait_for_all(fivec); // written by me, not shown
}
One might think, that 2) must be possible trivially as only "different" instances of polygons and arrangements are in the play. But: It is imaginable, as the library works with arbitrary precision points (Point_2t in my code above) that, for some implementation reason or other, all the points are inserted in a list static to the class Point_2t, so that identical points are represented only once in this list. So there would be nothing like "independent instances of Point_2t" and as a consequence also not for "Polygon_2" or "Polygon_set_2" and one could say farewell to thread safety.
I tried to resolve this question by googling (not by analyzing the library code, I have to admit) and would hope for an authoritative answer (hopefully positive as this primitive parallelism would greatly speed up my code).
Addendum:
1)
I implemented this already and made a test run with nothing exceptional occurring and visually plausible results, but of course this proves nothing.
2) The same question for the CGAL 2D-Arrangement-package from the same authors.
Thanks in advance!
P.S.: I am using CGAL 4.7 from the packages supplied with Ubuntu 16.04 (Xenial). A newer version on Ubuntu 18.04 gave me errors so I decided to stay with 4.7. Should a version newer than 4.7 be thread-safe, but not 4.7, of course I will try to use that newer version.
Incidentally I could not find out if the libcgal***.so libraries as supplied by Ubuntu 16.04 are thread safe as described in the documentation. Especially I found no reference to the Macro-Variable CGAL_HAS_THREADS that is mentioned in the "thread-safety" part of the docs, when I looked through the build-logs of the Xenial cgal package on launchpad.

Indeed there are several level of thread safety.
The 2D Regularized Boolean operation package depends of the 2D Arrangement package, and both packages depend on a kernel. For most operations the EPEC kernel is required.
Both packages are thread-safe, except for the rational-arc traits (Arr_rational_function_traits_2).
However, the EPEC kernel is not thread-safe yet when sharing number-type objects among threads. So, if you, for example, construct different arrangements in different threads, from different input sets of curves, respectively, you are safe.

Related

Misaligned pointer use with std::shared_ptr<NSDate> dereference

I am working in a legacy codebase with a large amount of Objective-C++ written using manual retain/release. Memory is managed using lots of C++ std::shared_ptr<NSMyCoolObjectiveCPointer>, with a suitable deleter passed in on construction that calls release on the contained object. This seems to work great; however, when enabling UBSan, it complains about misaligned pointers, usually when dereferencing the shared_ptrs to do some work.
I've searched for clues and/or solutions, but it's difficult to find technical discussion of the ins and outs of Objective-C object pointers, and even more difficult to find any discussion about Objective-C++, so here I am.
Here is a full Objective-C++ program that demonstrates my problem. When I run this on my Macbook with UBSan, I get a misaligned pointer issue in shared_ptr::operator*:
#import <Foundation/Foundation.h>
#import <memory>
class DateImpl {
public:
DateImpl(NSDate* date) : _date{[date retain], [](NSDate* date) { [date release]; }} {}
NSString* description() const { return [&*_date description]; }
private:
std::shared_ptr<NSDate> _date;
};
int main(int argc, const char * argv[]) {
#autoreleasepool {
DateImpl date{[NSDate distantPast]};
NSLog(#"%#", date.description());
return 0;
}
}
I get this in the call to DateImpl::description:
runtime error: reference binding to misaligned address 0xe2b7fda734fc266f for type 'std::__1::shared_ptr<NSDate>::element_type' (aka 'NSDate'), which requires 8 byte alignment
0xe2b7fda734fc266f: note: pointer points here
<memory cannot be printed>
I suspect that there is something awry with the usage of &* to "cast" the shared_ptr<NSDate> to an NSDate*. I think I could probably work around this issue by using .get() on the shared_ptr instead, but I am genuinely curious about what is going on. Thanks for any feedback or hints!

There were some red herrings here: shared_ptr, manual retain/release, etc. But I ended up discovering that even this very simple code (with ARC enabled) causes the ubsan hit:
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
#autoreleasepool {
NSDate& d = *[NSDate distantPast];
NSLog(#"%#", &d);
}
return 0;
}
It seems to simply be an issue with [NSDate distantPast] (and, incidentally, [NSDate distantFuture], but not, for instance, [NSDate date]). I conclude that these must be singleton objects allocated sketchily/misaligned-ly somewhere in the depths of Foundation, and when you dereference them it causes a misaligned pointer read.
(Note it does not happen when the code is simply NSLog(#"%#", &*[NSDate distantPast]). I assume this is because the compiler simply collapses &* on a raw pointer into a no-op. It doesn't for the shared_ptr case in the original question because shared_ptr overloads operator*. Given this, I believe there is no easy way to make this happen in pure Objective-C, since you can't separate the & operation from the * operation, like you can when C++ references are involved [by storing the temporary result of * in an NSDate&].)

You are not supposed to ever use a "bare" NSDate type. Objective-C objects should always be used with a pointer-to-object type (e.g. NSDate *), and you are never supposed to get the "type behind the pointer".
In particular, on 64-bit platforms, Objective-C object pointers can sometimes not be valid pointers, but rather be "tagged pointers" which store the "value" of the object in certain bits of the pointer, rather than as an actual allocated object. You must always let the Objective-C runtime machinery deal with Objective-C object pointers. Dereferencing it as a regular C/C++ pointer can lead to undefined behavior.

Identifying bug in linux kernel module

I am marking Michael's as he was the first. Thank you to osgx and employee of the month for additional information and assistance.
I am attempting to identify a bug in a consumer/produce kernel module. This is a problem being given to me for a course in university. My teaching assistant was not able to figure it out, and my professor said it was okay if I uploaded online (he doesn't think Stack can figure it out!).
I have included the module, the makefile, and the Kbuild.
Running the program does not guarantee the bug will present itself.
I thought the issue was on line 30 since it is possible for a thread to rush to line 36, and starve the other threads. My professor said that is not what he is looking for.
Unrelated question: What is the purpose of line 40? It seems out of place to me, but my professor said it serves a purporse.
My professor said the bug is very subtle. The bug is not deadlock.
My approach was to identify critical sections and shared variables, but I'm stumped. I am not familiar with tracing (as a method of debugging), and was told that while it may help it is not necessary to identify the issue.
File: final.c
#include <linux/completion.h>
#include <linux/init.h>
#include <linux/kthread.h>
#include <linux/module.h>
static int actor_kthread(void *);
static int writer_kthread(void *);
static DECLARE_COMPLETION(episode_cv);
static DEFINE_SPINLOCK(lock);
static int episodes_written;
static const int MAX_EPISODES = 21;
static bool show_over;
static struct task_info {
struct task_struct *task;
const char *name;
int (*threadfn) (void *);
} task_info[] = {
{.name = "Liz", .threadfn = writer_kthread},
{.name = "Tracy", .threadfn = actor_kthread},
{.name = "Jenna", .threadfn = actor_kthread},
{.name = "Josh", .threadfn = actor_kthread},
};
static int actor_kthread(void *data) {
struct task_info *actor_info = (struct task_info *)data;
spin_lock(&lock);
while (!show_over) {
spin_unlock(&lock);
wait_for_completion_interruptible(&episode_cv); //Line 30
spin_lock(&lock);
while (episodes_written) {
pr_info("%s is in a skit\n", actor_info->name);
episodes_written--;
}
reinit_completion(&episode_cv); // Line 36
}
pr_info("%s is done for the season\n", actor_info->name);
complete(&episode_cv); //Why do we need this line?
actor_info->task = NULL;
spin_unlock(&lock);
return 0;
}
static int writer_kthread(void *data) {
struct task_info *writer_info = (struct task_info *)data;
size_t ep_num;
spin_lock(&lock);
for (ep_num = 0; ep_num < MAX_EPISODES && !show_over; ep_num++) {
spin_unlock(&lock);
/* spend some time writing the next episode */
schedule_timeout_interruptible(2 * HZ);
spin_lock(&lock);
episodes_written++;
complete_all(&episode_cv);
}
pr_info("%s wrote the last episode for the season\n", writer_info->name);
show_over = true;
complete_all(&episode_cv);
writer_info->task = NULL;
spin_unlock(&lock);
return 0;
}
static int __init tgs_init(void) {
size_t i;
for (i = 0; i < ARRAY_SIZE(task_info); i++) {
struct task_info *info = &task_info[i];
info->task = kthread_run(info->threadfn, info, info->name);
}
return 0;
}
static void __exit tgs_exit(void) {
size_t i;
spin_lock(&lock);
show_over = true;
spin_unlock(&lock);
for (i = 0; i < ARRAY_SIZE(task_info); i++)
if (task_info[i].task)
kthread_stop(task_info[i].task);
}
module_init(tgs_init);
module_exit(tgs_exit);
MODULE_DESCRIPTION("CS421 Final");
MODULE_LICENSE("GPL");
File: kbuild
Kobj-m := final.o
File: Makefile
# Basic Makefile to pull in kernel's KBuild to build an out-of-tree
# kernel module
KDIR ?= /lib/modules/$(shell uname -r)/build
all: modules
clean modules:

When cleaning up in tgs_exit() the function executes the following without holding the spinlock:
if (task_info[i].task)
kthread_stop(task_info[i].task);
It's possible for a thread that's ending to set it's task_info[i].task to NULL between the check and call to kthread_stop().

I'm quite confused here.
You claim this is a question from an upcoming exam and it was released by the person delivering the course. Why would they do that? Then you say that TA failed to solve the problem. If TA can't do it, who can expect students to pass?
(professor) doesn't think Stack can figure it out
If the claim is that the level on this website is bad I definitely agree. But still, claiming it is below a level to be expected from a random university is a stretch. If there is no claim of the sort, I once more ask how are students expected to do it. What if the problem gets solved?
The code itself is imho unsuitable for teaching as it deviates too much from common idioms.
Another answer here noted one side effect of the actual problem. Namely, it was stated that the loop in tgs_exit can race with threads exiting on their own and test the ->task pointer to be non-NULL, while it becomes NULL just afterwards. The discussion whether this can result in a kthread_stop(NULL) call is not really relevant.
Either a kernel thread exiting on its own will clear everything up OR kthread_stop (and maybe something else) is necessary to do it.
If the former is true, the code suffers from a possible use-after-free. After tgs_exit tests that the pointer, the target thread could have exited. Maybe prior to kthread_stop call or maybe just as it was executed. Either way, it is possible that the passed pointer is stale as the area was already freed by the thread which was exiting.
If the latter is true, the code suffers from resource leaks due to insufficient cleanup - there are no kthread_stop calls if tgs_exit is executed after all threads exit.
The kthread_* api allows threads to just exit, hence effects are as described in the first variant.
For the sake of argument let's say the code is compiled in into the kernel (as opposed to being loaded as a module). Say the exit func is called on shutdown.
There is a design problem that there are 2 exit mechanisms and it transforms into a bug as they are not coordinated. A possible solution for this case would set a flag for writers to stop and would wait for a writer counter to drop to 0.
The fact that the code is in a module makes the problem more acute: unless you kthread_stop, you can't tell if the target thread is gone. In particular "actor" threads do:
actor_info->task = NULL;
So the thread is skipped in the exit handler, which can now finish and let the kernel unload the module itself...
spin_unlock(&lock);
return 0;
... but this code (located in the module!) possibly was not executed yet.
This would not have happened if the code followed an idiom if always using kthread_stop.
Other issue is that writers wake everyone up (so-called "thundering herd problem"), as opposed to at most one actor.
Perhaps the bug one is supposed to find is that each episode has at most one actor? Maybe that the module can exit when there are episodes written but not acted out yet?
The code is extremely weird and if you were shown a reasonable implementation of a thread-safe queue in userspace, you should see how what's presented here does not fit. For instance, why does it block instantly without checking for episodes?
Also a fun fact that locking around the write to show_over plays no role in correctness.
There are more issues and it is quite likely I missed some. As it is, I think the question is of poor quality. It does not look like anything real-world.

Debugging in threading building Blocks

I would like to program in threading building blocks with tasks. But how does one do the debugging in practice?
In general the print method is a solid technique for debugging programs.
In my experience with MPI parallelization, the right way to do logging is that each thread print its debugging information in its own file (say "debug_irank" with irank the rank in the MPI_COMM_WORLD) so that the logical errors can be found.
How can something similar be achieved with TBB? It is not clear how to access the thread number in the thread pool as this is obviously something internal to tbb.
Alternatively, one could add an additional index specifying the rank when a task is generated but this makes the code rather complicated since the whole program has to take care of that.

First, get the program working with 1 thread. To do this, construct a task_scheduler_init as the first thing in main, like this:
#include "tbb/tbb.h"
int main() {
tbb::task_scheduler_init init(1);
...
}
Be sure to compile with the macro TBB_USE_DEBUG set to 1 so that TBB's checking will be enabled.
If the single-threaded version works, but the multi-threaded version does not, consider using Intel Inspector to spot race conditions. Be sure to compile with TBB_USE_THREADING_TOOLS so that Inspector gets enough information.
Otherwise, I usually first start by adding assertions, because the machine can check assertions much faster than I can read log messages. If I am really puzzled about why an assertion is failing, I use printfs and task ids (not thread ids). Easiest way to create a task id is to allocate one by post-incrementing a tbb::atomic<size_t> and storing the result in the task.
If I'm having a really bad day and the printfs are changing program behavior so that the error does not show up, I use "delayed printfs". Stuff the printf arguments in a circular buffer, and run printf on the records later after the failure is detected. Typically for the buffer, I use an array of structs containing the format string and a few word-size values, and make the array size a power of two. Then an atomic increment and mask suffices to allocate slots. E.g., something like this:
const size_t bufSize = 1024;
struct record {
const char* format;
void *arg0, *arg1;
};
tbb::atomic<size_t> head;
record buf[bufSize];
void recf(const char* fmt, void* a, void* b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = a;
r->arg1 = b;
}
void recf(const char* fmt, int a, int b) {
record* r = &buf[head++ & bufSize-1];
r->format = fmt;
r->arg0 = (void*)a;
r->arg1 = (void*)b;
}
The two recf routines record the format and the values. The casting is somewhat abusive, but on most architectures you can print the record correctly in practice with printf(r->format, r->arg0, r->arg1) even if the the 2nd overload of recf created the record.
~
~

Is there any kill_proc() replacement for proprietary Linux kernel drivers?

I'm in the process of porting 4 proprietary (read: non-GPL) Linux kernel drivers (that I didn't write) from RHEL 5.x to RHEL 6.x (2.6.32 kernel). The drivers all use kill_proc() for signalling the user-space "session", but this function has been removed from the more recent kernels (somewhere between 2.6.18 and 2.6.32). I've seen this question asked many times here and elsewhere and I've searched fairly extensively, but of the many suggested solutions, none work due to either the functions no longer being exported, or requrieing a GPL-only function (see below). Does anyone know of a solution that could work for a proprietary driver?
given: kill_proc(pid, sig, 1);
The simplest solution I found was to use: kill_proc_info(sig, SEND_SIG_PRIV, pid); however kill_proc_info is no longer exported so it can't be used.
kill_pid_info() has been suggested (this is called by kill_proc_info() after setting an rcu_read_lock(). kill_pid_info() requires a struct pid* so I could use: kill_pid_info(sig, SEND_SIG_PRIV, find_vpid(pid)); however find_vpid() is exported for GPL use only and this is a proprietary driver. Is there another way to get the struct pid*?
kill_pid_info() also sets up an rcu_read_lock() and then calls group_send_sig_info(). Unfortunately, group_send_siginfo() is not exported, and also it requires a struct task_struct*, but the required find_task_by_vpid() function is not exported either.
Another suggestion was kill_pid(), but this also requires a struct pid*, and again, the function find_vpid() is only exported for GPL.
There were also suggestions for send_sig() and send_sig_info(), but these also require a struct task_struct*, and again, find_task_by_pid() is not exported, and pid_task() requires that (GPLd) find_vpid() to get a struct pid*. Also, these function don't set an rcu_read_lock() and they also pass a FALSE value for the group flag (whereas kill_proc ended up using a TRUE value) - so there could be some subtle differences.
That's all that I could find. Does anyone have a suggestion that will work for my case? Thanks in advance.

Since there have been no responses to my question, I've been
reading much of the kernel code and I think I've found a
solution.
It seems that the only exported function that provides the
same semantics as kill_proc() is kill_pid(). We can't use
the GPL find_vpid() function to get the needed struct pid*,
but if we can get the struct task_struct*, then we can get
the struct pid* from there as:
task->pids[PIDTYPE_PID].pid
Since find_task_by_vpid() is no longer exported, it seems
the only way to find the task is to go through the entire
task list looking for it. So, the proposed solution is:
int my_kill_proc(pid_t pid, int sig) {
int error = -ESRCH; /* default return value */
struct task_struct* p;
struct task_struct* t = NULL;
struct pid* pspid;
rcu_read_lock();
p = &init_task; /* start at init */
do {
if (p->pid == pid) { /* does the pid (not tgid) match? */
t = p;
break;
}
p = next_task(p); /* "this isn't the task you're looking for" */
} while (p != &init_task); /* stop when we get back to init */
if (t != NULL) {
pspid = t->pids[PIDTYPE_PID].pid;
if (pspid != NULL) error = kill_pid(pspid,sig,1);
}
rcu_read_unlock();
return error;
}
I know it will take a lot more time to search the whole task list rather
than using the hash tables, but it's all I've got. Some concerns/questions
that I have:
Is the rcu_read_lock() sufficient for this? Would
it be better to use something like preempt_disable() instead?
Can the struct task_struct ever NOT have a PIDTYPE_PID entry
in the pids array? And if so, is checking for NULL sufficient?
I'm new to working with the kernel, are there any other
suggestions to make this better?

Native mutex implementation

So in my ilumination days, i started to think about how the hell do windows/linux implement the mutex, i've implemented this synchronizer in 100... different ways, in many diferent arquitectures but never think how it is really implemented in big ass OS, for example in the ARM world i made some of my synchronizers disabling the interrupts but i always though that it wasn't a really good way to do it.
I tried to "swim" throgh the linux kernel but just like a though i can't see nothing that satisfies my curiosity. I'm not an expert in threading, but i have solid all the basic and intermediate concepts of it.
So does anyone know how a mutex is implemented?

A quick look at code apparently from one Linux distribution seems to indicate that it is implemented using an interlocked compare and exchange. So, in some sense, the OS isn't really implementing it since the interlocked operation is probably handled at the hardware level.
Edit As Hans points out, the interlocked exchange does the compare and exchange in an atomic manner. Here is documentation for the Windows version. For fun, I just now wrote a small test to show a really simple example of creating a mutex like that. This is a simple acquire and release test.
#include <windows.h>
#include <assert.h>
#include <stdio.h>
struct homebrew {
LONG *mutex;
int *shared;
int mine;
};
#define NUM_THREADS 10
#define NUM_ACQUIRES 100000
DWORD WINAPI SomeThread( LPVOID lpParam )
{
struct homebrew *test = (struct homebrew*)lpParam;
while ( test->mine < NUM_ACQUIRES ) {
// Test and set the mutex. If it currently has value 0, then it
// is free. Setting 1 means it is owned. This interlocked function does
// the test and set as an atomic operation
if ( 0 == InterlockedCompareExchange( test->mutex, 1, 0 )) {
// this tread now owns the mutex. Increment the shared variable
// without an atomic increment (relying on mutex ownership to protect it)
(*test->shared)++;
test->mine++;
// Release the mutex (4 byte aligned assignment is atomic)
*test->mutex = 0;
}
}
return 0;
}
int main( int argc, char* argv[] )
{
LONG mymutex = 0; // zero means
int shared = 0;
HANDLE threads[NUM_THREADS];
struct homebrew test[NUM_THREADS];
int i;
// Initialize each thread's structure. All share the same mutex and a shared
// counter
for ( i = 0; i < NUM_THREADS; i++ ) {
test[i].mine = 0; test[i].shared = &shared; test[i].mutex = &mymutex;
}
// create the threads and then wait for all to finish
for ( i = 0; i < NUM_THREADS; i++ )
threads[i] = CreateThread(NULL, 0, SomeThread, &test[i], 0, NULL);
for ( i = 0; i < NUM_THREADS; i++ )
WaitForSingleObject( threads[i], INFINITE );
// Verify all increments occurred atomically
printf( "shared = %d (%s)\n", shared,
shared == NUM_THREADS * NUM_ACQUIRES ? "correct" : "wrong" );
for ( i = 0; i < NUM_THREADS; i++ ) {
if ( test[i].mine != NUM_ACQUIRES ) {
printf( "Thread %d cheated. Only %d acquires.\n", i, test[i].mine );
}
}
}
If I comment out the call to the InterlockedCompareExchange call and just let all threads run the increments in a free-for-all fashion, then the results do result in failures. Running it 10 times, for example, without the interlocked compare call:
shared = 748694 (wrong)
shared = 811522 (wrong)
shared = 796155 (wrong)
shared = 825947 (wrong)
shared = 1000000 (correct)
shared = 795036 (wrong)
shared = 801810 (wrong)
shared = 790812 (wrong)
shared = 724753 (wrong)
shared = 849444 (wrong)
The curious thing is that one time the results showed now incorrect contention. That might be because there is no "everyone start now" synchronization; maybe all threads started and finished in order in that case. But when I have the InterlockedExchangeCall in place, it runs without failure (or at least it ran 100 times without failure ... that doesn't prove I didn't write a subtle bug into the example).

Here is the discussion from the people who implemented it ... very interesting as it shows the tradeoffs ..
Several posts from Linus T ... of course

In earlier days pre-POSIX etc I used to implement synchronization by using a native mode word (e.g. 16 or 32 bit word) and the Test And Set instruction lurking on every serious processor. This instruction guarantees to test the value of a word and set it in one atomic instruction. This provides the basis for a spinlock and from that a hierarchy of synchronization functions could be built. The simplest is of course just a spinlock which performs a busy wait, not an option for more than transitory sync'ing, then a spinlock which drops the process time slice at each iteration for a lower system impact. Notional concepts like Semaphores, Mutexes, Monitors etc can be built by getting into the kernel scheduling code.
As I recall the prime usage was to implement message queues to permit multiple clients to access a database server. Another was a very early real time car race result and timing system on a quite primitive 16 bit machine and OS.
These days I use Pthreads and Semaphores and Windows Events/Mutexes (mutices?) etc and don't give a thought as to how they work, although I must admit that having been down in the engine room does give one and intuitive feel for better and more efficient multiprocessing.

In windows world.
The mutex before the windows vista mas implemented with a Compare Exchange to change the state of the mutex from Empty to BeingUsed, the other threads that entered the wait on the mutex the CAS will obvious fail and it must be added to the mutex queue for furder notification. Those operations (add/remove/check) of the queue would be protected by an common lock in windows kernel.
After Windows XP, the mutex started to use a spin lock for performance reasons being a self-suficiant object.
In unix world i didn't get much furder but probably is very similar to the windows 7.
Finally for kernels that work on a single processor the best way is to disable the interrupts when entering the critical section and re-enabling then when exiting.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string