I've been working on the parser for a programming language that requires multithreading support. While investigating what the backend of my compiler should be, I noticed that I cannot find much information on multithreading for things like CIL, LLVM IR, gcc RTL, or JVM bytecode. I can find some references on how to make such code thread safe, but nothing on how to, say, create or fork threads. I can of course use signals or something to interface directly with the operating system, but that's nonportable and error-prone.
Is it the case that there's simply no portable way for managing threads in these low-level languages? Should I compile to a high(er)-level language like C instead?
In JVM byte code you can use any Java libraries, including those that work with threads. The conventional way of creating a thread would be
new Thread() {
#Override
public void run() {
/// code
}
}.start();
This code is written in Java. The corresponding JVM byte code can be seen by using javap:
0: new #2 // class Main$1
3: dup
4: invokespecial #3 // Method Main$1."<init>":()V
7: invokevirtual #4 // Method Main$1.start:()V
10: return
And Main$1 is a class:
final class Main$1 extends java/lang/Thread {
// compiled from: Intf.java
OUTERCLASS Main main ([Ljava/lang/String;)V
// access flags 0x8
static INNERCLASS Main$1 null null
// access flags 0x0
<init>()V
L0
LINENUMBER 7 L0
ALOAD 0
INVOKESPECIAL java/lang/Thread.<init> ()V
RETURN
L1
LOCALVARIABLE this LMain$1; L0 L1 0
MAXSTACK = 1
MAXLOCALS = 1
// access flags 0x1
public run()V
L0
LINENUMBER 11 L0
RETURN
L1
LOCALVARIABLE this LMain$1; L0 L1 0
MAXSTACK = 0
MAXLOCALS = 1
}
This code is perfectly portable.
Related
In this document, a QMutex is used to protect "number" from being modified by multiple threads at same time.
I have a code in which a thread is instructed to do different work according to a flag set by another thread.
//In thread1
if(flag)
dowork1;
else
dowork2;
//In thread2
void setflag(bool f)
{
flag=f;
}
I want to know if a QMutex is needed to protect flag, i.e.,
//In thread1
mutex.lock();
if(flag)
{
mutex.unlock();
dowork1;
}
else
{
mutex.unlock();
dowork2;
}
//In thread2
void setflag(bool f)
{
mutex.lock();
flag=f;
mutex.unlock();
}
The code is different from the document in that flag is accessed(read/written) by single statement in both threads, and only one thread modifies the value of flag.
PS:
I always see the example in multi-thread programming tutorials that one thread does "count++", the other thread does "count--", and the tutorials say you should use a Mutex to protect the variable "count". I cannot get the point of using a mutex. Does it mean the execution of single statement "count++" or "count--" can be interrupted in the middle and produce unexpected result? What unexpected results can be gotten?
Does it mean the execution of single statement "count++" or "count--"
can be interrupted in the middle and produce unexpected result? What
unexpected results can be gotten?
Just answering to this part: Yes, the execution can be interrupted in the middle of a statement.
Let's imagine a simple case:
class A {
void foo(){
++a;
}
int a = 0;
};
The single statement ++a is translated in assembly to
mov eax, DWORD PTR [rdi]
add eax, 1
mov DWORD PTR [rdi], eax
which can be seen as
eax = a;
eax += 1;
a = eax;
If foo() is called on the same instance of A in 2 different threads (be it on a single core, or multiple cores) you cannot predict what will be the result of the program.
It can behave nicely:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
thread 2 > eax = a // eax in thread 2 is equal to 1
thread 2 > eax += 1 // eax in thread 2 is equal to 2
thread 2 > a = eax // a is set to 2
or not:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 2 > eax = a // eax in thread 2 is equal to 0
thread 2 > eax += 1 // eax in thread 2 is equal to 1
thread 2 > a = eax // a is set to 1
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
In a well defined program, N calls to foo() should result in a == N.
But calling foo() on the same instance of A from multiple threads creates undefined behavior. There is no way to know the value of a after N calls to foo().
It will depend on how you compiled your program, what optimization flags were used, which compiler was used, what was the load of your CPU, the number of core of your CPU,...
NB
class A {
public:
bool check() const { return a == b; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
Now we have a class that, for an external observer, keeps a and b equal at all time.
The optimizer could optimize this class into:
class A {
public:
bool check() const { return true; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
because it does not change the observable behavior of the program.
However if you invoke undefined behavior by calling foo() on the same instance of A from multiple threads, you could end up if a = 3, b = 2 and check() still returning true. Your code has lost its meaning, the program is not doing what it is supposed to and can be doing about anything.
From here you can imagine more complex cases, like if A manages network connections, you can end up sending the data for client #10 to client #6. If your program is running in a factory, you can end up activating the wrong tool.
If you want the definition of undefined behavior you can look here : https://en.cppreference.com/w/cpp/language/ub
and in the C++ standard
For a better understanding of UB you can look for CppCon talks on the topic.
For any standard object (including bool) that is accessed from multiple threads, where at least one of the threads may modify the object's state, you need to protect access to that object using a mutex, otherwise you will invoke undefined behavior.
As a practical matter, for a bool that undefined behavior probably won't come in the form of a crash, but more likely in the form of thread B sometimes not "seeing" changes made to the bool by thread A, due to caching and/or optimization issues (e.g. the optimizer "knows" that the bool can't change during a function call, so it doesn't bother checking it more than once)
If you don't want to guard your accesses with a mutex, the other option is to change flag from a bool to a std::atomic<bool>; the std::atomic<bool> type has exactly the semantics you are looking for, i.e. it can be read and/or written from any thread without invoking undefined behavior.
Look here for an explanation: Do I have to use atomic<bool> for "exit" bool variable?
To synchronize access to flag you can make it a std::atomic<bool>.
Or you can use a QReadWriteLock together with a QReadLocker and a QWriteLocker. Compared to using a QMutex this gives you the advantage that you do not need to care about the call to QMutex::unlock() if you use exceptions or early return statements.
Alternatively you can use a QMutexLocker if the QReadWriteLock does not match your use case.
QReadWriteLock lock;
...
//In thread1
{
QReadLocker readLocker(&lock);
if(flag)
dowork1;
else
dowork2;
}
...
//In thread2
void setflag(bool f)
{
QWriteLocker writeLocker(&lock);
flag=f;
}
Keeping your program expressing its intent (ie. accessing shared vars under locks) is a big win for program maintenance and clarity. You need to have some pretty good reasons to abandon that clarity for obscure approaches like the atomics and devising consistent race conditions.
Good reasons include you have measured your program spending too much time toggling the mutex. In any decent implementation, the difference between a non-contested mutex and an atomic is minute -- the mutex lock and unlock typical employ an optimistic compare-and-swap, returning quickly. If your vendor doesn't provide a decent implementation, you might bring that up with them.
In your example, dowork1 and dowork2 are invoked with the mutex locked; so the mutex isn't just protecting flag, but also serializing these functions. If that is just an artifact of how you posed the question, then race conditions (variants of atomics travesty) are less scary.
In your PS (dup of comment above):
Yes, count++ is best thought of as:
mov $_count, %r1
ld (%r1), %r0
add $1, %r0, %r2
st %r2,(%r1)
Even machines with natural atomic inc (x86,68k,370,dinosaurs) instructions might not be used consistently by the compiler.
So, if two threads do count--; and count++; at close to the same time, the result could be -1, 0, 1. (ignoring the language weenies that say your house might burn down).
barriers:
if CPU0 executes:
store $1 to b
store $2 to c
and CPU1 executes:
load barrier -- discard speculatively read values.
load b to r0
load c to r1
Then CPU1 could read r0,r1 as: (0,0), (1,0), (1,2), (0,2).
This is because the observable order of the memory writes is weak; the processor may make them visible in an arbitrary fashion.
So, we change CPU0 to execute:
store $1 to b
store barrier -- stop storing until all previous stores are visible
store $2 to c
Then, if CPU1 saw that r1 (c) was 2, then r0 (b) has to be 1. The store barrier enforces that.
For me, its seems to be more handy to use a mutex here.
In general not using mutex when sharing references could lead to
problems.
The only downside of using mutex here seems to be, that you will slightly decrease the performance, because your threads have to wait for each other.
What kind of errors could happen ?
Like somebody in the comments said its a different situation if
your share fundamental datatype e.g. int, bool, float
or a object references. I added some qt code
example, which emphases 2 possible problems during NOT using mutex. The problem #3 is a fundamental one and pretty well described in details by Benjamin T and his nice answer.
Blockquote
main.cpp
#include <QCoreApplication>
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include "countingthread.h"
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
int amountThread = 3;
int counter = 0;
QString *s = new QString("foo");
QMutex *mutex = new QMutex();
//we construct a lot of thread
QList<CountingThread*> threadList;
//we create all threads
for(int i=0;i<amountThread;i++)
{
CountingThread *t = new CountingThread();
#ifdef TEST_ATOMIC_VAR_SHARE
t->addCounterdRef(&counter);
#endif
#ifdef TEST_OBJECT_VAR_SHARE
t->addStringRef(s);
//we add a mutex, which is shared to read read write
//just used with TEST_OBJECT_SHARE_FIX define uncommented
t->addMutexRef(mutex);
#endif
//t->moveToThread(t);
threadList.append(t);
}
//we start all with low prio, otherwise we produce something like a fork bomb
for(int i=0;i<amountThread;i++)
threadList.at(i)->start(QThread::Priority::LowPriority);
return a.exec();
}
countingthread.h
#ifndef COUNTINGTHREAD_H
#define COUNTINGTHREAD_H
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include <QMutex>
//atomic var is shared
//#define TEST_ATOMIC_VAR_SHARE
//more complex object var is shared
#define TEST_OBJECT_VAR_SHARE
// we add the fix
#define TEST_OBJECT_SHARE_FIX
class CountingThread : public QThread
{
Q_OBJECT
int *m_counter;
QString *m_string;
QMutex *m_locker;
public :
void addCounterdRef(int *r);
void addStringRef(QString *s);
void addMutexRef(QMutex *m);
void run() override;
};
#endif // COUNTINGTHREAD_H
countingthread.cpp
#include "countingthread.h"
void CountingThread::run()
{
//forever
while(1)
{
#ifdef TEST_ATOMIC_VAR_SHARE
//first use of counter
int counterUse1Copy= (*m_counter);
//some other operations, here sleep 10 ms
this->msleep(10);
//we will retry to use a second time
int counterUse2Copy= (*m_counter);
if(counterUse1Copy != counterUse2Copy)
qDebug()<<this->thread()->currentThreadId()<<" problem #1 found, counter not like we expect";
//we increment afterwards our counter
(*m_counter) +=1; //this works for fundamental types, like float, int, ...
#endif
#ifdef TEST_OBJECT_VAR_SHARE
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->lock();
#endif
m_string->replace("#","-");
//this will crash here !!, with problem #2,
//segmentation fault, is not handle by try catch
m_string->append("foomaster");
m_string->append("#");
if(m_string->length()>10000)
qDebug()<<this->thread()->currentThreadId()<<" string is: " << m_string;
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->unlock();
#endif
#endif
}//end forever
}
void CountingThread::addCounterdRef(int *r)
{
m_counter = r;
qDebug()<<this->thread()->currentThreadId()<<" add counter with value: " << *m_counter << " and address : "<< m_counter ;
}
void CountingThread::addStringRef(QString *s)
{
m_string = s;
qDebug()<<this->thread()->currentThreadId()<<" add string with value: " << *m_string << " and address : "<< m_string ;
}
void CountingThread::addMutexRef(QMutex *m)
{
m_locker = m;
}
If you follow up the code you are able to perform 2 tests.
If you uncomment TEST_ATOMIC_VAR_SHARE and comment TEST_OBJECT_VAR_SHARE in countingthread.h
your will see
problem #1 if you use your variable multiple times in your thread, it could be changes in the background from another thread, besides my expectation there was no app crash or weird exception in my build environment during execution using an int counter.
If you uncomment TEST_OBJECT_VAR_SHARE and comment TEST_OBJECT_SHARE_FIX and comment TEST_ATOMIC_VAR_SHARE in countingthread.h
your will see
problem #2 you get a segmentation fault, which is not possible to handle via try catch. This appears because multiple threads are using string functions for editing on the same object.
If you uncomment TEST_OBJECT_SHARE_FIX too you see the right handling via mutex.
problem #3 see answer from Benjamin T
What is Mutex:
I really like the chicken explanation which vallabh suggested.
I also found an good explanation here
I was debugging a fairly simple program written in D, that seems to have a random chance to receive a SEGV signal.
Upon further inspection I observed that using different compilers and build modes yielded different results.
Results of my tests:
DMD Debug = works 99% of the time
DMD Release = 50/50
LDC Debug = 50/50
LDC Release = 50/50
Because the binary from the default compiler (DMD) crashed only once I couldn't really debug it, and release mode didn't help either due to lack of debug symbols.
Building the binary with LDC in debug mode let me test it with gdb and valgrind, to summarize what I gathered.
Relevant information from valgrind,
Invalid read of size 4 # ctor in file video.d line 46
Access not within mapped region at address 0x0 # ctor in file video.d line
Gdb doesn't give me any more insight, 3 stack frames, of which only 0th is of interest, backtrace of frame 0 shows file video.d line 46 which is a break statement, so what now?
This is the snippet of code producing a seg fault
module video;
import ffmpeg.libavformat.avformat;
import ffmpeg.libavcodec.avcodec;
import ffmpeg.libavutil.avutil;
class Foo
{
private
{
AVFormatContext* _format_ctx;
AVStream* _stream_video;
AVStream* _stream_audio;
}
...
public this(const(string) path)
{
import std.string : toStringz;
_format_ctx = null;
enforce(avformat_open_input(&_format_ctx, path.toStringz, null, null) == 0);
scope (failure) avformat_close_input(&_format_ctx);
enforce(avformat_find_stream_info(_format_ctx, null) == 0);
debug av_dump_format(_format_ctx, 0, path.toStringz, 0);
foreach (i; 0 .. _format_ctx.nb_streams)
{
AVStream* stream = _format_ctx.streams[i];
if (stream == null)
continue;
enforce (stream.codecpar != null);
switch (stream.codecpar.codec_type)
{
case AVMediaType.AVMEDIA_TYPE_VIDEO:
_stream_video = stream;
break;
case AVMediaType.AVMEDIA_TYPE_AUDIO:
_stream_audio = stream;
break;
default:
stream.discard = AVDiscard.AVDISCARD_ALL;
break; // Magic line 46
}
}
}
}
// Might contain spelling errors, had to write it by hand.
So does anyone have an idea what causes this behaviour, or more precisely how to go about fixing it?
Try to check validity _stream_audio
default:
enforce( _stream_audio, new Exception( "_stream_audio is null" ))
.discard = AVDiscard.AVDISCARD_ALL;
break; // Magic line 46
You are not abiding the warning in the toStringz documentation:
“Important Note: When passing a char* to a C function, and the C function keeps it around for any reason, make sure that you keep a reference to it in your D code. Otherwise, it may become invalid during a garbage collection cycle and cause a nasty bug when the C code tries to use it.”
This may not be the cause of your problem, but the way you use toStringz is risky.
I read somewhere that when ever a method is called by "invokevirtual",
the object reference is fetched from the top of stack, followed by arguments.
I need to somehow print the object reference. Is it possible?
So, I'm not going to do it for you, because the actual code is annoying and tedious and if you're really genuinely interested you should learn how to do it yourself. But I will attempt to be helpful and provide you with some direction.
Firstly, you're going to want to read the ASM tutorials here.
The byte code format i'm going to write below comes from ASMIfier because it's much more clear. I'm going to completely ignore javap because it's even more pedantic and detailed, but if you want to know what it is actually showing you, then you should read about the Java ClassFile format.
Actually, you should do that first anyway, just to make sure that your background knowledge is somewhat filled out.
So, here's the nutshell of what you're going to want to do. You're going to want to write a ClassWriter that looks for instances of the INVOKEVIRTUAL opcode.
invokevirtual pops the values from the stack in reverse order, so last parameter first and the object you're invoking against last. the #38 you are referring too is not the object, its a reference to the constant pool which contains a method name and method descriptor pair which is used as metadata because the JVM is typesafe.
Lets assume you have this code:
package sample;
public class JavaSimpleHelloWorld {
public static void main(String[] args) {
System.out.println("Hello World");
}
}
If you run ASMIFier against it, you'll get something like this for just the main method ( cutting the context down for brevity )
public static main([Ljava/lang/String;)V
L0
LINENUMBER 6 L0
GETSTATIC java/lang/System.out : Ljava/io/PrintStream;
LDC "Hello World"
INVOKEVIRTUAL java/io/PrintStream.println (Ljava/lang/String;)V
L1
LINENUMBER 7 L1
RETURN
L2
LOCALVARIABLE args [Ljava/lang/String; L0 L2 0
MAXSTACK = 2
MAXLOCALS = 1
so, you implement some sort of static dump method ( public static final dump( Object o ) ) , and write a class visitor that reorganizes your byte code.
You can use the method descriptor to figure out how deep in the pervious stack push instructions ( ALOAD, LDC, ) you need to insert the the DUP/INVOKE to print your methods object target. For example the Method Descriptor for System.out.println is [Ljava/lang/String;]V Which means the method takes an array of Strings and returns void. So you need to go 1 back in the stack to find the object target. Your bytecode would, in turn, look like this:
Happy byte code twiddling.
public static main([Ljava/lang/String;)V
L0
GETSTATIC java/lang/System.out : Ljava/io/PrintStream;
DUP
INVOKESTATIC my/staticutil/ClassThatDumps.dump (Ljava/lang/Object;)V
LDC "Hello World"
INVOKEVIRTUAL java/io/PrintStream.println (Ljava/lang/String;)V
RETURN
L1
LOCALVARIABLE args [Ljava/lang/String; L0 L1 0
MAXSTACK = 2
MAXLOCALS = 1
I'm using Hex-Rays's IDA Pro to decompile a binary. I have this switch:
case 0x35:
CField::OnDesc_MAYB(v6, a6);
break;
case 0x36:
(*(void (__thiscall **)(_DWORD, _DWORD))(*(_DWORD *)(a1 - 8) + 28))(a1 - 8, a6);
break;
case 0x3A:
CField::OnWarnMessage(v6, a6);
break;
If you look at case 0x36:, I can't understand this statement. Usually I just point at the function and decode it using the F5 shotcut, however, I don't understand what this statement means? How can I decode it to view it's code?
Thanks.
case 0x36 is invoking a virtual function, or at least what Hex-Rays believes to be a virtual function. Consider the following pseudo C++ code (excluded reinterpret_cast to brevity, etc), which deconstructs that one line.
// in VC++, 'this' is usually passed via ECX register
typedef void (__thiscall* member_function_t)(_DWORD this_ptr, _DWORD arg_0);
// a1's declaration wasn't included in your post, so I'm making an assumption here
byte* a1 = address_of_some_child_object;
// It would appear a1 is a pointer to an object which has multiple vftables (due to multiple inheritance/interfaces)
byte*** base_object = (byte***)(a1 - 8);
// Dereference the pointer at a1[-8] to get the base's vftable pointer (constant list of function pointers for the class's virtual funcs)
// a1[0] would probably be the child/interface's vftable pointer
byte** base_object_vftable = *base_object;
// 28 / sizeof(void*) = 8th virtual function in the vftable
byte* base_object_member_function = base_object_vftable[28];
auto member_function = (member_function_t)base_object_member_function;
// case 0x36 simplified using a __thiscall function pointer
member_function((_DWORD)base_object, a6)
Deconstructed from:
(
*(
void (__thiscall **)(_DWORD, _DWORD)
)
(*
(_DWORD *)(a1 - 8) + 28
)
)
(a1 - 8, a6);
If you're unfamiliar with __thiscall calling convention, or how virtual functions are typically implemented in C++, you should probably read up on them before trying to reverse engineer programs which use them.
You could start with these breakdowns:
vftable - what is this?
Reversing Microsoft Visual C++ Part II: Classes, Methods and RTTI
So in my ilumination days, i started to think about how the hell do windows/linux implement the mutex, i've implemented this synchronizer in 100... different ways, in many diferent arquitectures but never think how it is really implemented in big ass OS, for example in the ARM world i made some of my synchronizers disabling the interrupts but i always though that it wasn't a really good way to do it.
I tried to "swim" throgh the linux kernel but just like a though i can't see nothing that satisfies my curiosity. I'm not an expert in threading, but i have solid all the basic and intermediate concepts of it.
So does anyone know how a mutex is implemented?
A quick look at code apparently from one Linux distribution seems to indicate that it is implemented using an interlocked compare and exchange. So, in some sense, the OS isn't really implementing it since the interlocked operation is probably handled at the hardware level.
Edit As Hans points out, the interlocked exchange does the compare and exchange in an atomic manner. Here is documentation for the Windows version. For fun, I just now wrote a small test to show a really simple example of creating a mutex like that. This is a simple acquire and release test.
#include <windows.h>
#include <assert.h>
#include <stdio.h>
struct homebrew {
LONG *mutex;
int *shared;
int mine;
};
#define NUM_THREADS 10
#define NUM_ACQUIRES 100000
DWORD WINAPI SomeThread( LPVOID lpParam )
{
struct homebrew *test = (struct homebrew*)lpParam;
while ( test->mine < NUM_ACQUIRES ) {
// Test and set the mutex. If it currently has value 0, then it
// is free. Setting 1 means it is owned. This interlocked function does
// the test and set as an atomic operation
if ( 0 == InterlockedCompareExchange( test->mutex, 1, 0 )) {
// this tread now owns the mutex. Increment the shared variable
// without an atomic increment (relying on mutex ownership to protect it)
(*test->shared)++;
test->mine++;
// Release the mutex (4 byte aligned assignment is atomic)
*test->mutex = 0;
}
}
return 0;
}
int main( int argc, char* argv[] )
{
LONG mymutex = 0; // zero means
int shared = 0;
HANDLE threads[NUM_THREADS];
struct homebrew test[NUM_THREADS];
int i;
// Initialize each thread's structure. All share the same mutex and a shared
// counter
for ( i = 0; i < NUM_THREADS; i++ ) {
test[i].mine = 0; test[i].shared = &shared; test[i].mutex = &mymutex;
}
// create the threads and then wait for all to finish
for ( i = 0; i < NUM_THREADS; i++ )
threads[i] = CreateThread(NULL, 0, SomeThread, &test[i], 0, NULL);
for ( i = 0; i < NUM_THREADS; i++ )
WaitForSingleObject( threads[i], INFINITE );
// Verify all increments occurred atomically
printf( "shared = %d (%s)\n", shared,
shared == NUM_THREADS * NUM_ACQUIRES ? "correct" : "wrong" );
for ( i = 0; i < NUM_THREADS; i++ ) {
if ( test[i].mine != NUM_ACQUIRES ) {
printf( "Thread %d cheated. Only %d acquires.\n", i, test[i].mine );
}
}
}
If I comment out the call to the InterlockedCompareExchange call and just let all threads run the increments in a free-for-all fashion, then the results do result in failures. Running it 10 times, for example, without the interlocked compare call:
shared = 748694 (wrong)
shared = 811522 (wrong)
shared = 796155 (wrong)
shared = 825947 (wrong)
shared = 1000000 (correct)
shared = 795036 (wrong)
shared = 801810 (wrong)
shared = 790812 (wrong)
shared = 724753 (wrong)
shared = 849444 (wrong)
The curious thing is that one time the results showed now incorrect contention. That might be because there is no "everyone start now" synchronization; maybe all threads started and finished in order in that case. But when I have the InterlockedExchangeCall in place, it runs without failure (or at least it ran 100 times without failure ... that doesn't prove I didn't write a subtle bug into the example).
Here is the discussion from the people who implemented it ... very interesting as it shows the tradeoffs ..
Several posts from Linus T ... of course
In earlier days pre-POSIX etc I used to implement synchronization by using a native mode word (e.g. 16 or 32 bit word) and the Test And Set instruction lurking on every serious processor. This instruction guarantees to test the value of a word and set it in one atomic instruction. This provides the basis for a spinlock and from that a hierarchy of synchronization functions could be built. The simplest is of course just a spinlock which performs a busy wait, not an option for more than transitory sync'ing, then a spinlock which drops the process time slice at each iteration for a lower system impact. Notional concepts like Semaphores, Mutexes, Monitors etc can be built by getting into the kernel scheduling code.
As I recall the prime usage was to implement message queues to permit multiple clients to access a database server. Another was a very early real time car race result and timing system on a quite primitive 16 bit machine and OS.
These days I use Pthreads and Semaphores and Windows Events/Mutexes (mutices?) etc and don't give a thought as to how they work, although I must admit that having been down in the engine room does give one and intuitive feel for better and more efficient multiprocessing.
In windows world.
The mutex before the windows vista mas implemented with a Compare Exchange to change the state of the mutex from Empty to BeingUsed, the other threads that entered the wait on the mutex the CAS will obvious fail and it must be added to the mutex queue for furder notification. Those operations (add/remove/check) of the queue would be protected by an common lock in windows kernel.
After Windows XP, the mutex started to use a spin lock for performance reasons being a self-suficiant object.
In unix world i didn't get much furder but probably is very similar to the windows 7.
Finally for kernels that work on a single processor the best way is to disable the interrupts when entering the critical section and re-enabling then when exiting.