How efficient is locking and unlocked mutex? What is the cost of a mutex?

How efficient is locking and unlocked mutex? What is the cost of a mutex? - multithreading

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.

I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.

I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).

This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.

I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.

The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)

I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

Related

Multithreading on multiple core/processors

I get the idea that if locking and unlocking a mutex is an atomic operation, it can protect the critical section of code in case of a single processor architecture.
Any thread, which would be scheduled first, would be able to "lock" the mutex in a single machine code operation.
But how are mutexes any good when the threads are running on multiple cores? (Where different threads could be running at the same time on different "cores" at the same time).
I can't seem to grasp the idea of how a multithreaded program would work without any deadlock or race condition on multiple cores?

The general answer:
Mutexes are an operating system concept. An operating system offering mutexes has to ensure that these mutexes work correctly on all hardware that this operation system wants to support. If implementing a mutex is not possible for a specific hardware, the operating system cannot offer mutexes on that hardware. If the operating system requires the existence of mutexes to work correctly, it cannot support that hardware at all. How the operating system is implementing mutexes for a specific hardware is unsurprisingly very hardware dependent and varies a lot between the operating systems and their supported hardware.
The detailed answer:
Most general purpose CPUs offer atomic operations. These operations are designed to be atomic across all CPU cores within a system, whether these cores are part of a single or multiple individual CPUs.
With as little as two atomic operations, atomic_or and atomic_and, it is possible to implement a lock. E.g. think of
int atomic_or ( int * addr, int val )
It atomically calculates *addr = *addr | val and returns the old value of *addr prior to performing the calculation. If *lock == 0 and multiple threads call atomic_or(lock, 1), then only one of them will get 0 as result; only the first thread to perform that operation. All other threads get 1 as result. The one thread that got 0 is the winner, it has the lock, all other threads register for an event and go to sleep.
The winner thread now has exclusive access to the section following the atomic_or, it can perform the desired work and once it is done, it just clears the lock again (atomic_and(lock, 0)) and generates a system event, that the lock is now available again.
The system will then wake up one, some, or all of the threads that registered for this event before going to sleep and the race for the lock starts all over. Either one of the woken up threads will win the race or possibly none of them, as another thread was even faster and may have grabbed the lock in between the atomic_and and before the other threads were even woken up but that is okay and still correct, as it's still only one thread having access. All threads that failed to obtain the lock go back to sleep.
Of course, the actual implementations of modern systems are often much more complicated than that, they may take things like threads priorities into account (high prio threads may be preferred in the lock race) or might ensure that every thread waiting for a mutex will eventually also get it (precautions exist that prevent a thread from always losing the lock-race). Also mutexes can be recursive, in which case the system ensures that the same thread can obtain the same mutex multiple times without deadlocking and this requires some additional bookkeeping.
Probably needless to say but atomic operations are more expensive operations as they require the cores within a system to synchronize their work and this will slow their processing throughput. They may be somewhat expensive if all cores run on a single CPU but they may even be very expensive if there are multiple CPUs as the synchronization must take place over the CPU bus system that connects the CPUs with each other and this bus system usually does not operate at CPU speed level.
On the other hand, using mutexes will always slow down processing to begin with as providing exclusive access to resources has to slow down processing if multiple threads ever require access at the same time to continue their work. So for implementing mutexes this is irrelevant. Actually, if you can implement a function in a thread-safe way using just atomic operations instead of full featured mutexes, you will quite often have a noticeable speed benefit, despite these operations being more expensive than normal operations.

Threads are managed by the operating system, which among other things, is responsible for scheduling threads to cores, so it can also avoid scheduling a specific thread onto a core.
A mutex is an operating-system concept. You're basically asking the OS to block a thread until some other thread tells the OS it's ok

On modern operating systems, threads are an abstraction over the physical hardware. A programmer targets the thread as an abstraction for code execution. There is no separate abstraction for working on a hardware core available. The operating system is responsible for mapping threads to physical cores.
A mutex is a data structure that lives in system memory. Any thread that has access can read that memory position, regardless of what thread or core it is running in. It doesn't matter whether your code is executing on core 1 or 20, its still has the ability to read the current state of the lock.
In other words, regardless of the number of threads or cores, there is only shared system memory for them to act on.

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Why is the throughput of the MCS lock poor when the number of threads is greater than the number of logical cpus

Why is the throughput of the mcs lock poor when the number of threads is greater than the number of logical cpus.
Could it be because of increased contention for a places on cpu which leads to a lot of threads being pre-empted?

I am not 100% on this, but the Microsoft library gives this definition of the Sleep() function:
After the sleep interval has passed, the thread is ready to run. If you specify 0 >milliseconds, the thread will relinquish the remainder of its time slice but remain >ready. Note that a ready thread is not guaranteed to run immediately. Consequently, the >thread may not run until some time after the sleep interval elapses.
In my experience, if I use an MCS lock to, lets say, update a data structure and the number of threads I run it on is 16 the drop off (excluding the massive drop off from 1 - 2 threads) from 8 to 16 threads (assuming you are just doubling the number of threads) is quite large. Throughput drops to about a third after one thread and then slowly decreased to as the number of threads being used approaches the number of CPUs. Obviously if you are using a lock the more threads that are trying to acquire the lock you will have more cache cache coherency work for the CPU's to do.
If you use any atomic instructions (again assuming you are) the more threads you add the slower this will become.
"I don't think the problem is that atomic operations will take longer themselves; the real problem might be that an atomic operation might block bus operations on other processors (even if they perform non-atomic operations)."
This was taken from another member of stackoverflow about similar issue. Couple that with the fact that a thread may or may not sleep, even with the use of Sleep(), and may or may not wake immediately this could cause a serious loss in throughput. You also have the increased bus traffic to deal with...

linux thread synchronization

I am new to linux and linux threads. I have spent some time googling to try to understand the differences between all the functions available for thread synchronization. I still have some questions.
I have found all of these different types of synchronizations, each with a number of functions for locking, unlocking, testing the lock, etc.
gcc atomic operations
futexes
mutexes
spinlocks
seqlocks
rculocks
conditions
semaphores
My current (but probably flawed) understanding is this:
semaphores are process wide, involve the filesystem (virtually I assume), and are probably the slowest.
Futexes might be the base locking mechanism used by mutexes, spinlocks, seqlocks, and rculocks. Futexes might be faster than the locking mechanisms that are based on them.
Spinlocks dont block and thus avoid context swtiches. However they avoid the context switch at the expense of consuming all the cycles on a CPU until the lock is released (spinning). They should only should be used on multi processor systems for obvious reasons. Never sleep in a spinlock.
The seq lock just tells you when you finished your work if a writer changed the data the work was based on. You have to go back and repeat the work in this case.
Atomic operations are the fastest synch call, and probably are used in all the above locking mechanisms. You do not want to use atomic operations on all the fields in your shared data. You want to use a lock (mutex, futex, spin, seq, rcu) or a single atomic opertation on a lock flag when you are accessing multiple data fields.
My questions go like this:
Am I right so far with my assumptions?
Does anyone know the cpu cycle cost of the various options? I am adding parallelism to the app so we can get better wall time response at the expense of running fewer app instances per box. Performances is the utmost consideration. I don't want to consume cpu with context switching, spinning, or lots of extra cpu cycles to read and write shared memory. I am absolutely concerned with number of cpu cycles consumed.
Which (if any) of the locks prevent interruption of a thread by the scheduler or interrupt...or am I just an idiot and all synchonization mechanisms do this. What kinds of interruption are prevented? Can I block all threads or threads just on the locking thread's CPU? This question stems from my fear of interrupting a thread holding a lock for a very commonly used function. I expect that the scheduler might schedule any number of other workers who will likely run into this function and then block because it was locked. A lot of context switching would be wasted until the thread with the lock gets rescheduled and finishes. I can re-write this function to minimize lock time, but still it is so commonly called I would like to use a lock that prevents interruption...across all processors.
I am writing user code...so I get software interrupts, not hardware ones...right? I should stay away from any functions (spin/seq locks) that have the word "irq" in them.
Which locks are for writing kernel or driver code and which are meant for user mode?
Does anyone think using an atomic operation to have multiple threads move through a linked list is nuts? I am thinking to atomicly change the current item pointer to the next item in the list. If the attempt works, then the thread can safely use the data the current item pointed to before it was moved. Other threads would now be moved along the list.
futexes? Any reason to use them instead of mutexes?
Is there a better way than using a condition to sleep a thread when there is no work?
When using gcc atomic ops, specifically the test_and_set, can I get a performance increase by doing a non atomic test first and then using test_and_set to confirm? I know this will be case specific, so here is the case. There is a large collection of work items, say thousands. Each work item has a flag that is initialized to 0. When a thread has exclusive access to the work item, the flag will be one. There will be lots of worker threads. Any time a thread is looking for work, they can non atomicly test for 1. If they read a 1, we know for certain that the work is unavailable. If they read a zero, they need to perform the atomic test_and_set to confirm. So if the atomic test_and_set is 500 cpu cycles because it is disabling pipelining, causes cpu's to communicate and L2 caches to flush/fill .... and a simple test is 1 cycle .... then as long as I had a better ratio of 500 to 1 when it came to stumbling upon already completed work items....this would be a win.
I hope to use mutexes or spinlocks to sparilngly protect sections of code that I want only one thread on the SYSTEM (not jsut the CPU) to access at a time. I hope to sparingly use gcc atomic ops to select work and minimize use of mutexes and spinlocks. For instance: a flag in a work item can be checked to see if a thread has worked it (0=no, 1=yes or in progress). A simple test_and_set tells the thread if it has work or needs to move on. I hope to use conditions to wake up threads when there is work.
Thanks!

Application code should probably use posix thread functions. I assume you have man pages so type
man pthread_mutex_init
man pthread_rwlock_init
man pthread_spin_init
Read up on them and the functions that operate on them to figure out what you need.
If you're doing kernel mode programming then it's a different story. You'll need to have a feel for what you are doing, how long it takes, and what context it gets called in to have any idea what you need to use.

Thanks to all who answered. We resorted to using gcc atomic operations to synchronize all of our threads. The atomic ops were about 2x slower than setting a value without synchronization, but magnitudes faster than locking a mutex, changeing the value, and then unlocking the mutex (this becomes super slow when you start having threads bang into the locks...) We only use pthread_create, attr, cancel, and kill. We use pthread_kill to signal threads to wake up that we put to sleep. This method is 40x faster than cond_wait. So basicly....use pthreads_mutexes if you have time to waste.

in addtion you should check the nexts books
Pthreads Programming: A POSIX
Standard for Better Multiprocessing
and
Programming with POSIX(R) Threads

regarding question # 8
Is there a better way than using a condition to sleep a thread when there is no work?
yes i think that the best aproach instead of using sleep
is using function like sem_post() and sem_wait of "semaphore.h"
regards

A note on futexes - they are more descriptively called fast userspace mutexes. With a futex, the kernel is involved only when arbitration is required, which is what provides the speed up and savings.
Implementing a futex can be extremely tricky (PDF), debugging them can lead to madness. Unless you really, really, really need the speed, its usually best to use the pthread mutex implementation.
Synchronization is never exactly easy, but trying to implement your own in userspace makes it inordinately difficult.

Critical Sections that Spin on Posix?

The Windows API provides critical sections in which a waiting thread will spin a limited amount of times before context switching, but only on a multiprocessor system. These are implemented using InitializeCriticalSectionAndSpinCount. (See http://msdn.microsoft.com/en-us/library/ms682530.aspx.) This is efficient when you have a critical section that will often only be locked for a short period of time and therefore contention should not immediately trigger a context switch. Two related questions:
For a high-level, cross-platform threading library or an implementation of a synchronized block, is having a small amount of spinning before triggering a context switch a good default?
What, if anything, is the equivalent to InitializeCriticalSectionAndSpinCount on other OS's, especially Posix?
Edit: Of course no spin count will be optimal for all cases. I'm only interested in whether using a nonzero spin count would be a better default than not using one.

My opinion is that the optimal "spin-count" for best application performance is too hardware-dependent for it to be an important part of a cross-platform API, and you should probably just use mutexes (in posix, pthread_mutex_init / destroy / lock / trylock) or spin-locks (pthread_spin_init / destroy / lock / trylock). Rationale follows.
What's the point of the spin count? Basically, if the lock owner is running simultaneously with the thread attempting to acquire the lock, then the lock owner might release the lock quickly enough that the EnterCriticalSection caller could avoid giving up CPU control in acquiring the lock, improving that thread's performance, and avoiding context switch overhead. Two things:
1: obviously this relies on the lock owner running in parallel to the thread attempting to acquire the lock. This is impossible on a single execution core, which is almost certainly why Microsoft treats the count as 0 in such environments. Even with multiple cores, it's quite possible that the lock owner is not running when another thread attempts to acquire the lock, and in such cases the optimal spin count (for that attempt) is still 0.
2: with simultaneous execution, the optimal spin count is still hardware dependent. Different processors will take different amounts of time to perform similar operations. They have different instruction sets (the ARM I work with most doesn't have an integer divide instruction), different cache sizes, the OS will have different pages in memory... Decrementing the spin count may take a different amount of time on a load-store architecture than on an architecture in which arithmetic instructions can access memory directly. Even on the same processor, the same task will take different amounts of time, depending on (at least) the contents and organization of the memory cache.
If the optimal spin count with simultaneous execution is infinite, then the pthread_spin_* functions should do what you're after. If it is not, then use the pthread_mutex_* functions.

For a high-level, cross-platform threading library or an
implementation of a synchronized block, is having a small amount of
spinning before triggering a context switch a good default?
One would think so. Many moons ago, Solaris 2.x implemented adaptive locks, which did exactly this - spin for a while, if the mutex is held by a thread executing on another CPU or block otherwise.
Obviously, it makes no sense to spin on single-CPU systems.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string