How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?

How does x86 pause instruction work in spinlock *and* can it be used in other scenarios? - multithreading

The pause instruction is commonly used in the loop of testing spinlock, when some other thread owns the spinlock, to mitigate the tight loop. It's said that it is equivalent to some NOP instructions. Could somebody tell me how exactly it works for spinlock optimization? It seems to me that even the NOP instructions are a waste of CPU time. Will they decrease CPU usage?
Another question is that could I use pause instruction for other similar purposes. For example, I have a busy thread which keeps scanning some places (e.g. a queue) to retrieve new nodes; however, sometimes the queue is empty and the thread is just wasting cpu time. Sleep the thread and wake it up by other threads may be an option, however the thread is critical, so I don't want to make it sleep.
Could pause instruction work for my purpose to mitigate the CPU usage? Currently it uses 100% cpu of a physical core?

PAUSE notifies the CPU that this is a spinlock wait loop so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop.
PAUSE may actually stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don't have to check if its supported. Older CPUs will simply do nothing (NOP) as fast as possible.
See also https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
Update: I don't think it's a good idea to use PAUSE in queue checking unless you are going to make your queue spinlock-like (and there is no obvious way to do it).
Spinning for a very long time is still very bad, even with PAUSE.

A processor suffers a severe performance penalty when exiting
the loop because it detects a possible memory order violation. The PAUSE instruction
provides a hint to the processor that the code sequence is a spin-wait loop. The
processor uses this hint to avoid the memory order violation in most situations,
which greatly improves processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.
An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.
[source: Intel manual]

Pause-based spin-wait loops
As I understood from your questions, the waits in your case are known in advance to be very long. In this case, spin-wait loops are not recommended at all. But if you are using a spin-loop that keeps checking a value from memory (e.g. a byte-sized synchronization variable), use PAUSE. See the Section 11.4.2 "Synchronization for Short Periods" of the Intel 64 and IA-32 Architectures Optimization Reference Manual.
You wrote that you have a "thread which keeps scanning some places (e.g. a queue) to retrieve new nodes".
In such a case (i.e. the long wait), Intel recommends using synchronization API functions of your operating system. For example, you can create an event when a new node appears in a queue, and just wait for this event using the WaitForSingleObject(Handle, INFINITE). The queue will trigger this event whenever a new node will appear.
According to the Intel Optimization Reference Manual, Section, 2.3.4 "Pause Latency in Skylake Client Microarchitecture",
The PAUSE instruction is typically used with software threads
executing on two logical processors located in the same processor
core, waiting for a lock to be released. Such short wait loops tend to
last between tens and a few hundreds of cycles, so performance-wise it
is better to wait while occupying the CPU than yielding to the OS.
By "tens and a few hundreds of cycles" of the above quote I understand from 20 to 500 CPU cycles.
500 CPU cycles on a 4500 MHz Intel Core i7 7700K processor (released on January 2017, based on Kaby-Lake-S microarchitecture) is 0.0000001 seconds, i.e. 1/10000000th of a second: the CPU can make 10 million times per second this 500-CPU-cycles loop.
This 500 cycle limit recommended by Intel is theoretical, and all depends on particular use case, i.e. on the logic of the code that needs to be synchronized by spin-wait loops. Some scenarios like FastMM4-AVX memory manger for Delphi work better with the value of 5000, according to the benchmarks. Even though, these benchmarks do not always reflect real-world scenario, and real program use cases should be measured.
As you see, this PAUSE-based spin-wait loop is for really short periods of time.
On the other hand, each call to an API function like Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
If there are more threads then the processor cores (multiplied to hyperthreading feature, if present) are available, and a thread will get switched to another one in the middle of a critical section, waiting for the critical section from another thread may really take looong, at least 10000+ cycles, so the PAUSE-based spin-wait loop will be futile.
In addition to the relevant chapters of the Intel Optimization Reference Manual, please see the following articles for more information:
https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors
https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
When the wait loop is expected to last for thousands of cycles or more, it is
preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject or SwitchToThread on Windows OS.
As a conclusion: in your scenario, the PAUSE-based spin-wait loop won't be the best choice, since your waiting time is long while the spin-wait loop is intended for very short loops.
The PAUSE instruction takes about 140 CPU cycles on processors based on Skylake microarchitecture, or later processors. For example, it is just or 35.10ns on Intel Core i7-6700K CPU (4GHz) released on August 2015, or 49.47ns on Intel Core i7-1165G7 CPU for mobile devices released on September 2020. On earlier processors (prior to Skylake), like those based on Haswell microarchitecture, it has about 9 cycles. It is 2.81ns on Intel Core i5-4430 (3GHz) released on June 2013. So, for the long loops, it's better to relinquish control to other threads using the OS synchronization API functions than to occupy CPU with the PAUSE loop, regardless of the microarchitecture.
Test, Test-and-Set
Please note that the spin-wait loops have also to be implemented properly. Intel recommends the so-called "test, test-and-set" technique (see Section 11.4.3 "Optimization with Spin-Locks" of the Intel 64 and IA-32 Architectures Optimization Reference Manual) to determine the availability of the synchronization variable. According to this technique, the first "test" is done via the normal (non-locking) memory load to prevent excessive bus locking during the spin-wait loop; if the variable is available upon the non-locking memory load of the first step ("test"), proceed to the second step ("test-and-set") which is done via the bus-locking atomic xchg instruction.
But be aware that this two-steps approach of using "test" before "test-and-set" can increase the cost for the un-contended case comparing to just single-step "test-and-set". The initial read-only access might only get the cache line in Shared state, so the atomic operation like test-and-set (xchg) or compare-and-swap (cmpxchg) still needs a ''Read For Ownership'' (RFO) operation to get exclusive ownership of the cache line. This operation is issued by a processor trying to write into a cache line that is in the Shared state.
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?
atomic operation cost

The PAUSE instruction also appears to be used in hyper-threading processors to mitigate performance impact on other hyper threads, presumably by relinquishing more CPU time to them.
The following Intel article outlines this, and not surprisingly recommends avoiding busy wait loops on such processors: https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors

Related

Does UMWAIT make the process do REP NOP or context switch immediately?

Does calling UMWAIT make the process to do REP NOP (= keep using its hardware thread, not evicted, but use less power by not issuing the uOPs to the processor backend) until its scheduled time is over?
Or, does it make the process to be evicted right away through context switch?

Yes, umwait (the user-mode version of mwait with a limit on how deep a sleep it can do) is basically like pause (encoded as rep nop, which is how it executes on ancient CPUs that don't recognize it as a pause instruction).
It doesn't make a yield() system-call or otherwise trap to the OS. Same for mwait in kernel mode; it sleeps the CPU core, not traps. Kernels use it to put the CPU into a C-state until the next interrupt. (I think it was originally designed for actually waiting for memory writes from another core, but now one of the primary purposes is an API that includes a sleep level, unlike hlt, so it's how CPUs expose deep sleep levels. The waiting for memory use-case is still supported, too.)
If it just trapped so the OS could context switch, it wouldn't need to exist. int or syscall instructions already exist. Or in a kernel, a simple call schedule would potentially context-switch.
UMWAIT will put the core into C0.2/C0.1 state to save power. ... if the other SMT thread is active, most of backend/frontend will be active to C0.0, and if the other SMT thread is not active, then it will probably go into C1~ state.
Yeah, if the other logical core is still active, the physical core should keep running. (And maybe switch back to "single-threaded" mode, allowing the other logical core to use the full ROB and store buffer, and similarly un-partitioning any other statically-partitioned resources. Check perf stat -e cpu_clk_unhalted.one_thread_active against the case where the other thread is fully idle.)
I don't know the details on what sleep levels real microarchitectures actually have in practice, and how the on-paper levels of sleep map to them. It might be a more shallow sleep if regular C1 doesn't have a low enough wake-up latency, since some OSes would definitely want to stop user-space from doing anything too high latency to meet realtime guarantees it wants to provide.

Why spinlocks can become performance issue in multithreaded programs?

I know what spinlocks are and that they use busy waiting. But why can it become a performance problem in multithreaded programs on a multicore processor?
And what can be done about it?

Your first issue, is that in cases where a spinlock protected section becomes contented, is usually a situation where there are more threads ready for execution than you have cores available. That means each thread wasting time in a spinlock is potentially starving another thread which would had something proper to do.
Then there is the cost of the spinock itself. You are burning through your budget of memory transactions, and that budget is actually shared between processor cores. Effectively, this can result in slowing down the operations within the critical sections.
A good example for that would be the memory allocator in the Windows kernel, in versions between 1703 and 1803. On systems with more than 16 threads, once a 50% total CPU utilization as exceeded, a spinlock in that path went out of control and would start eating up 90% of the CPU time. Time spent inside the critical section increased over tenfold due to the competing threads burning the memory bandwidth.
The naive solution is to use nano-sleeps in between spin cycles in order to at least reduce the performance burnt on the locks themselves. But that's pretty bad as well, as the cores still remain blocked, not doing any real work.
Try and yield in the spin locks instead? Just turns even slower, and you end up with a minimum delay proportional to the scheduling rate of the operating system. At a rate of 1ms (Windows realtime mode, active when any process requests it), 5ms (Linux default 200Hz scheduler), 10ms (Windows default mode), that's a huge delay this is introducing into execution. And if you happen to hit the critical section again, it was wasteful as you now added the overhead for context switch without any gains.
Ultimately, use operating system primitives for critical sections. The common approach is to use atomic operations to probe if any contention has occurred, and when it has, only then to involve the operating system.
Either way, the operating system below has better means to resolve the contention, mostly in the form of wait lists. Meaning threads waiting on a semaphore only wait up exactly when they are allowed to resume, and are guaranteed to hold the corresponding lock. When leaving the contended region, the thread owning the lock checks via lightweight means if there had been any contention, and only if so notifies the OS to resume operation on the other threads.
Not that you should actually reinvent the wheel though...
In Windows, that's already how Slim Reader/Writer Locks are implemented.
If you use a plain std::mutex or alike, you will usually already end up with such mechanism under the hood.
"Old" literature (10-15 years) will still warn you not to use OS primitives for scheduling, but that's seriously outdated and does not reflect the improvements made on the OS side. What used to be 10ms+ delay for every context switch is essentially down to being barely measurable nowadays.

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Who schedules threads?

I have a question about scheduling threads. on the one hand, I learned that threads are scheduled and treated as processes in Linux, meaning they get scheduled like any other process using the conventional methods. (for example, the Completely Fair Scheduler in linux)
On the other hand, I also know that the CPU might also switch between threads using methods like Switch on Event or Fine-grain. For example, on cache miss event the CPU switches a thread. but what if the scheduler doesn't want to switch the thread? how do they agree on one action?
I'm really confused between the two: who schedules a thread? the OS or the CPU?
thanks alot :)

The answer is both.
What happens is really fairly simple: on a CPU that supports multiple threads per core (e.g., an Intel with Hyperthreading) the CPU appears to the OS as having some number of virtual cores. For example, an Intel i7 has 4 actual cores, but looks to the OS like 8 cores.
The OS schedules 8 threads onto those 8 (virtual) cores. When it's time to do a task switch, the OS's scheduler looks through the threads and finds the 8 that are...the most eligible to run (taking into account things like thread priority, time since they last ran, etc.)
The CPU has only 4 real cores, but those cores support executing multiple instructions simultaneously (and out of order, in the absence of dependencies). Incoming instructions get decoded and thrown into a "pool". Each clock cycle, the CPU tries to find some instructions in that pool that don't depend on the results of some previous instruction.
With multiple threads per core, what happens is basically that each actual core has two input streams to put into the "pool" of instructions it might be able to execute in a given clock cycle. Each clock cycle it still looks for instructions from that pool that don't depend on the results of previous instructions that haven't finished executing yet. If it finds some, it puts them into the execution units and executes them. The only major change is that each instruction now needs some sort of tag attached to designate which "virtual core" will be used to store results into--that is, each of the two threads has (for example) its own set of registers, and instructions from each thread have to write to the registers for that virtual core.
It is possible, however, for a CPU to support some degree of thread priority so that (for example) if the pool of available instructions includes some instructions from both input threads (or all N input threads, if there are more than two) it will prefer to choose instructions from one thread over instructions from another thread in any given clock cycle. This can be absolute, so it runs thread A as fast as possible, and thread B only with cycles A can't use, or it can be a "milder" preference, such as attempting to maintain a 2:1 ratio of instructions executed (or, of course, essentially any other ratio preferred).
Of course, there are other ways of setting up priorities as well (such as partitioning execution resources), but the general idea remains the same.
An OS that's aware of shared cores like this can also modify its scheduling to suit, such as scheduling only one thread on a pair of cores if that thread has higher priority.

The OS handles scheduling and dispatching of ready threads, (those that require CPU), onto cores, managing CPU execution in a similar fashion as it manages other resources. A cache-miss is no reason to swap out a thread. A page-fault, where the desired page is not loaded into RAM at all, may cause a thread to be blocked until the page gets loaded from disk. The memory-management hardware does that by generating a hardware interrupt to an OS driver that handles the page-fault.

How efficient is locking and unlocked mutex? What is the cost of a mutex?

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.

I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.

I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).

This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.

I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.

The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)

I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string