clock_gettime might be very slow even using VDSO

clock_gettime might be very slow even using VDSO - linux

I'm using CentOS Linux release 7.3.1611 on Intel(R) Xeon(R) CPU E5-2667 v4 # 3.20GHz
During tests of my userspace application, I have noticed that clock_gettime(CLOCK_MONOTONIC, &ts) may take up to 5-6 microseconds instead of ~23 nanoseconds in average. It may happen only once per 10000 consequent calls, however it may happen.
If there were no VDSO library, it could be explained. However, VDSO is used for every clock_gettime (I checked it by strace).
No matter if corresponding thread is affined to certain CPU core, or not. No matter, if this CPU core isolated from OS, or not. It means test app may run on exclusive CPU core, while lag may appear anyway!
I'm measuring latency by comparing results of two consequent clock_gettime calls, like:
unsigned long long __gettimeLatencyNs() {
struct timespec t1_ts;
struct timespec t2_ts;
clock_gettime(CLOCK_MONOTONIC, &t1_ts);
clock_gettime(CLOCK_MONOTONIC, &t2_ts);
return ((t2_ts.tv_sec - t1_ts.tv_sec)*NANO_SECONDS_IN_SEC + t2_ts.tv_nsec - t1_ts.tv_nsec);
}
Could anyone share some ideas, what could be wrong there?

Let's look at the source code for clock_gettime:
/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
notrace static int __always_inline do_realtime(struct timespec *ts)
{
unsigned long seq;
u64 ns;
int mode;
do {
seq = gtod_read_begin(gtod);
mode = gtod->vclock_mode;
ts->tv_sec = gtod->wall_time_sec;
ns = gtod->wall_time_snsec;
ns += vgetsns(&mode);
ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;
return mode;
}
What we see here is that the code runs inside a loop. This loop is annotated with an unlikely condition. The condition has to do with the fact that this code reads shared memory that is sometimes updated, and while it is being updated, the code needs to wait for the update to complete.
The most likely answer to your question, then, is that every so often you catch clock_gettime while the corresponding kernel code is updating its structures. When that happens, the code runs significantly slower.

I don't think it's the logic of the clock_gettime call itself that is periodically taking longer, but rather than your timing loop is periodically being interrupted, and this extra time shows up as an extra long interval.
That is, any type of timing loop is subject to being interrupted by external events, such as interrupts. For example, except with a very specific tickless kernel configuration (not the default), your application will be interrupted periodically by the scheduler interrupt, which will do a bit of processing to see if another process should run. Even if no other process ultimately ends up running, this could easily account for a few microseconds.
In addition, the hardware may temporarily pause for a variety of reasons, such as frequency transitions that occur when other cores enter or leave the idle state. I have measured these transitions at around 8 microseconds, close to the value you report. During these pauses, the CPU isn't executing instructions, but the TSC keeps running, so it shows up as an extra-long interval.
Beyond that, there are a ton of reasons why you would experience outlier timings. That answer also includes ways in which you could narrow down the possible reasons if it interests you.
Finally, the answer suggestions that clock_gettime itself may be blocking, while the kernel updates the data structure. While it's certainly possible, I think it's less likely than the other reasons. You could copy and paste the VDSO code, and then modify it to record if any blocking actually happened, and call that to see if your pauses correlate with blocking. I would guess not.

Related

What could delay pthread_join() after threads have exited successfully?

My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Incrementing Clocks

When a process is set to run with an initial time slice of 10 for example, someone in the hardware should know this initial timeslice and decrement it and when the time slice turns 0, an interrupt should be fired!
In freeBSD kernel, I understand that hardclock and the softclock does this task of accounting. But my question is, is this decrementing of clock parallel to the execution of the process?

I'll use the PIT as an example here, because it's the simplest timing mechanism (and has been around for quite a while).
Also, this answer is fairly x86-specific; and also OS-agnostic. I don't know enough about the internals of FreeBSD and Linux to answer for them specifically. Someone else might be more capable of that.
Essentially, the timeslice is "decremented" parallel to the execution of the process as the timer creates an IRQ for each "tick" (note that timers such as the HPET can do 'one-shot' mode, which fires an IRQ after a specific delay, which can be used for scheduling as well). Once the timeslice decrements to zero, the scheduler is notified and a task switch occurs. All this happens "at the same time" as your process: the IRQ jumps in, runs some code, then lets your process keep going until the timeslice runs out.
It should be noted that, generally speaking, you don't see a process running to the end of it's timeslice as task switches can occur as the direct result of a system call (for example, a read from disk that blocks, or even writing to a terminal).

This was simpler in the misty past: a clock chip -- a discrete device on the motherboard -- would be configured to fire interrupts periodically at a rate of X Hz. Every time this "timer interrupt" went off, execution of the current program would be suspended (just like any other interrupt) and the kernel's scheduler code would decrement its timeslice. When the timeslice got all the way to zero, the kernel would take the CPU away from the program and give it to another one. The clock chip, being separate from the CPU, obviously runs in parallel with the execution of the program, but the kernel's bookkeeping work has to interrupt the program (this is the misty past we're talking about, so there is only one CPU, so kernel code and user code cannot run simultaneously).
Nowadays, the clock is not a discrete device, it's part of the CPU, and it can be programmed to do all sorts of clever things. Most importantly it can be programmed to fire one interrupt after N microseconds, where N can be quite large; this allows the kernel to idle the CPU for a very long time (in computer terms; maybe, like, a whole second) if there's nothing constructive for it to do, saving power. Meanwhile, it's hard to find a single-core CPU anymore, kernels do all sorts of clever tricks to push their bookkeeping work off to CPUs that don't have anything better to do, and timeslice accounting has gotten a whole lot more complicated. Linux currently uses the "Completely Fair Scheduler" which doesn't even really have a concept of "time slices". I don't know what FreeBSD's got, but I would be surprised if it was simple.
So the short answer to your question is "mostly in parallel, more so now than in the past, but it's not remotely as simple as a countdown timer anymore".

How efficient is locking and unlocked mutex? What is the cost of a mutex?

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.
How efficient is it to lock a mutex? I.e. how many assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)
The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how many more mutexes this means?
WebKits blog post (2016) about locking is very related to this question, and explains the differences between a spinlock, adaptive lock, futex, etc.

I have the choice in between either having a bunch of mutexes or a single one for an object.
If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.
How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?
The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.
Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.
Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)
Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.
How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?
You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.
Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:
Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
Less locks means less problems debugging multi-threading problems.
More locks means less contentions and higher parallelism
More locks means more chances of running into undebugable deadlocks.
A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.
(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").
And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)
To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.

I wanted to know the same thing, so I measured it.
On my box (AMD FX(tm)-8150 Eight-Core Processor at 3.612361 GHz),
locking and unlocking an unlocked mutex that is in its own cache line and is already cached, takes 47 clocks (13 ns).
Due to synchronization between two cores (I used CPU #0 and #1),
I could only call a lock/unlock pair once every 102 ns on two threads,
so once every 51 ns, from which one can conclude that it takes roughly 38 ns to recover after a thread does an unlock before the next thread can lock it again.
The program that I used to investigate this can be found here:
https://github.com/CarloWood/ai-statefultask-testsuite/blob/b69b112e2e91d35b56a39f41809d3e3de2f9e4b8/src/mutex_test.cxx
Note that it has a few hardcoded values specific for my box (xrange, yrange and rdtsc overhead), so you probably have to experiment with it before it will work for you.
The graph it produces in that state is:
This shows the result of benchmark runs on the following code:
uint64_t do_Ndec(int thread, int loop_count)
{
uint64_t start;
uint64_t end;
int __d0;
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (start) : : "%rdx");
mutex.lock();
mutex.unlock();
asm volatile ("rdtsc\n\tshl $32, %%rdx\n\tor %%rdx, %0" : "=a" (end) : : "%rdx");
asm volatile ("\n1:\n\tdecl %%ecx\n\tjnz 1b" : "=c" (__d0) : "c" (loop_count - thread) : "cc");
return end - start;
}
The two rdtsc calls measure the number of clocks that it takes to lock and unlock `mutex' (with an overhead of 39 clocks for the rdtsc calls on my box). The third asm is a delay loop. The size of the delay loop is 1 count smaller for thread 1 than it is for thread 0, so thread 1 is slightly faster.
The above function is called in a tight loop of size 100,000. Despite that the function is slightly faster for thread 1, both loops synchronize because of the call to the mutex. This is visible in the graph from the fact that the number of clocks measured for the lock/unlock pair is slightly larger for thread 1, to account for the shorter delay in the loop below it.
In the above graph the bottom right point is a measurement with a delay loop_count of 150, and then following the points at the bottom, towards the left, the loop_count is reduced by one each measurement. When it becomes 77 the function is called every 102 ns in both threads. If subsequently loop_count is reduced even further it is no longer possible to synchronize the threads and the mutex starts to be actually locked most of the time, resulting in an increased amount of clocks that it takes to do the lock/unlock. Also the average time of the function call increases because of this; so the plot points now go up and towards the right again.
From this we can conclude that locking and unlocking a mutex every 50 ns is not a problem on my box.
All in all my conclusion is that the answer to question of OP is that adding more mutexes is better as long as that results in less contention.
Try to lock mutexes as short as possible. The only reason to put them -say- outside a loop would be if that loop loops faster than once every 100 ns (or rather, number of threads that want to run that loop at the same time times 50 ns) or when 13 ns times the loop size is more delay than the delay you get by contention.
EDIT: I got a lot more knowledgable on the subject now and start to doubt the conclusion that I presented here. First of all, CPU 0 and 1 turn out to be hyper-threaded; even though AMD claims to have 8 real cores, there is certainly something very fishy because the delays between two other cores is much larger (ie, 0 and 1 form a pair, as do 2 and 3, 4 and 5, and 6 and 7). Secondly, the std::mutex is implemented in way that it spin locks for a bit before actually doing system calls when it fails to immediately obtain the lock on a mutex (which no doubt will be extremely slow). So what I have measured here is the absolute most ideal situtation and in practise locking and unlocking might take drastically more time per lock/unlock.
Bottom line, a mutex is implemented with atomics. To synchronize atomics between cores an internal bus must be locked which freezes the corresponding cache line for several hundred clock cycles. In the case that a lock can not be obtained, a system call has to be performed to put the thread to sleep; that is obviously extremely slow (system calls are in the order of 10 mircoseconds). Normally that is not really a problem because that thread has to sleep anyway-- but it could be a problem with high contention where a thread can't obtain the lock for the time that it normally spins and so does the system call, but CAN take the lock shortly there after. For example, if several threads lock and unlock a mutex in a tight loop and each keeps the lock for 1 microsecond or so, then they might be slowed down enormously by the fact that they are constantly put to sleep and woken up again. Also, once a thread sleeps and another thread has to wake it up, that thread has to do a system call and is delayed ~10 microseconds; this delay thus happens while unlocking a mutex when another thread is waiting for that mutex in the kernel (after spinning took too long).

This depends on what you actually call "mutex", OS mode and etc.
At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).
However, that can be very much higher. If what you call "mutex" a kernel object (i.e. - object managed by the OS) and run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.
For example on Intel Core Duo processor, Windows XP.
Interlocked operation: takes about 40 CPU cycles.
Kernel mode call (i.e. system call) - about 2000 CPU cycles.
If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.

I'm completely new to pthreads and mutex, but I can confirm from experimentation that the cost of locking/unlocking a mutex is almost zilch when there is no contention, but when there is contention, the cost of blocking is extremely high. I ran a simple code with a thread pool in which the task was just to compute a sum in a global variable protected by a mutex lock:
y = exp(-j*0.0001);
pthread_mutex_lock(&lock);
x += y ;
pthread_mutex_unlock(&lock);
With one thread, the program sums 10,000,000 values virtually instantaneously (less than one second); with two threads (on a MacBook with 4 cores), the same program takes 39 seconds.

The cost will vary depending on the implementation but you should keep in mind two things:
the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.
On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.
In both those cases, the instructions are relatively efficient.
As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.
By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)

I just measured it on my Windows 10 system.
This is testing Single Threaded code with no contention at all.
Compiler: Visual Studio 2019, x64 release, with loop overhead subtracted from measurements.
Using std::mutex takes about 74 machine cycles, while using a native Win32 CRITICAL_SECTION takes about 53 machine cycles.
So unless 100 machine cycles is a significant amount of time compared to the code itself, the mutexes aren't going to be the source of a performance problem.

sleep(0)? consistent time keeping in code?

Right now i am loading a file then using gettimeofday and tracking the CPU time with tv_usec
My results varies, i get 250's to 280s but sometimes 300's or 500's. I wrote usleep and sleep (0) and (1) with no success. The time still varies vastly. I thought sleep(1) (seconds in linux, not the windows Sleep in ms) would have solved it. How can i keep track of time in a more consistent way for testing? Maybe i should wait until i have a much larger test data and more complex code before starting measurements?

The currently recommended interface for high-rez time on Linux (and POSIX in general) is clock_gettime. See the man page.
clock_gettime(CLOCK_REALTIME, struct timespec *tp) // for wall-clock time
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, struct timespec *tp) // for CPU time
But read the man page. Note that you need to link with -lrt, because POSIX says so, I guess. Maybe to avoid symbol conflicts in -lc, for old programs that defined their own clock_gettime? But dynamic libs use weak symbols...
The best sleep function is nanosleep. It doesn't mess around with signals or any crap like usleep. It is defined to just sleep, and not have any other side effects. And it tells you if you woke up early (e.g. from signals), so you don't necessarily have to call another time function.
Anyway, you're going to have a hard time testing one rep of something that short that involves a system call. There's a huge amount of opportunity for variation. e.g. the scheduler may decide that some other work needs doing (unlikely if your process just started; you won't have used up your timeslice yet). CPU cache (L2 and TLB) are easily possible.
If you have a multi-core machine and a single-threaded benchmark for the code you're optimizing, you can give it realtime priority pinned to one of your cores. Make sure you choose the core that isn't handling interrupts, or your keyboard (and everything else) will be locked out until it's done. Use taskset (for pinning to one CPU) and chrt (for setting realtime prio).
See this mail I sent to gmp-devel with this trick:
http://gmplib.org/list-archives/gmp-devel/2008-March/000789.html
Oh yeah, for the most precise timing, you can use rdtsc yourself (on x86/amd64). If you don't have any other syscalls in what you're benching, it's not a bad idea. Grab a benchmarking framework to put your function into. GMP has a pretty decent one. It's maybe not set up well for benchmarking functions that aren't in GMP and called mpn_whatever, though. I don't remember, and it's worth a look.

Are you trying to measure how long it takes to load a file? Usually if you're performance testing some bit of code that is already pretty fast (sub-second), then you will want to repeat the same code a number of times (say a thousand or a million), time the whole lot, then divide the total time by the number of iterations.
Having said that, I'm not quite sure what you're using sleep() for. Can you post an example of what you intend to do?

I would recommend putting that code in a for loop. Run it over 1000 or 10000 iterations. There's problems with this if you're doing only a few instructions, but it should help.
Larger data sets also help of course.
sleep is going to deschedule your thread from the cpu. It does not accurately count time with precision.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string