Creating a friendly timed busy loop for a hyperthread

Creating a friendly timed busy loop for a hyperthread - multithreading

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).
The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.
So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.
I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.
A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.
Any better options?
1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

Some random musing on the subject.
So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.
The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).
start:
read time (25 cycles)
tmp = time - last (1 cycle)
if tmp < sample length goto start
last += cycles between samples
sample = time
goto start
In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.
When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.
So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.
On the other hand reading the real time clock in the main thread would cost ...
~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.
Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.

Related

Does having more threads lead to smaller time slices?

I'm trying to understand why having too many threads can reduce CPU usage due to the increased overhead of context switching. An explanation that sounded plausible to me is that increasing # of threads also increases the frequency of context switches, meaning we end up spending more time context switching and less time doing useful work. Is this correct? Do individual time slices get compressed (with more context switches in between) as we have more threads to schedule?

Generally no. The primary mechanism for lower overhead is that if the scheduler picks the same thread to run on a core for two timeslices in a row, there is no context-switch overhead of stale caches and an FP save/restore.
A "tickless" kernel might set a timer farther in the future if there aren't any other tasks to schedule, instead of the traditional design of having a timer interrupt every 1 or 10 milliseconds where it always calls a scheduler function. (And if there aren't any waiting tasks, it can trivially decide to keep running this one.)

Why is multi threading not faster on single core?

This question is not a duplicate of any question related to why multithreading is not faster on single-core, read the rest to figure out what I actually want to know
As far as I know, multithreading is only faster on a CPU with multiple cores, since each thread can run in parallel. However, as my understanding of how preemption and multithreading on single-core works, it should also be faster. The image below can describe what I mean better. Consider that our app is a simple loop that takes exactly 4 seconds to execute. In this example, the time slice is constant, but, I don't think it makes any difference because, in the end, all threads with the same priority will get equal time by the scheduler. The first timeline is single-threaded, but the second one has 4 threads. The cycle also means when the preemption ends and the scheduler goes back to the queue of threads from start. I/O has also been removed since that just adds complexity and even if it changes the results, let's assume I'm talking about some code that does not require any sort of I/O.
The red threads are threads related to my process, and others (black) are the ones for other processes and apps
There are a couple of questions here:
Why isn't it faster? What's wrong with my timeline?
What's that cycle point called?
Since the time slice is not fixed, does that means the Cycle time is fixed, or the time slice gets calculated and the cycle will be as much time required to spend the calculated time slice on each thread?
Is the slice time based on time or instruction? I mean, is it like 0.1 sec for each thread or like 10 instructions for each thread?
The CPU utilization is based on CPU time, so why isn't it always on 100% because when a thread's time reaches, it moves to the next thread, and if a thread stops on I/O, it does not wait but executes the next one, so the CPU always tries to find a thread to execute and minimalize the time spent IDLE. Is the time for I/O so significant that more than 50% of CPU time is spent doing nothing because all threads are waiting for something, mostly I/O and the CPU time is elapsed waiting for a thread to become in a ready state?
Note: This timeline is simplified, the time spent on I/O, thread creation, etc. is not calculated and it's assumed that other threads do not finish before the end of the timeline and have the same priority/nice value as our process

How to start two CPU cores to run instructions at the same time?

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Why is the throughput of the MCS lock poor when the number of threads is greater than the number of logical cpus

Why is the throughput of the mcs lock poor when the number of threads is greater than the number of logical cpus.
Could it be because of increased contention for a places on cpu which leads to a lot of threads being pre-empted?

I am not 100% on this, but the Microsoft library gives this definition of the Sleep() function:
After the sleep interval has passed, the thread is ready to run. If you specify 0 >milliseconds, the thread will relinquish the remainder of its time slice but remain >ready. Note that a ready thread is not guaranteed to run immediately. Consequently, the >thread may not run until some time after the sleep interval elapses.
In my experience, if I use an MCS lock to, lets say, update a data structure and the number of threads I run it on is 16 the drop off (excluding the massive drop off from 1 - 2 threads) from 8 to 16 threads (assuming you are just doubling the number of threads) is quite large. Throughput drops to about a third after one thread and then slowly decreased to as the number of threads being used approaches the number of CPUs. Obviously if you are using a lock the more threads that are trying to acquire the lock you will have more cache cache coherency work for the CPU's to do.
If you use any atomic instructions (again assuming you are) the more threads you add the slower this will become.
"I don't think the problem is that atomic operations will take longer themselves; the real problem might be that an atomic operation might block bus operations on other processors (even if they perform non-atomic operations)."
This was taken from another member of stackoverflow about similar issue. Couple that with the fact that a thread may or may not sleep, even with the use of Sleep(), and may or may not wake immediately this could cause a serious loss in throughput. You also have the increased bus traffic to deal with...

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.

How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.

If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.

You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string