How to start two CPU cores to run instructions at the same time?

How to start two CPU cores to run instructions at the same time? - linux

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

Related

Creating a friendly timed busy loop for a hyperthread

Imagine I want to have one main thread and a helper thread run as the two hyperthreads on the same physical core (probably by forcing their affinity to approximately ensure this).
The main thread will be doing important high IPC, CPU-bound work. The helper thread should do nothing other than periodically updating a shared timestamp value that the the main thread will periodically read. The update frequency is configurable, but could be as fast as 100 MHz or more. Such fast updates more or less rule out a sleep-based approach, since blocking sleeps are too slow to sleep/wake on a 10 nanosecond (100 MHz) period.
So I want a busy wait. However, the busy wait should be as friendly as possible to the main thread: use as few execution resources as possible, and so add as little overhead as possible to the main thread.
I guess the idea would be a long-latency instruction that doesn't use many resources, like pause and that also has a fixed-and-known latency. That would let us calibrate the "sleep" period so no clock read is even needed (if want to update with period P we just issue P/L of these instructions for a calibrated busy-sleep. Well pause doesn't meet that latter criterion, as its latency varies a lot1.
A second option would be to use a long-latency instruction even if the latency is unknown, and after every instruction do a rdtsc or some other clock reading method (clock_gettime, etc) to see how long we actually slept. Seems like it might slow down the main thread a lot though.
Any better options?
1 Also pause has some specific semantics around preventing speculative memory accesses which may or may not be beneficial to this sibling thread scenario, since I'm not in a spin-wait loop really.

Some random musing on the subject.
So you want to have a time stamp on a 100 MHz sample, that means that on a 4GHz cpu you have 40 cycles between each call.
The timer thread busily reads the real time clock (RTDSC???) but can't use the save method with cpuid as that takes 100 cycles. The old real time clock has a latency of around 25(and a throughput of 1/25), there might be a slightly newer, slightly more accurate with slightly more latency timer (32 cycles).
start:
read time (25 cycles)
tmp = time - last (1 cycle)
if tmp < sample length goto start
last += cycles between samples
sample = time
goto start
In a perfect world the branch predictor will guess right every time, in reality it will mispredict randomly adding 5-14 cycles to the loops 26 cycles due to variance in the read time cycles.
When the sample is written the other thread will have its instructions cancelled from the first speculative loads from this cache line (remember to align to 64 byte for the sample position so no other data is affected). And the load of the sample time stamp starts over after a delay of ~5-14 cycles depending on where the instructions come from, the loop buffer, micro-ops cache or I-cache.
So a mimimum of 5->14 cycles / 40 cycles performance will be lost, in addition to half the cpu being used by the other thread.
On the other hand reading the real time clock in the main thread would cost ...
~1/4 cycle, the latency will most likely be covered by other instructions. But then you can't vary the frequency. The long latency of 25 cycles could be a problem unless some other long latency instructions precede it.
Using a CAS instruction (lock exch???) might partly solve the problem as the loads then shouldn't cause a reissue of the instruction, but instead results in a delay on all following reads and writes.

Reserve a processor for only one process (with already the max priority)

I have used this piece of code for trying to set the -same- high priority while executing a program :
CPU_SET(CPU_NUM, &cmask);
if (pthread_setaffinity_np(pid, sizeof(cmask), &cmask) < 0) {
LOG_ERROR("Could not set cpu affinity to core %d", CPU_NUM); goto exit_err;
}
errno = 0;
setpriority(PRIO_PROCESS, 0, -19);
The purpose of the program is to perform a computation for a constant bunch (every 80 bytes) of input.
But when executing the program, the time elapsed for this computation varies from 30% to 150%.
When plotting the computation time values, I was waiting for a -quite- smooth graph were the deviation would be something like 10%-15%, but instead there is more than 40% !!!
So I would like to ask, if the CPU is interfering the execution of the program with an other, and if so could I force the CPU to run ONLY a specific program?
Thanks in advance !
P.S. I haven't found a thread that could answer to my question yet...
The most relevant is :) :
Linux reserve a processor for a group of processes (dynamically)

To try and reduce jitter some of the things you can do are:
Ensure sure you've turned off CPU scaling.
Set scheduling policy to SCHED_FIFO for that program.
Try and pin your process to a single processor if you have more than one.
Try and run as few other processes at the same time while you're measuring your program.
Don't trigger sources of time related non-determinism (e.g. disk I/O).
It is probably useful to skim through How to build a Linux RT application because accurate measurement is the same domain - it's possible to be more extreme though:
Ensure your program doesn't use dynamic memory allocations.
Use a realtime Linux kernel.
Prevent Linux from scheduling non-specific userspace programs on a given CPU.
Even disable timer ticks on a given CPU (CONFIG_TASK_ISOLATION).
Modern desktop/server processors are so complicated that trying to precisely measure a single program's execution time with low variance is extremely hard. Things like the various caches and pipeline starting states can perturb execution times in any number of ways so there are always going to be limits.

How NOHZ=ON affects do_timer() in Linux kernel?

In a simple experiment I set NOHZ=OFF and used printk() to print how often the do_timer() function gets called. It gets called every 10 ms on my machine.
However if NOHZ=ON then there is a lot of jitter in the way do_timer() gets called. Most of the times it does get called every 10 ms but there are times when it completely misses the deadlines.
I have researched about both do_timer() and NOHZ. do_timer() is the function responsible for updating jiffies value and is also responsible for the round robin scheduling of the processes.
NOHZ feature switches off the hi-res timers on the system.
What I am unable to understand is how can hi-res timers affect the do_timer()? Even if hi-res hardware is in sleep state the persistent clock is more than capable to execute do_timer() every 10 ms. Secondly if do_timer() is not executing when it should, that means some processes are not getting their timeshare when they should ideally be getting it. A lot of googling does show that for many people many applications start working much better when NOHZ=OFF.
To make long story short, how does NOHZ=ON affect do_timer()?
Why does do_timer() miss its deadlines?

First lets understand what is a tickless kernel ( NOHZ=On or CONFIG_NO_HZ set ) and what was the motivation of introducing it into the Linux Kernel from 2.6.17
From http://www.lesswatts.org/projects/tickless/index.php,
Traditionally, the Linux kernel used a periodic timer for each CPU.
This timer did a variety of things, such as process accounting,
scheduler load balancing, and maintaining per-CPU timer events. Older
Linux kernels used a timer with a frequency of 100Hz (100 timer events
per second or one event every 10ms), while newer kernels use 250Hz
(250 events per second or one event every 4ms) or 1000Hz (1000 events
per second or one event every 1ms).
This periodic timer event is often called "the timer tick". The timer
tick is simple in its design, but has a significant drawback: the
timer tick happens periodically, irrespective of the processor state,
whether it's idle or busy. If the processor is idle, it has to wake up
from its power saving sleep state every 1, 4, or 10 milliseconds. This
costs quite a bit of energy, consuming battery life in laptops and
causing unnecessary power consumption in servers.
With "tickless idle", the Linux kernel has eliminated this periodic
timer tick when the CPU is idle. This allows the CPU to remain in
power saving states for a longer period of time, reducing the overall
system power consumption.
So reducing power consumption was one of the main motivations of the tickless kernel. But as it goes, most of the times, Performance takes a hit with decreased power consumption. For desktop computers, performance is of utmost concern and hence you see that for most of them NOHZ=OFF works pretty well.
In Ingo Molnar's own words
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer
interrupts: if there is no timer to be expired for say 1.5 seconds
when the system goes idle, then the system will stay totally idle for
1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ
to 1-2 timer interrupts per second.
Now, lets try to answer your queries-
What I am unable to understand is how can hi-res timers affect the
do_timer ?
If a system supports high-res timers, timer interrupts can occur more frequently than the usual 10ms on most systems. i.e these timers try to make the system more responsive by leveraging the system capabilities and by firing timer interrupts even faster, say every 100us. So with NOHZ option, these timers are cooled down and hence the lower execution of do_timer
Even if hi-res hardware is in sleep state the persistent clock is more
than capable to execute do_timer every 10ms
Yes it is capable. But the intention of NOHZ is exactly the opposite. To prevent frequent timer interrupts!
Secondly if do_timer is not executing when it should that means some
processes are not getting their timeshare when they should ideally be
getting it
As caf noted in the comments, NOHZ does not cause processes to get scheduled less often, because it only kicks in when the CPU is idle - in other words, when no processes are schedulable. Only the process accounting stuff will be done at a delayed time.
Why does do_timer miss it's deadlines ?
As elaborated, it is the intended design of NOHZ
I suggest you go through the tick-sched.c kernel sources as a starting point. Search for CONFIG_NO_HZ and try understanding the new functionality added for the NOHZ feature
Here is one test performed to measure the Impact of a Tickless Kernel

How does x86 pause instruction work in spinlock and can it be used in other scenarios?

The pause instruction is commonly used in the loop of testing spinlock, when some other thread owns the spinlock, to mitigate the tight loop. It's said that it is equivalent to some NOP instructions. Could somebody tell me how exactly it works for spinlock optimization? It seems to me that even the NOP instructions are a waste of CPU time. Will they decrease CPU usage?
Another question is that could I use pause instruction for other similar purposes. For example, I have a busy thread which keeps scanning some places (e.g. a queue) to retrieve new nodes; however, sometimes the queue is empty and the thread is just wasting cpu time. Sleep the thread and wake it up by other threads may be an option, however the thread is critical, so I don't want to make it sleep.
Could pause instruction work for my purpose to mitigate the CPU usage? Currently it uses 100% cpu of a physical core?

PAUSE notifies the CPU that this is a spinlock wait loop so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop.
PAUSE may actually stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don't have to check if its supported. Older CPUs will simply do nothing (NOP) as fast as possible.
See also https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
Update: I don't think it's a good idea to use PAUSE in queue checking unless you are going to make your queue spinlock-like (and there is no obvious way to do it).
Spinning for a very long time is still very bad, even with PAUSE.

A processor suffers a severe performance penalty when exiting
the loop because it detects a possible memory order violation. The PAUSE instruction
provides a hint to the processor that the code sequence is a spin-wait loop. The
processor uses this hint to avoid the memory order violation in most situations,
which greatly improves processor performance. For this reason, it is recommended
that a PAUSE instruction be placed in all spin-wait loops.
An additional function of the PAUSE instruction is to reduce the power consumed by Intel processors.
[source: Intel manual]

Pause-based spin-wait loops
As I understood from your questions, the waits in your case are known in advance to be very long. In this case, spin-wait loops are not recommended at all. But if you are using a spin-loop that keeps checking a value from memory (e.g. a byte-sized synchronization variable), use PAUSE. See the Section 11.4.2 "Synchronization for Short Periods" of the Intel 64 and IA-32 Architectures Optimization Reference Manual.
You wrote that you have a "thread which keeps scanning some places (e.g. a queue) to retrieve new nodes".
In such a case (i.e. the long wait), Intel recommends using synchronization API functions of your operating system. For example, you can create an event when a new node appears in a queue, and just wait for this event using the WaitForSingleObject(Handle, INFINITE). The queue will trigger this event whenever a new node will appear.
According to the Intel Optimization Reference Manual, Section, 2.3.4 "Pause Latency in Skylake Client Microarchitecture",
The PAUSE instruction is typically used with software threads
executing on two logical processors located in the same processor
core, waiting for a lock to be released. Such short wait loops tend to
last between tens and a few hundreds of cycles, so performance-wise it
is better to wait while occupying the CPU than yielding to the OS.
By "tens and a few hundreds of cycles" of the above quote I understand from 20 to 500 CPU cycles.
500 CPU cycles on a 4500 MHz Intel Core i7 7700K processor (released on January 2017, based on Kaby-Lake-S microarchitecture) is 0.0000001 seconds, i.e. 1/10000000th of a second: the CPU can make 10 million times per second this 500-CPU-cycles loop.
This 500 cycle limit recommended by Intel is theoretical, and all depends on particular use case, i.e. on the logic of the code that needs to be synchronized by spin-wait loops. Some scenarios like FastMM4-AVX memory manger for Delphi work better with the value of 5000, according to the benchmarks. Even though, these benchmarks do not always reflect real-world scenario, and real program use cases should be measured.
As you see, this PAUSE-based spin-wait loop is for really short periods of time.
On the other hand, each call to an API function like Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
If there are more threads then the processor cores (multiplied to hyperthreading feature, if present) are available, and a thread will get switched to another one in the middle of a critical section, waiting for the critical section from another thread may really take looong, at least 10000+ cycles, so the PAUSE-based spin-wait loop will be futile.
In addition to the relevant chapters of the Intel Optimization Reference Manual, please see the following articles for more information:
https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors
https://software.intel.com/en-us/articles/benefitting-power-and-performance-sleep-loops
When the wait loop is expected to last for thousands of cycles or more, it is
preferable to yield to the operating system by calling one of the OS synchronization API functions, such as WaitForSingleObject or SwitchToThread on Windows OS.
As a conclusion: in your scenario, the PAUSE-based spin-wait loop won't be the best choice, since your waiting time is long while the spin-wait loop is intended for very short loops.
The PAUSE instruction takes about 140 CPU cycles on processors based on Skylake microarchitecture, or later processors. For example, it is just or 35.10ns on Intel Core i7-6700K CPU (4GHz) released on August 2015, or 49.47ns on Intel Core i7-1165G7 CPU for mobile devices released on September 2020. On earlier processors (prior to Skylake), like those based on Haswell microarchitecture, it has about 9 cycles. It is 2.81ns on Intel Core i5-4430 (3GHz) released on June 2013. So, for the long loops, it's better to relinquish control to other threads using the OS synchronization API functions than to occupy CPU with the PAUSE loop, regardless of the microarchitecture.
Test, Test-and-Set
Please note that the spin-wait loops have also to be implemented properly. Intel recommends the so-called "test, test-and-set" technique (see Section 11.4.3 "Optimization with Spin-Locks" of the Intel 64 and IA-32 Architectures Optimization Reference Manual) to determine the availability of the synchronization variable. According to this technique, the first "test" is done via the normal (non-locking) memory load to prevent excessive bus locking during the spin-wait loop; if the variable is available upon the non-locking memory load of the first step ("test"), proceed to the second step ("test-and-set") which is done via the bus-locking atomic xchg instruction.
But be aware that this two-steps approach of using "test" before "test-and-set" can increase the cost for the un-contended case comparing to just single-step "test-and-set". The initial read-only access might only get the cache line in Shared state, so the atomic operation like test-and-set (xchg) or compare-and-swap (cmpxchg) still needs a ''Read For Ownership'' (RFO) operation to get exclusive ownership of the cache line. This operation is issued by a processor trying to write into a cache line that is in the Shared state.
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?
atomic operation cost

The PAUSE instruction also appears to be used in hyper-threading processors to mitigate performance impact on other hyper threads, presumably by relinquishing more CPU time to them.
The following Intel article outlines this, and not surprisingly recommends avoiding busy wait loops on such processors: https://software.intel.com/en-us/articles/long-duration-spin-wait-loops-on-hyper-threading-technology-enabled-intel-processors

Algorithm to optimize # threads used in a calculation

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.
any pointers, pseudocode, etc. are much appreciated.

How about an evolutionary algorithm.
Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.
Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).
If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).
If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).
As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.

If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.
If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.

You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.
You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).
You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string