In sched_fair.c it has:
unsigned int sysctl_sched_latency = 5000000ULL //5m
unsigned int sysctl_sched_min_granularity = 1000000ULL //1ms
I understand that Linux fair timeslice varies depending on the nr_running and the relative weight of this fair task, but through code studying, I figured out the main idea is to keep the timeslice 1 to 5 ms. Please correct me if I understand it wrong. I must be wrong here but I just cannot figure out how!
Also knowing that HZ, or the number of system ticks per s, or the number of timer interrupts every second, is normally 200 or 100 for arm machine (and most non-desktop machines too), which gives us a 5 to 10 ms tick rate.
Timeslice is put in action by starting rq->hrtick_timer in set_next_entity(), every time a fair task is scheduled to run, and invoking resched_task() in timeout callback function hrtick(). This timer is simply one of the queued timers that are processed by timer irq handler on every tick, timer_tick()...run_local_timer(). There seems no other hidden secret.
Then how we can get a timeslice shorter than 5 ms? Please help me understand this. Thank you very much!
As stated in Robert Love's Linux Kernel Development, the only ways to get a timeslice shorter is to increase number of running processes (or ones with less priority than others).
Increasing number on running process creates a need for shorten timeslice to guarantee appropriate target latency (but timeslice is lower bounded with minimum granularity). But there is no guarantee that process will be preempted in given timeslice. That's because time accounting is driven by timer interrupts.
Increasing value of HZ makes timer interrupts happen more frequently that makes time accounting more precious, so rescheduling may occur more frequently.
The vruntime variable stores the virtual runtime of a process, which is the actual runtime normalized by the number of runnable processes. On the ideal multitasking system vruntime of all the process would be identical—all tasks would have received an equal, fair share of the processor.
Typically timeslice is target latency divided by number of running processes. But when number of running processes approaches infinity, timeslice approaches 0. As this will eventually result in unacceptable switching costs, CFS imposes a floor on the timeslice assigned to each process.This floor is called the minimum granularity. So timeslice is value between sysctl_sched_latency and sysctl_sched_granularity. (See sched_timeslice())
vruntime variable is managed by update_curr(). update_curr() is invoked periodically by the system timer and also whenever a process becomes runnable or blocks, becoming unrunnable.
To drive preemption between tasks, hrtick() calls task_tick_fair() on each timer interrupt, which, in turn, calls entity_tick(). entity_tick() calls update_curr() to update process vruntime and then calls check_preempt_tick(). check_preempt_tick() checks whether current runtime is greater than ideal runtime (timeslice), if so, calls resched_task(), which sets TIF_NEED_RESCHED flag.
When TIF_NEED_RESCHED is set, schedule() gets called on the nearest possible occasion.
So, with increasing value of HZ, timer interrupts happens more frequently causing more precious time accounting and allowing scheduler to reschedule tasks more frequently.
Related
I'm trying to understand why having too many threads can reduce CPU usage due to the increased overhead of context switching. An explanation that sounded plausible to me is that increasing # of threads also increases the frequency of context switches, meaning we end up spending more time context switching and less time doing useful work. Is this correct? Do individual time slices get compressed (with more context switches in between) as we have more threads to schedule?
Generally no. The primary mechanism for lower overhead is that if the scheduler picks the same thread to run on a core for two timeslices in a row, there is no context-switch overhead of stale caches and an FP save/restore.
A "tickless" kernel might set a timer farther in the future if there aren't any other tasks to schedule, instead of the traditional design of having a timer interrupt every 1 or 10 milliseconds where it always calls a scheduler function. (And if there aren't any waiting tasks, it can trivially decide to keep running this one.)
For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?
Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.
Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.
This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.
Some notes:
The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.
A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.
Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.
Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?
I'd tend to go with Execute1 because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.
footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.
But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)
x86 asm tricks:
I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.
Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.
(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)
If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.
On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.
Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.
rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.
See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.
Staying in sync after initial start:
On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.
On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.
Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.
More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.
I have read that linux kernel contains many schedule classes each having it's own priority. To select a new process to run, the process scheduler iterates from the highest priority class to lowest priority class. If a runnable process is found in a class, the highest priority process is selected to run from that class.
Extract from Linux kernel development by Robert Love:
The main entry point into the process schedule is the function
schedule() , defined in kernel/sched.c .This is the function that the
rest of the kernel uses to invoke the process scheduler, deciding
which process to run and then running it. schedule() is generic with
respect to scheduler classes.That is, it finds the highest priority
scheduler class with a runnable process and asks it what to run next.
Given that, it should be no surprise that schedule() is simple.The
only important part of the function—which is otherwise too
uninteresting to reproduce here—is its invocation of pick_next_task()
, also defined in kernel/sched.c .The pick_next_task() function goes
through each scheduler class, starting with the highest priority, and
selects the highest priority process in the highest priority class.
Let's imagine the following scenario. There are some processes waiting in lower priority classes and processes are being added to higher priority classes continuously. Won't the processes in lower priority classes starve?
Linux kernel implements Completely Fair Scheduling algorithm which is based on virtual clock.
Each scheduling entity has a sched_entity structure associated with it whose snapshot looks like
struct sched_entity {
...
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
The above four attributes are used to track the runtime of a process and using these attributes along with some other methods(update_curr() where these are updated), the virtual clock is implemented.
When a process is assigned to CPU, exec_start is updated to current time and the consumed CPU time is recorded in sum_exec_runtime. When process is taken off from CPU sum_exec_runtime value is preserved in prev_sum_exec_runtime. sum_exec_runtime is calculated cumulatively. (Meaning it grows monotonically).
vruntime stores the amount of time that has elapsed on virtual clock during process execution.
How vruntime is calculated?
Ignoring all the complex calculations, the core concept of how it is calculated is :-
vruntime += delta_exec_weighted;
delta_exec_weighted = delta_exec * (NICE_0_LOAD/load.weight);
Here delta_exec is the time difference between process assigned to CPU and taken off from CPU whereas load.weight is the weight of the process which depends on priority (Nice Value). Usually an increase in nice value of 1 for a process means it gets 10 percent less CPU time resulting in less weight.
Process with NICE value 0, weight = 1024
Process re-Niced with value 1, weight = 1024/1.25 = 820(approx)
Points drawn from above
So vruntime increases when a process gets CPU
And vruntimeincreases slowly for higher priority processes compared with lower priority processes.
The runqueue is maintained in red-black tree and each runqueue has a min_vruntime variable associated with it that holds the smallest vruntime among all the process in the run-queue. (min_vruntime can only increase, not decrease as processes will be scheduled).
The key for the node in red black tree is process->vruntime - min_vruntime
When scheduler is invoked, the kernel basically picks up the task which has the smallest key (the leftmost node) and assigns it the CPU.
Elements with smaller key will be placed more to the left, and thus be scheduled more quickly.
When a process is running, its vruntime will steadily increase, so it will finally move rightwards in the red-black tree.
Because vruntime is increase more slowly for more important processes, they will also move rightwards more slowly, so their chance to be scheduled is bigger for a less important process - just as required.
If a process sleeps, its vruntime will remain unchanged. Because the per-queue min_vruntime will increase in the meantime, the sleeper process will be placed more to the left after waking up because the key(mentioned above) got smaller.
Therefore there are no chances of starvation as a lower priority process if deprived of CPU, will have its vruntime smallest and hence key will be smallest so it moves to the left of the tree quickly and therefore scheduled.
It would indeed starve.
There are many ways of dealing with such scenario.
Aging, the longer the process is in the system, increase its priority.
Scheduling algorithms giving every process a time-quantum to use the CPU. Time-Quantum varies, usually, interactive processes are given lower time-quantum as they spend more time doing I/O while time consuming/computational processes are given bigger time quantum.
After a process runs its time quantum, it is put in an expired queue until there are no active processes in the system.
Then, the expired queue becomes the active queue and vice versa.
These are 2 ways in preventing starvation.
I've heard that if a thread does not consume the entire time-slice allocated by the OS's thread-scheduler the remainder is wasted: e.g. if the time-slice is 10ms and the thread ends before 5ms, the remaining 5ms are lost.
So if you have a lot of small fast tasks always taking less time than the originally allocated time-slice the waste can be important system-wide.
If this is true I guess with standard workloads the impact is negligible and will be a concern only with specific use-cases like servers running a single type of tasks.
Do you confirm this?
Have you more information?
I've heard that if a thread does not consume the entire time-slice
allocated by the OS's thread-scheduler the remainder is wasted
I don't think that's the case. For linux a running task goes into terminated state when it exits, thus freeing the processor:
... but if the OS's scheduler only "wakes up" at fixed times (e.g. with a
frequency of 10ms/100 times per second)
The scheduler is invoked whenever a task needs to be scheduled. That happens when the time allocated for the running task has expired (that doesn't necessary mean fix frequency), but also on IO/events, exit and other scenarios.
In a simple experiment I set NOHZ=OFF and used printk() to print how often the do_timer() function gets called. It gets called every 10 ms on my machine.
However if NOHZ=ON then there is a lot of jitter in the way do_timer() gets called. Most of the times it does get called every 10 ms but there are times when it completely misses the deadlines.
I have researched about both do_timer() and NOHZ. do_timer() is the function responsible for updating jiffies value and is also responsible for the round robin scheduling of the processes.
NOHZ feature switches off the hi-res timers on the system.
What I am unable to understand is how can hi-res timers affect the do_timer()? Even if hi-res hardware is in sleep state the persistent clock is more than capable to execute do_timer() every 10 ms. Secondly if do_timer() is not executing when it should, that means some processes are not getting their timeshare when they should ideally be getting it. A lot of googling does show that for many people many applications start working much better when NOHZ=OFF.
To make long story short, how does NOHZ=ON affect do_timer()?
Why does do_timer() miss its deadlines?
First lets understand what is a tickless kernel ( NOHZ=On or CONFIG_NO_HZ set ) and what was the motivation of introducing it into the Linux Kernel from 2.6.17
From http://www.lesswatts.org/projects/tickless/index.php,
Traditionally, the Linux kernel used a periodic timer for each CPU.
This timer did a variety of things, such as process accounting,
scheduler load balancing, and maintaining per-CPU timer events. Older
Linux kernels used a timer with a frequency of 100Hz (100 timer events
per second or one event every 10ms), while newer kernels use 250Hz
(250 events per second or one event every 4ms) or 1000Hz (1000 events
per second or one event every 1ms).
This periodic timer event is often called "the timer tick". The timer
tick is simple in its design, but has a significant drawback: the
timer tick happens periodically, irrespective of the processor state,
whether it's idle or busy. If the processor is idle, it has to wake up
from its power saving sleep state every 1, 4, or 10 milliseconds. This
costs quite a bit of energy, consuming battery life in laptops and
causing unnecessary power consumption in servers.
With "tickless idle", the Linux kernel has eliminated this periodic
timer tick when the CPU is idle. This allows the CPU to remain in
power saving states for a longer period of time, reducing the overall
system power consumption.
So reducing power consumption was one of the main motivations of the tickless kernel. But as it goes, most of the times, Performance takes a hit with decreased power consumption. For desktop computers, performance is of utmost concern and hence you see that for most of them NOHZ=OFF works pretty well.
In Ingo Molnar's own words
The tickless kernel feature (CONFIG_NO_HZ) enables 'on-demand' timer
interrupts: if there is no timer to be expired for say 1.5 seconds
when the system goes idle, then the system will stay totally idle for
1.5 seconds. This should bring cooler CPUs and power savings: on our (x86) testboxes we have measured the effective IRQ rate to go from HZ
to 1-2 timer interrupts per second.
Now, lets try to answer your queries-
What I am unable to understand is how can hi-res timers affect the
do_timer ?
If a system supports high-res timers, timer interrupts can occur more frequently than the usual 10ms on most systems. i.e these timers try to make the system more responsive by leveraging the system capabilities and by firing timer interrupts even faster, say every 100us. So with NOHZ option, these timers are cooled down and hence the lower execution of do_timer
Even if hi-res hardware is in sleep state the persistent clock is more
than capable to execute do_timer every 10ms
Yes it is capable. But the intention of NOHZ is exactly the opposite. To prevent frequent timer interrupts!
Secondly if do_timer is not executing when it should that means some
processes are not getting their timeshare when they should ideally be
getting it
As caf noted in the comments, NOHZ does not cause processes to get scheduled less often, because it only kicks in when the CPU is idle - in other words, when no processes are schedulable. Only the process accounting stuff will be done at a delayed time.
Why does do_timer miss it's deadlines ?
As elaborated, it is the intended design of NOHZ
I suggest you go through the tick-sched.c kernel sources as a starting point. Search for CONFIG_NO_HZ and try understanding the new functionality added for the NOHZ feature
Here is one test performed to measure the Impact of a Tickless Kernel