Multi threaded BIOS - multithreading

Multi threaded BIOS - multithreading

I would like to know why the BIOS is single-threaded even we have 4 cores/8 cores. Latest UEFI technology allows GUI utilities. Is there any specific reason for not implementing Multi-threaded BIOS.

The simple answer is: Diminishing Returns
On most PCs, the boot sequence of BIOS/UEFI only takes ~5 seconds to work (Not counting HDD spinup latency). To most people, that is fast enough. (If you want faster, put your PC to sleep instead of turning it off.)
Keep in mind that many of the tasks done in the BIOS cannot be parallelized. The memory controller has to be initialized first. The PCI/PCIe busses must be enumerated before you can check any of the subsequent devices (USB, SATA, Video, etc). You can't boot until you disks have spun up.
There are a few initialization items that are time-consuming, and could be done in parallel.
IDE/SATA - Usually takes a while due to mechanical disk latencies.
USB - Some USB devices need 100s of msec after power is applied to come to life.
Video (any any other third-party BIOS extensions) - It takes a while to communicate with the displays and sync up.
Those tasks could be done in parallel, which might speed up your PC's boot time. Keep in mind that to get there, you need to write a kernel and task scheduler. In legacy BIOS (pure x86 assembler), this would not be pretty. In UEFI (which is mostly C source), this is a little more feasible. However, it still requires a non-trivial engineering effort for a minor gain (maybe 1-2 second of boot time.)
Phoenix has tried to introduce a multi-threaded BIOS initialization before. As far as I know, it never took off.

Because there is no need. The BIOS does not do heavy computations. It does some coordination and then exits (forever).

UEFI does not describe any multiprocessing functionality. However, the PI specification (also produced by the UEFI Forum) does, and EDK2 provides the EFI_MP_SERVICES_PROTOCOL (currently for IA32/X64 only).
It is not exactly pthreads, but it does let you schedule tasks to run on Application Processors while the Bootstrap Processor keeps providing the single-threaded UEFI instance.
The interface for DXE phase is described in Volume 3 of the v1.5 PI specification, section MP Services Protocol (13.4).
Functionality available during PEI are described by Volume 2, EFI MP Services PPI (8.3.9).

Related

Vtune: Accuracy of Intel sampling drivers when vtune measurement run on a machine running other tasks

I have the latest coffeelake machine which is primarily used as a storage server. The average workload on each core (4 cores) is around 5-10% when running a storage server alone.
I want to run vtune measurements of a workload on this machine using Intel Sampling drivers. However, I'm doubtful whether or not the measurements will be accurate given the storage server application is concurrently running.
But as the intel's documents suggest, the sampling drivers get installed on the Linux kernel, so is it really the case that the measurements will be inaccurate if run concurrently with other applications? In other words, how exactly do the intel sampling drivers work? Are they able to distinguish between the workload process and other processes running on the system?

If VTune is like the Linux PAPI subsystem that perf uses, it basically saves/restores HW event counter registers on context switch, along with the regular register state. So events like instructions and uops_retired should be unaffected. And effects on other events will be due to actual impacts, like extra cache misses.
(The basic mechanism for HW performance events are that each logical core has its own programmable perf counters that increment every time some microarchitectural event happens. If one overflows, it raises an interrupt for the driver to collect the count. Or for perf record type of functionality, perf or VTune would program them to count down so trigger an interrupt regularly, and sample the saved user-space RIP at that point. This produces some funky effects on a superscalar out-of-order CPU, like "blaming" the instruction waiting for data, not the cache miss load itself, for example. But the key point is that the inside-the-core events are totally per-core. The uncore / L3 cache events count stuff about shared resources like L3 cache, so are more easily disturbed by system load.)
Another point is that if you are running something on a CPU core, Linux isn't going to want to schedule other tasks there. So your background load will tend to avoid whichever core your test is running on, leaving it able to use 100% of a single core without a lot of context switches. (Although network / disk interrupts might still be handled on that core.)
So yes, you should be able to fairly accurately measure what's actually happening in your process while it runs on a system that's not totally idle. That might be a bit different from what would happen if it were run on a fully idle system, but probably not much different. Especially if it's single-threaded, or you can limit it to fewer than all of your cores, so there's at least one left for the OS to schedule other tasks onto.

Programmatically disable CPU core

It is known the way to disable logical CPUs in Linux, basically with echo 0 > /sys/devices/system/cpu/cpu<number>/online. This way, you are only telling to the OS to ignore that given (<number>) CPU.
My question goes further, is it possible not only to ignore it but to turn it off physically programmatically? I want that CPU to not receive any power, in order to make its energy consumption zero.
I know that it is possible disable cores from the BIOS (not always), but I want to know whether is possible to do it within a certain program or not.

When you do echo 0 > /sys/devices/system/cpu/cpu<number>/online, what happens next depends on the particular CPU. On ARM embedded systems the kernel will typically disable the clock that drives the particular core PLL so effectively you get what you want.
On Intel X86 systems, you can only disable the interrupts and call the hlt instruction (which Linux Kernel does). This effectively puts CPU to the power-saving state until it is woken up by another CPU at user request. If you have a laptop, you can verify that power draw indeed goes down when you disable the core by reading the power from /sys/class/power_supply/BAT{0,1}/current_now (or uevent for all values such as voltage) or using the "powertop" utility.
For example, here's the call chain for disabling the CPU core in Linux Kernel for Intel CPUs.
https://github.com/torvalds/linux/blob/master/drivers/cpufreq/intel_pstate.c
arch/x86/kernel/smp.c: smp_ops.play_dead = native_play_dead,
arch/x86/kernel/smpboot.c : native_play_dead() -> play_dead_common() -> local_irq_disable()
Before that, CPUFREQ also sets the CPU to the lowest power consumption level before disabling it though this does not seem to be strictly necessary.
intel_pstate_stop_cpu -> intel_cpufreq_stop_cpu -> intel_pstate_set_min_pstate -> intel_pstate_set_pstate -> wrmsrl_on_cpu(cpu->cpu, MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
On Intel X86 there does not seem to be an official way to disable the actual clocks and voltage regulators. Even if there was, it would be specific to the motherboard and thus your closest bet might be looking into BIOS such as coreboot.
Hmm, I realized I have no idea about Intel except looking into kernel sources.

In Windows 10 it became possible with new power management commands CPMINCORES CPMAXCORES.
Powercfg -setacvalueindex scheme_current sub_processor CPMAXCORES 50
Powercfg -setacvalueindex scheme_current sub_processor CPMINCORES 25
Powercfg -setactive scheme_current
Here 50% of cores are assigned for desired deep sleep, and 25% are forbidden to be parked. Very good in numeric simulations requiring increased clock rate (15% boost on Intel)
You can not choose which cores to park, but Windows 10 kernel checks Intel's Comet Lake and newer "prefered" (more power efficient) cores, and starts parking those not preferred.
It is not a strict parking, so at high load the kernel can use these cores with very low load.
just in case if you are looking for alternatives

You can get closest to this by using governors like cpufreq. Make Linux exclude the CPU and power saving mode will ensure that the core runs at minimal frequency.

You can also isolate cpus from the scheduler at kernel boot time.
Add isolcpus=0,1,2 to the kernel boot parameters.
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

I know this is an old question but one way to disable the CPU is via grub config.
If you add to end of GRUB_CMDLINE_LINUX in /etc/default/grub (assuming you are using a standard Linux dist, if you are using an appliance the location of the grub config may be different), e.g.:
GRUB_CMDLINE_LINUX=".......Current config here **maxcpus**=2"
Then remake you grub config by running
grub2-mkconfig -o /boot/grub2/grub.cfg (or grub-mkconfig -o /boot/grub2/grub.cfg depending on your installation). Some distros may require nr_cpus instead of maxcpus.
Just some extra info:
If you are running a server with Multiple physical CPU then disabling one CPU may will most likely disable the memory set that is linked to that CPU, therefore it may have an effect on the performance of the server
Disabling the CPU this way, will not effect your type 1 hypervisor from accessing the CPU (this is based on xen hypervisor, I believe it will apply to vmware as well, if anyone can provide confirmation would be great). Depending on virtualbox setup, it may restrict the amount of CPU you can allocate to VM's unless you are running para-virtualization.
I am unsure however if you will have any power savings, most servers and even desktops these days, already control the power well, putting to sleep any device not needed for the current load. My concern would be by reducing the number of CPU (cores) then you will just be moving the load to the remaining CPU and due to the need to schedule the processors time, and potentially having instructions queued, and the effect of having a smaller number of cores available for interrupts (eg: network traffic), it may have a negative effect on power consumption.

AFAIK there is no system call or library function available as of now. or even ioctl implementation. So apart from creating new module / system call there are two ways I can think of :
using ASM asm(<assembly code>); where assembly code being architecture specific asm code to modify cpu flag.
system call in c (man 3 system). Assuming you just want to do it through c.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.

EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.

In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

Why doesn't Linux use the hardware context switch via the TSS?

I read the following statement:
The x86 architecture includes a
specific segment type called the Task
State Segment (TSS), to store hardware
contexts. Although Linux doesn't use
hardware context switches, it is
nonetheless forced to set up a TSS for
each distinct CPU in the system.
I am wondering:
Why doesn't Linux use the hardware support for context switch?
Isn't the hardware approach much faster than the software approach?
Is there any OS which does take advantage of the hardware context switch? Does windows use it?
At last and as always, thanks for your patience and reply.
-----------Added--------------
http://wiki.osdev.org/Context_Switching got some explanation.
People as confused as me could take a look at it. 8^)

The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)
The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.
I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.
Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.

Linux used to use HW-based switching, in the pre-1.3 timeframe iirc. I believe sw-based context switching turned out to be faster, and it is more flexible.
Another reason may have been minimizing arch-specific code. The first port of Linux to a non-x86 architecture was Alpha. Alpha didn't have TSS, so more code could be shared if all archs used SW switching. (Just a guess.) Unfortunately the kernel changelogs for the 1.2-1.3 kernel period are not well-preserved, so I can't be more specific.

Linux doesn't use a segmented memory model, so this segmentation specific feature isn't used.
x86 CPUs have many different kinds of hardware support for context switching, so the distinction isn't hardware vs software, but more how does an OS use the various hardware features available. It isn't necessary to use them all.
Linux is so efficiency focussed that you can bet that someone has profiled every option that is possible, and that the options currently used are the best available compromise.

Disabling Multithreading during runtime

I am wondering if Intel's processor provides instructions in their instruction set
to turn on and off the multithreading or hyperthreading capability? Basically, I wanna
know if an Operating System can control these feature via instructions somehow?
Thank you so much
Mareike

Most operating systems have a facility for changing a process' CPU affinity, thereby restricting it to a single physical or virtual core. But multithreading is a program architecture, not a CPU facility.

I think that what you are trying to ask is, "Is there a way to prevent the OS from utilizing hyperthreading and/or multiple cores?"
The answer is, definitely. This isn't governed by a single instruction, and indeed it's not like you can just write a device driver that would automagically disable all of that hardware. Most of this depends on how the kernel configures the interrupt controllers at boot time.
When a machine is first started, there is a designated processor that is used for bootstrapping. It is the responsibility of the OS to configure the multiprocessor hardware accordingly. On PC platforms this would involve reading information about the multiprocessor configuration from in-memory tables provided by the boot firmware. This data would likely conform to either the ACPI or the Intel multiprocessor specifications. The kernel then uses that date to configure the APIC hardware accordingly.

Multithreading and multitasking are not special instructions or modes in the CPU. They're just fancy ways people who write operating systems use interrupts. There is a hardware timer, basically a counter being incremented by a clocking signal, that triggers an interrupt when it overflows. The exact interrupt is platform specific. In the olden days this timer is actually a separate chip/circuit on the motherboard that is simply attached to one of the CPU's interrupt pin. Modern CPUs have this timer built in. So, to turn off multithreading and multitasking the OS can simply disable the interrupt signal.
Alternatively, since it's the OS's job to actually schedule processes/threads, the OS can simply decide to ignore all threads and not run them.
Hyperthreading is a different thing. It sort of allows the OS to see a second virtual CPU that it can execute code on. Never had to deal with the thing directly so I'm not sure how to turn it off (or even if it is possible).

There is no x86 instruction that disables HyperThreading or additional cores. But, there is BIOS settings that can turn off these features. Because it can be set in BIOS, it requires rebooting, and generally it's beyond OS control. There is Windows booting option that limits the number of active core, but HyperThreading can be turn on/off only by BIOS. Current Intel's HyperThreading implementation doesn't allow dynamic turn on and off (and it won't be easily implemented in a near time).
I have assumed 'multithreading' in your question as 'hardware multithreading' which is technically identical to HyperThreading. However, if you really intended software-level multithreading (i.e., multitasking), then it's totally different question. It is (almost) impossible for modern operating systems since they are by default supports multitasking. And, this question actually doesn't make sense. It can make sense if you want to run MS-DOS (as real mode of x86, where a single task can be done).
p.s. Please note that 'multithreading' can be either hardware or software. Also I agree all others' answers such as processor/thread affinity.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string