Why doesn't Linux use the hardware context switch via the TSS? - linux

I read the following statement:
The x86 architecture includes a
specific segment type called the Task
State Segment (TSS), to store hardware
contexts. Although Linux doesn't use
hardware context switches, it is
nonetheless forced to set up a TSS for
each distinct CPU in the system.
I am wondering:
Why doesn't Linux use the hardware support for context switch?
Isn't the hardware approach much faster than the software approach?
Is there any OS which does take advantage of the hardware context switch? Does windows use it?
At last and as always, thanks for your patience and reply.
-----------Added--------------
http://wiki.osdev.org/Context_Switching got some explanation.
People as confused as me could take a look at it. 8^)

The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)
The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.
I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.
Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.

Linux used to use HW-based switching, in the pre-1.3 timeframe iirc. I believe sw-based context switching turned out to be faster, and it is more flexible.
Another reason may have been minimizing arch-specific code. The first port of Linux to a non-x86 architecture was Alpha. Alpha didn't have TSS, so more code could be shared if all archs used SW switching. (Just a guess.) Unfortunately the kernel changelogs for the 1.2-1.3 kernel period are not well-preserved, so I can't be more specific.

Linux doesn't use a segmented memory model, so this segmentation specific feature isn't used.
x86 CPUs have many different kinds of hardware support for context switching, so the distinction isn't hardware vs software, but more how does an OS use the various hardware features available. It isn't necessary to use them all.
Linux is so efficiency focussed that you can bet that someone has profiled every option that is possible, and that the options currently used are the best available compromise.

Related

Efficiency of monolithic kernels

I know that monolithic kernel runs all services in itself. And I searched over the internet to find why every article say its efficient. I couldn't find the reason. Why does it tend to be efficient than other kernels?
Before we start, let us summersize what a kernel is.
In computing, a kernel, which is a fundamental part of modern operating systems(OSs), is a program that allocates all your systems' resources and manages the
input/output (I/O) requests, acting as an interface between the software and the hardware.
The kernel performs its tasks in the kernel space, whereas every other tasks are done in the user space. This separation ensures that the kernel data and
the user data do not interfere, hence maintaining stability.
Kernel types:
A monolithic kernel has all its components running in the kernel space. This aproach was designed to provide robust performance.
A micro kernel has only its crucial parts running in the kernel space while the rest runs in the user space. It was designed to be faster and more modular for easy maintanence.
A hybrid kernel is a kernel that attempts to combine the aspects of both monolithic and micro kernels. It has a structure similar to that of a micro kernel, but implimented in the manner of a monolithic kernel.
Now coming to the main topic: Why is a monolithic kernel is more efficient?
Because of their design, monolithic kernels have better performance, hence provide rich and more powerful hardware access. Nowadays they also consist of dynamically loaded and unloaded module, which provide the efficiency of modularity.
Hoped you liked my answer and found it useful
=)
Microkernels offer the bare essentials to get a system operating. Microkernel systems have small kernelspaces and large userspaces.
Monolithic kernels, however, contain much more. Monolithic systems have large kernelspaces. For instance, one difference is the placement of device drivers. Monolithic kernels contain drivers (modules) and place them in kernelspace while microkernels lack drivers. In such systems, the device drivers are offered in another way and placed in the userspace. This means microkernel system still have drivers, but they are not part of the kernel. In other words, the drivers exist in another part of the operating system.
Also, in the modern day approach to monolithic architecture, the kernel consists of different modules which can be dynamically loaded and un-loaded. This modular approach allows easy extension of OS's capabilities. With this approach, maintainability of kernel became very easy as only the concerned module needs to be loaded and unloaded every time there is a change or bug fix in a particular module. So, there is no need to bring down and recompile the whole kernel for a smallest bit of change.
So,as per modern day changes, the monolithic kernels have took a grip over their limitations and are evolving better. Hence, everybody seems to praise it for its improved efficiency!

what is the difference between kmemleak and kmemcheck? and How to enable these tools on Android operating systems?

Is there any special usage/advantage over each other (kmemleak and kmemcheck) ? Can I enable these tools on Android operating system (not Linux OS) please guide me how.
Ref: https://www.kernel.org/doc/Documentation/kmemcheck.txt
https://www.kernel.org/doc/Documentation/kmemleak.txt
Kmemleak and Kmemcheck perform different tasks, none is better than the other.
1.
Kmemleak checks if some memory blocks were allocated by the kernel but were not freed (that is, checks for memory leaks in the kernel, hence the name). The performance overhead is usually acceptable.
2.
Kmemcheck checks if some kernel code accesses uninitialized memory. Example: the kernel code allocates a structure, does not fill it with values and then reads something from that structure. Kmemcheck should detect that.
Kmemcheck does not check for memory leaks, by the way.
Kmemcheck often slows down the system so much that graphical environments are impossible to use. The boot process may also become very slow (and may fail).
3.
If I am not mistaken, Kmemleak works at least for x86 and ARM. Kmemcheck is x86 only.
4.
Unfortunately, I cannot say how to enable Kmemleak on Android, I only used it on desktop Linux systems.
5.
Depending on what you are trying to accomplish, there could be tools that suit your needs better. For example, the Linux kernel has a variety of debugging features that can be enabled and built in. Again, I have no experience with Android kernel in this regard.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

What is Intel microcode?

From what I've read it's used to fix bugs in the CPU without modifying the BIOS.
From my basic knowledge of Assembly I know that assembly instructions are split into microcodes internally by the CPU and executed accordingly. But intel somehow gives access to make some updates while the system is up and running.
Anyone has more info on them? Is there any documentation regarding what can it be done with microcodes and how can they be used?
EDIT:
I've read the wikipedia article: didn't figure out how can I write some on my own, and what uses it would have.
In older times, microcode was heavily used in CPU: every single instruction was split into microcode. This enabled relatively complex instruction sets in modest CPU (consider that a Motorola 68000, with its many operand modes and eight 32-bit registers, fits in 40000 transistors, whereas a single-core modern x86 will have more than a hundred millions). This is not true anymore. For performance reasons, most instructions are now "hardwired": their interpretation is performed by inflexible circuitry, outside of any microcode.
In a recent x86, it is plausible that some complex instructions such as fsin (which computes the sine function on a floating point value) are implemented with microcode, but simple instructions (including integer multiplication with imul) are not. This limits what can be achieved with custom microcode.
That being said, microcode format is not only very specific to the specific processor model (e.g. microcode for a Pentium III and a Pentium IV cannot be freely exchanged with eachother -- and, of course, using Intel microcode for an AMD processor is out of the question), but it is also a severely protected secret. Intel has published the method by which an operating system or a motherboard BIOS may update the microcode (it must be done after each hard reset; the update is kept in volatile RAM) but the microcode contents are undocumented. The Intel® 64 and IA-32 Architectures Software Developer’s Manual (volume 3a) describes the update procedure (section 9.11 "microcode update facilities") but states that the actual microcode is "encrypted" and clock-full of checksums. The wording is vague enough that just about any kind of cryptographic protection may be hidden, but the bottom-line is that it is not currently possible, for people other than Intel, to write and try some custom microcode.
If the "encryption" does not include a digital (asymmetric) signature and/or if the people at Intel botched the protection system somehow, then it may be conceivable that some remarkable reverse-engineering effort could potentially enable one to produce such microcode, but, given the probably limited applicability (since most instructions are hardwired), chances are that this would not buy much, as far as programming power is concerned.
Think loosely about a virtual machine or simulator where say for example qemu-arm can simulate an arm processor on an x86 host, ideally the software running on the simulated arm has no idea that it isnt a real arm. Take this idea to the level where the whole chip is designed such that it always looks like you are an x86, the software never knows there is some programmable items inside the chip. And that some other processor inside is somewhat designed for the purpose of implementing/simulating an x86. Supposedly the popular AMD 29000 product line just went away because the hardware team and perhaps processor/core became the guts of an early x86 clone. Transmeta, where Linus worked, had a vliw processor that was made to be a low power x86. In that case the translation layer was not (as much of) a secret. Vliw, very long instruction word, RISC taken to the extreme, is the kind of thing you build for this kind of task.
No it is not as much of an emulation layer as I am implying, there isnt some linux running there with a qemu program inside each chip. It is somewhere between hardwired where there is no software/microcode in the middle and a full blow emulation. The programmable bits may be like an fpga, programmable gates, or it may be software or programmable state machines, meaning not-programmable gates, just what runs on the gates is programmable.
Your non-x86, non-big iron type processors. Take ARM for example, are hardwired, no microcode. Microcontrollers, PIC, MSP430, AVR, assume these are not microcoded. Basically do not assume all processors are microcoded, few if any processor families are. It is just that the ones we deal with in PCs have been and may still be, so it may feel like they all are.
As fun as it may sound to play with this microcode, it is likely very specific to the processor family, and you likely will never gain access to how it works unless you work for Intel or AMD, each of which likely have their own internals. So you would need to get a job at one of the two, then work your way through the trenches to become one of what is likely an elite team that does this work. And once you get that far your career is trapped, your skills may be limited to one job at one company. You might have more fun programming individual gpus on a video card, something that is documented or at least has tools, something you can do today without spending 10 years at AMD or Intel to possibly get nowhere.

Disabling Multithreading during runtime

I am wondering if Intel's processor provides instructions in their instruction set
to turn on and off the multithreading or hyperthreading capability? Basically, I wanna
know if an Operating System can control these feature via instructions somehow?
Thank you so much
Mareike
Most operating systems have a facility for changing a process' CPU affinity, thereby restricting it to a single physical or virtual core. But multithreading is a program architecture, not a CPU facility.
I think that what you are trying to ask is, "Is there a way to prevent the OS from utilizing hyperthreading and/or multiple cores?"
The answer is, definitely. This isn't governed by a single instruction, and indeed it's not like you can just write a device driver that would automagically disable all of that hardware. Most of this depends on how the kernel configures the interrupt controllers at boot time.
When a machine is first started, there is a designated processor that is used for bootstrapping. It is the responsibility of the OS to configure the multiprocessor hardware accordingly. On PC platforms this would involve reading information about the multiprocessor configuration from in-memory tables provided by the boot firmware. This data would likely conform to either the ACPI or the Intel multiprocessor specifications. The kernel then uses that date to configure the APIC hardware accordingly.
Multithreading and multitasking are not special instructions or modes in the CPU. They're just fancy ways people who write operating systems use interrupts. There is a hardware timer, basically a counter being incremented by a clocking signal, that triggers an interrupt when it overflows. The exact interrupt is platform specific. In the olden days this timer is actually a separate chip/circuit on the motherboard that is simply attached to one of the CPU's interrupt pin. Modern CPUs have this timer built in. So, to turn off multithreading and multitasking the OS can simply disable the interrupt signal.
Alternatively, since it's the OS's job to actually schedule processes/threads, the OS can simply decide to ignore all threads and not run them.
Hyperthreading is a different thing. It sort of allows the OS to see a second virtual CPU that it can execute code on. Never had to deal with the thing directly so I'm not sure how to turn it off (or even if it is possible).
There is no x86 instruction that disables HyperThreading or additional cores. But, there is BIOS settings that can turn off these features. Because it can be set in BIOS, it requires rebooting, and generally it's beyond OS control. There is Windows booting option that limits the number of active core, but HyperThreading can be turn on/off only by BIOS. Current Intel's HyperThreading implementation doesn't allow dynamic turn on and off (and it won't be easily implemented in a near time).
I have assumed 'multithreading' in your question as 'hardware multithreading' which is technically identical to HyperThreading. However, if you really intended software-level multithreading (i.e., multitasking), then it's totally different question. It is (almost) impossible for modern operating systems since they are by default supports multitasking. And, this question actually doesn't make sense. It can make sense if you want to run MS-DOS (as real mode of x86, where a single task can be done).
p.s. Please note that 'multithreading' can be either hardware or software. Also I agree all others' answers such as processor/thread affinity.

Resources