Does hyper threading change the binary code sequence of a compiled program?
If we had a compiled binary code of say:
10100011100011100101010010011111100011111110010111
If hyper threading were enabled, what would a thread represent?
Would it be just some section of this binary code?
How does the operating system allocate time intervals for these threads?
For parallelism:
Would the compiled binary code be any different?
How the cores handle this binary sequence?
Just execute some section of the code in different cores?
How does the operating system allocate parallel task? Is there any specific structure?
Most programs are compiled to be run by a particular operating system, on any member of a broad family of processors (e.g., on Windows, for the x86-64 family). Within a given CPU family, there may be CPUs with different numbers of cores, and there may be cores with or without hyperthreading.
None of that changes the binary code. The same program typically will run on any member of the CPU family, without any changes.
The operating system in which the program runs may or may not be differently configured for different members of the processor family.
Sometimes, a program can be compiled to exploit the features of a specific CPU, but programs compiled in that way are unsuitable for distribution to different sites and/or PCs.
If we had a compiled binary code of say: 101000111000... If hyper threading were enabled, what would a thread represent? Would it be just some section of this binary code?
That's an unanswerable question. You can learn more about what "binary code" means by reading an introductory book on computer architecture. E.g., https://www.amazon.com/Elements-Computing-Systems-Building-Principles/dp/0262640686/
Related
I have two physical, "identical" Linux RedHat servers. I ran a small program on both of them. My problem: the CPU usage of my program varies between both servers. I am not a Linux expert. I am wondering what could lead to that performance difference?
I wrote the program in C++ and in java to see if the inconsistency comes from the programming language chosen. The program itself does a little bit of integer calculation over time to consume a constant amount of CPU time. Both program versions have the same percentual CPU usage difference.
The environmental variables I have already thought of and could be excluded:
identical server type
identical processor (both have two sockets, single core)
both Intel Hyper-Threading-Technology enabled
identical clock speed
identical OS version (Red Hat Enterprise Linux Server release 5.9)
identical Java version, Java RE, JVM
Intel Demand based Switching can be ignored since the measurement tool uses the default value of clock speed for CPU capacity
processor affinity can be excluded as well I think. I ran multiple measurement series and I always retrieve exactly the same CPU usage values.
Is there maybe a C library or something like that, that has an impact on the CPU usage of C++ and Java programs which needs to be updated separately from the actual OS version? Or could there be a different thread scheduler?
There are a variety of things that can differ even for "identical" systems. Different compilers being used to build various libraries, as well as different versions of compilers. For example, there are continuous improvements from generation to generation of the ability of Intel compilers to optimize. Other differences can occur due to airflow differences causing one machine to run hotter than the other resulting in a drop in frequency occasionally. There are a whole host of other issues that can cause identical systems to run differently.
Here's my recommendation: Create an OS image and use that same image for both systems. Disconnect both from any network. Run compute bound (which you are). Bind your app to a certain core. Verify the exit air temperatures are well within specification. Disable any turbo capability. If there are still differences, do a memory speed check.
Also, use a more sophisticated profiling and analysis tool such as Intel Vtune. You can dig into actual cycles, measure cache misses, branch mispredicts, etc. They should also be identical. If they aren't, the analysis should give you an idea of where the problem lies.
I want to know when you have multi-core processor, does any one detect the independent chunk of code and create a new thread and assign it to an idle CPU?
I know that newer CPUs tries to do that. but I don't know enough about compiler and OS.
so can someone please explain to me, how much work will be done by compiler, OS and CPU?
Thanks
Update:
Sorry if my question was so vague, lets assume we have a code that translate to below instruction set.
add $1,$2,$3
nand $3,$2,$4
lw $2,$5,imm
10000 more instruction
......
add $65,$66,$67
add $67,$65,$68
10000 more instruction
and I assume instructions between lower and upper set are independent of each other.
now the below instructions can be send to another core for execution instead of waiting on the upper set to finish.
my question is who try to identify this independent blocks and send them to different cores.
thanks again.
In ye olde days, threads were managed by libraries and in some case with the compiler as well. Ada compilers of the 80's vintage were a good example of that.
Now, in the days of multi-processors and operating systems that support them, threads tend to be managed by the operating system. The operating system picks the processor to use.
There are few programming languages that have real threading support (e.g., Ada) so compilers tend to have little involvement in multithreading.
This is semi-theoretic question.
Can I specify the virtualization mode for memory (pure segmentation/segmentation+paging/just paging) while compiling for Windows (e.g., MSVS12 C++) and for Linux (e.g. g++)?
I have read all MSVS linker+compiler options, and found no point of control in there.
For g++ the manual is quite too complex for such question.
The source of this question is this - link
I know from theory and practice that these should either be possible or restricted by OS policy at some level cause core i7 supports all three modes I mentioned above.
Practical background:
The piece of code that created lots of data is here, function Init - and it exhausted my memory if I wanted to have over 2-3G primes on heap.
Intel x86 CPUs always use some form segmentation that can't be turned off. In 64-bit mode code segmentation is limited, but it's still there. Paging is required for both Windows and Linux to work on Intel CPUs (though Linux doesn't use paging on certain other CPU architectures). Paging is also required to enable 64-bit mode on Intel CPUs.
So in other words on Windows and Linux the OS always uses segmentation and paging, and so do any applications run on them, though this is largely transparent. It's not possible to "compiled+linked for 'segmentation without paging'" as you said in the answer you linked. Maybe the book you referenced is referring to ancient 16-bit versions of Windows (3.1 or earlier) which could be run in a mode that supported 80286 CPUs which didn't have paging. Though even then that normally didn't make any difference in how you compiled and linked your applications.
What you are describing is not a function of a compiler, or even a linker.
When you run your program, you get the memory model that is already running on the system. Your compiled code does not care abut the underlying memory mode.
However, your program itself can change the memory model IF it starts running in an unprotected processor mode.
I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.
I read the following statement:
The x86 architecture includes a
specific segment type called the Task
State Segment (TSS), to store hardware
contexts. Although Linux doesn't use
hardware context switches, it is
nonetheless forced to set up a TSS for
each distinct CPU in the system.
I am wondering:
Why doesn't Linux use the hardware support for context switch?
Isn't the hardware approach much faster than the software approach?
Is there any OS which does take advantage of the hardware context switch? Does windows use it?
At last and as always, thanks for your patience and reply.
-----------Added--------------
http://wiki.osdev.org/Context_Switching got some explanation.
People as confused as me could take a look at it. 8^)
The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)
The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.
I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.
Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.
Linux used to use HW-based switching, in the pre-1.3 timeframe iirc. I believe sw-based context switching turned out to be faster, and it is more flexible.
Another reason may have been minimizing arch-specific code. The first port of Linux to a non-x86 architecture was Alpha. Alpha didn't have TSS, so more code could be shared if all archs used SW switching. (Just a guess.) Unfortunately the kernel changelogs for the 1.2-1.3 kernel period are not well-preserved, so I can't be more specific.
Linux doesn't use a segmented memory model, so this segmentation specific feature isn't used.
x86 CPUs have many different kinds of hardware support for context switching, so the distinction isn't hardware vs software, but more how does an OS use the various hardware features available. It isn't necessary to use them all.
Linux is so efficiency focussed that you can bet that someone has profiled every option that is possible, and that the options currently used are the best available compromise.