Environmental performance parameters for applications on Linux - linux

I have two physical, "identical" Linux RedHat servers. I ran a small program on both of them. My problem: the CPU usage of my program varies between both servers. I am not a Linux expert. I am wondering what could lead to that performance difference?
I wrote the program in C++ and in java to see if the inconsistency comes from the programming language chosen. The program itself does a little bit of integer calculation over time to consume a constant amount of CPU time. Both program versions have the same percentual CPU usage difference.
The environmental variables I have already thought of and could be excluded:
identical server type
identical processor (both have two sockets, single core)
both Intel Hyper-Threading-Technology enabled
identical clock speed
identical OS version (Red Hat Enterprise Linux Server release 5.9)
identical Java version, Java RE, JVM
Intel Demand based Switching can be ignored since the measurement tool uses the default value of clock speed for CPU capacity
processor affinity can be excluded as well I think. I ran multiple measurement series and I always retrieve exactly the same CPU usage values.
Is there maybe a C library or something like that, that has an impact on the CPU usage of C++ and Java programs which needs to be updated separately from the actual OS version? Or could there be a different thread scheduler?

There are a variety of things that can differ even for "identical" systems. Different compilers being used to build various libraries, as well as different versions of compilers. For example, there are continuous improvements from generation to generation of the ability of Intel compilers to optimize. Other differences can occur due to airflow differences causing one machine to run hotter than the other resulting in a drop in frequency occasionally. There are a whole host of other issues that can cause identical systems to run differently.
Here's my recommendation: Create an OS image and use that same image for both systems. Disconnect both from any network. Run compute bound (which you are). Bind your app to a certain core. Verify the exit air temperatures are well within specification. Disable any turbo capability. If there are still differences, do a memory speed check.
Also, use a more sophisticated profiling and analysis tool such as Intel Vtune. You can dig into actual cycles, measure cache misses, branch mispredicts, etc. They should also be identical. If they aren't, the analysis should give you an idea of where the problem lies.

Related

Control over memory virtualization on Linux and Windows

This is semi-theoretic question.
Can I specify the virtualization mode for memory (pure segmentation/segmentation+paging/just paging) while compiling for Windows (e.g., MSVS12 C++) and for Linux (e.g. g++)?
I have read all MSVS linker+compiler options, and found no point of control in there.
For g++ the manual is quite too complex for such question.
The source of this question is this - link
I know from theory and practice that these should either be possible or restricted by OS policy at some level cause core i7 supports all three modes I mentioned above.
Practical background:
The piece of code that created lots of data is here, function Init - and it exhausted my memory if I wanted to have over 2-3G primes on heap.
Intel x86 CPUs always use some form segmentation that can't be turned off. In 64-bit mode code segmentation is limited, but it's still there. Paging is required for both Windows and Linux to work on Intel CPUs (though Linux doesn't use paging on certain other CPU architectures). Paging is also required to enable 64-bit mode on Intel CPUs.
So in other words on Windows and Linux the OS always uses segmentation and paging, and so do any applications run on them, though this is largely transparent. It's not possible to "compiled+linked for 'segmentation without paging'" as you said in the answer you linked. Maybe the book you referenced is referring to ancient 16-bit versions of Windows (3.1 or earlier) which could be run in a mode that supported 80286 CPUs which didn't have paging. Though even then that normally didn't make any difference in how you compiled and linked your applications.
What you are describing is not a function of a compiler, or even a linker.
When you run your program, you get the memory model that is already running on the system. Your compiled code does not care abut the underlying memory mode.
However, your program itself can change the memory model IF it starts running in an unprotected processor mode.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

Why doesn't Linux use the hardware context switch via the TSS?

I read the following statement:
The x86 architecture includes a
specific segment type called the Task
State Segment (TSS), to store hardware
contexts. Although Linux doesn't use
hardware context switches, it is
nonetheless forced to set up a TSS for
each distinct CPU in the system.
I am wondering:
Why doesn't Linux use the hardware support for context switch?
Isn't the hardware approach much faster than the software approach?
Is there any OS which does take advantage of the hardware context switch? Does windows use it?
At last and as always, thanks for your patience and reply.
-----------Added--------------
http://wiki.osdev.org/Context_Switching got some explanation.
People as confused as me could take a look at it. 8^)
The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)
The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.
I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.
Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.
Linux used to use HW-based switching, in the pre-1.3 timeframe iirc. I believe sw-based context switching turned out to be faster, and it is more flexible.
Another reason may have been minimizing arch-specific code. The first port of Linux to a non-x86 architecture was Alpha. Alpha didn't have TSS, so more code could be shared if all archs used SW switching. (Just a guess.) Unfortunately the kernel changelogs for the 1.2-1.3 kernel period are not well-preserved, so I can't be more specific.
Linux doesn't use a segmented memory model, so this segmentation specific feature isn't used.
x86 CPUs have many different kinds of hardware support for context switching, so the distinction isn't hardware vs software, but more how does an OS use the various hardware features available. It isn't necessary to use them all.
Linux is so efficiency focussed that you can bet that someone has profiled every option that is possible, and that the options currently used are the best available compromise.

Highly concurrent multi-threaded application requires hardware

I am looking for a hardware, which must run about 256 computationally intensive real-time concurrent tasks in 24 hour mode (one multi-threaded C application). Each task takes about 40-50 MFLOPs, so all tasks require about 10 GFLOPs. CPU-RAM speed is insignificant. All tasks must be managed by a Linux Kernel (32 bit, with SMP).
I am looking for a one-mainboard solution with one multi-core CPU (if such CPU exist). If such CPU doesn't exist, then I need one mulit-socket mainboard solution (with multiple CPUs).
Can you please recommend me any professional CPU/Mainboard solution which will satisfy such requirements? It is also very important that there are no issues with Linux Kernel (2.6.25). No virtualization, no needs in huge RAM or CPU cache. I also would prefer Intel architecture and well-proved stability. I still have doubts that it is feasible at all.
Thank you in advance.
UPDATE:
I think I have found a right answer here and here.
UltraSPARC T2 has 8 cores with 8 threads each. Integrated high-bandwidth memory and IO. The T5140 carries two of them for 128 hardware threads.
The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s. - (http://blogs.oracle.com/deniss/entry/floating_point_performance_on_the)
Rent some Amazon EC2 nodes.
Updated: How about PS3's then? The NASA uses them for their simulation engines.
Maybe use CPU+GPU's in commercial servers?
Build it around FPGAs: nowadays, some variants include processors that can run Linux.
Even though you've given us the specs you think you need, we might be able to help you out better if you tell us what the application is intended to accomplish, and how it was implemented.
There may be a better way to split the work up or deal with it rather than your current solution.
Not Intel architecture but these run linux and have 64 cores on a single die.
TILEPro64
Get a bunch of four- or eight-core machines and split the processing across the machines using some sort of grid or clustering software. Maybe have a look at Beowulf.
As you mentioned, 10GFlops isn't exactly to be sneezed at so in a single machine, it'll be expensive. There's also the problem what you do when the machine breaks, you're unlikely to have a second machine of similar spec available. If you build a cluster using commodity hardware, you're a little more resilient and it's easier to find replacement machines.
MFLOPS and GFLOPS are very poor indicators of how well a program can run on any given CPU. These days, cache footprint is much more important; perhaps branch prediction accuracy as well.
There's almost no way to gauge performance of a given application on different architectures without actually giving it a spin. And even then, you may not get a good idea if you were unlucky enough to unknowingly build with compiler options that ruined your cache footprint, or used a bad threading library, or any of a hundred other things.
I see you'd prefer intel, but if you need one chip, I will again suggest the cell processor -
its theoretical peak performance is arount 25GFlops - kernel 2.6.25 had support for it already.
You could try a pre-slim playstation 3 for experimenting with (that would cost you little) or get yourself a server-based solution at around US$8K - you will have to re-write and fine tune your threads to take advabtage of the SPU co-processors there, but you could achieve your computational needs without breaking a sweat with a single CELL (1 PPC core + 8 SPU's)
NB.: with a playstation 3, you'd have only 6 available co-processors - but you don't seen to be on a budget with this project -
So you could at least try IBM's cell developer kit, which offers an emulator, to see if you can code your solution to run on it.
Thre are commercially available CELL products, both as stand-alone servers in blade form factory, and PCI Express add-on boards for PC workstations from
Mercury Computer Systems:
http://www.mc.com/microsites/cell/products.aspx?id=6986
Mercury does not list any prices on the site, but the pricing seens to be around the previoulsy mentioned U$8000.00 for these PCI Express cards.
A playstation 3 videogame can be purchased for about U$300.00 - and would allow you to prototype your application, and check if it is up to the needed performance. (I myself got one and have Fedora 9 running on it, although I did that as a hobbyst and have not, so far, used it for any calculations - I had also put together a Playstation-3 12 machinne cluster for Molecular simulations at the local University. The application they run did not take advantage of the multimedia SPU's, while I was in touch with then. But even so, clocked at 3.5GHz they performed better than standard ,s imlarly priced, PC's, even considering PS3's are priced 5x higher around here)

Is there any advantage for developing on a 64 bit OS?

I'm not sure I understand it properly: does a 64 bit OS run/compile code faster than a 32 bit OS on the same system?
We're using 64 bit OSs where I am and it seems to only cause compatibility issues with legacy and proprietary software. (We're running Ubuntu 9.04 Jaunty amd64)
I will restrict this answer to x86-32 (IA-32) vs x86-64 (AMD64), as I believe that's the question you're actually asking.
At the processor level, there are a few advantages. First and most obvious is the expansion of the per-process virtual memory to a much wider range of 48 bits. (64 is allowed in the architecture but not required, if memory serves.) That enables applications to use a lot more of the system's memory available to them, as well as opening up a lot of space for things like memory mapped files that operate on virtual memory that isn't linked to real memory. It also opens up a lot of space for the OS in question to work, as it doesn't have to share your 4 GB limit for its data. In short, applications and the OS can make better use of your machine's resources.
Additionally, the AMD64 architecture addresses one of the biggest problems of IA-32, which is the utter lack of registers. In fact it doubles the available registers, which is a huge win for some types of code. (Actually it's a win for almost ANY code, but some applications suffer from the increased memory cost of 64 bits and it evens out.)
On the Windows side, MS has taken it as an opportunity to break a whole bunch of historical compability problems. It's not a clean break from the old world, but it's a start. I don't believe Linux suffers from the same problems to begin with, and I don't have much perspective to offer on their 64 bit advantages.
As a general rule, developing--or using--a 64-bit operating system, in any context, will be slower than the same 32-bit operating system. Because all pointers are suddenly twice as large, you are far more likely to blow the cache, and can fit less data in RAM. That slows down your application considerably. You normally would only use 64-bit systems when your applications need to address more than 2 to 3 GB of data simultaneously--something very common in scientific computing and some database situations, but otherwise extremely rare. This is why Apple does not advocate unconditionally compiling PowerPC applications in 64-bit mode, for example: the cost due to cache-misses and lack of memory are high enough that going 64-bit only makes sense when you truly can take advantage of the 64-bit space.
But x86 v. AMD64, which is what you're really asking about (since you're discussing Ubuntu), is a very special beast. AMD64 not only extends all pointers to 64-bit; it fixes many, many deficiencies in the x86 architecture, doubling the number of GPRs, simplifying the instructions to be more friendly to modern CPU designs, and more. Because of this, on AMD64 platforms only, you will frequently see a substantial performance boost by going to 64-bit.
There is one other area where, in software development, it makes sense to go to 64-bit: you need to run lots of VMs. Running a couple of VMs can easily blow you past the 3 GB memory barrier of the operating system, making using them very painful. (It will work due to a technology called PAE, or Paged Addressing Extensions, that Intel invented to bridge the gap between 32-bit systems and 64-bit systems, but the result is slow, painful to work with as a developer, and not very well supported on Windows.) Going to a 64-bit OS can provide tremendous benefits.
(As the commentators note, this answer is somewhat generic, some of these points do not apply to intel/amd chips.)
The answer is: it varies, for a few reasons:
With larger-width instructions, you're going to get more expressiveness (either a greater variety of instructions or a greater capacity to encode data into those instructions directly), which can mean a reduced number of instructions flowing through the machine, which is generally a win: so ++64bit here.
But sometimes larger instructions might take more cycles to decode and execute, because they may be more complex. So a possible --64bit here.
Also, you need to transfer these instructions to and from the CPU: 64 bit instructions are twice as big as 32 bit instructions, which means more traffic to and from memory and the caches. CPUs are structured to ameliorate a lot of this cost, but it is a slight --64bit here.
More registers are usually available in wider instruction sets, which causes less data traffic to and from the stack and or memory. So ++64bit here.
And as everyone's no doubt going to mention, you have the ability to address more memory.
(Nearly forgot this one) the native "long" or "int" size may go up, depending on architecture, meaning data structures based on these get larger. Larger = more memory to move around, which means more possible waiting on data moving: --64bit if you're not careful.
Depending on your architecture, a lot of other concerns may apply too. You can rest assured that the processor and compiler vendors are working their butts off to reduce the "--"s above and increase the "++"s.
I have this 5GByte database that needs converting. On a 64-bit system, I just put all data in collections. In the 32-bit system, I had to think about the order in which to load and convert. The problem is not run-time, it is engineering time. Switching to 64 bit saves weeks of development time.
The compatability issues: that's no bug, that's a feature. It shows you who has written clean software.
There are also some security advantages to using 64-bit operating systems. There have been some buffer overflow exploits that circumvent address space layout randomization by brute force. On a 64-bit OS, there are simply too many addresses for this kind of attack to be successful.
It will speed up compilation if your compile process is memory-bound and you use your 64bit OS to increase the amount of memory usable by your system.
I expect it to be slightly slower, I had that experience with FC10. I don't have real reasons, but it is definitely not the sizeof(pointer) issue. (*)
My own hunch is that it simply is a matter of less optimized drivers or tweaked chipsets.
Also NTFS-3g was funny under 64-bit, while it worked under 32-bit (same distro, same kernel same partition, it just "hung" in some circumstances)
(*) most compiling is disk bound, not CPU bound. Moreover there are other improvements in the x86_64 architecture that cancel out that fact (better PIC, more regs, SSE2 default on, 686 cmov default on) . Unless your app does nothing than randomly moving small blocks around.

Resources