Control over memory virtualization on Linux and Windows - linux

This is semi-theoretic question.
Can I specify the virtualization mode for memory (pure segmentation/segmentation+paging/just paging) while compiling for Windows (e.g., MSVS12 C++) and for Linux (e.g. g++)?
I have read all MSVS linker+compiler options, and found no point of control in there.
For g++ the manual is quite too complex for such question.
The source of this question is this - link
I know from theory and practice that these should either be possible or restricted by OS policy at some level cause core i7 supports all three modes I mentioned above.
Practical background:
The piece of code that created lots of data is here, function Init - and it exhausted my memory if I wanted to have over 2-3G primes on heap.

Intel x86 CPUs always use some form segmentation that can't be turned off. In 64-bit mode code segmentation is limited, but it's still there. Paging is required for both Windows and Linux to work on Intel CPUs (though Linux doesn't use paging on certain other CPU architectures). Paging is also required to enable 64-bit mode on Intel CPUs.
So in other words on Windows and Linux the OS always uses segmentation and paging, and so do any applications run on them, though this is largely transparent. It's not possible to "compiled+linked for 'segmentation without paging'" as you said in the answer you linked. Maybe the book you referenced is referring to ancient 16-bit versions of Windows (3.1 or earlier) which could be run in a mode that supported 80286 CPUs which didn't have paging. Though even then that normally didn't make any difference in how you compiled and linked your applications.

What you are describing is not a function of a compiler, or even a linker.
When you run your program, you get the memory model that is already running on the system. Your compiled code does not care abut the underlying memory mode.
However, your program itself can change the memory model IF it starts running in an unprotected processor mode.

Related

Environmental performance parameters for applications on Linux

I have two physical, "identical" Linux RedHat servers. I ran a small program on both of them. My problem: the CPU usage of my program varies between both servers. I am not a Linux expert. I am wondering what could lead to that performance difference?
I wrote the program in C++ and in java to see if the inconsistency comes from the programming language chosen. The program itself does a little bit of integer calculation over time to consume a constant amount of CPU time. Both program versions have the same percentual CPU usage difference.
The environmental variables I have already thought of and could be excluded:
identical server type
identical processor (both have two sockets, single core)
both Intel Hyper-Threading-Technology enabled
identical clock speed
identical OS version (Red Hat Enterprise Linux Server release 5.9)
identical Java version, Java RE, JVM
Intel Demand based Switching can be ignored since the measurement tool uses the default value of clock speed for CPU capacity
processor affinity can be excluded as well I think. I ran multiple measurement series and I always retrieve exactly the same CPU usage values.
Is there maybe a C library or something like that, that has an impact on the CPU usage of C++ and Java programs which needs to be updated separately from the actual OS version? Or could there be a different thread scheduler?
There are a variety of things that can differ even for "identical" systems. Different compilers being used to build various libraries, as well as different versions of compilers. For example, there are continuous improvements from generation to generation of the ability of Intel compilers to optimize. Other differences can occur due to airflow differences causing one machine to run hotter than the other resulting in a drop in frequency occasionally. There are a whole host of other issues that can cause identical systems to run differently.
Here's my recommendation: Create an OS image and use that same image for both systems. Disconnect both from any network. Run compute bound (which you are). Bind your app to a certain core. Verify the exit air temperatures are well within specification. Disable any turbo capability. If there are still differences, do a memory speed check.
Also, use a more sophisticated profiling and analysis tool such as Intel Vtune. You can dig into actual cycles, measure cache misses, branch mispredicts, etc. They should also be identical. If they aren't, the analysis should give you an idea of where the problem lies.

what is the difference between kmemleak and kmemcheck? and How to enable these tools on Android operating systems?

Is there any special usage/advantage over each other (kmemleak and kmemcheck) ? Can I enable these tools on Android operating system (not Linux OS) please guide me how.
Ref: https://www.kernel.org/doc/Documentation/kmemcheck.txt
https://www.kernel.org/doc/Documentation/kmemleak.txt
Kmemleak and Kmemcheck perform different tasks, none is better than the other.
1.
Kmemleak checks if some memory blocks were allocated by the kernel but were not freed (that is, checks for memory leaks in the kernel, hence the name). The performance overhead is usually acceptable.
2.
Kmemcheck checks if some kernel code accesses uninitialized memory. Example: the kernel code allocates a structure, does not fill it with values and then reads something from that structure. Kmemcheck should detect that.
Kmemcheck does not check for memory leaks, by the way.
Kmemcheck often slows down the system so much that graphical environments are impossible to use. The boot process may also become very slow (and may fail).
3.
If I am not mistaken, Kmemleak works at least for x86 and ARM. Kmemcheck is x86 only.
4.
Unfortunately, I cannot say how to enable Kmemleak on Android, I only used it on desktop Linux systems.
5.
Depending on what you are trying to accomplish, there could be tools that suit your needs better. For example, the Linux kernel has a variety of debugging features that can be enabled and built in. Again, I have no experience with Android kernel in this regard.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

What is the difference between x64 and IA-64?

I was on Microsoft's website and noticed two different installers, one for x64 and one for IA-64. Reference:Installing the .NET Framework 4.5, 4.5.1
My understanding is that IA-64 is a subclass of x64, so I'm curious why it would have a separate installer.
x64 is used as a short term for the 64 bit extensions of the "classical" x86 architecture; almost any "normal" PC produced in the last years have a processor based on such architecture.
AMD invented the AMD64 extensions; Intel was more or less forced to implement them, and called them first IA-32e, then EM64T and finally Intel 64 (actually, the AMD and Intel extensions aren't exactly the same, but they are almost identical).
Many people also call this stuff x86-64, to have a vendor-independent name and to stress the fact that it's the 64 bit evolution of the x86 architecture. All the "regular" PCs that are sold with "64 bit processors" run on x86-64 architecture.
IA-64 (Intel Architecture 64) is an almost completely unrelated 64 bit architecture (also known as Itanium), developed by Intel initially for high-end servers. It was said that Itanium could have been a replacement for the x86 architecture, but this architecture didn't have much success (for various reasons), so it's unlikely that you'll ever need the IA-64 installers.
For more information, you may have a look at the wikipedia articles on x86-64 and Itanium.
IA-64 is the Intel Itanium Architecture. This is a Very Long Instruction Word (VLIW) processor instruction set.
x86_64 is the normal 64-bit architecture that is used by processors inside every laptop / desktop in today's computers. This processor is a dynamic processor.
The main difference between these two is that
In VLIW, the compiler resolves the dependencies between instructions and schedules them appropriately. The processor merely executes them.
With a dynamic processor, the compiler just schedules the instructions without worrying about dependencies. The processor takes care of dependencies, reorders them and executes them appropriately.
VLIW code is dependent on each chip's internal architecture. The compiler needs to know that information. The advantage of them is that it can extract much more parallelism than dynamic processors can give.
The code is independent on each chip's internal architecture for dynamic processors. It just needs to follow the instruction set. So code compiled on one machine can run on other machines very easily. The disadvantage though is that limited parallelism can be exploited from dynamic processors. And the internal logic and design is very complex and intricate than VLIW.
Nevertheless, dynamic processors are used today mostly by consumers (individuals), so they can run code compiled / generated on any machine. VLIW processors are used by servers and enterprises because of the parallelism they can produce.
they are different
IA-64 is itanium - an architecture for servers
x64 is what 64bit intel core and amd cpus implement
x64 is short for x86-64 which is an extension of the x86 instruction set.
IA-64 is for the Itanium 64 bit Architecture (by Intel)
IA-64 is for computers running Intel Itanium 64 bit processors. They do not support running 32 bit applications like x64 processors do. A special version of Windows is needed to run on these processors, thus the two different installers.
They have different instruction set, this is the key point.

What is the Maximum Java Heap Space for SuSE Linux

This question is related to Java Refuses To Start - Could Not Resrve Enough Space for Object Heap and should be easy enough to figure out. However; my searches haven't yielded anything useful.
Essentially we have 2 32 bit OS's (RedHat & SuSE) on different machines with the same hardware. Both use the same JVM both executing the same command line. RedHat works perfectly fine but SuSE reports there isn't enough Memory.
We just need to know if this is a limitation of the version of SuSE we're using or if it's something else.
'cat /proc/version' gives us:
Linux version 2.6.5-7.244-bigsmp (geeko#buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Mon Dec 12 18:32:25 UTC 2005
'uname -a' gives us the following on BOTH types of machines:
UTC 2005 i686 i686 i386 GNU/Linux
The JVM memory limit is related the largest free contiguous block available, not the amount of free memory. The limit varies from about 1.4 GB to a bit over 2.0 GB, and depends on where your operating system puts various things in memory. I don't know the particulars of where Redhat or Suse load stuff into memory, but it could be that suse is mapping some library to an address in the middle of RAM, where Redhat might map it at the end (speculating).
And remember that your actual memory usage in java is more than what you specify for Xmx. The other memory settings also affect the size of your heap (like permgen). So it could also be that the perm space on Suse has a larget default than on Redhat.
Also, depending on the memory allocation profile of your application, you might get away with a smaller heap size and different garbage collecting options. There are some details here (http://java.sun.com/performance/reference/whitepapers/tuning.html) and other places. For example, if you allocate a lot of small, temporary blocks, you'll want different GC settings than if you have a lot of bit, long-lived objects.
Regarding the linked question, why not just use Redhat? That might be a simplistic solution, but I guarantee it's going to fix your problem faster than deeply delving into the arcane world of java tuning and OS memory management :P
Firstly, you are crazy to be running a 32-bit OS when you have this much address space pressure. Migrate to a 64-bit JVM on 64-bit Linux. How much time have you wasted already trying to diagnose this problem which you must have suspected from the outset would go away with the larger address space of a 64-bit system ?
Secondly, it's well known that out of all the Linux vendors Red Hat has the most kernel engineers on staff and makes some serious tweaks for the kernels in their RHEL products. These often include patches for large workloads like yours (well, it's a large workload for a 32-bit system, it's nothing special on 64-bit). So there's some chance the reason ultimately is that RHEL has other customers doing the same crazy stuff as you and you're benefiting from work they did to support those customers.
Finally though, since I suspect you're going to insist on trying to find a way to do this on 32-bit SuSE I will point out that Linux offers a variety of address space trade-offs on 32-bit x86, and it's possible (but not certain) that your SuSE systems just have a different trade-off selected. If you can bring up the configuration of the running kernels (often in /boot/config....) then you can compare settings like HIGHMEM.
The conventional option until a few years ago was 2:2 split, that is userspace is limited to 2GiB of address space, an easy solution to program and it has decent efficiency but in this scenario obviously you can't have your requested heap since it would leave no space for the program text, stack etc. More recently the trend has been for 3:1 (similar to the Windows /3GB switch) which expands userspace address space at the cost of cramming the OS kernel itself into less space which potentially causes its own problems. This might work, but it would be very cramped so I also wouldn't be surprised if it didn't work for your jobs. Finally newer Linux kernels also offer an option where you get 4GiB 32-bit userspace, which might be enough to make your jobs run reliably, at a significant performance cost since then obviously userspace and kernel addresses can't co-exist.
To try this you'd need a new kernel. You may be able to just install one provided by SuSE (see if they offer others to choose from, e.g. a "PAE" option) or you may have to compile your own, in which case it probably invalidates your support contract.
But really, you should just go with option 1, switch to a 64-bit JVM and put your feet up.

Resources