Linux memory/disk behavior during ELF execution - linux

I have a series of executables running on a cluster that read in some small config files at execution start, and then do a lot of processing for several hours, and then write out some data and exit. Our sysadmin is trying to tell me that our fileserver is really slow because his analysis is showing that the cluster nodes are spending all their time using NFS disk I/O reading in ELF executables for execution, long after they've been spawned (note: our executables are only a few MB in size). This doesn't sound right to me, as I was under the impression that the dynamic linker loaded the entire executable into memory at runtime and then operated out of memory. I know that the kernel leaves an open file descriptor to the executable while it's running, but I didn't think it was continually reading from it.
My question is, is my understanding of how executables are loaded flawed? I find it hard to believe that the kernel is constantly doing file reads on the executable to fetch instructions, as this would be terribly slow (even with caching) because branch predictions are hardly reliable, so you'd be spending forever reading the executable from disk if your binary performed frequent jumps.

I was under the impression that the dynamic linker loaded the entire executable into memory at runtime and then operated out of memory.
Your impression is incorrect.
First, a minor inaccuracy: while the dynamic linker is responsible for loading shared libraries, the main executable itself is loaded by the kernel before dynamic loader is started.
Second, most current systems use demand paging. The files are mmaped, but the code isn't actually loaded into memory until that code is accessed (i.e. tries to execute). If you never execute some parts of the program, these parts are never loaded into memory at all.
I find it hard to believe that the kernel is constantly doing file reads on the executable to fetch instructions
It doesn't constantly do that. It typically loads the code into memory and the code stays there.
It is possible for the code to be discarded from memory (which would require reloading it again if it executes again) on a system that doesn't have enough memory (this is called thrashing).
because branch predictions are hardly reliable,
Branch prediction
has approximately nothing to do with your problem, and
is exceedingly good on modern CPUs.

Related

Does executing a binary on a RAMDisk reload the executable into memory?

Let's say I have two copies of the same 10MB binary executable, A and B.
If I have plenty of available memory and run ./A, my understanding is that A will be loaded into memory and run from there. This will take around 10MB of RAM to accomplish.
If I have plenty of available memory, create a RAMDisk, copy B to the RAMDisk, and run ./B from the RAMDisk, my understanding is that B will be (re)loaded into memory and run from there. This will take around 10MB of RAM for the executable, plus the memory in use for the RAMDisk.
Is this correct? Is a RAMDisk smart enough to say "oh, I already have that binary executable in memory, let's just run it in place?" Even if it was, wouldn't the loader have to do its magic to run the thing?
I'm using QNX and running ELF without COFF binaries, but I would appreciate answers for any *Nix system.
I would really expect it to be loaded, typical ELF binaries are really not an "execute in place" format.
There are things you need to do, like relocating any position-independent code and of course dynamic library loading, which the file system on the RAM disk knows nothing about.

How does chroot affect dynamic libraries memory use?

Although there is another question with similar topic, it does not cover the memory use by the shared libraries in chrooted jails.
Let's say we have a few similar chroots. To be more specific, exactly the same sets of binary files and shared libraries which are actually hard links to the master copies to conserve the disk space (to prevent the potential possibility of a files alteration the file system is mounted read only).
How is the memory use affected in such a setup?
As described in the chroot system call:
This call changes an ingredient in the pathname resolution process and does nothing else.
So, the shared library will be loaded in the same way as if it were outside the chroot jail (share read only pages, duplicate data, etc.)
http://man7.org/linux/man-pages/man2/chroot.2.html
Because hardlinks share the same underlying inode, the kernel treats them as the same item when it comes to caching/mapping.
You'll see filesystem cache savings by using hardlinks, as well as disk-space savings.
The biggest issue I'd have with this is that if someone manages so subvert the read-only nature of one of the chroot environments, then they could subvert all of them by making modifications to any of the hardlinked files.
When I set this up, I copied the shared libraries per chroot instead of linking to a read-only mount. With separate files, the text segments were not shared. It's likely that the same inode will map to the same read-only text segment, but this may vary with available memory management hardware and similar architectural details.
Try this experiment on your system: write a small program that makes some minimal use of a large shared library. Run twenty or thirty chroot jails as you describe, each with a running copy of the program. Check overall memory usage before & during running, and dissect one instance to get a good text/data segment breakdown. If memory use increases by the full size of the map for each instance, the segments are not shared. Conversely, if memory use goes up by a fraction of the map, the segments are shared.

Executing a ELF binary without creating a local file?

Is it possible to execute a binary file without copying it to hard drive?
I know /lib/ld-linux.so.2 can load arbitrary binaries, but that still requires a locally stored file, my thought was to allocate a memory region, duplicate the contents to memory and execute it.
So is that possible?
my thought was to allocate a memory region, duplicate the contents to memory and execute it.
For a statically-linked a.out, you can do just that.
For a dynamically-linked one, you'll need something like dlopen_phdr.
It is possible but very difficult. I worked on a similar problem years ago here. (Caution! That code is incomplete and contains serious bugs.)
The hard part is to make sure that your process's address space requirements do not conflict with the binary's or the dynamic linker's (e.g., /lib/ld-linux.so.2). If you control both programs' memory layout (because you linked them) you can avoid that problem. (My code does not assume such control, and it takes pains to move itself out of the way if necessary.)
If there are no address space conflicts, it is a matter of allocating memory for the exe's PT_LOAD segments (and the linker's, if any), aligning and copying the segments in, protecting the read-only segments, allocating stack, initializing stack and registers the way the kernel does, and jumping to the exe's or linker's entry address.

How many copies of program/class gets loaded into memory when multiple users accessing it at the same time

We are trying to setup Eclipse in a shared environment, i.e., it will be installed on a server and each user connects to it using VNC. There are different reasons for sharing Eclipse, one being proper integration with ClearCase.
We identified that Eclipse is using large amounts of memory. We are wondering whether the Eclipse(JVM?) loads each class once per user/session or whether there is any sort of sharing of objects that are already loaded into memory?
This makes me think about a basic question in general. How many copies of a program gets loaded into memory when two or more users are accessing the host at the same time.
Is it one per user or a single copy is shared between users?
Two questions here:
1) How many copies of a program gets loaded into memory when two or
more users are using it at the same time?
2) How does the above holds in the world of Java/JVM?
Linux allows for sharing binary code between running processes, i.e. the segments that hold executable parts of a program are mapped into virtual memory space of each running copy. Then each process gets its own data parts (stack, heap, etc.).
The issue with Java, or almost any other interpreted language, is that run-time, the JVM, treats byte-code as data, loading it into heap. The fact that Java is half-compiled and half interpreted is irrelevant here. This results in a situation where the JVM executable itself is eligible for code sharing by the OS, but your application Java code is not.
In general, a single copy of a program (i.e. text segment) is loaded into RAM and shared by all instances, so the exact same read-only memory mapped physical pages (though possibly/probably mapped to different addresses in different address spaces, but it's still the same memory). Data is usually private to each process, i.e. each program's data lives in separate pages RAM (though it can be shared).
BUT
The problem is that the actual program here is only the Java runtime interpreter, or the JIT compiler. Eclipse, like all Java programs, is rather data than a program (which however is interpreted as a program). That data is either loaded into the private address space and interpreted by the JVM or turned into an executable by the JIT compiler, resulting in a (temporary) executable binary, which is launched. This means, in principle, each Java program runs as a separate copy, using separate RAM.
Now, you might of course be lucky, and the JVM might load the data as a shared mapping, in this case the bytecode would occupy the same identical RAM in all instances. However, whether that's the case is something only the author of the JVM could tell, and it's not something you can rely on in general.
Also, depending on how clever the JIT is, it might cache that binary for some time and reuse it for identical Java programs, which would be very advantageous, not only because it saves the compilation. All instances launched from the same executable image share the same memory, so this would be just what you want.
It is even likely that this is done -- at least to some extent -- on your JIT compiler, because compiling is rather expensive and it's a common optimization.

Why is the startup of an App on linux slower when using shared libs?

On the embedded device I'm working on, the startup time is an important issue. The whole application consists of several executables that use a set of libraries. Because space in FLASH memory is limited we'd like to use shared libraries.
The application workes as usual when compiled and linked with shared libraries and the amount of FLASH memory is reduced as expected.
The difference to the version that is linked to static libs is that the startup time of the application is about 20s longer and I have no idea why.
The application runs on an ARM9 CPU at 180 MHz with Linux 2.6.17 OS,
16 MB FLASH (JFFS File System) and 32 MB RAM.
Bacause shared libraries have to be linked to at runtime, usually by dlopen() or something similar. There's no such step for static libraries.
Edit: some more detail. dlopen has to perform the following tasks.
Find the shared library
Load it into memory
Recursively load all dependencies (and their dependencies....)
Resolve all symbols
This requires quite a lot of IO operations to accomplish.
In a statically linked program all of the above is done at compile time, not runtime. Therefore it's much faster to load a statically linked program.
In your case, the difference is exaggerated by the relatively slow hardware your code has to run on.
This is a fine example of the classic tradeoff of speed and space.
You can statically link all your executables so that they are faster but then they will take more space
OR
You can have shared libraries that take less space but also more time to load.
So decide what you want to sacrifice.
There are many factors for this difference (OS, compiler e.t.c) but a good list of reasons can be found here. Basically shared libraries were created for space reasons and much of the "magic" involved to make them work takes a performance hit.
(As a historical note the original Netscape navigator on Linux/Unix was a statically linked big fat executable).
This may help others with similar problems:
The reason why startup took so long in my case was, that the default setting of the GCC is to export all symbols inside of a library.
A big improvement is to set a compiler setting "-fvisibility=hidden".
All symbols that the lib has to export have to be augmented with the statement
__attribute__ ((visibility("default")))
see gcc wiki
and the very fine article how to write shared libraries
Ok, I have learned now that the usage of shared libraries has it's disadvatages concerning speed. I found this article about dynamic linking and loading enlighting. The loading process seems to be much lengthier than I have expected.
Interesting.. typically loading time for a shared library is unnoticeable from a fat app that is statically linked. So I can only surmise that the system is either very slow to load a library from flash memory, or the library that is loaded is being checked in some way (eg .NET apps run a checksum for all loaded dlls, reducing startup time considerably in some cases). It could be that the shared libraries are being loaded as-needed, and unloaded afterwards which could indicate a configuration problem.
So, sorry I can't help say why, but I think its an issue with your ARM device/OS. Have you tried instrumenting the startup code, or statically linking with 1 of the most commonly-used libraries to see if that makes a large difference. Also put the shared libs in the same directory as the app to reduce the time it takes to search the FS for the lib.
One option which seems obvious to me, is to statically link the several programs all into a single binary. That way you continue to share as much code as possible (probably more than before), but you will also avoid the overhead of the dynamic linker AND save the space of having the dynamic linker on the system at all.
It's pretty easy to combine several executables into the same one, you normally just examine argv and decide which routine to call based on that.

Resources