How does library size affects an application's load time and memory foot print? - linux

Some other people in the office are discussing about reducing the memory footprint and load time by cutting out non-essential parts of an internal library into separate libraries, and load them on-demand.
I did some Googling tonight and learned that ld.so loads libraries with lazy binding, and both ld.so and dlopen loads the library with mmap. This seems to mean that those none-essential parts of the library are already loaded on-demand, i.e. as long as the functions are not used / data pages in the library are not read,
It doesn't take space and
It doesn't take time to load and allocate those pages.
So I thought they just need to make sure they don't actively touch all the non-essential components.
Is this true? Is this true (practically) for all posix systems?

Any library that directly links into your program or is a direct dependency still gets mapped and loaded at run time. To see what libraries will be open, you can use ldd some_program. Some of these will be indirect dependencies that will also be loaded. To see only what the direct depencies are you can use objdump -x my_program | grep NEEDED. Not only does every library have to be read from disk (can be relatively slow), but all the necessary symbols (sometimes many) from them must be mapped.
It may be that one dependency that isn't necessarily needed at startup, is pulling in the majority of your total library size. If you can split apart different capabilities into functional units that can be loaded as modules, it can be straight forward to reduce startup time and to some extent the memory footprint.
The easiest example I can come up with is an image viewer application where the GUI libraries are linked into the binary, but the actual loading of images is done by modules. (feh would be be an example of this using Imlib2 or optionally gdk-pixbuf for the image loading). This allows the main GUI to start up without ever loading libpng, libjpeg, libXPM, libcairo, librsvg, etc... This reduces the startup time because none of those need to be loaded, and may never be loaded, thus reducing the memory footprint.
In C this is commonly done with dlopen(), dlsym() and dlclose() (embedded systems using a flat binary format can use a specially linked module) but most toolkits will have their own module implementation (for instance Gtk+ uses glib's cross-platform module loading).
It doesn't have to be that cut and dry. You could just as easily split your initial GUI into parts and sequentially dlopen() them as an alternative to having a separate "splash screen". I don't know of any example of a toolkit that allows this but one could even open an X11 window with title and dimensions, then load the GUI and associated libraries into that window. You could load different binary format parsers depending on the file's extension/magic value or load different binary format generators depending on what format the user decides to Save_As.
Another alternative is to statically link library dependencies so that only the required symbols get loaded (as long as there are no licensing issues with doing so). Why load up 100+ widgets, when you only use 10? This works best with libraries that are built with static linking in mind such as musl-libc (glibc is notoriously useless for static linking). Compiling with the -flto compiler flag (or -combine -fwhole-program on older versions of gcc) along with -ffunction-sections and -fdata-sections and linking with --gc-sections typically helps (I normally see 15-50% size reduction) This is also a good option if you can't rely on the shared system libraries to be sane.

Related

How can process share shared library with load-time relocation?

When shared library was not complied as PIC, it still can be linked with the executable thorugh load-time relocation.
If I understand correctly, dynamic loader would look for entries listed in relocation table, and modifies them according to the memory mapping. That is, the code of shared library was adapted for the current process during load time.
My question is, how can another process uses the same shared library at the same time, dose the loader guarantee the memory mapping of the two processes are consistent? Or the library cannot be shared, and the OS would just load another copy of shared library into the memory?
When shared library was not complied as PIC, it still can be linked with the executable thorugh load-time relocation.
This statement needs qualifiers: this is only possible on some platforms (e.g. ELF32 ix86), but not others (e.g. ELF64 x86_64 with default medium memory model).
dynamic loader would look for entries listed in relocation table, and modifies them according to the memory mapping. That is, the code of shared library was adapted for the current process during load time.
Correct. Note that any memory pages that the loader had to update become unshared.
how can another process uses the same shared library at the same time
The other process will have its own copies of the loader-updated pages, but will share any unmodified pages with the first process.
dose the loader guarantee the memory mapping of the two processes are consistent?
No.
Or the library cannot be shared, and the OS would just load another copy of shared library into the memory?
Not quite: depending on how many pages need to be modified by the loader, some sharing can still happen.
P.S. When you build the shared library with -fPIC, the number of pages that need updating by the loader is minimized (all the places to be updated are grouped together in the .got section, instead of having these places spread throughout the .text section).

Rebuilding & install Shared library does not impact on process which already loaded that library

I have a question regarding shared library used by multiple processes.
I have a shared library libfoo.so which is used by two different processes, process1 and process2.
The first process (process1) is in the Running state and libfoo.so is loaded in memory. I made some modifications in libfoo.so code, rebuilt and installed, then started process2. The new process2 has loaded the newly-installed library libfoo.so.
But process1 is still running with older libfoo.so. If I restart process1, then it loads the newly-installed libfoo.so as expected.
If the operating system has a single copy of shared library, then why does installing a new shared library not affect currently running processes?
If the operating system has a single copy of shared library, then why does installing a new shared library not affect currently running processes?
First, your entire notion of a single copy is somewhat flawed.
Let's talk about an ELF shared library (the concepts apply to other kinds of libraries as well, though the details differ).
An ELF shared library will usually have at least two loadable segments: a read-only one, and a writable one. The first one contains read-only data, program code (often called .text), relocation sections, etc. The second segment contains initialized, but writable data (often called .data).
When two processes run and use the same libfoo.so, there will be at least three pages of memory used by libfoo.so: at least one page to "cover" the read-only segment (that page will be shared between the two runing processes), and at least one separate page in each process to "cover" the writable segment.
As you can see from this, a single copy of the shared library on disk is also replicated into multiple copies in RAM while the library is used by a running program.
Second, we need to talk about how you update libfoo.so. You could do it in one of two ways:
you could do this:
rm -f libfoo.so; gcc -shared -o libfoo.so foo.o, or
you could do this: gcc -shared -o libfoo.so foo.o.
In the first case, you are not affecting any process that has mmaped libfoo.so at all: the old data for libfoo.so will remain on disk, but not visible to any process which doesn't have it already opened or mmaped. Note also that if libfoo.so is 1GB in size, your disk usage will have gone up by 1GB (both old and new copy are still taking disk space).
In the second case, you are updating libfoo.so in place (this is not recommended, for reasons that will become obvious shortly). The inode number for libfoo.so will stay the same, and the old data will be gone. Your disk usage will stay constant (assuming new libfoo.so is about the same size as the old one).
This will affect any running process, but maybe not in a way you expect. The most likely outcome is that your running process will crash.
Why would it? Think of the library as a book, which has table of contents. During the initial library loading, table of contents will be loaded into RAM, and modified (because shared library can be loaded at arbitrary location in memory). If you now update the book (the library on disk) in such a way that e.g. chapter 3 becomes 3 pages longer, then the table of contents will no longer be valid (at least for chapters 4 through end). Any attempt to follow the pointers in table of contents will land you not at the start of the chapter you are seeking, but in the middle of a chapter. So you would call a function, and land in the middle of a different function. The most likely outcome of this is a crash.
The picture is even more complicated by demand paging. You may have paged in some of the chapters, but not others. So you may not discover that your process is in fact hosed immediately after update. If your library is small, you may not discover this at all.
P.S. Some operating systems prohibit the second form of update: opening a library for writing fails with ETXTBSY if the library is currently being used by some process. Linux does that for some file systems, but not all.

Dynamically loaded library remains loaded despite dlclose

Today I'm looking for some enlightenment of the deep magic inside the dynamic loader. I'm debugging/troubleshooting a plugin system for a C++ application running on Linux. It loads plugins via dlopen (RTLD_NOW | RTLS_LOCAL) and releases them using dlclose. Nothing extraordinary - one would think.
However, I noticed that some plugins remain loaded even after dlclose is successfully called*. I concluded this after looking at the running process' memory map using pmap. Some libs get immediately removed from process memory and others keep lingering around apparently indefinitely.
Continuing on, the dlopen man page states:
The function dlclose() decrements the reference count on the dynamic
library handle handle. If the reference count drops to zero and no
other loaded libraries use symbols in it, then the dynamic library is
unloaded.
That means the problem boils down to these two possibilities; Either the reference counts aren't zero, or other loaded libs are using symbols from some - but not all - of the plugins.
I'm pretty sure (although not 100%) that the reference counts are zero. The application's plugin manager handles all plugins exactly the same. It also makes sure that a plugin doesn't get loaded multiple times. IMO loading & unloading should therefore behave the same for all plugins.
That leaves the second possibility: other loaded libs are using symbols from the plugins. Another typical case of 'That should never happen'. Although it's certainly possible. We're using gcc and default visibility and as far as I've seen nothing is stripped, so tons of symbols are being exported. Actually that worries me much more as these plugins should be independent.
Here then are my open questions at this point:
Are my conclusions so far correct?
Do you know a way to verifydlopen's reference count?
If there are internal symbols of my plugins (accidentally) used by other libs, is there a way to track down who is using which symbols?
My machine is:
Linux 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:44 UTC 2014 i686 i686 i686 GNU/Linux
* I should mention that all of the loading and unloading happens in the main thread, so there should be no multi threading issue here.
other loaded libs are using symbols from the plugins
If other libs are not linked to that shared library at link time, referring to symbols of a shared library does not prevent that shared library from being unloaded.
To debug the run-time linker set environment variable LD_DEBUG to all, e.g. LD_DEBUG=all ./my_app. See man ld.so for details.

Which libraries appear in /proc/$PID/pmaps?

On Linux you can inspect /proc/$PID/pmaps to see the libraries loaded by a particular program, and a program can open /proc/self/pmaps to examine the libraries it itself has loaded.
I know pmaps will only contain dynamic libraries, and obviously the kernel can't predict which libraries we might dlopen at a later point, so I expect those aren't included in /proc/self/maps. But I'm unsure of a few other other scenarios:
Are libraries that have been linked at build time but we haven't called any function in yet included? My understanding is the Linux delays linking symbols until the first time they are used, so I'm not sure if they'll show up.
Does pmaps contain all the libraries used recursively? E.g. if I look at each library in pmaps and run ldd on it, and then run ldd on those, ad nauseum, I shouldn't find any new libraries that weren't in the original pmaps? I tried this on a couple binaries and it appears to be so but maybe I'm getting lucky.
Are libraries that have been linked at build time but we haven't called any function in yet included?
Yes: the runtime loader will mmap every library that your executable directly depends on, before your program starts running.
You can find the list of such libraries by running
readelf -d a.out | grep NEEDED
Does pmaps contain all the libraries used recursively?
Yes: if a library that you directly depend on itself depends on some other library, the runtime loader will mmap the recursive dependencies as well.
My understanding is the Linux delays linking symbols until the first time they are used
That is mosty correct for function symbols, but false for data symbols, which can't be resolved lazily.
Also, whether the symbols are resolved lazily or not depends on LD_BIND_NOW environment variable, and on an equivalent setting in the executable dynamic section, controlled by -znow linker flag.
None of that changes the mmap pciture though; if you have a DT_NEEDED entry for foo.so in your dynamic section, then foo.so will be mmaped (and will show in /proc/self/*map*) independent of lazy or non-lazy resolution.
/proc/$pid/maps is not only going to list libraries that are loaded, but also ALL other mapped memory segments.
Read this thread and the article in there:
Understanding Linux /proc/id/maps

Why is the startup of an App on linux slower when using shared libs?

On the embedded device I'm working on, the startup time is an important issue. The whole application consists of several executables that use a set of libraries. Because space in FLASH memory is limited we'd like to use shared libraries.
The application workes as usual when compiled and linked with shared libraries and the amount of FLASH memory is reduced as expected.
The difference to the version that is linked to static libs is that the startup time of the application is about 20s longer and I have no idea why.
The application runs on an ARM9 CPU at 180 MHz with Linux 2.6.17 OS,
16 MB FLASH (JFFS File System) and 32 MB RAM.
Bacause shared libraries have to be linked to at runtime, usually by dlopen() or something similar. There's no such step for static libraries.
Edit: some more detail. dlopen has to perform the following tasks.
Find the shared library
Load it into memory
Recursively load all dependencies (and their dependencies....)
Resolve all symbols
This requires quite a lot of IO operations to accomplish.
In a statically linked program all of the above is done at compile time, not runtime. Therefore it's much faster to load a statically linked program.
In your case, the difference is exaggerated by the relatively slow hardware your code has to run on.
This is a fine example of the classic tradeoff of speed and space.
You can statically link all your executables so that they are faster but then they will take more space
OR
You can have shared libraries that take less space but also more time to load.
So decide what you want to sacrifice.
There are many factors for this difference (OS, compiler e.t.c) but a good list of reasons can be found here. Basically shared libraries were created for space reasons and much of the "magic" involved to make them work takes a performance hit.
(As a historical note the original Netscape navigator on Linux/Unix was a statically linked big fat executable).
This may help others with similar problems:
The reason why startup took so long in my case was, that the default setting of the GCC is to export all symbols inside of a library.
A big improvement is to set a compiler setting "-fvisibility=hidden".
All symbols that the lib has to export have to be augmented with the statement
__attribute__ ((visibility("default")))
see gcc wiki
and the very fine article how to write shared libraries
Ok, I have learned now that the usage of shared libraries has it's disadvatages concerning speed. I found this article about dynamic linking and loading enlighting. The loading process seems to be much lengthier than I have expected.
Interesting.. typically loading time for a shared library is unnoticeable from a fat app that is statically linked. So I can only surmise that the system is either very slow to load a library from flash memory, or the library that is loaded is being checked in some way (eg .NET apps run a checksum for all loaded dlls, reducing startup time considerably in some cases). It could be that the shared libraries are being loaded as-needed, and unloaded afterwards which could indicate a configuration problem.
So, sorry I can't help say why, but I think its an issue with your ARM device/OS. Have you tried instrumenting the startup code, or statically linking with 1 of the most commonly-used libraries to see if that makes a large difference. Also put the shared libs in the same directory as the app to reduce the time it takes to search the FS for the lib.
One option which seems obvious to me, is to statically link the several programs all into a single binary. That way you continue to share as much code as possible (probably more than before), but you will also avoid the overhead of the dynamic linker AND save the space of having the dynamic linker on the system at all.
It's pretty easy to combine several executables into the same one, you normally just examine argv and decide which routine to call based on that.

Resources