Why is the startup of an App on linux slower when using shared libs? - linux

On the embedded device I'm working on, the startup time is an important issue. The whole application consists of several executables that use a set of libraries. Because space in FLASH memory is limited we'd like to use shared libraries.
The application workes as usual when compiled and linked with shared libraries and the amount of FLASH memory is reduced as expected.
The difference to the version that is linked to static libs is that the startup time of the application is about 20s longer and I have no idea why.
The application runs on an ARM9 CPU at 180 MHz with Linux 2.6.17 OS,
16 MB FLASH (JFFS File System) and 32 MB RAM.

Bacause shared libraries have to be linked to at runtime, usually by dlopen() or something similar. There's no such step for static libraries.
Edit: some more detail. dlopen has to perform the following tasks.
Find the shared library
Load it into memory
Recursively load all dependencies (and their dependencies....)
Resolve all symbols
This requires quite a lot of IO operations to accomplish.
In a statically linked program all of the above is done at compile time, not runtime. Therefore it's much faster to load a statically linked program.
In your case, the difference is exaggerated by the relatively slow hardware your code has to run on.

This is a fine example of the classic tradeoff of speed and space.
You can statically link all your executables so that they are faster but then they will take more space
OR
You can have shared libraries that take less space but also more time to load.
So decide what you want to sacrifice.
There are many factors for this difference (OS, compiler e.t.c) but a good list of reasons can be found here. Basically shared libraries were created for space reasons and much of the "magic" involved to make them work takes a performance hit.
(As a historical note the original Netscape navigator on Linux/Unix was a statically linked big fat executable).

This may help others with similar problems:
The reason why startup took so long in my case was, that the default setting of the GCC is to export all symbols inside of a library.
A big improvement is to set a compiler setting "-fvisibility=hidden".
All symbols that the lib has to export have to be augmented with the statement
__attribute__ ((visibility("default")))
see gcc wiki
and the very fine article how to write shared libraries

Ok, I have learned now that the usage of shared libraries has it's disadvatages concerning speed. I found this article about dynamic linking and loading enlighting. The loading process seems to be much lengthier than I have expected.

Interesting.. typically loading time for a shared library is unnoticeable from a fat app that is statically linked. So I can only surmise that the system is either very slow to load a library from flash memory, or the library that is loaded is being checked in some way (eg .NET apps run a checksum for all loaded dlls, reducing startup time considerably in some cases). It could be that the shared libraries are being loaded as-needed, and unloaded afterwards which could indicate a configuration problem.
So, sorry I can't help say why, but I think its an issue with your ARM device/OS. Have you tried instrumenting the startup code, or statically linking with 1 of the most commonly-used libraries to see if that makes a large difference. Also put the shared libs in the same directory as the app to reduce the time it takes to search the FS for the lib.

One option which seems obvious to me, is to statically link the several programs all into a single binary. That way you continue to share as much code as possible (probably more than before), but you will also avoid the overhead of the dynamic linker AND save the space of having the dynamic linker on the system at all.
It's pretty easy to combine several executables into the same one, you normally just examine argv and decide which routine to call based on that.

Related

What is determing which `malloc` will be called for an injected code?

I'm using frida to hook various functions of the Firefox web browser running atop of Windows. One of the symbols I hooked was mozglue::malloc() which calls for the jemalloc allocator.
In the process address-space there are three malloc() symbols:
In msvcrt.lib (static linking)
In ucrtbase.dll for dynamic linking
The already mentioned in mozglue.dll
I was expecting that all memory allocations that made by the Firefox processes will be allocated by the mozglue::malloc() and of course this truely happens.
I didn't expect that memory allocations that made by the frida JS agent which was injected to the target process will also be allocated using jemalloc, and honestly I still can't figure out why.
frida couldn't possibly know that there is a mozglue::malloc() symbol when its first attaching to a process, from the frida point of view there is a simple call for malloc(), so how and why this call is redirected from the default CRT symbols to the Mozila dll? This probably have something to do with the PE design, but I can't put the finger on it...
Thank for any help / insight / answer
mozglue::malloc shouldn't even be the only function called inside Firefox, because some system functions called by Firefox will use system malloc.
I see only one explanation: Firefox replaces original malloc calls in memory with its own version of malloc. Having a check at the source code supports this idea: https://searchfox.org/mozilla-central/source/memory/build/replace_malloc.h

Dynamically loaded library remains loaded despite dlclose

Today I'm looking for some enlightenment of the deep magic inside the dynamic loader. I'm debugging/troubleshooting a plugin system for a C++ application running on Linux. It loads plugins via dlopen (RTLD_NOW | RTLS_LOCAL) and releases them using dlclose. Nothing extraordinary - one would think.
However, I noticed that some plugins remain loaded even after dlclose is successfully called*. I concluded this after looking at the running process' memory map using pmap. Some libs get immediately removed from process memory and others keep lingering around apparently indefinitely.
Continuing on, the dlopen man page states:
The function dlclose() decrements the reference count on the dynamic
library handle handle. If the reference count drops to zero and no
other loaded libraries use symbols in it, then the dynamic library is
unloaded.
That means the problem boils down to these two possibilities; Either the reference counts aren't zero, or other loaded libs are using symbols from some - but not all - of the plugins.
I'm pretty sure (although not 100%) that the reference counts are zero. The application's plugin manager handles all plugins exactly the same. It also makes sure that a plugin doesn't get loaded multiple times. IMO loading & unloading should therefore behave the same for all plugins.
That leaves the second possibility: other loaded libs are using symbols from the plugins. Another typical case of 'That should never happen'. Although it's certainly possible. We're using gcc and default visibility and as far as I've seen nothing is stripped, so tons of symbols are being exported. Actually that worries me much more as these plugins should be independent.
Here then are my open questions at this point:
Are my conclusions so far correct?
Do you know a way to verifydlopen's reference count?
If there are internal symbols of my plugins (accidentally) used by other libs, is there a way to track down who is using which symbols?
My machine is:
Linux 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:44 UTC 2014 i686 i686 i686 GNU/Linux
* I should mention that all of the loading and unloading happens in the main thread, so there should be no multi threading issue here.
other loaded libs are using symbols from the plugins
If other libs are not linked to that shared library at link time, referring to symbols of a shared library does not prevent that shared library from being unloaded.
To debug the run-time linker set environment variable LD_DEBUG to all, e.g. LD_DEBUG=all ./my_app. See man ld.so for details.

How does library size affects an application's load time and memory foot print?

Some other people in the office are discussing about reducing the memory footprint and load time by cutting out non-essential parts of an internal library into separate libraries, and load them on-demand.
I did some Googling tonight and learned that ld.so loads libraries with lazy binding, and both ld.so and dlopen loads the library with mmap. This seems to mean that those none-essential parts of the library are already loaded on-demand, i.e. as long as the functions are not used / data pages in the library are not read,
It doesn't take space and
It doesn't take time to load and allocate those pages.
So I thought they just need to make sure they don't actively touch all the non-essential components.
Is this true? Is this true (practically) for all posix systems?
Any library that directly links into your program or is a direct dependency still gets mapped and loaded at run time. To see what libraries will be open, you can use ldd some_program. Some of these will be indirect dependencies that will also be loaded. To see only what the direct depencies are you can use objdump -x my_program | grep NEEDED. Not only does every library have to be read from disk (can be relatively slow), but all the necessary symbols (sometimes many) from them must be mapped.
It may be that one dependency that isn't necessarily needed at startup, is pulling in the majority of your total library size. If you can split apart different capabilities into functional units that can be loaded as modules, it can be straight forward to reduce startup time and to some extent the memory footprint.
The easiest example I can come up with is an image viewer application where the GUI libraries are linked into the binary, but the actual loading of images is done by modules. (feh would be be an example of this using Imlib2 or optionally gdk-pixbuf for the image loading). This allows the main GUI to start up without ever loading libpng, libjpeg, libXPM, libcairo, librsvg, etc... This reduces the startup time because none of those need to be loaded, and may never be loaded, thus reducing the memory footprint.
In C this is commonly done with dlopen(), dlsym() and dlclose() (embedded systems using a flat binary format can use a specially linked module) but most toolkits will have their own module implementation (for instance Gtk+ uses glib's cross-platform module loading).
It doesn't have to be that cut and dry. You could just as easily split your initial GUI into parts and sequentially dlopen() them as an alternative to having a separate "splash screen". I don't know of any example of a toolkit that allows this but one could even open an X11 window with title and dimensions, then load the GUI and associated libraries into that window. You could load different binary format parsers depending on the file's extension/magic value or load different binary format generators depending on what format the user decides to Save_As.
Another alternative is to statically link library dependencies so that only the required symbols get loaded (as long as there are no licensing issues with doing so). Why load up 100+ widgets, when you only use 10? This works best with libraries that are built with static linking in mind such as musl-libc (glibc is notoriously useless for static linking). Compiling with the -flto compiler flag (or -combine -fwhole-program on older versions of gcc) along with -ffunction-sections and -fdata-sections and linking with --gc-sections typically helps (I normally see 15-50% size reduction) This is also a good option if you can't rely on the shared system libraries to be sane.

Allocating a data page in linux with NX bit turned off

I would like to generate some machine code in my program and then run it. One way to do it would be to write out a .so file and then load it in the program but that seems too expensive.
IS there a way in linux for me to write out the code in my data pages and then set my function ointer there and just call it? I've seen something similar on windows where you can allocate a page with the NX protection turned off for that page, but I can't find a similar OS call for linux.
The mmap(2) (with munmap(2)) and mprotect(2) syscalls are the elementary operations to do that. Recall that syscalls are elementary operations from the point of view of an application. You want PROT_EXEC
You could just strace any dynamically linked executable to get a clue about how you might call them, since the dynamic linker ld.so is using them.
Generating a shared object might be less expensive than you imagine. Actually, generating C code, running the compiler, then dlopen-ing the resulting shared object has some sense, even when you work interactively. My MELT domain specific language (to extend GCC) is doing this. Recall that you can do a big lot of dlopen-s without issues.
If you want to generate machine code in memory, you could use GNU lightning (quick generation of slow machine code), libjit from dotgnu (generate less bad machine code), LuaJit, asmjit (x86 or amd64 specific), LLVM (slowly generate optimized machine code). BTW, the SBCL Common Lisp implementation is dynamically compiling to memory and produces good machine code at runtime (and there is also all the JIT for JVMs doing that).

Do (statically linked) DLLs use a different heap than the main program?

I'm new to Windows programming and I've just "lost" two hours hunting a bug which everyone seems aware of: you cannot create an object on the heap in a DLL and destroy it in another DLL (or in the main program).
I'm almost sure that on Linux/Unix this is NOT the case (if it is, please say it, but I'm pretty sure I did that thousands of times without problems...).
At this point I have a couple of questions:
1) Do statically linked DLLs use a different heap than the main program?
2) Is the statically linked DLL mapped in the same process space of the main program? (I'm quite sure the answer here is a big YES otherwise it wouldn't make sense passing pointers from a function in the main program to a function in a DLL).
I'm talking about plain/regular DLL, not COM/ATL services
EDIT: By "statically linked" I mean that I don't use LoadLibrary to load the DLL but I link with the stub library
DLLs / exes will need to link to an implementation of C run time libraries.
In case of C Windows Runtime libraries, you have the option to specify, if you wish to link to the following:
Single-threaded C Run time library (Support for single threaded libraries have been discontinued now)
Multi-threaded DLL / Multi-threaded Debug DLL
Static Run time libraries.
Few More (You can check the link)
Each one of them will be referring to a different heap, so you are not allowed pass address obtained from heap of one runtime library to other.
Now, it depends on which C run time library the DLL which you are talking about has been linked to. Suppose let's say, the DLL which you are using has been linked to static C run time library and your application code (containing the main function) has linked to multi-threaded C Runtime DLL, then if you pass a pointer to memory allocated in the DLL to your main program and try to free it there or vice-versa, it can lead to undefined behaviour. So, the basic root cause are the C runtime libraries. Please choose them carefully.
Please find more info on the C run time libraries supported here & here
A quote from MSDN:
Caution Do not mix static and dynamic versions of the run-time libraries. Having more than one copy of the run-time libraries in a process can cause problems, because static data in one copy is not shared with the other copy. The linker prevents you from linking with both static and dynamic versions within one .exe file, but you can still end up with two (or more) copies of the run-time libraries. For example, a dynamic-link library linked with the static (non-DLL) versions of the run-time libraries can cause problems when used with an .exe file that was linked with the dynamic (DLL) version of the run-time libraries. (You should also avoid mixing the debug and non-debug versions of the libraries in one process.)
Let’s first understand heap allocation and stack on Windows OS wrt our applications/DLLs. Traditionally, the operating system and run-time libraries come with an implementation of the heap.
At the beginning of a process, the OS creates a default heap called Process heap. The Process heap is used for allocating blocks if no other heap is used.
Language run times also can create separate heaps within a process. (For example, C run time creates a heap of its own.)
Besides these dedicated heaps, the application program or one of the many loaded dynamic-link libraries (DLLs) may create and use separate heaps, called private heaps
These heap sits on top of the operating system's Virtual Memory Manager in all virtual memory systems.
Let’s discuss more about CRT and associated heaps:
C/C++ Run-time (CRT) allocator: Provides malloc() and free() as well as new and delete operators.
The CRT creates such an extra heap for all its allocations (the handle of this CRT heap is stored internally in the CRT library in a global variable called _crtheap) as part of its initialization.
CRT creates its own private heap, which resides on top of the Windows heap.
The Windows heap is a thin layer surrounding the Windows run-time allocator(NTDLL).
Windows run-time allocator interacts with Virtual Memory Allocator, which reserves and commits pages used by the OS.
Your DLL and exe link to multithreaded static CRT libraries. Each DLL and exe you create has a its own heap, i.e. _crtheap. The allocations and de-allocations has to happen from respective heap. That a dynamically allocated from DLL, cannot be de-allocated from executable and vice-versa.
What you can do? Compile our code in DLL and exe’s using /MD or /MDd to use the multithread-specific and DLL-specific version of the run-time library. Hence both DLL and exe are linked to the same C run time library and hence one _crtheap. Allocations are always paired with de-allocations within a single module.
If I have an application that compiles as an .exe and I want to use a library I can either statically link that library from a .lib file or dynamically linked that library from a .dll file.
Each linked module (ie. each .exe or .dll) will be linked to an implementation of the C or C++ run times. The run times themselves are a library that can be statically or dynamically linked to and come in different threading configurations.
By saying statically linked dlls are you describing a set up where an application .exe dynamically links to a library .dll and that library internally statically links to the runtime? I will assume that this is what you mean.
Also worth noting is that every module (.exe or .dll) has its own scope for statics i.e. a global static in an .exe will not be the same instance as a global static with the same name in a .dll.
In the general case therefore it cannot be assumed that lines of code running inside different modules are using the same implementation of the runtime, furthermore they will not be using the same instance of any static state.
Therefore certain rules need to be obeyed when dealing with objects or pointers that cross module boundaries. Allocations and deallocations must be occur in the same module for any given address. Otherwise the heaps will not match and behaviour will not be defined.
COM solves this this using reference counting, objects delete themselves when the reference count reaches zero. This is a common pattern used to solve the matched location problem.
Other problems can exist, for instance windows defines certain actions e.g. how allocation failures are handled on a per thread basis, not on a per module basis. This means that code running in module A on a thread setup by module B can also run into unexpected behaviour.

Resources