Do (statically linked) DLLs use a different heap than the main program? - linux

I'm new to Windows programming and I've just "lost" two hours hunting a bug which everyone seems aware of: you cannot create an object on the heap in a DLL and destroy it in another DLL (or in the main program).
I'm almost sure that on Linux/Unix this is NOT the case (if it is, please say it, but I'm pretty sure I did that thousands of times without problems...).
At this point I have a couple of questions:
1) Do statically linked DLLs use a different heap than the main program?
2) Is the statically linked DLL mapped in the same process space of the main program? (I'm quite sure the answer here is a big YES otherwise it wouldn't make sense passing pointers from a function in the main program to a function in a DLL).
I'm talking about plain/regular DLL, not COM/ATL services
EDIT: By "statically linked" I mean that I don't use LoadLibrary to load the DLL but I link with the stub library

DLLs / exes will need to link to an implementation of C run time libraries.
In case of C Windows Runtime libraries, you have the option to specify, if you wish to link to the following:
Single-threaded C Run time library (Support for single threaded libraries have been discontinued now)
Multi-threaded DLL / Multi-threaded Debug DLL
Static Run time libraries.
Few More (You can check the link)
Each one of them will be referring to a different heap, so you are not allowed pass address obtained from heap of one runtime library to other.
Now, it depends on which C run time library the DLL which you are talking about has been linked to. Suppose let's say, the DLL which you are using has been linked to static C run time library and your application code (containing the main function) has linked to multi-threaded C Runtime DLL, then if you pass a pointer to memory allocated in the DLL to your main program and try to free it there or vice-versa, it can lead to undefined behaviour. So, the basic root cause are the C runtime libraries. Please choose them carefully.
Please find more info on the C run time libraries supported here & here
A quote from MSDN:
Caution Do not mix static and dynamic versions of the run-time libraries. Having more than one copy of the run-time libraries in a process can cause problems, because static data in one copy is not shared with the other copy. The linker prevents you from linking with both static and dynamic versions within one .exe file, but you can still end up with two (or more) copies of the run-time libraries. For example, a dynamic-link library linked with the static (non-DLL) versions of the run-time libraries can cause problems when used with an .exe file that was linked with the dynamic (DLL) version of the run-time libraries. (You should also avoid mixing the debug and non-debug versions of the libraries in one process.)

Let’s first understand heap allocation and stack on Windows OS wrt our applications/DLLs. Traditionally, the operating system and run-time libraries come with an implementation of the heap.
At the beginning of a process, the OS creates a default heap called Process heap. The Process heap is used for allocating blocks if no other heap is used.
Language run times also can create separate heaps within a process. (For example, C run time creates a heap of its own.)
Besides these dedicated heaps, the application program or one of the many loaded dynamic-link libraries (DLLs) may create and use separate heaps, called private heaps
These heap sits on top of the operating system's Virtual Memory Manager in all virtual memory systems.
Let’s discuss more about CRT and associated heaps:
C/C++ Run-time (CRT) allocator: Provides malloc() and free() as well as new and delete operators.
The CRT creates such an extra heap for all its allocations (the handle of this CRT heap is stored internally in the CRT library in a global variable called _crtheap) as part of its initialization.
CRT creates its own private heap, which resides on top of the Windows heap.
The Windows heap is a thin layer surrounding the Windows run-time allocator(NTDLL).
Windows run-time allocator interacts with Virtual Memory Allocator, which reserves and commits pages used by the OS.
Your DLL and exe link to multithreaded static CRT libraries. Each DLL and exe you create has a its own heap, i.e. _crtheap. The allocations and de-allocations has to happen from respective heap. That a dynamically allocated from DLL, cannot be de-allocated from executable and vice-versa.
What you can do? Compile our code in DLL and exe’s using /MD or /MDd to use the multithread-specific and DLL-specific version of the run-time library. Hence both DLL and exe are linked to the same C run time library and hence one _crtheap. Allocations are always paired with de-allocations within a single module.

If I have an application that compiles as an .exe and I want to use a library I can either statically link that library from a .lib file or dynamically linked that library from a .dll file.
Each linked module (ie. each .exe or .dll) will be linked to an implementation of the C or C++ run times. The run times themselves are a library that can be statically or dynamically linked to and come in different threading configurations.
By saying statically linked dlls are you describing a set up where an application .exe dynamically links to a library .dll and that library internally statically links to the runtime? I will assume that this is what you mean.
Also worth noting is that every module (.exe or .dll) has its own scope for statics i.e. a global static in an .exe will not be the same instance as a global static with the same name in a .dll.
In the general case therefore it cannot be assumed that lines of code running inside different modules are using the same implementation of the runtime, furthermore they will not be using the same instance of any static state.
Therefore certain rules need to be obeyed when dealing with objects or pointers that cross module boundaries. Allocations and deallocations must be occur in the same module for any given address. Otherwise the heaps will not match and behaviour will not be defined.
COM solves this this using reference counting, objects delete themselves when the reference count reaches zero. This is a common pattern used to solve the matched location problem.
Other problems can exist, for instance windows defines certain actions e.g. how allocation failures are handled on a per thread basis, not on a per module basis. This means that code running in module A on a thread setup by module B can also run into unexpected behaviour.

Related

Link to the same DLL twice - Implicit and explicit at the same time

The project on which I'm working, loads same library twice:
with LoadLibrary
statically loads the DLL with a lib file and " __declspec(dllimport/dllexport)".
What is happening in this case? Are these 2 "loadings" use same heap or share something else. E.g. is it same or similar as calling LoadLibrary twice?
My general problem is that I'm having stack corruption problems, when calling dll methods from exe via the second approach. And I'm wondering if the problem could be, because of the first loading? All projects use same RT, alignment and so on.
By "statically loads the DLL with a lib file and _declspec(dllimport/dllexport)" I assume you meant that that you compiled your executable with the .lib as a dependency, and at the runtime the .dll is automatically loaded by the exe (at the beginning). Here's a fragment from FreeLibrary (surprisingly) MSDN page:
The system maintains a per-process reference count for each loaded module. A module that was loaded at process initialization due to load-time dynamic linking has a reference count of one. The reference count for a module is incremented each time the module is loaded by a call to LoadLibrary. The reference count is also incremented by a call to LoadLibraryEx unless the module is being loaded for the first time and is being loaded as a data or image file.
So in other words, the .dll gets loaded at application startup (because you linked against it) and LoadLibrary just increments its ref count (). For more info you could also check DllMain, or this dll guide.
There's absolutely no reason to use both approaches for the same .dll in the same application.
The 2nd approach is the preferred one, if the .dll comes with a .h file (that holds the function definitions exported by the library, needed at compile time) and a .lib file (that instructs the liker to add references from the .dll file into the executable).
The 1st approach on the other hand is the only way if you only have the .dll file and you somehow have the signatures of the functions it exports. In that case you must define in your app pointers to those functions and initialize them using GetProcAddress. There are cases when this approach is preferred, for example when the functionality in the .dll is needed only in a corner case of the program flow, in that case there's no point to link against the .lib file and load the .dll at app startup if let's say in 99% of the cases it won't be required. Also, a major advantage of this approach: if the .dll is somehow deleted then only the functionality related to it won't work (LoadLibrary will fail), while using the other approach, the application won't start.
Now, without details i can't get to the bottom of this specific problem you'r running into. You say that you call a function "normally" (from its definition in the .h file), it fails while if you call it (with the same arguments) using a function pointer it succeeds? What's the stack error message?
Note: From my experience a typical reason for stack corruptions in scenarios like this one is calling convention mismatch between the caller and the callee (stdcall vs cdecl or viceversa). Also mixing Debug and Release could introduce problems.

Is it possible to force a range of virtual addresses?

I have an Ada program that was written for a specific (embedded, multi-processor, 32-bit) architecture. I'm attempting to use this same code in a simulation on 64-bit RHEL as a shared object (since there are multiple versions and I have a requirement to choose a version at runtime).
The problem I'm having is that there are several places in the code where the people who wrote it (not me...) have used Unchecked_Conversions to convert System.Addresses to 32-bit integers. Not only that, but there are multiple routines with hard-coded memory addresses. I can make minor changes to this code, but completely porting it to x86_64 isn't really an option. There are routines that handle interrupts, CPU task scheduling, etc.
This code has run fine in the past when it was statically-linked into a previous version of the simulation (consisting of Fortran/C/C++). Now, however, the main executable starts, then loads a shared object based on some inputs. This shared object then checks some other inputs and loads the appropriate Ada shared object.
Looking through the code, it's apparent that it should work fine if I can keep the logical memory addresses between 0 and 2,147,483,647 (32-bit signed int). Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
Is there a way to either force the shared object loader to leave space in the lower ranges for the Ada code
The good news is that the loader will leave the lower ranges untouched.
The bad news is that it will not load any shared object there. There is no interface you could use to influence placement of shared objects.
That said, dlopen from memory (which we implemented in our private fork of glibc) would allow you to do that. But that's not available publicly.
Your other possible options are:
if you can fit the entire process into 32-bit address space, then your solution is trivial: just build everything with -m32.
use prelink to relocate the library to desired address. Since that address should almost always be available, the loader is very likely to load the library exactly there.
link the loader with a custom mmap implementation, which detects the library of interest through some kind of side channel, and does mmap syscall with MAP_32BIT set, or
run the program in a ptrace sandbox. Such sandbox can again intercept mmap syscall, and or-in MAP_32BIT when desirable.
or perhaps make the Ada code "think" that it's addresses are between 0 and 2,147,483,647?
I don't see how that's possible. If the library stores an address of a function or a global in a 32-bit memory location, then loads that address and dereferences it ... it's going to get a 32-bit truncated address and a SIGSEGV on dereference.

DLL Functions and pointers

I am absolutely new to DLLs, but not to C++. For a project I need to implement some functions in DLL. My question : can I pass pointers, from my main project, to functions inside DLLs without worrying about nothing ? I find it strange because the address in my main project are relatives, so the address passed to the DLL should mean something else. Is there a trick anywhere ?
Thanks.
Any modules (exe, dll, etc.) in the same address space can be used the same pointer.
When OS execute the EXE. It will load the EXE to the virtual memory at the address specified in EXE header, or randomize address if EXE support relocation and ASLR (Address space layout randomization) is enabled. Then dependencies (All of DLL that EXE uses) will be loaded into the same address space afterward.
The things to be worried when you passed the pointer between modules is allocation and deallocation. You MUST use the same corresponding function to deallocate memory. If you use HeapAlloc, you must use HeapFree to free memory with the same heap.
In Visual C++, the real problems is malloc and free (new and delete too). Imagine module A allocate a block of memory with malloc and its compiled with Visual C++ 8.0. Module B compiled with Visual C++ 9.0. In this situation, you cannot use free in the module B to free a block of memory that allocated by malloc in the module A, since it is not the same corresponding functions.

How tcamalloc gets linked to the main program

I want to know how malloc gets linked to the main program.Basically I have a program which uses several static and dynamic libraries.I am including all these in my makefile using option "-llibName1 -llibName2".
The documentation of TCmalloc says that we can override our malloc simply by calling "LD_PRELOAD=/usr/lib64/libtcmalloc.so".I am not able to understand how tcamlloc gets called to the all these static and dynamic libraries.Also how does tcmalloc also gets linked to STL libraries and new/delete operations of C++?
can anyone please give any insights on this.
"LD_PRELOAD=/usr/lib64/libtcmalloc.so" directs the loader to use libtcmalloc.so before any other shared library when resolving symbols external to your program, and because libtcmalloc.so defines a symbol named "malloc", that is the verison your program will use.
If you omit the LD_PRELOAD line, glibc.so (or whatever C library you have on your system) will be the first shared library to define a symbol named "malloc".
Note also that if you link against a static library which defines a symbol named "malloc" (and uses proper arguments, etc), or another shared library is loaded that defines a symbol named "malloc", your program will attempt to use that version of malloc.
That's the general idea anyway; the actual goings-on is quite interesting and I will have to direct to you http://en.wikipedia.org/wiki/Dynamic_linker as a starting point for more information.

Why is the startup of an App on linux slower when using shared libs?

On the embedded device I'm working on, the startup time is an important issue. The whole application consists of several executables that use a set of libraries. Because space in FLASH memory is limited we'd like to use shared libraries.
The application workes as usual when compiled and linked with shared libraries and the amount of FLASH memory is reduced as expected.
The difference to the version that is linked to static libs is that the startup time of the application is about 20s longer and I have no idea why.
The application runs on an ARM9 CPU at 180 MHz with Linux 2.6.17 OS,
16 MB FLASH (JFFS File System) and 32 MB RAM.
Bacause shared libraries have to be linked to at runtime, usually by dlopen() or something similar. There's no such step for static libraries.
Edit: some more detail. dlopen has to perform the following tasks.
Find the shared library
Load it into memory
Recursively load all dependencies (and their dependencies....)
Resolve all symbols
This requires quite a lot of IO operations to accomplish.
In a statically linked program all of the above is done at compile time, not runtime. Therefore it's much faster to load a statically linked program.
In your case, the difference is exaggerated by the relatively slow hardware your code has to run on.
This is a fine example of the classic tradeoff of speed and space.
You can statically link all your executables so that they are faster but then they will take more space
OR
You can have shared libraries that take less space but also more time to load.
So decide what you want to sacrifice.
There are many factors for this difference (OS, compiler e.t.c) but a good list of reasons can be found here. Basically shared libraries were created for space reasons and much of the "magic" involved to make them work takes a performance hit.
(As a historical note the original Netscape navigator on Linux/Unix was a statically linked big fat executable).
This may help others with similar problems:
The reason why startup took so long in my case was, that the default setting of the GCC is to export all symbols inside of a library.
A big improvement is to set a compiler setting "-fvisibility=hidden".
All symbols that the lib has to export have to be augmented with the statement
__attribute__ ((visibility("default")))
see gcc wiki
and the very fine article how to write shared libraries
Ok, I have learned now that the usage of shared libraries has it's disadvatages concerning speed. I found this article about dynamic linking and loading enlighting. The loading process seems to be much lengthier than I have expected.
Interesting.. typically loading time for a shared library is unnoticeable from a fat app that is statically linked. So I can only surmise that the system is either very slow to load a library from flash memory, or the library that is loaded is being checked in some way (eg .NET apps run a checksum for all loaded dlls, reducing startup time considerably in some cases). It could be that the shared libraries are being loaded as-needed, and unloaded afterwards which could indicate a configuration problem.
So, sorry I can't help say why, but I think its an issue with your ARM device/OS. Have you tried instrumenting the startup code, or statically linking with 1 of the most commonly-used libraries to see if that makes a large difference. Also put the shared libs in the same directory as the app to reduce the time it takes to search the FS for the lib.
One option which seems obvious to me, is to statically link the several programs all into a single binary. That way you continue to share as much code as possible (probably more than before), but you will also avoid the overhead of the dynamic linker AND save the space of having the dynamic linker on the system at all.
It's pretty easy to combine several executables into the same one, you normally just examine argv and decide which routine to call based on that.

Resources