Which code in LLVM IR runs before "main()"? - linux

Does anyone know the general rule for exactly which LLVM IR code will be executed before main?
When using Clang++ 3.6, it seems that global class variables have their constructors called via a function in the ".text.startup" section of the object file. For example:
define internal void #__cxx_global_var_init() section ".text.startup" {
call void #_ZN7MyClassC2Ev(%class.MyClass* #M)
ret void
}
From this example, I'd guess that I should be looking for exactly those IR function definitions that specify section ".text.startup".
I have two reasons to suspect my theory is correct:
I don't see anything else in my LLVM IR file (.ll) suggesting that the global object constructors should be run first, if we assume that LLVM isn't sniffing for C++ -specific function names like "__cxx_global_var_init". So section ".text.startup" is the only obvious means of saying that code should run before main(). But even if that's correct, we've identified a sufficient condition for causing a function to run before main(), but haven't shown that it's the only way in LLVM IR to cause a function to run before main().
The Gnu linker, in some cases, will use the first instruction in the .text section to be the program entry point. This article on Raspberry Pi programming describes causing the .text.startup content to be the first body of code appearing in the program's .text section, as a means of causing the .text.startup code to run first.
Unfortunately I'm not finding much else to support my theory:
When I grep the LLVM 3.6 source code for the string ".startup", I only find it in the CLang-specific parts of the LLVM code. For my theory to be correct, I would expect to have found that string in other parts of the LLVM code as well; in particular, parts outside of the C++ front-end.
This article on data initialization in C++ seems to hint at ".text.startup" having a special role, but it doesn't come right out and say that the Linux program loader actually looks for a section of that name. Even if it did, I'd be surprised to find a potentially Linux-specific section name carrying special meaning in platform-neutral LLVM IR.
The Linux 3.13.0 source code doesn't seem to contain the string ".startup", suggesting to me that the program loader isn't sniffing for a section with the name ".text.startup".

The answer is pretty easy - LLVM is not executing anything behind the scenes. It's a job of the C runtime (CRT) to perform all necessary preparations before running main(). This includes (but not limited to) to static ctors and similar things. The runtime is usually informed about these objects via addresses of constructores being emitted in the special sections (e.g. .init_array or .ctors). See e.g. http://wiki.osdev.org/Calling_Global_Constructors for more information.

Related

Why are some foreign functions statically linked while others are dynamically linked?

I'm working on a program that needs to manipulate git repositories. I've decided to use libgit2. Unfortunately, the haskell bindings for it are several years out of date and lack several functions that I require. Because of this I've decided to write the portions that use libgit2 in C and call them through the FFI. For demonstration purposes one of them is called git_update_repo.
git_update_repo works perfectly when used in a pure C program, however when it's called from haskell an assertion fails indicating that the libgit2 global init function, git_libgit2_init, hasn't been called. But, git_libgit2_init is called by git_update_repo. And if I use gdb I can see that git_libgit2_init is indeed called and reports that the initialization has been successful.
I've used nm to examine the executables and found something interesting. In a pure C executable, all the libgit2 functions are dynamically linked (as expected). However, in my haskell executable, git_libgit2_init is dynamically linked, while the rest of the libgit2 functions are statically linked. I'm certain that this mismatch is the cause of my issue.
So why do certain functions get linked dynamically and others statically? How can I change this?
The relevant settings in my .cabal file are
cc-options: -g
c-sources:
src/git-bindings.c
extra-libraries:
git2

How to retrieve the type of architecture (linux versus Windows) within my fortran code

How can I retrieve the type of architecture (linux versus Windows) in my fortran code? Is there some sort of intrinsic function or subroutine that gives this information? Then I would like to use a switch like this every time I have a system call:
if (trim(adjustl(Arch))=='Linux') then
resul = system('ls > output.txt')
elseif (trim(adjustl(Arch))=='Windows')
resul = system('dir > output.txt')
else
write(*,*) 'architecture not supported'
stop
endif
thanks
A.
The Fortran 2003 standard introduced the GET_ENVIRONMENT_VARIABLE intrinsic subroutine. A simple form of call would be
call GET_ENVIRONMENT_VARIABLE (NAME, VALUE)
which will return the value of the variable called NAME in VALUE. The routine has other optional arguments, your favourite reference documentation will explain all. This rather assumes that you can find an environment variable to tell you what the executing platform is.
If your compiler doesn't yet implement this standard approach it is extremely likely to have a non-standard approach; a routine called getenv used to be available on more than one of the Fortran compilers I've used in the recent past.
The 2008 standard introduced a standard function COMPILER_OPTIONS which will return a string containing the compilation options used for the program, if, that is, the compiler supports this sort of thing. This seems to be less widely implemented yet than GET_ENVIRONMENT_VARIABLE, as ever consult your compiler documentation set for details and availability. If it is available it may also be useful to you.
You may also be interested in the 2008-introduced subroutine EXECUTE_COMMAND_LINE which is the standard replacement for the widely-implemented but non-standard system routine that you use in your snippet. This is already available in a number of current Fortran compilers.
There is no intrinsic function in Fortran for this. A common workaround is to use conditional compilation (through makefile or compiler supported macros) such as here. If you really insist on this kind of solution, you might consider making an external function, e.g., in C. However, since your code is built for a fixed platform (Windows/Linux, not both), the first solution is preferable.

code segment referenced again with second plugin crashes

I'd like to understand the dynamic-linker/loader behaviour on Linux box in the problematic case I work upon.
Our code that crashes is loaded as a plugin (dlopen(libwrapper.so, RTLD_GLOBAL)). libwrapper.so is just a thin layer that loads another plugins that do the real job. These plugins can be named: P1 and P2, each of these depend on common library called F (all together very much simplified).
Wrapper (libwrapper.so) is introduced to allow loading Pn without RTLD_GLOBAL, since that flag set leads to obvious linkage problems loading Pns (they have the same API). RTLD_DEEPBIND is not an option since target platform is too old - does not support it.
To our surprise, the problem manifests in F library at the load time of P2 (when P1 is already loaded (and initialized) and F as its implicit dependency). At the time P2 is explicitly loaded (dlopen(libP2.so, RTLD_LOCAL | RTLD_NOW)), dynamic linker reports no problems, but calling code within F to instantiate some type instances defined in F (again) leads to segmentation faults on various places (in case one is skipped / out-commented, it crashes on another place - therefore didn't spent time to investigate the code pattern that might be troublesome, since more general problem / misunderstanding is suspected). There are no inlined functions used, code is linked with -Wl,-E, visibility default, GCC is 3.4.4.. F code is very much stable and used within standalone apps or as part of plugins in the past.
I thought to link F as static library to workaround any problem there might be with the dynamic linker, but result is the same.
My view on the topic:
linking F as dynamic library leads dynamic linker to "know" F is referenced second time loading P2 and just increments the reference counter and does not call static initializers (which is ok), but does relocations (again, and this seems to be problematic).
linking F as static library leads dynamic linker to load F code as statically linked part of P2 (P2F) and does relocations within P2F. However, "somehow" common symbols from F gets messed up with P1F code instance.
Assumption about the workaround to make the code at least work:
link P1 ... Pn in a single shared library (single plugin), whether F is shared / static doesn't matter. This way any relocation is done only once.
I'd appreciate any feedback is my view on the topic wrong / too simplified / missing important part? Is this some known GCC / binutils bug from the past?
My view on the topic:
Your view on the topic is wrong; but there is no way to prove that to you.
Write a minimal test case that simulates what your system does, and still crashes in a similar way. Update your question with actual broken code; then we can tell you exactly what the problem is.
There is also a very good chance that in reducing the problem to the minimal example, you'll discover what the problem is yourself.
Either way you'll understand the problem, and will learn something new.

Is there a compiled* programming language with dynamic, maybe even weak typing?

I wondered if there is a programming language which compiles to machine code/binary (not bytecode then executed by a VM, that's something completely different when considering typing) that features dynamic and/or weak typing, e.g:
Think of a compiled language where:
Variables don't need to be declared
Variables can be created during runtime
Functions can return values of different types
Questions:
Is there such a programming language?
(Why) not?
I think that a dynamically yet strong typed, compiled language would really sense, but is it possible?
I believe Lisp fits that description.
http://en.wikipedia.org/wiki/Common_Lisp
Yes, it is possible. See Julia. It is a dynamic language (you can write programs without types) but it never runs on a VM. It compiles the program to native code at runtime (JIT compilation).
Objective-C might have some of the properties you seek. Classes can be opened and altered in runtime, and you can send any kind of message to an object, whether it usually responds to it or not. In that way, you can implement duck typing, much like in Ruby. The type id, roughly equivalent to a void*, can be endowed with interfaces that specify a contract that the (otherwise unknown) type will adhere to.
C# 4.0 has many, if not all of these characteristics. If you really want native machine code, you can compile the bytecode down to machine code using a utility.
In particular, the use of the dynamic keyword allows objects and their members to be bound dynamically at runtime.
Check out Anders Hejlsberg's video, The Future of C#, for a primer:
http://channel9.msdn.com/pdc2008/TL16/
Objective-C has many of the features you mention: it compiles to machine code and is effectively dynamically typed with respect to object instances. The id type can store any class instance and Objective-C uses message passing instead of member function calls. Methods can be created/added at runtime. The Objective-C runtime can also synthesize class instance variables at runtime, but local variables still need to be declared (just as in C).
C# 4.0 has many of these features, except that it is compiled to IL (bytecode) and interpreted using a virtual machine (the CLR). This brings up an interesting point, however: if bytecode is just-in-time compiled to machine code, does that count? If so, it opens to the door to not only any of the .Net languages, but Python (see PyPy or Unladed Swallow or IronPython) and Ruby (see MacRuby or IronRuby) and many other dynamically typed languages, not mention many LISP variants.
In a similar vein to Lisp, there is Factor, a concatenative* language with no variables by default, dynamic typing, and a flexible object system. Factor code can be run in the interactive interpreter, or compiled to a native executable using its deploy function.
* point-free functional stack-based
VB 6 has most of that
I don't know of any language that has exactly those capabilities. I can think of two that have a significant subset, though:
D has type inference, garbage collection, and powerful metaprogramming facilities, yet compiles to efficient machine code. It does not have dynamic typing, however.
C# can be compiled directly to machine code via the mono project. C# has a similar feature set to D, but again without dynamic typing.
Python to C probably needs these criteria.
Write in Python.
Compile Python to Executable. See Process to convert simple Python script into Windows executable. Also see Writing code translator from Python to C?
Elixir does this. The flexibility of dynamic variable typing helps with doing hot-code updates (for which Erlang was designed). Files are compiled to run on the BEAM, the Erlang/Elixir VM.
C/C++ both indirectly support dynamic typing using void*. C++ example:
#include <string>
int main() {
void* x = malloc(sizeof(int))
*(int*)x = 5;
x = malloc(sizeof(std::string));
*(std::string*x) = std::string("Hello world");
free(x);
return 0;
}
In C++17, std::any can be used as well:
#include <string>
#include <any>
int main() {
std::any x = 5;
x = std::string("Hello world");
return 0;
}
Of course, duck typing is rarely used or needed in C/C++, and both of these options have issues (void* is unsafe, std::any is a huge performance bottleneck).
Another example of what you may be looking for is the V8 engine for JavaScript. It is a JIT compiler, meaning the source code is compiled to bytecode and then machine code at runtime, although this is hidden from the user.

for a function in binary without source code, is there any way to get the number of parameters

I don't have the source code but have the binary. With command "nm binary_name" I could know the functions inside the binary.
Can I know how many parameters a function has? Under solaris, is there anyway to do that?
e.g, if the function is: func1(a int,b int,c int), then there are 3 parameters.
Thanks
Daniel
No. Neil Butterworth's suggestion to examine the function signature is a good one for C++ (since the parameters are often encoded into the function so the linker can tell the difference between "int x(int)" and "int x(float)" for example) but, for C, you're going to have to get your hands dirty and disassemble the function, taking particular note of how the stack frames are built and used in your environment.
Keep in mind that SPARC has a rotating window stack rather than regular grow-down stack. You're really going to delve deep into the way the CPU works. If you're talking Solaris for Intel, the rotating stack is not there, of course.
Assuming this is C code, then no there is not - the
compiler/linker elides that information. If it is C++ code, it is just possible that the mangled name of the function is retained and includes the parameters in encoded form.
At the lowest level, if you emulate the function running on the machine, then it will read some information either from registers or the stack which it has not written. If you compare these reads to the ABI of the platform ( You don't say whether it's Sparc Solaris or Intel Solaris ) then some of them should correspond to the registers/stack locations of the parameters of the function. Of course, there's no guarantee that a function will read all its parameters.
For Solaris, elfdump might give more information than nm ( a quick google for elfdump signature indicates support was requested and added, but you'd need to check what version you've got )
IDA Pro (http://www.hex-rays.com/idapro/) is a disassembler which is pretty clever at infering parameters of a function from object code;
maybe there is also symbolic information you can use; eg. on Win32 the symbol _function#8 reveals that 8 bytes (2 parameters) are passed
one can also demangle C++ names to get the parameters and types

Resources