Visual C++ project, Wintel-32. I have a C file that's compiled to object then linked, pretty vanilla setup. Debug configuration.
When I examine the object file with dumpbin /symbols, it tells me that my object file has numerous code ("COMDAT") sections - one per function, it seems. They all are named .text, and linker would unite them into the one large .text section in the final executable.
Function-level linking is disabled in project settings. I'm not even sure why are COMDAT's being generated in the first place.
But I've noticed in the debugger that those OBJ-level sections (functions) are not going contiguously in the executable. Between them there's some padding - several dozen bytes of int 3 instruction - obviously dead space where control is not supposed to go. The function boundaries are all aligned by 16 bytes, but there's more going on - this isn't just alignment by 16 bytes, or the padding would be much less in most cases. It's typically around 20-40 bytes, but I've seen some outliers - 11 bytes padding here, 73 there.
This has nothing to do with linker's /ALIGN option - that one deals with sections proper. And its default is 4K, definitely not what we have here.
Why this padding? And what's the algorithm for its size (definitely not mere alignment)?
If you have Edit & Continue turned on for the project, the padding you're seeing is introduced so the compiler and linker can patch the executable image rather than having to rebuild and relink it.
Related
I have a simple test program that loads an xmm register with the
movdqu instruction accessing data across a page boundary (OS = Linux).
If the following page is mapped, this works just fine. If it's not
mapped then I get a SIGSEGV, which is probably expected.
However this diminishes the usefulness of the unaligned loads quite
a bit. Additionally SSE4.2 instructions (like pcmpistri) which
allow for unaligned memory references appear to exhibit this behavior
as well.
That's all fine -- except there's many an implementation of strcmp
using pcmpistri that I've found that don't seem to address this issue
at all -- and I've been able to contrive trivial testcases that will
cause these implementations to fail, while the byte-at-a-time trivial
strcmp implementation will work just fine with the same data layout.
One more note -- it appears the the GNU C library implementation for
64-bit Linux has a __strcmp_sse42 variant that appears to use the
pcmpistri instruction in a more safe manner. The implementation of
this strcmp is fairly complex, but it appears to be carefully trying
to avoid the page boundary issue. I'm not sure if that's due to the
issue I describe above, or whether it's just a side-effect of trying to
get better performance by aligning the data.
Anyway the question I have is primarily -- where can I find out more
about this issue? I've typed in "movdqu crossing page boundary" and
every variant of that I can think of to Google, but haven't come across
anything particularly useful. If anyone can point me to further info
on this it would be greatly appreciated.
First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.
Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.
However...
AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.
(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)
Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.
The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.
Since we never read past the string we cannot potentially stray into a protected page.
To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.
I'm new to binary and assembly, and I'm curious about how to directly edit binary executables. I tried to remove an instruction from a binary file (according to disassembled instructions provided by objdump), but after doing that the "executable" seems no longer in an executable format (segmentation fault when running; gdb cannot recognize). I heard that this is due to instruction alignment issue. (Is it?)
So, is it possible to add/remove single x86 instructions directly in linux executables? If so, how? Thanks in advance.
If you remove a chunk of binary file without adjusting file headers accordingly, it will become invalid.
Fortunately, you can replace instructions with NOP without actually removing them. File size remains the same, and if there is no checksum or signature (or if it's not actually checked), there is nothing more to do.
There is no universal way to insert the instructions, but generally you overwrite the original code with a JMP to another location, where you reproduce what the original code did, do your own things as you wanted, then JMP back. Finding room for your new code might be impossible without changing the size of the binary, so I would instead patch the code after executable is loaded (perhaps using a special LD_PRELOADed library).
Yes. Just replace it with a NOP instruction (0x90) - or multiple ones if the instruction spans across multiple bytes. This is an old trick.
Could someone please explain, what exactly this O_LARGEFILE option does to support opening of large files.
And can there be any side effects of compiling with -D_FILE_OFFSET_BITS=64 flag. In other words, when compiled with this option do we have to make sure something.
Use _FILE_OFFSET_BITS in preference to O_LARGEFILE. These are used on 32 bit systems to allow opening files so large that they exceed the range of a 32bit file pointer.
No, you don't have to do anything special. If you are on 64bit Linux it makes no difference anyway.
From man 2 open:
O_LARGEFILE
(LFS) Allow files whose sizes cannot be represented in an off_t (but can be represented in an off64_t) to be opened. The _LARGE‐
FILE64_SOURCE macro must be defined in order to obtain this definition. Setting the _FILE_OFFSET_BITS feature test macro to 64 (rather
than using O_LARGEFILE) is the preferred method of obtaining method of accessing large files on 32-bit systems (see fea‐
ture_test_macros(7)).
Edit: (ie. RTM :P)
I have a stripped ld.so that I want to replace with the unstripped version (so that valgrind works). I have ensured that I have the same version of glib and the cross compiler.
I have compiled the shared object, calling 'file' on it shows that it is compiled correctly (the only difference with the original being the 'unstripped' and being about 15% bigger). Unfortunately, it then causes a kernel panic (unable to init) on start up. Stripping the newly compiled .so, readelf-ing it and diff-ing it with the original, shows that there were extra symbols in the new version of the .so . All of the old symbols were still present, so what I don't understand is why the kernel panics with those extra symbols there.
I would expect the extra symbols to have no affect on the kernel start up, as they should never be called, so why do I get a kernel panic?
NB: To be clear - I will still need to investigate why there are extra symbols, but my question is about why these unused symbols cause problems.
The kernel (assuming Linux) doesn't depend on or use ld.so in any way, shape or form. The reason it panics is most likely that it can't exec any of the user-level programs (such as /bin/init and /bin/sh), which do use ld.so.
As to why your init doesn't like your new ld.so, it's hard to tell. One common mistake is to try to replace ld.so with the contents of /usr/lib/debug/ld-X.Y.so. While that file looks like it's not much different from the original /lib/ld-X.Y.so, it is in fact very different, and can't be used to replace the original, only to debug the original (/usr/lib/debug/ld-X.Y.so usually contains only debug sections, but none of code and data sections of /lib/ld-X.Y.so, and so attempts to run it usually cause immediate SIGSEGV).
Perhaps you can set up a chroot, mimicking your embedded environment, and run /bin/ls in it? The error (or a core dump) this will produce will likely tell you what's wrong with your ld.so.
How do you account for the difference between the size in bytes of a compiled ELF file as reported by wc (relatively large) and size (sum total of sections in file - relatively small) under Linux?
Edit: For example, compile a very simple C++ program using g++ and run 'wc myexe' and 'size myexe' and wc may return, for example 500B, whilst size may return a total of 100 bytes for all sections.
Edit II: I understand the the two commands do different things, sorry I should have said that I'm not looking for the answer 'because they're different'. I want to know what exactly accounts for the difference. Why should the wc bytes be so much larger than the total sum of the size of the sections, which after all, comprises the executable part of the code.
wc just counts the number of bytes in a file (and words and lines). It works on any normal files, not just object and executable files.
size parses the headers of an object or executable file and shows information about the segments that the author of size thought would be useful, back when size was written (hint - long before Linux was born!).
readelf and some of the other binutils programs read and parse the more modern ELF format files and show more info, including segments that size doesn't know about.
If you really want to understand what is going on under the hood, you can write your own readelf-like program, starting with /usr/include/elf.h, and parse the files for yourself. :)
Correct me if I'm wrong, but don't wc and size do different things? wc returns the number of characters words and lines in a file but size returns the bytes in each section. This then would account for the difference.
My random guess would be the ELF headers. I'm only putting that random guess here so people stop saying wc and size do different things, and point people who know the real answer in the right direction.
I just tested both, and apparently size only shows the size of the code, data and bss sections.
There are generally lots of other sections (plus headers), which you see with:
readelf --sections file
[Edit]
Among other things, here is what takes space in your file:
Constructors and destructors,
The dynamic symbol table, which lists symbols which will be resolved at runtime,
I believe .init and .fini contain runtime initialization and finalization.
[Edit]
For more information on the ELF format, read:
http://www.iecc.com/linker/ (chapter 3)
http://www.sco.com/developers/gabi/2003-12-17/contents.html