Does appending arbitrary data to an ELF file violate the ELF spec?

Does appending arbitrary data to an ELF file violate the ELF spec? - linux

I would like to add some information to an ELF file, but it ideally needs to be done in a way that a program can easily read this information without understanding ELF or using tools outside a normal standard language library. I was thinking of simply appending this data to the end of the ELF file (with some sort of sentinel to indicate the start of the data so the reading program can just seek backward to the sentinel), but I wanted to make sure this doesn't violate the ELF spec first. I'm not interested in whether a particular loader works fine with such appended data; I want to know if the ELF spec itself guarantees anything so that I can know different ELF-compliant loaders will be happy with it.
I see that questions like this have been asked before, but either assuming that this appending is ok or with no direct responses:
Accessing data appended to an ELF binary
Add source code to elf file
As far as I can tell, the ELF spec is here:
http://www.muppetlabs.com/~breadbox/software/ELF.txt
I couldn't determine with a few searches whether the property I want is unambiguously allowed by that spec.

The specification does not really say anything about it, so one could argue for "it's undefined behavior to have trailing data". On the other hand, the ELF specification is rather clear about its expectations: “sections and segments have no specified order. Only the ELF header has a fixed position in the file.”, which gives sufficient room to embed data one way or another, using a section, or doing without one [this is then unreferenced data!].
This "data freedom" has been exploited since at least the end of the 1980s; consider "self-extracting archives" where a generic unpacking code stub is let loose on a trailing data portion.
In fact, you can find such implicit feature even in non-executable data formats, such as RIFF and PNG. Not all formats allow this of course; in particular those where data is defined to runs until EOF rather than for a fixed length stored in some header. (Consider ZIP: appending data is not possible, but prepending is, which is what leads to EXE-ZIPs being readable by both (unmodified) unzip programs and operating systems.)
There is just one drawback to using unreferenced data like this: when reading and saving a file, you can lose this data.

It might be ok to add extra data into ELF files (since you can add new segments and new sections to ELF), but you should have (or improve) the tools to work on your "improved" ELFs, and that may be a significant burden. And don't forget to document very well (if possible, in a freely accessible document) what you are doing.

Related

Linux elf binary recuperate file that binary read/writes

I need to get files that a binary uses.
I can view all dependency of an ELF binary in the .interp section, but can I get conf files of my binary?
For example if a binary reads /etc/host, I want to see /etc/host in a section of my ELF file.
I do not see that in the documentation:
https://refspecs.linuxfoundation.org/LSB_1.1.0/gLSB/specialsections.html

I need to get files that some binary executable uses.
You can't get (all of) them. A file path used by some executable could be computed at runtime (and that is very often the case, just think of the cat(1) program). Solving that problem (of reliably computing all the files used by a program) in general could be proved equivalent to the Halting problem.
However, in practice, the strings(1) utility might help you guess some of the files (statically) referred by an executable.
You could also use strace(1) to understand (dynamically) what files are open(2)-ed during some particular execution.
Read also carefully the documentation of your executable. If it is a free software, study also its source code.

ELF, Build-ID, is there a utility to recompute it?

I came across this useful feature in ELF binaries -- Build ID. "It ... is (normally) the SHA1 hash over all code sections in the ELF image." One can read it with GNU utility:
$ readelf -n /bin/bash
...
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 54967822da027467f21e65a1eac7576dec7dd821
And I wonder if there is an easy way to recompute Build ID yourself? To check if it isn't corrupted etc.

So, I've got an answer from Mark. Since it is an up to date info, I post it here. But basically you guys are right. Indeed there is no tool for computing Build-ID, and the intentions of Build-ID are not (1) identification of the file contents, and not even (2) identification of the executable (code) part of it, but it is for (3) capturing "semantic meaning" of a build, which is the hard bit for formalization. (Numbers are for self-reference.)
Quote from the email:
-- "Is there a user tool recomputing the build-id from the file itself, to
check if it's not corrupted/compromised somehow etc?"
If you have time, maybe you could post an answer there?
Sorry, I don't have a stackoverflow account.
But the answer is: No, there is no such tool because the precise way a
build-id is calculated isn't specified. It just has to be universally
unique. Even the precise length of the build-id isn't specified. There
are various ways using different hashing algorithms a build-id could be
calculated to get a universally unique value. And not all data might
(still be) in the ELF file to recalculate it even if you knew how it was
created originally.
Apparently, the intentions of Build-ID changed
since the Fedora Feature page was written about
it.
And people's opinions diverge on what it is now.
Maybe in your answer you could include status of Build-ID and what it is
now as well?
I think things weren't very precisely formulated. If a tool changes the
build that creates the ELF file so that it isn't a "semantically
identical" binary anymore then it should get a new (recalculated)
build-id. But if a tool changes something about the file that still
results in a "semantically identical" binary then the build-id stays the
same.
What isn't precisely defined is what "semantically identical binary"
means. The intention is that it captures everything that a build was
made from. So if the source files used to generate a binary are
different then you expect different build-ids, even if the binary code
produced might happen to be the same.
This is why when calculating the build-id of a file through a hash
algorithm you use not just the (allocated) code sections, but also the
debuginfo sections (which will contain references to the source file
names).
But if you then for example strip the debuginfo out (and put it into a
separate file) then that doesn't change the build-id (the file was still
created from the same build).
This is also why, even if you knew the precise hashing algorithm used to
calculate the build-id, you might not be able to recalculate the
build-id. Because you might be missing some of the original data used in
the hashing algorithm to calculate the build-id.
Feel free to share this answer with others.
Cheers,
Mark
Also, for people interested in debuginfo (linux performance & tracing, anyone?), he mentioned a couple projects for managing them on Fedora:
https://fedoraproject.org/wiki/Changes/ParallelInstallableDebuginfo
https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo

The build ID is not a hash of the program, but rather a unique identifier for the build, and is to be considered just a "unique blob" — at least at some point it used to be defined as a hash of timestamp and absolute file path, but that's not a guarantee of stability either.

I wonder if there is an easy way to recompute Build ID yourself?
No, there isn't, by design.
The page you linked to itself links to the original description of what build-id is and what it's usable for. That pages says:
But I'd like to specify it explicitly as being a unique identifier good
only for matching, not any kind of checksum that can be verified against
the contents.
(There are external general means for content verification, and I don't
think debuginfo association needs to do that.)
Additional complications are: the linker can take any of:
--build-id
--build-id=sha1
--build-id=md5
--build-id=0xhexstring
So the build id is not necessarily an sha1 sum to begin with.

x64 Portable Executable section order

Can the 3 essential sections: .data (resources), .rdata (imports), and .text (instructions) in the Portable Executable (.exe) file format be in any order as long as the 'Address of Entry Point' field points to the .text section? It seems like having the instructions (.text) be first is a big pain in the butt since you have to calculate the imports and resources sections to actually WRITE the instructions section...
This is what I'm going off of: https://i.imgur.com/LIImg.jpg
What about for run-time performance?

As already answered by Hans, the linker is free to arrange sections in any order, as seen best fit. The only exception is named sections like .text$A and .text$B, where the sections must be sorted in lexicographical order according to the suffix following the $.
The order in which the sections are written by the linker is not of great significance to how easy it is to produce the final binary, either. Typically, the binary file isn't written sequentially as the sections are computed; rather, the section contents are produced in buffers, and the references between code and data are kept symbolic (in a relocatable format) until the sections are written to the final executable.
The part of the question relating to performance has more to do with how the image loader in Windows works, rather than the linker. Because the loader does not need the sections in any particular order, there is no additional overhead (e.g. related to sorting) when unpacking the sections into the memory view of the image file. Relocations and matching between import and export tables are done in any case, and the amount of work is decided by other factors. Hence, the order decided by the linker does not in itself affect the loading time.
For normal Windows API or Native binaries (not CLR), the section names are not important either--only the characteristics of each section, which decide e.g. the access rights of the memory mapped pages in the image (whether they are read-only, writable, executable, etc.). For example, the import table may be placed in a section named .idata rather than .rdata, or the section may be named something completely different.

The format of a PE file is described in detail by the pecoff.doc document (direct link to a Word2003 file). What you are asking about is covered in chapter 4, it talks about the Section Table. The most relevant detail:
The number of entries in the Section Table is given by the NumberOfSections field in the file header. Entries in the Section Table are numbered starting from one. The code and data memory section entries are in the order chosen by the linker.
So no, this is not cast in stone, sections can appear in any order.
It seems like having the instructions (.text) be first is a big pain
As hinted by the pecoff language, it is a linker implementation detail. And to Microsoft's linker, and probably most any other linker, it is not actually a big pain. It's first and foremost job is to generate the executable code and there tends to be a lot of it. And not all of the code is used, just what is needed to resolve the dependencies. Which is a very common scenario, a static C runtime library would be a classic example. Your program does not call every possible runtime function, the linker only links in what is needed.
Details like relocations and imports are a minor detail, there are just not nearly as many of them. So it is a lot more efficient to first generate the code and keep track of the required relocations and imports to match that code in memory, to write them to the PE file later.
Your assumption that it is "better" the other way around is not accurate. To a linker anyway.

How to modify an ELF file in a way that changes data length of parts of the file?

I'm trying to modify the executable contents of my own ELF files to see if this is possible. I have written a program that reads and parses ELF files, searches for the code that it should update, changes it, then writes it back after updating the sh_size field in the section header.
However, this doesn't work. If I simply exchange some bytes, with other bytes, it works. However, if I change the size, it fails. I'm aware of that some sh_offsets are immediately adjacent to each other; however this shouldn't matter when I'm reducing the size of the executable code.
Of course, there might be a bug in my program (or more than one), but I've already painstakingly gone through it.
Instead of asking for help with debugging my program I'm just wondering, is there anything else than the sh_size field I need to update in order to make this work (when reducing the size)? Is there anything that would make changing the length fail other than that field?
Edit:
It seems that Andy Ross was perfectly correct. Even in this very simple program I have come across some indirect addressing in __libc_start_main that I cannot trivially modify to update the offset it will reach.
I was curious though, what would be the best approach to still trying to get as far as possible with this problem? I know I cannot solve this in every case, but for some simple programs, it should be possible to update what is required to make it run? Should I try writing my own virtual machine or try developing a "debugger" that would replace each suspected problem instruction with INT 3? Any ideas?

The text segment is likely internally linked with relative offsets. So one function might be trying to jump to, say, "current address plus 194 bytes". If you move things around such that the jump target is now 190 bytes, you will obviously break things. The same is true of constant data on some architectures (e.g. x86-64 but not i686). There is no simple way short of a complete disassembly to know where the internal references are, and in fact it's computationally undecidable to find them all (i.e. trying to figure out all possible jump targets of a runtime-computed branch is the Halting Problem).
Basically, this isn't solvable in the general case, so if you have an ELF binary from someone else you're trying to patch, you'll need to try other techniques. But with (great!) care it's possible to produce a library where all internal references go through the GOT/PLT which can be sliced up and relinked like this. What are you trying to accomplish?

is there anything else than the sh_size field I need to update in order to make this work
It sounds like you are patching a fully-linked binary (ET_EXEC or ET_DYN). Please note that .sh_size is not used for anything after the static link is done. You can strip the entire section table, and the binary will continue to work fine. What matters at runtime are the segments in the ELF, not sections.
ELF stands for executable and linking format, and the executable and linking form "dual nature" of the ELF. Sections are used at (static) link time to combine into segments; which are used at execution time (aka runtime, aka dynamic linking time).
Of course you haven't told us exactly what your patching strategy is when you are shrinking your binary, and in what way the result is broken. It is very likely that Andy Ross's answer is the real cause of your breakage.

Simple way to reorder ELF file sections

I'm looking for a simple way to reorder the ELF file sections. I've got a sequence of custom sections that I would like all to be aligned in a certain order.
The only way I've found how to do it is to use a Linker script. However the documentation indicates that specifying a custom linker script overrides the default. The default linker script has a lot of content in it that I don't want to have to duplicate in my custom script just to get three sections always together in a certain order. It does not seem very flexible to hard code the linker behavior like that.
Why do I want to do this? I have a section of data that I need to know the run-time memory location of (beginning and end). So I've created two additional sections and put sentinel variables in them. I then want to use the memory locations of those variables to know the extent of the unknown section in memory.
.markerA
int markerA;
.targetSection
... Lots of variables ...
.markerB
int markerB;
In the above example, I would know that the data in .targetSection is between the address of markerA and markerB.
Is there another way to accomplish this? Are there libraries that would let me read in the currently executing ELF image and determine section location and size?

You can obtain addresses of loaded sections by analyzing the ELF-File format. Details may be found e.g. in
Tool Interface Standard (TIS)
Portable Formats Specification,
version 1.2
(http://refspecs.freestandards.org/elf/elf.pdf)
for a short impression which information is available its worth to take a look at readelf
readelf -S <filename>
returns a list of all sections contained in .
The sections which were mapped into memory were typed PROGBITS.
The address your are looking for is displayed in the column Addr.
To obtain the memory location you have to add the load address of your
executable / shared object
There are a few ways to determine the load adress of your executable/shared object:
you may parse /proc/[pid]/maps (the first column contains the load address). [pid] is the process id
if you know one function contained in your file you can apply dlsym to receive a pointer to the function. That pointer is the input parameter for dladdr returning a Dl_info struct containing the requested load address
To get some ELF information the library
libelf
may be a helpful companian (I detected it after studying the above mentioned TIS so I only took a short look at it and I don't know deeper details)
I hope this sketch of a possible solution will help.

You may consider using GCC's initializers to reference the variables which would go into a separate section otherwise and maintain all their pointers in an array. I recommend using initializers because this works file-independently.

You may look at ELFIO library. It contains WriteObj and Writer examples. By using the library, you will be able to create/modify ELF binary file programmatically.

I'm afraid override the default link script is the simple solution.
Since you worried about it might not be flexible (even I think the link script does change that often), you could write a script to generate a link script based on host system's default ld script ("ld --verbose") and insert your special sections into it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string