I'm writing a small parser and after trials and errors it seems the file byte order is big endian (which i was told it ain't common, but it's there).
I don't think the original devs include anything about endianness since the byte order may depend only in the hardware that wrote the file. Please correct me here if flawed (is it possible that the developers specify in the C code the endianness?).
So I don't really find how would I parse those files, when there is no actual way to determine the byte order - say, for a Int32 number. I've read this similar post but that's for a system that writes and reads the binary files, hence you can just use an system-endianness reader.
In my case, the code parses the instrument output gathered and binary-written by potentially any type of computer with any OS (but I guess again endianness depends on the system architecture and not the OS).
Do you have any idea/pointers on how to deal with this problem?
Wikipedia was very informative but as far as I read it's just general information.
Related
For learning purposes, I'm implementing git from scratch.
And in the process I have reached the part of the git index, and this the thing.
Git stores information about the index entry related to timing and others in 32-bit integers, for example each index entry has a ctime field that tracks last time a file was changed.
And here my confusion, I am working with rust and with the libc crate and the interfaces it provides to obtain this information returns values in 64 bits.
And this raises me a couple questions
First, is there a reason why the interfaces of libc and git are different? Why libc works with 64-bit registers and git with 32-bits registers? Is it really like that or am I missing something? There is some more fundamental concept that I don't see?
And mainly, How to approach this, how problematic would it be to make a lossy conversion? Is this a bad idea?
I mean the format in the git documentation specifies that these values are stored in 32 bits, therefore I have to work with 32 bit values, so it means that there is no other way out than to do a lossy conversion?
I came across this useful feature in ELF binaries -- Build ID. "It ... is (normally) the SHA1 hash over all code sections in the ELF image." One can read it with GNU utility:
$ readelf -n /bin/bash
...
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 54967822da027467f21e65a1eac7576dec7dd821
And I wonder if there is an easy way to recompute Build ID yourself? To check if it isn't corrupted etc.
So, I've got an answer from Mark. Since it is an up to date info, I post it here. But basically you guys are right. Indeed there is no tool for computing Build-ID, and the intentions of Build-ID are not (1) identification of the file contents, and not even (2) identification of the executable (code) part of it, but it is for (3) capturing "semantic meaning" of a build, which is the hard bit for formalization. (Numbers are for self-reference.)
Quote from the email:
-- "Is there a user tool recomputing the build-id from the file itself, to
check if it's not corrupted/compromised somehow etc?"
If you have time, maybe you could post an answer there?
Sorry, I don't have a stackoverflow account.
But the answer is: No, there is no such tool because the precise way a
build-id is calculated isn't specified. It just has to be universally
unique. Even the precise length of the build-id isn't specified. There
are various ways using different hashing algorithms a build-id could be
calculated to get a universally unique value. And not all data might
(still be) in the ELF file to recalculate it even if you knew how it was
created originally.
Apparently, the intentions of Build-ID changed
since the Fedora Feature page was written about
it.
And people's opinions diverge on what it is now.
Maybe in your answer you could include status of Build-ID and what it is
now as well?
I think things weren't very precisely formulated. If a tool changes the
build that creates the ELF file so that it isn't a "semantically
identical" binary anymore then it should get a new (recalculated)
build-id. But if a tool changes something about the file that still
results in a "semantically identical" binary then the build-id stays the
same.
What isn't precisely defined is what "semantically identical binary"
means. The intention is that it captures everything that a build was
made from. So if the source files used to generate a binary are
different then you expect different build-ids, even if the binary code
produced might happen to be the same.
This is why when calculating the build-id of a file through a hash
algorithm you use not just the (allocated) code sections, but also the
debuginfo sections (which will contain references to the source file
names).
But if you then for example strip the debuginfo out (and put it into a
separate file) then that doesn't change the build-id (the file was still
created from the same build).
This is also why, even if you knew the precise hashing algorithm used to
calculate the build-id, you might not be able to recalculate the
build-id. Because you might be missing some of the original data used in
the hashing algorithm to calculate the build-id.
Feel free to share this answer with others.
Cheers,
Mark
Also, for people interested in debuginfo (linux performance & tracing, anyone?), he mentioned a couple projects for managing them on Fedora:
https://fedoraproject.org/wiki/Changes/ParallelInstallableDebuginfo
https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo
The build ID is not a hash of the program, but rather a unique identifier for the build, and is to be considered just a "unique blob" — at least at some point it used to be defined as a hash of timestamp and absolute file path, but that's not a guarantee of stability either.
I wonder if there is an easy way to recompute Build ID yourself?
No, there isn't, by design.
The page you linked to itself links to the original description of what build-id is and what it's usable for. That pages says:
But I'd like to specify it explicitly as being a unique identifier good
only for matching, not any kind of checksum that can be verified against
the contents.
(There are external general means for content verification, and I don't
think debuginfo association needs to do that.)
Additional complications are: the linker can take any of:
--build-id
--build-id=sha1
--build-id=md5
--build-id=0xhexstring
So the build id is not necessarily an sha1 sum to begin with.
I am working on an obfuscated binary as a part of a crackme challenge. It has got a sequence of push, pop and nop instructions (which repeats for thousands of times). Functionally, these chunks do not have any effect on the program. But, they make generation of CFGs and the process of reversing, very hard.
There are solutions on how to change the instructions to nop so that I can remove them. But in my case, I would like to completely strip off those instructions, so that I can get a better view of the CFG. If instructions are stripped off, I understand that the memory offsets must be modified too. As far as I could see, there were no tools available to achieve this directly.
I am using IDA Pro evaluation version. I am open to solutions using other reverse engineering frameworks too. It is preferable, if it is scriptable.
I went through a similar question but, the proposed solution is not applicable in my case.
I would like to completely strip off those instructions ... I understand that the memory offsets must be modified too ...
In general, this is practically impossible:
If the binary exports any dynamic symbols, you would have to update the .dynsym (these are probably the offsets you are thinking of).
You would have to find every statically-assigned function pointer, and update it with the new address, but there is no effective way to find such pointers.
Computed GOTOs and switch statements create function pointer tables even when none are present in the program source.
As Peter Cordes pointed out, it's possible to write programs that use delta between two assembly labels, and use such deltas (small immediate values directly encoded into instructions) to control program flow.
It's possible that your target program is free from all of the above complications, but spending much effort on a technique that only works for that one program seems wasteful.
I'm trying to modify the executable contents of my own ELF files to see if this is possible. I have written a program that reads and parses ELF files, searches for the code that it should update, changes it, then writes it back after updating the sh_size field in the section header.
However, this doesn't work. If I simply exchange some bytes, with other bytes, it works. However, if I change the size, it fails. I'm aware of that some sh_offsets are immediately adjacent to each other; however this shouldn't matter when I'm reducing the size of the executable code.
Of course, there might be a bug in my program (or more than one), but I've already painstakingly gone through it.
Instead of asking for help with debugging my program I'm just wondering, is there anything else than the sh_size field I need to update in order to make this work (when reducing the size)? Is there anything that would make changing the length fail other than that field?
Edit:
It seems that Andy Ross was perfectly correct. Even in this very simple program I have come across some indirect addressing in __libc_start_main that I cannot trivially modify to update the offset it will reach.
I was curious though, what would be the best approach to still trying to get as far as possible with this problem? I know I cannot solve this in every case, but for some simple programs, it should be possible to update what is required to make it run? Should I try writing my own virtual machine or try developing a "debugger" that would replace each suspected problem instruction with INT 3? Any ideas?
The text segment is likely internally linked with relative offsets. So one function might be trying to jump to, say, "current address plus 194 bytes". If you move things around such that the jump target is now 190 bytes, you will obviously break things. The same is true of constant data on some architectures (e.g. x86-64 but not i686). There is no simple way short of a complete disassembly to know where the internal references are, and in fact it's computationally undecidable to find them all (i.e. trying to figure out all possible jump targets of a runtime-computed branch is the Halting Problem).
Basically, this isn't solvable in the general case, so if you have an ELF binary from someone else you're trying to patch, you'll need to try other techniques. But with (great!) care it's possible to produce a library where all internal references go through the GOT/PLT which can be sliced up and relinked like this. What are you trying to accomplish?
is there anything else than the sh_size field I need to update in order to make this work
It sounds like you are patching a fully-linked binary (ET_EXEC or ET_DYN). Please note that .sh_size is not used for anything after the static link is done. You can strip the entire section table, and the binary will continue to work fine. What matters at runtime are the segments in the ELF, not sections.
ELF stands for executable and linking format, and the executable and linking form "dual nature" of the ELF. Sections are used at (static) link time to combine into segments; which are used at execution time (aka runtime, aka dynamic linking time).
Of course you haven't told us exactly what your patching strategy is when you are shrinking your binary, and in what way the result is broken. It is very likely that Andy Ross's answer is the real cause of your breakage.
What are the advantages of strings for serialization?
What's wrong with binary files for serialization?
String data is human-readable, which is wonderful for troubleshooting. Binary data is not.
String data is easily parsed by systems on any platform. Binary data is not always, so if you need to pass data between a Windows and a Linux box, or maybe an IBM Mainframe, string data is simpler.
String data also is comprehensive enough to include XML, which brings even better features to the table.
It's easier to truly encrypt and decrypt, particularly if you're going cross-platform.
However, there are disadvantages to using string data as well.
It usually results in larger amounts of raw bits - larger files, more traffic on the network, etc.
It's human-readable which is not so wonderful if you need to protect it from being used in other systems or from prying eyes. (Although encryption helps with the prying eyes deal.)
Strings are simply more portable, and more forward and backward compatible. In binary formats, everything relies on offsets, known sizes, and expected fields. They're difficult to write parsers for, because basically you need to support every known "version" of the binary format. With text however (especially something flexible like XML) it's easy to find the fields you're looking for, and it's even easier to debug when something goes wrong (human readable makes everything better).