If i do "cksum filename" in two different linux system with different hardware spec, i am getting different checksum value for the same file.
Can anyone tell me the reason behind this?
The "filename" is a binary file generated in one system and copied to other system.
The algorithm employed by cksum is specified by POSIX. All POSIX-compliant systems (including GNU/Linux) should compute the same value for the same file -- that's the whole point. If you get different values on different systems, then either the program is buggy or the files (at least cksum's view of them) are not, in fact, the same. I wouldn't bet on the program being buggy.
Do note, however, that there are likely to be other hashing and checksumming programs on both systems (e.g. md5sum or sum). The sums computed by each of these programs are likely to differ, but each should be consistent from system to system. They could be a useful alternative for you, and/or they could give you a second opinion of whether the files really are the same.
Related
I'm writing a small parser and after trials and errors it seems the file byte order is big endian (which i was told it ain't common, but it's there).
I don't think the original devs include anything about endianness since the byte order may depend only in the hardware that wrote the file. Please correct me here if flawed (is it possible that the developers specify in the C code the endianness?).
So I don't really find how would I parse those files, when there is no actual way to determine the byte order - say, for a Int32 number. I've read this similar post but that's for a system that writes and reads the binary files, hence you can just use an system-endianness reader.
In my case, the code parses the instrument output gathered and binary-written by potentially any type of computer with any OS (but I guess again endianness depends on the system architecture and not the OS).
Do you have any idea/pointers on how to deal with this problem?
Wikipedia was very informative but as far as I read it's just general information.
I came across this useful feature in ELF binaries -- Build ID. "It ... is (normally) the SHA1 hash over all code sections in the ELF image." One can read it with GNU utility:
$ readelf -n /bin/bash
...
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: 54967822da027467f21e65a1eac7576dec7dd821
And I wonder if there is an easy way to recompute Build ID yourself? To check if it isn't corrupted etc.
So, I've got an answer from Mark. Since it is an up to date info, I post it here. But basically you guys are right. Indeed there is no tool for computing Build-ID, and the intentions of Build-ID are not (1) identification of the file contents, and not even (2) identification of the executable (code) part of it, but it is for (3) capturing "semantic meaning" of a build, which is the hard bit for formalization. (Numbers are for self-reference.)
Quote from the email:
-- "Is there a user tool recomputing the build-id from the file itself, to
check if it's not corrupted/compromised somehow etc?"
If you have time, maybe you could post an answer there?
Sorry, I don't have a stackoverflow account.
But the answer is: No, there is no such tool because the precise way a
build-id is calculated isn't specified. It just has to be universally
unique. Even the precise length of the build-id isn't specified. There
are various ways using different hashing algorithms a build-id could be
calculated to get a universally unique value. And not all data might
(still be) in the ELF file to recalculate it even if you knew how it was
created originally.
Apparently, the intentions of Build-ID changed
since the Fedora Feature page was written about
it.
And people's opinions diverge on what it is now.
Maybe in your answer you could include status of Build-ID and what it is
now as well?
I think things weren't very precisely formulated. If a tool changes the
build that creates the ELF file so that it isn't a "semantically
identical" binary anymore then it should get a new (recalculated)
build-id. But if a tool changes something about the file that still
results in a "semantically identical" binary then the build-id stays the
same.
What isn't precisely defined is what "semantically identical binary"
means. The intention is that it captures everything that a build was
made from. So if the source files used to generate a binary are
different then you expect different build-ids, even if the binary code
produced might happen to be the same.
This is why when calculating the build-id of a file through a hash
algorithm you use not just the (allocated) code sections, but also the
debuginfo sections (which will contain references to the source file
names).
But if you then for example strip the debuginfo out (and put it into a
separate file) then that doesn't change the build-id (the file was still
created from the same build).
This is also why, even if you knew the precise hashing algorithm used to
calculate the build-id, you might not be able to recalculate the
build-id. Because you might be missing some of the original data used in
the hashing algorithm to calculate the build-id.
Feel free to share this answer with others.
Cheers,
Mark
Also, for people interested in debuginfo (linux performance & tracing, anyone?), he mentioned a couple projects for managing them on Fedora:
https://fedoraproject.org/wiki/Changes/ParallelInstallableDebuginfo
https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo
The build ID is not a hash of the program, but rather a unique identifier for the build, and is to be considered just a "unique blob" — at least at some point it used to be defined as a hash of timestamp and absolute file path, but that's not a guarantee of stability either.
I wonder if there is an easy way to recompute Build ID yourself?
No, there isn't, by design.
The page you linked to itself links to the original description of what build-id is and what it's usable for. That pages says:
But I'd like to specify it explicitly as being a unique identifier good
only for matching, not any kind of checksum that can be verified against
the contents.
(There are external general means for content verification, and I don't
think debuginfo association needs to do that.)
Additional complications are: the linker can take any of:
--build-id
--build-id=sha1
--build-id=md5
--build-id=0xhexstring
So the build id is not necessarily an sha1 sum to begin with.
I need to execute a compiled program which hardcodes various filesystem paths, with different values for those paths. For practical reasons, adjusting the source code of the program and recompiling it is not an option. Additionally, is is not acceptable to replace the hardcoded files with symlinks, or change the hardcoded files in any other way.
I can only think of two solutions: LD_PRELOAD hooks and patching the binary. The former seems easier and more reliable. Is there any better solution, or perhaps some existing software aiming to solve this problem?
P.S. I am aware I'm speaking of horrible hacks. The hardcoded software in question is widely distributed on Linux distributions, but it appears completely unmaintained, and I don't see any chance of getting a patch in, let alone having it hit distros, in the time I find acceptable.
I'm re-writing a build that produces a number of things (shared/static libraries, jars, executables, etc). The question came up whether there's a way to verify that the results are functionally equivalent without doing a full top-to-bottom test of the resulting software.
However, that is proving to be more difficult to do than I anticipated.
As an example, I expected that the md5 of two objects produced from the same source (sun studio C++ compiler) and command-line parameters would have the same md5 hash, but that isn't the case. I can build the file, rename it, build again, and they have different hashes.
With that said ... is there a way do a quick check to verify that two files produced from separate build architectures of the same source tree (eg, two shared objects) are functionally equivalent?
edit I am sorry, I neglected to mention this is for a debug build ... when debugging flags aren't used the binaries are identical, but they've been using debugging flags by default for so many years their stuff breaks when you remove the debugging flags (part of the reason I'm re-writing the build is to take that particular 'feature' out of the build so we can get some proper testing going)
Windows DLLs have a link timestamp (TimeDateStamp) as part of PE image.
Looking at linker options, I don't see an option to suppress that. So re-linking a DLL (or an EXE) will always produce a different binary.
You could write a tool to zero out these timestamps (always at a fixed offset from file start), and compare MD5s afterwards. But you'll likely discover lots of other differences as well. In particular, any program that uses __DATE__ or __TIME__ builtins will give you trouble.
We've had to work quite hard to achieve bit-identical rebuilds (using GNU toolchain). It's possible (at least for open-source tools, on Linux), but not easy (as you've discovered).
I forgot about this question; I'm revisiting so I can give the answer I came up with.
objcopy can be used to produce a new binary file in different formats. It's been a few years since I worked on this, so the specifics escape me, but here's what I recall:
objcopy can strip various things out (debug info, symbol information, etc), but even after stripping stuff out I was still seeing different hashes between objects.
In the end I found I could convert it from ELF to other formats. I ended up dumping it to another format (I think I chose SREC) that consistently provided the same MD5 for objects built at different times with identical source/flags.
I'm betting I could have done this a better way with objcopy (or perhaps another binutils tool), but it was good enough to satisfy our concerns.
Does anybody know one? preferrably with linux implementation?
alternatively, does anybody know how much effort would it take to add it in any open-source implementation? (i mean: maybe it's enough to change an if statement, maybe i have to go carefully trhough the whole fs implementation adding tests; do you have that notion? ).
thanks....
HFS+ allows directory hardlinks in OSX 10.5. Only TimeMachine can create them since OSX 10.6, and HFS+ does some sanity checking that they do not introduce cycles.
However, Linux will not read them. Besides filesystems, this could be enforced at the VFS layer. Even if there are no cycles, some userspace tools rely on having no directory hard links (eg, a GNU find optimisation that lets it skip many directories; it can be disabled with -noleaf ).
Technically nothing keeps you from opening /dev/sda with a hex editor and creating one. However everything else in your system will fall apart if you do.
The best explanation i could find is this quote from jta:
User-added hardlinks to directories
are forbidden because they break the
directed acyclic graph structure of
the filesystem (which is an ASSERT in
Unixiana, roughly), and because they
confuse the hell out of
file-tree-walkers (a term Multicians
will recognize at sight, but Unix
geeks can probably figure out without
problems too.