I am using cmp command in x86 processor and is working properly (binary files are generated using gcc)
but while using it in arm cortex a9, it does not give proper output (binaries are generated using cross gcc)
board specific binaries while comparing in X86 machine using cmp command, produces proper output.
X-86 machine:
say I got 2 files a.bin, b.bin (should be same while comparing using cmp)
cmp a.bin b.bin
and its proper.
Arm cortex A9:
a.bin, b.bin
cmp a.bin b.bin
here also it must be same.
but it generates a mismatch.
any clue please !!
Your question isn't very clear and is a little vague so I'll take a stab in the dark and assume that you're asking why the same source code compiles to different files.
Although a compiled program (assuming no UB or portability issues) will be functionally the same no matter what compiler is used, the program on the binary level won't necessarily be.
Different optimization levels will generate different files for example. The compiler may embed build dates into the file. Different compilers will arrange the code differently.
These are all reasons why you may be getting different outputs for the 'same' program.
Related
I am looking for any references or guidance on how to interpret the addresses in the backtrack of a core file.
For e.g. a typical backtrack would look like:
(gdb) where
#0 [0x000012345] in func1 + 123 (a.out + 0xABC)
#1 [0x000034345] in func2 + 567 (a.out + 0xea7)
..
I can compile with -g and get exact line numbers. But the executable in production environment would not have been compiled with -g in which case I get a stack as above.
I would like to know:
what the addresses and offsets
[0x000012345], +123 and 0xABC represent, how to interpret them and how to map them to line number in the source code.
Appreciate any help.
On most architectures, it is simply not possible to get good data out of core files without debugging information. Even simple things like stack unwinding will not work. An expert, familiar with the processor architecture and the ABI, will be able to determine many things from it, but it is a cumbersome, manual process.
For the rest of the answer, I assume you are using GCC. If you compile your executable with optimization and -g, GCC will create production-ready binaries with debugging information. Unless there are compiler bugs (which are rare), -g will not affect the generated code. The debugging information consists of data tables on the side (and this causes problems for debugging because the machine code and the original code may be very far apart in aspects such as the execution order of individual statements).
If you do not want to distribute production binaries with debugging information, you should still compile with -g, but remove the debugging information with strip before distributing the binaries, while retaining the unstripped binaries for your own records. When later analyzing a core file, you use these unstripped binaries, rather than the stripped ones you gave away. The in-memory offsets you quote are the same between stripped and unstripped binaries, which is why this works.
For more advanced usage, you could also separate the debugging information using tools such as eu-strip -f, but this does not seem necessary in your case.
I've decided to learn assembler through online tutorials.
I've come across this one that uses the NASM compiler, which most other tutorials seem to as well:
http://www.tutorialspoint.com/assembly_programming/index.htm
I've also come across this youtube series "Assembly primer for hackers"
https://www.youtube.com/watch?v=K0g-twyhmQ4&list=PLue5IPmkmZ-P1pDbF3vSQtuNquX0SZHpB
This one uses what the guy describes as the 'generic linux compiler' (owtte).
The commands for compiling go something like this:
as -o file.o file.s
Where file.s is the assembly source code. Followed by:
ld -o file file.o
Where file is then the executable.
Each of the tutorials uses a different syntax (e.g. a register in the latter tutorial is always preceded by %. NB. There do appear to be less superficial differences in the syntax than this as well). Are these syntaxes decided by the individual compiler?
I was also initially confused when I tried to compile code from the NASM tutorial with the latter method. I was always under the impression that the instruction set had to depend on the CPU and it therefore shouldn't matter which compiler I use. I've just concluded that it's merely differences in syntax but is that correct?
I'm running a Linux computer, by the way, on kernel 4.1.6.
My main question is really which syntax do I use? Is it just a matter of choice? Is one more widely used than the other? Thanks for any help.
Each of the tutorials uses a different syntax (e.g. a register in the
latter tutorial is always preceded by %. NB. There do appear to be
less superficial differences in the syntax than this as well). Are
these syntaxes decided by the individual compiler?
Yes, different assemblers (= assembly language compilers) might use different assembler language syntax although they provide code for the same processor and platform.
My main question is really which syntax do I use? Is it just a matter
of choice? Is one more widely used than the other?
One assembler, like NASM, might go for a wide range of processors and platforms, in this case you would benefit from learning its syntax when you need to work with several processors or platforms.
In other cases it might be better to stick with the assembler of some prominent vendor, because it is widely used and you can find more example code on the net for it which might help you with your development.
Last not least you might simply prefer a particular assembler because you like its features or syntax.
If your'e on a Windows system, Microsoft's MASM (ML.EXE or ML64.exe for 64 bit) syntax is virtually the same as Intel's syntax. MASM (ML.EXE and ML64.EXE) is included with the free Visual Studio express editions, although you usually have to create a custom build step to invoke the assembler in a VS project. VS express includes a good source level debugger.
If you're on a Linux type system, then you'll probably use AT&T syntax, which I assume ended up that way since it was a conversion of some generic assembler. I don't know which assembler(s) to recommend for Linux.
Attempting to solve linear programming problems using GLPLK's GLPSOL we've come upon a snag, namely that in very specific cases, the results between glpsol executables created with different compilers are different.
The situation is that we have a problem with several valid solutions. To put it simply, we have a table where each row (X) can be assigned only one column (Y), and viceversa. As such, all combinations that assign unique column/row pairs are valid.
Example, for a 2x2 table, these are valid:
{(X0,Y0),(X1,Y1)} {(X0,Y1),(X1,Y0)}
Now, the original glpsol binary we used under windows, returned the results in order, something like this:
{(X0,Y0),(X1,Y1)...(Xn,Yn)}
We noticed an issue with the Linux binary, in that it returned the solution in a different order, something like this:
{(X0,Y0),(Xn,Y1),(X1,Y2) ....}
Note that the order is not random, every execution follows the same pattern.
After much investigation I discovered that the issue lies in which compiler was used to create each binary. In our example above, the Windows binary was compiled using Visual C++, while the Linux binary used GCC.
I've verified this by recompiling the Windows binary using GCC, resulting in the same pattern. Compiling with Borland results in a different pattern.
So the question is, mainly, why is this happening?
I'm guessing it might be the result of how each compiler optimizes the binary, but I'm not sure, and my objective is to obtain the same results we had with the original executable (the one compiled with Visual C++) both for Windows and Linux. And I am suspecting cross-compiling with the Visual C++ toolchain won't be an option.
Note: I managed to determine the compiler used by each binary by opening them as text and locating text strings within the executable referencing Visual C++ and GNU GCC respectively.
Thanks!
Versions of the solver built with different compilers can take different paths during the optimization process which can result in the behavior you observe. Things that can affect this are: differences in floating-point semantics (possibly caused by -ffast-math), different implementations of sort (qsort is normally not a stable sort) - this is mentioned by Ben Voigt, different implementations of random number generators in standard libraries.
If both solutions are optimal, I wouldn't be too much concerned about this.
I'm re-writing a build that produces a number of things (shared/static libraries, jars, executables, etc). The question came up whether there's a way to verify that the results are functionally equivalent without doing a full top-to-bottom test of the resulting software.
However, that is proving to be more difficult to do than I anticipated.
As an example, I expected that the md5 of two objects produced from the same source (sun studio C++ compiler) and command-line parameters would have the same md5 hash, but that isn't the case. I can build the file, rename it, build again, and they have different hashes.
With that said ... is there a way do a quick check to verify that two files produced from separate build architectures of the same source tree (eg, two shared objects) are functionally equivalent?
edit I am sorry, I neglected to mention this is for a debug build ... when debugging flags aren't used the binaries are identical, but they've been using debugging flags by default for so many years their stuff breaks when you remove the debugging flags (part of the reason I'm re-writing the build is to take that particular 'feature' out of the build so we can get some proper testing going)
Windows DLLs have a link timestamp (TimeDateStamp) as part of PE image.
Looking at linker options, I don't see an option to suppress that. So re-linking a DLL (or an EXE) will always produce a different binary.
You could write a tool to zero out these timestamps (always at a fixed offset from file start), and compare MD5s afterwards. But you'll likely discover lots of other differences as well. In particular, any program that uses __DATE__ or __TIME__ builtins will give you trouble.
We've had to work quite hard to achieve bit-identical rebuilds (using GNU toolchain). It's possible (at least for open-source tools, on Linux), but not easy (as you've discovered).
I forgot about this question; I'm revisiting so I can give the answer I came up with.
objcopy can be used to produce a new binary file in different formats. It's been a few years since I worked on this, so the specifics escape me, but here's what I recall:
objcopy can strip various things out (debug info, symbol information, etc), but even after stripping stuff out I was still seeing different hashes between objects.
In the end I found I could convert it from ELF to other formats. I ended up dumping it to another format (I think I chose SREC) that consistently provided the same MD5 for objects built at different times with identical source/flags.
I'm betting I could have done this a better way with objcopy (or perhaps another binutils tool), but it was good enough to satisfy our concerns.
I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:
Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?
Let's assume for the purposes of the question:
That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
That performance isn't a reliable indicator of language (no comparing, say, Java to C).
That you don't have an interpreter or something between you and the language - just raw executable binary.
Bonus points if you're language-agnostic as possible.
Short answer: YES
Long answer:
If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.
Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv
Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.
At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.
For other programs, I've sometimes been able to find the path from which a program was compiled.
So, yes, it is possible to determine the source language from the binary.
I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.
Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.
As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.
what about these tools:
PE Detective
PEiD
both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here
I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.
You have excluded all byte-code languages so this issue is going to be less common than expected.
First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.
Also, there's often a "map" to the various library functions. That's a big hint, also.
When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.
Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.
Well, C is initially converted the ASM, so you could write all C code in ASM.
No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.
The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).