How does decompiling work? - decompiling

I have heard the term "decompiling" used a few times before, and I am starting to get very curious about how it works.
I have a very general idea of how it works; reverse engineering an application to see what functions it uses, but I don't know much beyond that.
I have also heard the term "disassembler", what is the difference between a disassembler and a decompiler?
So to sum up my question(s): What exactly is involved in the process of decompiling something? How is it usually done? How complicated/easy of a processes is it? can it produce the exact code? And what is the difference between a decompiler, and a disassembler?

Ilfak Guilfanov, the author of Hex-Rays Decompiler, gave a speech about the internal working of his decompiler at some con, and here is the white paper and a presentation. This describes a nice overview in what are all the difficulties in building a decompiler and how to make it all work.
Apart from that, there are some quite old papers, e.g. the classical PhD thesis of Cristina Cifuentes.
As for the complexity, all the "decompiling" stuff depends on the language and runtime of the binary. For example decompiling .NET and Java is considered "done", as there are available free decompilers, that have a very high succeed ratio (they produce the original source). But that is caused by the very specific nature of the virtual machines that these runtimes use.
As for truly compiled languages, like C, C++, Obj-C, Delphi, Pascal, ... the task get much more complicated. Read the above papers for details.
what is the difference between a disassembler and a decompiler?
When you have a binary program (executable, DLL library, ...), it consists of processor instructions. The language of these instructions is called assembly (or assembler). In a binary, these instructions are binary encoded, so that the processor can directly execute them. A disassembler takes this binary code and translates it into a text representation. This translation is usually 1-to-1, meaning one instruction is shown as one line of text. This task is complex, but straightforward, the program just needs to know all the different instructions and how they are represented in a binary.
On the other hand, a decompiler does a much harder task. It takes either the binary code or the disassembler output (which is basically the same, because it's 1-to-1) and produces high-level code. Let me show you an example. Say we have this C function:
int twotimes(int a) {
return a * 2;
}
When you compile it, the compiler first generates an assembly file for that function, it might look something like this:
_twotimes:
SHL EAX, 1
RET
(the first line is just a label and not a real instruction, SHL does a shift-left operation, which does a quick multiply by two, RET means that the function is done). In the result binary, it looks like this:
08 6A CF 45 37 1A
(I made that up, not real binary instructions). Now you know, that a disassembler takes you from the binary form to the assembly form. A decompiler takes you from the assembly form to the C code (or some other higher-level language).

Decompiling is essentially the reverse of compiling. That is - taking the object code (binary) and trying to recreate the source code from it.
Decompilation depends on artefacts being left in the object code which can be used to ascertain the structure of the source code.
With C/C++ there isn't much left to help the decompilation process so it's very difficult. However with Java and C# and other languages which target virtual machines, it can be easier to decompile because the language leaves many more hints within the object code.

Related

Removing assembly instructions from ELF

I am working on an obfuscated binary as a part of a crackme challenge. It has got a sequence of push, pop and nop instructions (which repeats for thousands of times). Functionally, these chunks do not have any effect on the program. But, they make generation of CFGs and the process of reversing, very hard.
There are solutions on how to change the instructions to nop so that I can remove them. But in my case, I would like to completely strip off those instructions, so that I can get a better view of the CFG. If instructions are stripped off, I understand that the memory offsets must be modified too. As far as I could see, there were no tools available to achieve this directly.
I am using IDA Pro evaluation version. I am open to solutions using other reverse engineering frameworks too. It is preferable, if it is scriptable.
I went through a similar question but, the proposed solution is not applicable in my case.
I would like to completely strip off those instructions ... I understand that the memory offsets must be modified too ...
In general, this is practically impossible:
If the binary exports any dynamic symbols, you would have to update the .dynsym (these are probably the offsets you are thinking of).
You would have to find every statically-assigned function pointer, and update it with the new address, but there is no effective way to find such pointers.
Computed GOTOs and switch statements create function pointer tables even when none are present in the program source.
As Peter Cordes pointed out, it's possible to write programs that use delta between two assembly labels, and use such deltas (small immediate values directly encoded into instructions) to control program flow.
It's possible that your target program is free from all of the above complications, but spending much effort on a technique that only works for that one program seems wasteful.

How do functional language compilers work? [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

Producing executables within Linux (in relation to implementing a compiler)

For my university, final-year dissertation, I am going to implement a compiler for a skeletal form of the C programming language, then go about extending it until it resembles something a little more like Java with array bounds checking, type-checking and so forth.
I am relatively competent at much of the theory that relates to compiler construction, and have experience programming in MIPS assembly language, so I do understand a little of what it is to write extremely low-level code.
My main concern is that I am likely to be able to get all the way to the point where I need to produce the actual machine-code output, but then not understand enough about how machine code is executed from the perspective of the operating system running it.
So, my actual question is basically, "does anyone know the best place to read up about writing assembly to run on an intel x86-64 processor under linux?"
The main gap in my knowledge is how the machine code is actually run in practise. Is it run directly on the processor, making "syscall"s (or the x86 equivalent) when it needs services provided by the kernel, or is the assembly language somehow an encapsulated description that tells the kernel how to execute the instructions (in a manner similar to an interpreted language such as Java)?
Any help you can provide would be greatly appreciated.
This document explains how you can implement a foreign function interface to interact with other code: http://www.x86-64.org/documentation/abi.pdf
Firstly, for the machine code start here: http://www.intel.com/products/processor/manuals/
Next, I assume your question about how the machine code is run is really about how the OS loads the exe into memory and calls main()? These links may help
Linkers and loaders:
http://www.linuxjournal.com/article/6463
ELF file format:
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format and
http://www.linuxjournal.com/article/1060
Your machine code will go into the .text section of the executable
Finally, best of luck. Your project is similar to my final year project, except I targeted the JVM and compiled a subset of Visual Basic!

Which languages are dynamically typed and compiled (and which are statically typed and interpreted)?

In my reading on dynamic and static typing, I keep coming up against the assumption that statically typed languages are compiled, while dynamically typed languages are interpreted. I know that in general this is true, but I'm interested in the exceptions.
I'd really like someone to not only give some examples of these exceptions, but try to explain why it was decided that these languages should work in this way.
Here's a list of a few interesting systems. It is not exhaustive!
Dynamically typed and compiled
The Gambit Scheme compiler, Chez Scheme, Will Clinger's Larceny Scheme compiler, the Bigloo Scheme compiler, and probably many others.
Why?
Lots of people really like Scheme. Programs as data, good macro system, 35 years of development, big community. But they want performance. Hence, a number of good native-code compilers—Chez Scheme is even a successful commercial product (interpreted bytecodes are free; native codes you pay for).
The LuaJIT just-in-time compiler for Lua.
Why?
To show it could be done. And then, people started to like getting 3x speedup on their Lua programs. Lua is in a lot of games, where performance matters, plus it's creeping into other products too. 70% of the code in Adobe Lightroom is Lua.
The iconc Icon-to-C compiler.
Why?
The fifty people who used it loved Icon. Totally unusual evaluation model, the most innovative (and in my opinion, best) string-processing system ever designed. But that evaluation model was really expensive, especially on late-1980s computers. By compiling Icon to C, the Icon Project made it possible for big Icon programs to run in fewer hours.
Conclusion: people first develop an attachment to a dynamically typed language, and probably a significant code base. Eventually, the community spits out a native-code compiler so that you can get better performance and solve bigger problems.
Statically Typed and Interpreted
This category is less common, but...
Objective Caml. Dialect of ML, vehicle for lots of innovative experiments in language design.
Why?
Very portable system and very fast compilation times. People like both properties, so the new language-design ideas are desseminated widely.
Moscow ML. Standard ML with a few extra features of the modules system.
Why?
Portable, fast compilation times, easy to make an interactive read/eval/print loop. Became a popular teaching compiler.
C-Terp. An old product, I think maybe from Gimpel Software. Saber C—a product I don't think you can buy any more.
Why?
Debugging. Especially, debugging on 1980s hardware under MS-DOS. For very little resources, you could get really good help debugging C code on very limited hardware (think: 4.77MHz processor with an 8-bit bus, 640K of RAM fully loaded). Nearly impossible to get a good visual debugger for native-compiled code, but with the interpreter, fairly easy.
UCSD Pascal—the system that made "P-code" a household word.
Why?
Teachers liked Niklaus Wirth's language design, and the compiler could run on very small machines. Wirth's clean design and the UCSD P-system made an unbeatable combination, and Pascal was the standard teaching language of the 1970s. Younger people may find it hard to appreciate that in the 1970s there was no debate over what language to teach in the first course. Today I know of programs using C, C++, Haskell, Java, ML, and Scheme. In the 1970s it was always Pascal, and the UCSD P-system was a big reason way.
In case you are wondering, P stood for portable.
Summary: Interpreting a statically typed language is a great way to get an implementation into everybody's hands quickly. (It also had advantages for debugging on Bronze Age hardware.)
Objective-C is compiled and supports dynamic typing (certainly when calling methods via [target doSomething] syntax). That is, you can send any message to a target (using ordinary language syntax, without programming against a reflection API), receive only a warning at compile time that it might not be handled, and receive an exception only at runtime if the target doesn't respond to that selector (which is like a method signature); and you can ask any object (which can all be of static type id if your code doesn't know any better or doesn't care) whether it respondsToSelector: to probe its capabilities.
Java (a statically typed language) is compiled to JVM bytecode, which was interpreted on older versions of the JVM, whereas it now uses Just In Time (JIT) compilation, meaning machine code is generated at runtime. I also believe ML and its dialects can be interpreted, and ML is definitely statically typed.
Python is a dynamic language that has compilers.
See this SO question - Python - why compile?, for instance.
In general, compiling makes the program run much faster.
Actionscript has dynamic typing and compiles to bytecode.
And it even compiles right down to native machine code if you want to release a Flash app on the iPhone.

Determine source language from a binary?

I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:
Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?
Let's assume for the purposes of the question:
That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
That performance isn't a reliable indicator of language (no comparing, say, Java to C).
That you don't have an interpreter or something between you and the language - just raw executable binary.
Bonus points if you're language-agnostic as possible.
Short answer: YES
Long answer:
If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.
Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv
Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.
At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.
For other programs, I've sometimes been able to find the path from which a program was compiled.
So, yes, it is possible to determine the source language from the binary.
I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.
Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.
As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.
what about these tools:
PE Detective
PEiD
both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here
I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.
You have excluded all byte-code languages so this issue is going to be less common than expected.
First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.
Also, there's often a "map" to the various library functions. That's a big hint, also.
When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.
Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.
Well, C is initially converted the ASM, so you could write all C code in ASM.
No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.
The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).

Resources