I've been learning how to write exploits using stack-based buffer overflows and, the one thing I cannot comprehend is just how the code (I believe "Machine Code") is interpreted and used. What I am talking about is the "/x3b/x09..." used in the actual injection of arbitrary code. I would like some clarification on how a simple "Hello World" program can be turned into what looks like hexadecimal, and that if the usage of this as a payload would be platform-specific. Any clarification would be greatly appreciated, thanks.
It's interpreted by the CPU. That is the binary machine code that the CPU actually processes, just like the code that's created when you compile a program in C or C++.
Yes, it's platform-specific, just like compilers have to generate code for a specific platform.
Related
I was searching for writing an emulator, and its techniques. But following paragraph made me wondered, I think I couldn't figure out which scenario can be present, if you write a self-modifying code to be static-recompilation emulated.
In this technique, you take a program written in the emulated code and attempt to translate it into the assembly code of your computer. The result will be a usual executable file which you can run on your computer without any special tools. While static recompilation sounds very nice, it is not always possible. For example, you cannot statically recompile self-modifying code as there is no way to tell what it will become without running it. To avoid such situations, you may try combining static recompiler with an interpreter or a dynamic recompiler.
Here is what I was reading, and this line made me wondered:
For example, you cannot statically recompile self-modifying code as there is no way to tell what it will become without running it
A good explanation with examples will be so instructive, thanks.
Edit: By the way, I know the meaning of self-modifying, I just wonder what problems and where will we get problems after statically-recompilation, which thing will make our self-modifying code broken.
Self-modifying code heavily relies on the instruction set encoding of the original CPU. For example, it could flip some bits in a specific memory location to turn one instruction into another. With static recompilation, flipping those same bits will have an entirely different effect since the instructions are encoded completely differently for the host CPU.
I have heard the term "decompiling" used a few times before, and I am starting to get very curious about how it works.
I have a very general idea of how it works; reverse engineering an application to see what functions it uses, but I don't know much beyond that.
I have also heard the term "disassembler", what is the difference between a disassembler and a decompiler?
So to sum up my question(s): What exactly is involved in the process of decompiling something? How is it usually done? How complicated/easy of a processes is it? can it produce the exact code? And what is the difference between a decompiler, and a disassembler?
Ilfak Guilfanov, the author of Hex-Rays Decompiler, gave a speech about the internal working of his decompiler at some con, and here is the white paper and a presentation. This describes a nice overview in what are all the difficulties in building a decompiler and how to make it all work.
Apart from that, there are some quite old papers, e.g. the classical PhD thesis of Cristina Cifuentes.
As for the complexity, all the "decompiling" stuff depends on the language and runtime of the binary. For example decompiling .NET and Java is considered "done", as there are available free decompilers, that have a very high succeed ratio (they produce the original source). But that is caused by the very specific nature of the virtual machines that these runtimes use.
As for truly compiled languages, like C, C++, Obj-C, Delphi, Pascal, ... the task get much more complicated. Read the above papers for details.
what is the difference between a disassembler and a decompiler?
When you have a binary program (executable, DLL library, ...), it consists of processor instructions. The language of these instructions is called assembly (or assembler). In a binary, these instructions are binary encoded, so that the processor can directly execute them. A disassembler takes this binary code and translates it into a text representation. This translation is usually 1-to-1, meaning one instruction is shown as one line of text. This task is complex, but straightforward, the program just needs to know all the different instructions and how they are represented in a binary.
On the other hand, a decompiler does a much harder task. It takes either the binary code or the disassembler output (which is basically the same, because it's 1-to-1) and produces high-level code. Let me show you an example. Say we have this C function:
int twotimes(int a) {
return a * 2;
}
When you compile it, the compiler first generates an assembly file for that function, it might look something like this:
_twotimes:
SHL EAX, 1
RET
(the first line is just a label and not a real instruction, SHL does a shift-left operation, which does a quick multiply by two, RET means that the function is done). In the result binary, it looks like this:
08 6A CF 45 37 1A
(I made that up, not real binary instructions). Now you know, that a disassembler takes you from the binary form to the assembly form. A decompiler takes you from the assembly form to the C code (or some other higher-level language).
Decompiling is essentially the reverse of compiling. That is - taking the object code (binary) and trying to recreate the source code from it.
Decompilation depends on artefacts being left in the object code which can be used to ascertain the structure of the source code.
With C/C++ there isn't much left to help the decompilation process so it's very difficult. However with Java and C# and other languages which target virtual machines, it can be easier to decompile because the language leaves many more hints within the object code.
For my university, final-year dissertation, I am going to implement a compiler for a skeletal form of the C programming language, then go about extending it until it resembles something a little more like Java with array bounds checking, type-checking and so forth.
I am relatively competent at much of the theory that relates to compiler construction, and have experience programming in MIPS assembly language, so I do understand a little of what it is to write extremely low-level code.
My main concern is that I am likely to be able to get all the way to the point where I need to produce the actual machine-code output, but then not understand enough about how machine code is executed from the perspective of the operating system running it.
So, my actual question is basically, "does anyone know the best place to read up about writing assembly to run on an intel x86-64 processor under linux?"
The main gap in my knowledge is how the machine code is actually run in practise. Is it run directly on the processor, making "syscall"s (or the x86 equivalent) when it needs services provided by the kernel, or is the assembly language somehow an encapsulated description that tells the kernel how to execute the instructions (in a manner similar to an interpreted language such as Java)?
Any help you can provide would be greatly appreciated.
This document explains how you can implement a foreign function interface to interact with other code: http://www.x86-64.org/documentation/abi.pdf
Firstly, for the machine code start here: http://www.intel.com/products/processor/manuals/
Next, I assume your question about how the machine code is run is really about how the OS loads the exe into memory and calls main()? These links may help
Linkers and loaders:
http://www.linuxjournal.com/article/6463
ELF file format:
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format and
http://www.linuxjournal.com/article/1060
Your machine code will go into the .text section of the executable
Finally, best of luck. Your project is similar to my final year project, except I targeted the JVM and compiled a subset of Visual Basic!
I know that its possible to read from a .txt file and then convert various parts of that into string, char, and int values, but is it possible to take a string and use it as real code in the program?
Code:
string codeblock1="cout<<This is a test;";
string codeblock2="int array[5]={0,6,6,3,5};}";
int i;
cin>>i;
if(i)
{
execute(codeblock1);
}
else
{
execute(codeblock2);
}
Where execute is a function that converts from text to actual code (I don't know if there actually is a function called execute, I'm using it for the purpose of my example).
In C++ there's no simple way to do this. This feature is available in higher-level languages like Python, Lisp, Ruby and Perl (usually with some variation of an eval function). However, even in these languages this practice is frowned upon, because it can result in very unreadable code.
It's important you ask yourself (and perhaps tell us) why you want to do it?
Or do you only want to know if it's possible? If so, it is, though in a hairy way. You can write a C++ source file (generate whatever you want into it, as long as it's valid C++), then compile it and link to your code. All of this can be done automatically, of course, as long as a compiler is available to you in runtime (and you just execute it with system). I know someone who did this for some heavy optimization once. It's not pretty, but can be made to work.
You can create a function and parse whatever strings you like and create a data structure from it. This is known as a parse tree. Subsequently you can examine your parse tree and generate the necessary dynamic structures to perform the logic therin. The parse tree is subsequently converted into a runtime representation that is executed.
All compilers do exactly this. They take your code and they produce machine code based on this. In your particular case you want a language to write code for itself. Normally this is done in the context of a code generator and it is part of a larger build process. If you write a program to parse your language (consider flex and bison for this operation) that generates code you can achieve the results you desire.
Many scripting languages offer this sort of feature, going all the way back to eval in LISP - but C and C++ don't expose the compiler at runtime.
There's nothing in the spec that stops you from creating and executing some arbitrary machine language, like so:
char code[] = { 0x2f, 0x3c, 0x17, 0x43 }; // some machine code of some sort
typedef void (FuncType*)(); // define a function pointer type
FuncType func = (FuncType)code; // take the address of the code
func(); // and jump to it!
but most environments will crash if you try this, for security reasons. (Many viruses work by convincing ordinary programs to do something like this.)
In a normal environment, one thing you could do is create a complete program as text, then invoke the compiler to compile it and invoke the resulting executable.
If you want to run code in your own memory space, you could invoke the compiler to build you a DLL (or .so, depending on your platform) and then link in the DLL and jump into it.
First, I wanted to say, that I never implemented something like that myself and I may be way off, however, did you try CodeDomProvider class in System.CodeDom.Compiler namespace? I have a feeling the classes in System.CodeDom can provide you with the functionality you are looking for.
Of course, it will all be .NET code, not any other platform
Go here for sample
Yes, you just have to build a compiler (and possibly a linker) and you're there.
Several languages such as Python can be embedded into C/C++ so that may be an option.
It's kind of sort of possible, but not with just straight C/C++. You'll need some layer underneath such as LLVM.
Check out c-repl and ccons
One way that you could do this is with Boost Python. You wouldn't be using C++ at that point, but it's a good way of allowing the user to use a scripting language to interact with the existing program. I know it's not exactly what you want, but perhaps it might help.
Sounds like you're trying to create "C++Script", which doesn't exist as far as I know. C++ is a compiled language. This means it always must be compiled to native bytecode before being executed. You could wrap the code as a function, run it through a compiler, then execute the resulting DLL dynamically, but you're not going to get access to anything a compiled DLL wouldn't normally get.
You'd be better off trying to do this in Java, JavaScript, VBScript, or .NET, which are at one stage or another interpreted languages. Most of these languages either have an eval or execute function for just that, or can just be included as text.
Of course executing blocks of code isn't the safest idea - it will leave you vulnerable to all kinds of data execution attacks.
My recommendation would be to create a scripting language that serves the purposes of your application. This would give the user a limited set of instructions for security reasons, and allow you to interact with the existing program much more dynamically than a compiled external block.
Not easily, because C++ is a compiled language. Several people have pointed round-about ways to make it work - either execute the compiler, or incorporate a compiler or interpreter into your program. If you want to go the interpreter route, you can save yourself a lot of work by using an existing open source project, such as Lua
I remember a professor once saying that interpreted code was about 10 times slower than compiled. What's the speed difference between interpreted and bytecode? (assuming that the bytecode isn't JIT compiled)
I ask because some folks have been kicking around the idea of compiling vim script into bytecode and I just wonder what kind of performance boost that will get.
When you compile things down to bytecode, you have the opportunity to first perform a bunch of expensive high-level optimizations. You design the byte-code to be very easily compiled to machine code and run all the optimizations and flow analysis ahead of time.
The speed-increase is thus potentially quite substantial - not only do you skip the whole lexing/parsing stages at runtime, but you also have more opportunity to apply optimizations and generate better machine code.
You could see a pretty good boost. However, there are a lot of factors. You can't just say that compiled code is always about 10 times faster than interpreted code, or that bytecode is n times faster than interpreted code.
Factors include the complexity and verbosity of the language for example. If a keyword in the language is several characters, and the bytecode is one, it should be quite a bit faster to load the bytecode, and jump to the routine that handles that bytecode, than it is to read the keyword string, then figure out where to go. But, if you're interpreting one of the exotic languages that has a one-byte keyword, the difference might be less noticeable.
I've seen this performance boost in practice, so it might worth it for you. Besides, it's fun to write such a thing, gives you a feel for how language interpreters and compilers work, and that will make you a better coder.
Are there actually any mainstream "interpreters" these days that don't actually compile their code? (Either to bytecode or something similar.)
For instance, when you use use a Perl program directly from its source code, the first thing it does is compile the source into a syntax tree, which it then optimizes and uses to execute the program. In normal situations the time spent compiling is tiny compared to the time actually running the program.
Sticking to this example, obviously Perl cannot be faster than well-optimized C code, as it is written in C. In practice, for most things you would normally do with Perl (like text processing), it will be as fast as you could reasonably code it in C, and orders of magnitude easier to write. On the other hand, I certainly wouldn't try to write a high performance math routine directly in Perl.
Also, a lot of "classic" interpreters also include the lex/parse phase along with execution.
For example, consider executing a Python script. When you do that, you have all the costs associated with converting the program text in to the internal interpreter data structures, which are then executed.
Now contrast that with executing a compiled Python script, a .pyc file. Here, the lex and parse phase is done, and you have just the runtime of the inner interpreter.
But if you consider, say, a classic BASIC interpreter, these typically never store the raw text, rather they store a tokenized form and recreate the program text when you do "LIST". Here the byte code is much cruder (you don't really have a virtual machine here), but your execution gets to skip some of the text processing. That's all done when you enter the line and hit ENTER.
It is according to your virtual machine. Some of your faster virtual machines(JVM) are approaching the speed of C code. So how fast is your interpreted code running compared to C?
Don't think that if you convert your interpreted code into ByteCode it will run as fast a Java(near C speeds), there has been years of performance boosting going on, but you should see significant speed boost.
Emacs has been ported into bytecode with increased performance. Might be worth a look to you.
I've never noticed a Vim script that was slow enough to notice. Assuming a script primarily calls built-in, native-code, operations (regexes, block operations, etc) that are implemented in the editor's core, even a 10x speed-up of the 'glue logic' in scripting would be insignificant.
Still, profiling is the only way to be really sure.