A tool for understanding assembler programs? - linux

I have an assembler program and I try to understand it. (I'm actually profiling it).
Is there a Linux tool/editor that would structurize it a little for me?
Displaying where loops/jumps are would be enough.
Handy description of what an instruction does would be great.

If you look for something that resembles OllyDbg but for linux, you might try edb.

Since you are really reversing a high level language for profiling, you can do a couple things to help the process. First enable and preserve debugging information in the C++ compiler and linker (don't let it strip the executable). In g++ the -g command line flag does this. Second many C++ compilers have flags to output the immediate assembly source code, rather than emitting the object code (which is used by the linker). In g++ the -S flag enables this.
The assembly from the compiler can be compared to the assembly from the oprofile disassembly.
I'm not very familiar with decompilers, but two including one from another SO post are Boomerang and REC for C, rather than C++.
I hope that helps.

There's an Asm plugin for Eclipse you could use.
However, usually IDEs aren't employed for assembly programming. I'm not sure how much more understanding you will gain by easily spotting where the jumps and loops are - you have to read through everything anyway.

Have a look at this...
http://sourceforge.net/projects/asmplugin/
It's a plugin for Eclipse...

Not really, but you can always look at instruction references, use syntax highlighting (vim asm syntax) also you could step it through debugger if there's no limitation running it. For already assembled code this might be interesting: LIDA

Only, for PowerPC CPU
The project Demono purpose to restore algorithms of binary code (currently, only for PPC) - and it's not fully decompiler. Project is under construction, but some examples are works.
Site has Online service for generating C-like description of functions from assebler:
online service for decompile PPC asm
For use it, you should perform following steps:
Disassemble binary code(for PowerPC) to text of assembler (by IDA)
Insert text of assebler to field "Ассемблер"
Press button "Восстановить"
Look to form "Восстановленный алгоритм"
For example, if your assembler has text:
funct(){
0x00000002: cmpw r3, r4
0x00000003: ble label_1
0x00000004: mr r6, r3
0x00000005: b label_2
0x00000006: label_1:
0x00000007: mr r6, r4
0x00000008: label_2:
0x00000009: cmpw r6, r5
0x0000000A: ble label_3
0x0000000B: mr r7, r6
0x0000000C: b label_4
0x0000000D: label_3:
0x0000000E: mr r7, r5
0x0000000F: label_4:
0x00000010: mr r3, r7
0x00000011: blr
}
Online service restore following function's algorithm description:
funct(arg_w_0, arg_w_1, arg_w_2){
if(arg_w_0>arg_w_1?true?false){
arg_w_0=arg_w_1;
}else{
}
if(arg_w_0>arg_w_2?true?false){
arg_w_0=arg_w_2;
}else{
}
return (arg_w_0);
}
And, therefore, funct() derive max of three numbers - max3().
Possible, this service helps you understand how assembler instructions works..

Related

Is there a way to use NASM syntax for inline assembly?

I really dislike the GNU Assembler syntax and I some existing code written with NASM syntax that would be quite painful and time consuming to port.
Is it possible to make the global_asm!() macro use NASM as the assembler or possibly make GAS use NASM syntax?
You might be able to change it but it seems as if GAS is the only viable option. In Directives Support:
'Inline assembly supports a subset of the directives supported by both GNU AS and LLVM's internal assembler, given as follows. The result of using other directives is assembler-specific (and may cause an error, or may be accepted as-is).'
Additionally,the documentation states "Currently, all supported targets follow the assembly code syntax used by LLVM's internal assembler which usually corresponds to that of the GNU assembler (GAS). On x86, the .intel_syntax noprefix mode of GAS is used by default.'
This might be helpful as well https://github.com/Amanieu/rfcs/blob/inline-asm/text/0000-inline-asm.md

ARM assembly "retne" instruction

I am currently in the process of understanding what it takes for the Linux kernel to boot. I was browsing through the Linux kernel source tree, in particular for the ARM architecture, until I stumbled upon this assembly instruction retne lr in arch/arm/kernel/hyp-stub.S
Conceptually, it's easily understood that the instruction is suppose to return to the address stored in the link register if the Z-flag is 0. What I am looking for is where this ARM assembly instruction is actually documented.
I searched in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition section A8.8 and could not find the description of the instruction.
Grepping the sources and seeing if it was an ARM specific GNU AS extension did not turn up anything in particular.
A google search with the queries "arm assembly ret instruction", "arm return instruction" and anything similar along the lines did not turn up anything useful either. Surely I must be looking in the wrong places or I must be missing something.
Any clarification will be much appreciated.
The architectural assembly language is one thing, real world code is another. Once assembler pseudo-ops and macros come into play, a familiarity with both the toolchain and the codebase in question helps a lot. Linux is particularly nasty as much of the assembly source contains multiple layers of both assembler macros and CPP macros. If you know what to look for, and follow the header trail to arch/arm/include/asm/assembler.h, you eventually find this complicated beast:
.irp c,,eq,ne,cs,cc,mi,pl,vs,vc,hi,ls,ge,lt,gt,le,hs,lo
.macro ret\c, reg
#if __LINUX_ARM_ARCH__ < 6
mov\c pc, \reg
#else
.ifeqs "\reg", "lr"
bx\c \reg
.else
mov\c pc, \reg
.endif
#endif
.endm
.endr
The purpose of this is to emit the architecturally-preferred return instruction for the benefit of microarchitectures with a return stack, whilst allowing the same code to still compile for older architectures.

CMP command not working properly

I am using cmp command in x86 processor and is working properly (binary files are generated using gcc)
but while using it in arm cortex a9, it does not give proper output (binaries are generated using cross gcc)
board specific binaries while comparing in X86 machine using cmp command, produces proper output.
X-86 machine:
say I got 2 files a.bin, b.bin (should be same while comparing using cmp)
cmp a.bin b.bin
and its proper.
Arm cortex A9:
a.bin, b.bin
cmp a.bin b.bin
here also it must be same.
but it generates a mismatch.
any clue please !!
Your question isn't very clear and is a little vague so I'll take a stab in the dark and assume that you're asking why the same source code compiles to different files.
Although a compiled program (assuming no UB or portability issues) will be functionally the same no matter what compiler is used, the program on the binary level won't necessarily be.
Different optimization levels will generate different files for example. The compiler may embed build dates into the file. Different compilers will arrange the code differently.
These are all reasons why you may be getting different outputs for the 'same' program.

Reading integers from keyboard in Assembly (Linux IA-32 x86 gcc gas)

I'd like to know how to read integers from keyboard in assembly. I'm using Linux/x86 IA-32 architecture and GCC/GAS (GNU Assembler). The examples I found so far are for NASM or some other Windows/DOS related compiler.
I heard that it has something to do with the "int 16h" interrupt, but I don't know how it works (does it needs parameters? The result goes to %eax or any of its virtual registers [AX, AH, AL]?).
Thanks in advance,
Flayshon.
:D
Simple answer is that you don't read integers from the keyboard, you read characters from the keyboard. You don't print integers to the screen, either - you print characters. You will need routines to convert "ascii-to-integer" and "integer-to-ascii". You can "just call scanf" for the one, and "just call printf" for the other. "scanf" works okay if the user is well-behaved and confines input to characters representing decimal digits, but it's difficult to get rid of any "junk" entered! "printf" isn't too bad.
Although I'm a Nasm user (it works fine for Linux - not really "Windows/dos related"), I might have routines in (G)as syntax lying around. I'll see if I can find 'em if you can't figure it out.
As Brian points out, int 16h is a BIOS interrupt - 16-bit code - and is not useful in Linux.
Best,
Frank
In 2012, I don't recommend coding an entire program in assembly. Code only the most critical parts (if you absolutely want some assembly code). Compilers are optimizing better than humans. So use C or C++ for low level software, and higher-level languages e.g. Ocaml instead.
On Linux, you need to understand the role of the linux kernel and of system calls, which are documented in the section 2 of man pages. You probably want at least read(2) and write(2) (if only handling stdin and stdout which should have already be opened by the parent process, e.g. a shell), and you probably need many other syscalls (e.g. open(2) and close(2)). Don't forget to do your buffering (for efficiency purpose).
I strongly recommend learning the Linux system interfaces by reading a good book such as Advanced Unix Programming.
How system calls are done at the machine level in assembly is documented in the Linux Assembly Howto (at least for x86 Linux in 32 bits).
If your goal is to "obtain" a program, I would agree entirely with Basile. If your goal is to "learn assembly language", these other languages aren't really going to help. If your goal is to learn the nitty-gritty details of the hardware, you probably want assembly language, but Linux (or any other "protected mode" OS) isolates us from the hardware, so you might want to use clunky old DOS or even "write your own OS". Flayshon doesn't actually say what his goal is, but since he's asking here, he's probably interested in assembly language...
Some of us have a mental illness that makes us think it's "fun" to write in assembly language. Humor us!
Best,
Frank

How does decompiling work?

I have heard the term "decompiling" used a few times before, and I am starting to get very curious about how it works.
I have a very general idea of how it works; reverse engineering an application to see what functions it uses, but I don't know much beyond that.
I have also heard the term "disassembler", what is the difference between a disassembler and a decompiler?
So to sum up my question(s): What exactly is involved in the process of decompiling something? How is it usually done? How complicated/easy of a processes is it? can it produce the exact code? And what is the difference between a decompiler, and a disassembler?
Ilfak Guilfanov, the author of Hex-Rays Decompiler, gave a speech about the internal working of his decompiler at some con, and here is the white paper and a presentation. This describes a nice overview in what are all the difficulties in building a decompiler and how to make it all work.
Apart from that, there are some quite old papers, e.g. the classical PhD thesis of Cristina Cifuentes.
As for the complexity, all the "decompiling" stuff depends on the language and runtime of the binary. For example decompiling .NET and Java is considered "done", as there are available free decompilers, that have a very high succeed ratio (they produce the original source). But that is caused by the very specific nature of the virtual machines that these runtimes use.
As for truly compiled languages, like C, C++, Obj-C, Delphi, Pascal, ... the task get much more complicated. Read the above papers for details.
what is the difference between a disassembler and a decompiler?
When you have a binary program (executable, DLL library, ...), it consists of processor instructions. The language of these instructions is called assembly (or assembler). In a binary, these instructions are binary encoded, so that the processor can directly execute them. A disassembler takes this binary code and translates it into a text representation. This translation is usually 1-to-1, meaning one instruction is shown as one line of text. This task is complex, but straightforward, the program just needs to know all the different instructions and how they are represented in a binary.
On the other hand, a decompiler does a much harder task. It takes either the binary code or the disassembler output (which is basically the same, because it's 1-to-1) and produces high-level code. Let me show you an example. Say we have this C function:
int twotimes(int a) {
return a * 2;
}
When you compile it, the compiler first generates an assembly file for that function, it might look something like this:
_twotimes:
SHL EAX, 1
RET
(the first line is just a label and not a real instruction, SHL does a shift-left operation, which does a quick multiply by two, RET means that the function is done). In the result binary, it looks like this:
08 6A CF 45 37 1A
(I made that up, not real binary instructions). Now you know, that a disassembler takes you from the binary form to the assembly form. A decompiler takes you from the assembly form to the C code (or some other higher-level language).
Decompiling is essentially the reverse of compiling. That is - taking the object code (binary) and trying to recreate the source code from it.
Decompilation depends on artefacts being left in the object code which can be used to ascertain the structure of the source code.
With C/C++ there isn't much left to help the decompilation process so it's very difficult. However with Java and C# and other languages which target virtual machines, it can be easier to decompile because the language leaves many more hints within the object code.

Resources