Why isn't a constant multiply realized by a mul instruction? [duplicate] - linux

This question already has an answer here:
x86 Multiplication with 3: IMUL vs SHL + ADD
(1 answer)
Closed 1 year ago.
Let's consider the following function:
#include <stdint.h>
uint64_t foo(uint64_t x) { return x * 3; }
If I were to write it, I'd do
.global foo
.text
foo:
imul %rax, %rdi, $0x3
ret
On the other hand, the compiler generates two additions, with -O0:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 89 7d f8 mov %rdi,-0x8(%rbp)
8: 48 8b 55 f8 mov -0x8(%rbp),%rdx
c: 48 89 d0 mov %rdx,%rax
f: 48 01 c0 add %rax,%rax
12: 48 01 d0 add %rdx,%rax
15: 5d pop %rbp
16: c3 retq
or lea with -O2:
0000000000000000 <foo>:
0: 48 8d 04 7f lea (%rdi,%rdi,2),%rax
4: c3 retq
Why? Since every assembly instruction equals one processor clock tick, my version should run within 2 CPU clock cycles (since it has two instructions), in the -O0 we need 4 cycles for performing addition, because it could be rewritten to
mov %rdi,%rax
add %rax,%rax
add %rdi,%rax
retq
and the lea should take two cycles either.

You're targeting a processor with dedicated address-calculation units. It's likely to be faster to compute small multiplications in the address calculator than in a general-purpose arithmetic/logic unit (ALU).
Also, depending on your processor model, the ALU may be shared with other code, either due to hyperthreading or by speculative or out-of-order execution within the same thread. Your compiler is making a good estimate of how best to utilise the available resources to give a good throughput of execution without stalling.
The idea that "every assembly instruction equals one processor clock tick" (or even a fixed number of cycles) has only ever been true on the very simplest of processors.

Related

Stack canaries can be disabled by compiler?

Who is responsible for inserting the stack canaries in the stack? Is it the OS?
If yes, how can the gcc compiler disable them by using the -fno-stack-protector option? Or it is only a flag created using that option and added to the binary to tell the OS to not insert canaries in the stack where the binary is loaded at runtime?
EDIT: one more question
Who checks the value of the canaries if they were changed over the execution?
Again if inserted by the compiler, how can be checked by the OS? If inserted by the OS how can it be disabled by the compiler (main question)?
Who is responsible for inserting the stack canaries in the stack?
The compiler. The code for creating and checking stack canaries is a subset of the code generated by the compiler from the program source code.
For GCC:
-fstack-protector
Emit extra code to check for buffer overflows, such as stack smashing attacks. This is done by adding a guard variable to functions with vulnerable objects. This includes functions that call alloca, and functions with buffers larger than 8 bytes. The guards are initialized when a function is entered and then checked when the function exits. If a guard check fails, an error message is printed and the program exits.
The aforementioned "guard variable" is commonly referred to as a canary:
The basic idea behind stack protection is to push a "canary" (a randomly chosen integer) on the stack just after the function return pointer has been pushed. The canary value is then checked before the function returns; if it has changed, the program will abort. Generally, stack buffer overflow (aka "stack smashing") attacks will have to change the value of the canary as they write beyond the end of the buffer before they can get to the return pointer. Since the value of the canary is unknown to the attacker, it cannot be replaced by the attack. Thus, the stack protection allows the program to abort when that happens rather than return to wherever the attacker wanted it to go.1
Example program:
Source code:
int test(int i) {
return i;
}
int main(void) {
int x;
int i = 10;
x = test(i);
return x;
}
Function from binary compiled without -fstack-protector-all:
$ objdump -dj .text test | grep -A7 "<test>:"
00000000004004ed <test>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 89 7d fc mov %edi,-0x4(%rbp)
4004f4: 8b 45 fc mov -0x4(%rbp),%eax
4004f7: 5d pop %rbp
4004f8: c3 retq
Function from binary compiled with -fstack-protector-all:
$ objdump -dj .text protected_test | grep -A20 "<test>:"
000000000040055d <test>:
40055d: 55 push %rbp
40055e: 48 89 e5 mov %rsp,%rbp
400561: 48 83 ec 20 sub $0x20,%rsp
400565: 89 7d ec mov %edi,-0x14(%rbp)
400568: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax <- get guard variable value
40056f: 00 00
400571: 48 89 45 f8 mov %rax,-0x8(%rbp) <- save guard variable on stack
400575: 31 c0 xor %eax,%eax
400577: 8b 45 ec mov -0x14(%rbp),%eax
40057a: 48 8b 55 f8 mov -0x8(%rbp),%rdx <- move it to register
40057e: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx <- check it against original
400585: 00 00
400587: 74 05 je 40058e <test+0x31>
400589: e8 b2 fe ff ff callq 400440 <__stack_chk_fail#plt>
40058e: c9 leaveq
40058f: c3 retq
1. "Strong" stack protection for GCC

calling C functions shellcode

I have the following dump taken from gdb
00000000004006f6 <win>:
4006f6: 55 push rbp
4006f7: 48 89 e5 mov rbp,rsp
4006fa: bf 98 08 40 00 mov edi,0x400898
4006ff: e8 8c fe ff ff call 400590 <system#plt>
400704: 5d pop rbp
400705: c3 ret
Usually this C function is never called however I need to write some shellcode thats less then 10 bytes to run it or get the value displayed. Here is the source of the function;
void win(){
system("/bin/cat ./flag.txt");
}
I'm still a novice at both assembly and C, so any help is appreciated.
In order to run function win() you must do write push <function-win-address> ret in shellcode.
In your case that will be:
\x68\xf6\x06\x40\xc3
\x68 is push
\xf6\x06\x40 is the function address
\xc3 is ret
mov eax, (win addr)
call eax
objdump opcodes after

Position Independent Code pointing to wrong address

I have a small example program written in NASM(2.11.08) targeting the macho64 architecture. I'm running OSX 10.10.3:
bits 64
section .data
msg1 db 'Message One', 10, 0
msg1len equ $-msg1
msg2 db 'Message Two', 10, 0
msg2len equ $-msg2
section .text
global _main
extern _printf
_main:
sub rsp, 8 ; align
lea rdi, [rel msg1]
xor rax, rax
call _printf
lea rdi, [rel msg2]
xor rax, rax
call _printf
add rsp, 8
ret
I'm compiling and linking using the following command line:
/usr/local/bin/nasm -f macho64 test2.s
ld -macosx_version_min 10.10.0 -lSystem -o test2 test2.o
When I do an object dump on the test2 executable, this is the relevant snippet(I can post more if I'm wrong!):
0000000000001fb7 <_main>:
1fb7: 48 83 ec 08 sub $0x8,%rsp
1fbb: 48 8d 3d 56 01 00 00 lea 0x156(%rip),%rdi # 2118 <msg2+0xf3>
1fc2: 48 31 c0 xor %rax,%rax
1fc5: e8 14 00 00 00 callq 1fde <_printf$stub>
1fca: 48 8d 3d 54 00 00 00 lea 0x54(%rip),%rdi # 2025 <msg2>
1fd1: 48 31 c0 xor %rax,%rax
1fd4: e8 05 00 00 00 callq 1fde <_printf$stub>
1fd9: 48 83 c4 08 add $0x8,%rsp
1fdd: c3 retq
...
0000000000002018 <msg1>:
0000000000002025 <msg2>:
And, finally, the output:
$ ./test2
Message Two
$
My question is, what happened to msg1?
I'm assuming msg1 isn't printed because 0x14f(%rip) is not the correct address (just nulls).
Why is lea edi, [rel msg2] pointing to the correct address, while lea edi, [rel msg1] is pointing past msg2, into NULLs?
It looks like the 0x14f(%rip) offset is exactly 0x100 beyond where msg1 lies in memory (this is true throughout many tests of this problem).
What am I missing here?
Edit: Whichever message (msg1 or msg2) appears last in the .data section is the only message that gets printed.
IDK about the Mach-o ABI, but if it's the same as the SystemV x86-64 ABI GNU/Linux uses, then I think your problem is that you need to clear eax to tell a varargs function like printf that there are zero FP.
Also, lea rdi, [rel msg1] would be a much better choice. As it stands, your code is only position-independent within the low 32bits of virtual address space, because you're truncating the pointers to 32bits.
It appears NASM has a bug. This same problem came up again: NASM 2 lines of db (initialized data) seemingly not working. There, the OP confirmed that the data was present, but labels were wrong, and is hopefully reporting it upstream.

Why does the Solaris assembler generate different machine code than the GNU assembler here?

I wrote this little assembly file for amd64. What the code does is not important for this question.
.globl fib
fib: mov %edi,%ecx
xor %eax,%eax
jrcxz 1f
lea 1(%rax),%ebx
0: add %rbx,%rax
xchg %rax,%rbx
loop 0b
1: ret
Then I proceeded to assemble and then disassemble this on both Solaris and Linux.
Solaris
$ as -o y.o -xarch=amd64 -V y.s
as: Sun Compiler Common 12.1 SunOS_i386 Patch 141858-04 2009/12/08
$ dis y.o
disassembly for y.o
section .text
0x0: 8b cf movl %edi,%ecx
0x2: 33 c0 xorl %eax,%eax
0x4: e3 0a jcxz +0xa <0x10>
0x6: 8d 58 01 leal 0x1(%rax),%ebx
0x9: 48 03 c3 addq %rbx,%rax
0xc: 48 93 xchgq %rbx,%rax
0xe: e2 f9 loop -0x7 <0x9>
0x10: c3 ret
Linux
$ as --64 -o y.o -V y.s
GNU assembler version 2.22.90 (x86_64-linux-gnu) using BFD version (GNU Binutils for Ubuntu) 2.22.90.20120924
$ objdump -d y.o
y.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <fib>:
0: 89 f9 mov %edi,%ecx
2: 31 c0 xor %eax,%eax
4: e3 0a jrcxz 10 <fib+0x10>
6: 8d 58 01 lea 0x1(%rax),%ebx
9: 48 01 d8 add %rbx,%rax
c: 48 93 xchg %rax,%rbx
e: e2 f9 loop 9 <fib+0x9>
10: c3 retq
How comes the generated machine code is different? Sun as generates 8b cf for mov %edi,%ecx while gas generates 89 f9 for the very same instruction. Is this because of the various ways to encode the same instruction under x86 or do these two encodings really have a particular difference?
Some x86 instructions have multiple encodings that do the same thing. In particular, any instruction that acts on two registers can have the registers swapped and the direction bit in the instruction reversed.
Which one a given assembler/compiler picks simply depends on what the tool authors chose.
You've not specified the operand size for the mov, xor and add operations. This creates some ambiguity. The GNU assembler manual, i386 Mnemonics, mentions this:
If no suffix is specified by an instruction then as tries to fill in the missing suffix based on the destination register operand (the last one by convention). [ ... ] . Note that this is incompatible with the AT&T Unix assembler which assumes that a missing mnemonic suffix implies long operand size.
This implies the GNU assembler chooses differently - it'll pick the opcode with the R/M byte specifying the target operand (because the destination size is known/implied) while the AT&T one chooses the opcode where the R/M byte specifies the source operand (because the operand size is implied).
I've done that experiment though and given explicit operand sizes in your assembly source, and it doesn't change the GNU assembler output. There is, though, the other part of the documentation above,
Different encoding options can be specified via optional mnemonic suffix. `.s' suffix swaps 2 register operands in encoding when moving from one register to another.
which one can use; the following sourcecode, with GNU as, creates me the opcodes you got from Solaris as:
.globl fib
fib: movl.s %edi,%ecx
xorl.s %eax,%eax
jrcxz 1f
leal 1(%rax),%ebx
0: addq.s %rbx,%rax
xchgq %rax,%rbx
loop 0b
1: ret

Can MSVC _penter and _pexit hooks be disabled on a per function basis?

There are compiler options in MSVC to enable the automatic generation of instrumentation calls on entering and exiting functions. These hooks are called _penter() and _pexit(). The options to the compiler are:
/Gh Enable _penter Hook Function
/GH Enable _pexit Hook Function
Is there a pragma or some sort of function declaration that will turn off the instrumentation on a per function basis? I know that using __declspec(naked) functions will not be instrumented but this isn't always a very practical option. I'm using MSVC both on PC and on a non-X86 platform and the non-X86 platform is a pain to manually write epilog/prolog in assembler (not to mention it messes up the debugger stack tracing).
If this in only on a per file (compiler option) basis, I think I will have to split out the special functions into a separate file to turn the option off but it'd be much easier if I could just control it on a per file basis.
The fallback plan if this can't be done is to just move the functions to their own CPP translation unit and compile separately without the options.
I don't see any way to do this. Given that you would have to locate and handle every affected function anyway, perhaps moving them into their own module(s) is not such a big deal.
Asker is aware, but worth writing out the disqualified approach for future reference. /Gh and /GH do not instrument naked functions. You can declare the function you want to opt-out for as naked and manually supply the standard prolog/epilog, as shown below,
void instrumented_fn(void *p)
{
/* Function body */
}
__declspec(naked) void uninstrumented_fn(void *p)
{
__asm
{
/* prolog */
push ebp
mov ebp, esp
sub esp, __LOCAL_SIZE
}
/* Function body */
__asm
{
/* epilog */
mov esp, ebp
pop ebp
ret
}
}
An example instrumented function disassembly, showing calls to penter and pexit,
537b0: e8 7c d9 ff ff call 0x51131
537b5: 55 push %ebp
537b6: 8b ec mov %esp,%ebp
537b8: 83 ec 40 sub $0x40,%esp
537bb: 53 push %ebx
537bc: 56 push %esi
537bd: 57 push %edi
537be: 90 nop
537bf: 90 nop
537c0: 90 nop
537c1: 5f pop %edi
537c2: 5e pop %esi
537c3: 5b pop %ebx
537c4: 8b e5 mov %ebp,%esp
537c6: 5d pop %ebp
537c7: e8 01 d9 ff ff call 0x510cd
537cc: c3 ret
The equivalent uninstrumented function disassembly (naked body plus standard prolog/epilog)
51730: 55 push %ebp
51731: 8b ec mov %esp,%ebp
51733: 83 ec 40 sub $0x40,%esp
51736: 90 nop
51737: 90 nop
51738: 90 nop
51739: 8b e5 mov %ebp,%esp
5173b: 5d pop %ebp
5173c: c3 ret

Resources