Fails to compile [duplicate] - linux

I have a function foo written in assembly and compiled with yasm and GCC on Linux (Ubuntu) 64-bit. It simply prints a message to stdout using puts(), here is how it looks:
bits 64
extern puts
global foo
section .data
message:
db 'foo() called', 0
section .text
foo:
push rbp
mov rbp, rsp
lea rdi, [rel message]
call puts
pop rbp
ret
It is called by a C program compiled with GCC:
extern void foo();
int main() {
foo();
return 0;
}
Build commands:
yasm -f elf64 foo_64_unix.asm
gcc -c foo_main.c -o foo_main.o
gcc foo_64_unix.o foo_main.o -o foo
./foo
Here is the problem:
When running the program it prints an error message and immediately segfaults during the call to puts:
./foo: Symbol `puts' causes overflow in R_X86_64_PC32 relocation
Segmentation fault
After disassembling with objdump I see that the call is made with the wrong address:
0000000000000660 <foo>:
660: 90 nop
661: 55 push %rbp
662: 48 89 e5 mov %rsp,%rbp
665: 48 8d 3d a4 09 20 00 lea 0x2009a4(%rip),%rdi
66c: e8 00 00 00 00 callq 671 <foo+0x11> <-- here
671: 5d pop %rbp
672: c3 retq
(671 is the address of the next instruction, not address of puts)
However, if I rewrite the same code in C the call is done differently:
645: e8 c6 fe ff ff callq 510 <puts#plt>
i.e. it references puts from the PLT.
Is it possible to tell yasm to generate similar code?

TL:DR: 3 options:
Build a non-PIE executable (gcc -no-pie -fno-pie call-lib.c libcall.o) so the linker will generate a PLT entry for you transparently when you write call puts.
call puts wrt ..plt like gcc -fPIE would do.
call [rel puts wrt ..got] like gcc -fno-plt would do.
The latter two will work in PIE executables or shared libraries. The 3rd way, wrt ..got, is slightly more efficient.
Your gcc is building PIE executables by default (32-bit absolute addresses no longer allowed in x86-64 Linux?).
I'm not sure why, but when doing so the linker doesn't automatically resolve call puts to call puts#plt. There is still a puts PLT entry generated, but the call doesn't go there.
At runtime, the dynamic linker tries to resolve puts directly to the libc symbol of that name and fixup the call rel32. But the symbol is more than +-2^31 away, so we get a warning about overflow of the R_X86_64_PC32 relocation. The low 32 bits of the target address are correct, but the upper bits aren't. (Thus your call jumps to a bad address).
Your code works for me if I build with gcc -no-pie -fno-pie call-lib.c libcall.o. The -no-pie is the critical part: it's the linker option. Your YASM command doesn't have to change.
When making a traditional position-dependent executable, the linker turns the puts symbol for the call target into puts#plt for you, because we're linking a dynamic executable (instead of statically linking libc with gcc -static -fno-pie, in which case the call could go directly to the libc function.)
Anyway, this is why gcc emits call puts#plt (GAS syntax) when compiling with -fpie (the default on your desktop, but not the default on https://godbolt.org/), but just call puts when compiling with -fno-pie.
See What does #plt mean here? for more about the PLT, and also Sorry state of dynamic libraries on Linux from a few years ago. (The modern gcc -fno-plt is like one of the ideas in that blog post.)
BTW, a more accurate/specific prototype would let gcc avoid zeroing EAX before calling foo:
extern void foo(); in C means extern void foo(...);
You could declare it as extern void foo(void);, which is what () means in C++. C++ doesn't allow function declarations that leave the args unspecified.
asm improvements
You can also put message in section .rodata (read-only data, linked as part of the text segment).
You don't need a stack frame, just something to align the stack by 16 before a call. A dummy push rax will do it.
Or we can tail-call puts by jumping to it instead of calling it, with the same stack position as on entry to this function. This works with or without PIE. Just replace call with jmp, as long as RSP is pointing at your own return address.
If you want to make PIE executables (or shared libraries), you have two options
call puts wrt ..plt - explicitly call through the PLT.
call [rel puts wrt ..got] - explicitly do an indirect call through the GOT entry, like gcc's -fno-plt style of code-gen. (Using a RIP-relative addressing mode to reach the GOT, hence the rel keyword).
WRT = With Respect To. The NASM manual documents wrt ..plt, and see also section 7.9.3: special symbols and WRT.
Normally you would use default rel at the top of your file so you can actually use call [puts wrt ..got] and still get a RIP-relative addressing mode. You can't use a 32-bit absolute addressing mode in PIE or PIC code.
call [puts wrt ..got] assembles to a memory-indirect call using the function pointer that dynamic linking stored in the GOT. (Early-binding, not lazy dynamic linking.)
NASM documents ..got for getting the address of variables in section 9.2.3. Functions in (other) libraries are identical: you get a pointer from the GOT instead of calling directly, because the offset isn't a link-time constant and might not fit in 32-bits.
YASM also accepts call [puts wrt ..GOTPCREL], like AT&T syntax call *puts#GOTPCREL(%rip), but NASM does not.
; don't use BITS 64. You *want* an error if you try to assemble this into a 32-bit .o
default rel ; RIP-relative addressing instead of 32-bit absolute by default; makes the [rel ...] optional
section .rodata ; .rodata is best for constants, not .data
message:
db 'foo() called', 0
section .text
global foo
foo:
sub rsp, 8 ; align the stack by 16
; PIE with PLT
lea rdi, [rel message] ; needed for PIE
call puts WRT ..plt ; tailcall puts
;or
; PIE with -fno-plt style code, skips the PLT indirection
lea rdi, [rel message]
call [rel puts wrt ..got]
;or
; non-PIE
mov edi, message ; more efficient, but only works in non-PIE / non-PIC
call puts ; linker will rewrite it into call puts#plt
add rsp,8 ; restore the stack, undoing the add
ret
In a position-dependent Linux executable, you can use mov edi, message instead of a RIP-relative LEA. It's smaller code-size and can run on more execution ports on most CPUs. (Fun fact: MacOS always puts the "image base" outside the low 4GiB so this optimization isn't possible there.)
In a non-PIE executable, you also might as well use call puts or jmp puts and let the linker sort it out, unless you want more efficient no-plt style dynamic linking. But if you do choose to statically link libc, I think this is the only way you'll get a direct jmp to the libc function.
(I think the possibility of static linking for non-PIE is why ld is willing to generate PLT stubs automatically for non-PIE, but not for PIE or shared libraries. It requires you to say what you mean when linking ELF shared objects.)
If you did use call puts in a PIE (call rel32), it could only work if you statically linked a position-independent implementation of puts into your PIE, so the entire thing was one executable that would get loaded at a random address at runtime (by the usual dynamic-linker mechanism), but simply didn't have a dependency on libc.so.6
Linker "relaxing" calls when the target is present at static-link time
GAS call *bar#GOTPCREL(%rip) uses R_X86_64_GOTPCRELX (relaxable)
NASM call [rel bar wrt ..got] uses R_X86_64_GOTPCREL (not relaxable)
This is less of a problem with hand-written asm; you can just use call bar when you know the symbol will be present in another .o (rather than .so) that you're going to link. But C compilers don't know the difference between library functions and other user functions you declare with prototypes (unless you use stuff like gcc -fvisibility=hidden https://gcc.gnu.org/wiki/Visibility or attributes / pragmas).
Still, you might want to write asm source that the linker can optimize if you statically link a library, but AFAIK you can't do that with NASM. You can export a symbol as hidden (visible at static-link time, but not for dynamic linking in the final .so) with global bar:function hidden, but that's in the source file defining the function, not files accessing it.
global bar
bar:
mov eax,231
syscall
call bar wrt ..plt
call [rel bar wrt ..got]
extern bar
The 2nd file, after assembling with nasm -felf64 and disassembling with objdump -drwc -Mintel to see the relocations:
0000000000000000 <.text>:
0: e8 00 00 00 00 call 0x5 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # 0xb 7: R_X86_64_GOTPCREL bar-0x4
After linking with ld (GNU Binutils) 2.35.1 - ld bar.o bar2.o -o bar
0000000000401000 <_start>:
401000: e8 0b 00 00 00 call 401010 <bar>
401005: ff 15 ed 1f 00 00 call QWORD PTR [rip+0x1fed] # 402ff8 <.got>
40100b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
0000000000401010 <bar>:
401010: b8 e7 00 00 00 mov eax,0xe7
401015: 0f 05 syscall
Note that the PLT form got relaxed to just a direct call bar, PLT eliminated. But the ff 15 call [rel mem] was not relaxed to an e8 rel32
With GAS:
_start:
call bar#plt
call *bar#GOTPCREL(%rip)
gcc -c foo.s && disas foo.o
0000000000000000 <_start>:
0: e8 00 00 00 00 call 5 <_start+0x5> 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # b <_start+0xb> 7: R_X86_64_GOTPCRELX bar-0x4
Note the X at the end of R_X86_64_GOTPCRELX.
ld bar2.o foo.o -o bar && disas bar:
0000000000401000 <bar>:
401000: b8 e7 00 00 00 mov eax,0xe7
401005: 0f 05 syscall
0000000000401007 <_start>:
401007: e8 f4 ff ff ff call 401000 <bar>
40100c: 67 e8 ee ff ff ff addr32 call 401000 <bar>
Both calls got relaxed to a direct e8 call rel32 straight to the target address. The extra byte in indirect call is filled with a 67 address-size prefix (which has no effect on call rel32), padding the instruction to the same length. (Because it's too late to re-assemble and re-compute all relative branches within functions, and alignment and so on.)
That would happen for call *puts#GOTPCREL(%rip) if you statically linked libc, with gcc -static.

The 0xe8 opcode is followed by a signed offset to be applied to the PC (which has advanced to the next instruction by that time) to compute the branch target. Hence objdump is interpreting the branch target as 0x671.
YASM is rendering zeros because it has likely put a relocation on that offset, which is how it asks the loader to populate the correct offset for puts during loading. The loader is encountering an overflow when computing the relocation, which may indicate that puts is at a further offset from your call than can be represented in a 32-bit signed offset. Hence the loader fails to fix this instruction, and you get a crash.
66c: e8 00 00 00 00 shows the unpopulated address. If you look in your relocation table, you should see a relocation on 0x66d. It is not uncommon for the assembler to populate addresses/offsets with relocations as all zeros.
This page suggests that YASM has a WRT directive that can control use of .got, .plt, etc.
Per S9.2.5 on the NASM documentation, it looks like you can use CALL puts WRT ..plt (presuming YASM has the same syntax).

Related

Why gcc generates a PLT when it is apparently not needed?

Consider this code:
int foo();
int main() {
foo();
while(1){}
}
int foo() is implemented in a shared object.
Compiling this code with gcc -o main main.c -lfoo -nostdlib -m32 -O2 -e main --no-pic -L./shared gives the following diasm:
$ objdump -d ./main
./main: file format elf32-i386
Disassembly of section .plt:
00000240 <.plt>:
240: ff b3 04 00 00 00 pushl 0x4(%ebx)
246: ff a3 08 00 00 00 jmp *0x8(%ebx)
24c: 00 00 add %al,(%eax)
...
00000250 <foo#plt>:
250: ff a3 0c 00 00 00 jmp *0xc(%ebx)
256: 68 00 00 00 00 push $0x0
25b: e9 e0 ff ff ff jmp 240 <.plt>
Disassembly of section .text:
00000260 <main>:
260: 8d 4c 24 04 lea 0x4(%esp),%ecx
264: 83 e4 f0 and $0xfffffff0,%esp
267: ff 71 fc pushl -0x4(%ecx)
26a: 55 push %ebp
26b: 89 e5 mov %esp,%ebp
26d: 51 push %ecx
26e: 83 ec 04 sub $0x4,%esp
271: e8 fc ff ff ff call 272 <main+0x12>
276: eb fe jmp 276 <main+0x16>
With the following relocations:
$ objdump -R ./main
./main: file format elf32-i386
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
00000272 R_386_PC32 foo
00001ffc R_386_JUMP_SLOT foo
Note that:
The code was compiled with --no-pic, so it is not PIC
The call to foo(), in the .text section (main function), is not going through the PLT. Instead, it is just a simple R_386_PC32 relocation that I assume it will directly be relocated to the address of the foo function at load time. It makes sense to me since the code is not PIC, so there is no need to add an extra indirection through the PLT.
Even without being used, the PLT is still being generated. An entry for foo is present there and we even have a R_386_JUMP_SLOT relocation to set up the foo entry in the GOT at load time (which the PLT points to).
My question is simple: I don't see the PLT being used anywhere in the code and I also don't see it being necessary here, so why does gcc creates it?
--no-pic isn't like -no-pie, it seems to be a synonym for -fno-pic or -fno-pie affecting code-gen but not linking. Assuming your distro's GCC defaults to making a PIE, you are making a PIE so there's no conversion of the call to foo#plt.
I get a linker warning /tmp/ccyRsNtd.o: warning: relocation against 'getpid##GLIBC_2.0' in read-only section '.text.startup' / warning: creating DT_TEXTREL in a PIE. (But the executable does run, unlike if it were 64-bit where call rel32 isn't relocatable to the whole address space.)
And yeah, there is an unused PLT entry built by ld for some reason, but the way you're linking is totally nonstandard.
The normal reason for building a PLT is:
ld when linking a non-PIE will convert call foo into call foo#plt instead of including text relocations at every callsite that would require runtime fixups every time the program loads.
Use -fno-plt to get more efficient asm, especially for 64-bit mode where even PIE code can efficiently reference the GOT directly.
To make a simpler example, I used a function in libc (getpid) instead of a custom library. Compiling normally with gcc -fno-pie -no-pie -m32 -O2 foo.c, I get 5-byte e8 d5 ff ff ff call rel32: call 8049040 <getpid#plt>.
But adding -fno-plt to that, I get 6-byte ff 15 f4 bf 04 08 call [disp32] - call DWORD PTR ds:0x804bff4. No PLT involved, just the GOT entry referenced with an absolute address.
No runtime relocation needed; this page of the .text section can stay "clean" as a file-backed private mapping of the executable. (Runtime relocation would dirty it, making it backed only by swap space if the kernel wanted to evict that page.)
Also, it uses a "normal" GOT entry which needs early binding. This does work even with -nostdlib -lc and the ill-advised -e main instead of calling it _start like a normal person. Since it's a dynamically linked executable, the dynamic linker does run before your entry point and set up the GOT.

Is ASM just a MACRO for ML, does it have standarized directives? What about GAS?

I took some courses were MIPS and x86 assembly were taught.
For MIPS we used the MARS simulator.
For x86 we wrote code to templated .S files which they basically ignored the details of, so we basically wrote the body of subroutines...
Both shared some keywords:
.data
.text
Other we did not use in MARS:
.globl
.section
Now I need to learn by my self ARM and I don't know really how to start coding.
What I understand is that in MARS we wrote code that was compiled to ML and sent direcly to the simulated CPU's memmory as if it were a Microcontroller (which by the way I did with a Microchip PIC). With x86 however we run through a host OS, which is Linux, and needed the compiler to add "header" and assign VM to the program thus not trivial.
Probably the the fact is that they taught us ASM more like a MACRO to ML and not as a language that had directives and features, which are low level, that needs to be compiled.
Googling I found GAS which might be what we did with x86...
So the questions are:
Is GAS what is used to write ASM in *NIX systems?
Are those above mentioned keyword inherent from ASM or are they from GAS?
You don't need to make it that complicated
int main ( void )
{
return(555);
}
gcc -O2 -c so.c -o so.o
objdump -D so.o
0000000000000000 <main>:
0: b8 2b 02 00 00 mov $0x22b,%eax
5: c3 retq
Or you could look at the assembly generated, I prefer to disassemble, so now I can
.globl main
main:
mov $0x22b,%eax
retq
as so.s -o so.o
gcc so.o -o so
./so
And of course nothing comes out but
so.s
.globl fun
fun:
mov $0x22b,%eax
retq
so.c
#include <stdio.h>
int fun ( void );
int main ( void )
{
printf("%u\n",fun());
return(0);
}
as so.s -o fun.o
gcc so.c fun.o -o so
./so
555
And of course you can then complicate it as much as you like beyond that.
gcc outputs gnu assembler so
int fun ( void )
{
return(333);
}
gcc -O2 -save-temps -c so.c -o so.o
cat so.s
.file "so.c"
.section .text.unlikely,"ax",#progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl fun
.type fun, #function
fun:
.LFB0:
.cfi_startproc
movl $333, %eax
ret
.cfi_endproc
.LFE0:
.size fun, .-fun
.section .text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609"
.section .note.GNU-stack,"",#progbits
And although they often have an excess of directives (useful for debuggers and other things but not always used/required), you can use this fact to help to some extent to learn gnu assembler for this target (x86-64), but you of course need the documentation from the processor vendor (Intel in this case). Understanding that the syntax in that document is not necessarily the syntax used by any particular toolchain that you have or will use, you have to be multi-lingual there but you see what the instructions are and what they do and their limits, etc.
MARS and other similar environments are quite useful for teaching and are often designed for that reason leaving out a lot of the traps that you can fall into. The goal being to learn the instruction set by playing with a simulator and get your feet wet in assembly language. I am not a fan of an assembly interface, for educational purposes I think the student should generate/see the machine code, and perhaps within that sim you can, I have only used it for SO questions, I use real or simulated MIPS processors if I want to play with MIPS.
Assembly language is specific to the tool not the target, assume that each assembler for any target has its own assembly language and if there happens to be overlap then so be it.
global fun
fun:
mov eax, 333
ret
nasm so.s -felf64 -o so.o
gcc so.c so.o -o so
./so
333
There is the well known Intel vs AT&T thing but those are not syntaxes those are source destination swapping from the Intel standard. nasm doesn't like .globl, try it it likes global without the dot.
.globl fun
fun:
movl %eax, $333
ret
so.s:1: error: attempt to define a local label before any non-local labels
so.s:1: error: parser: instruction expected
so.s:3: error: parser: instruction expected
globl fun
fun:
movl %eax, $333
ret
nasm so.s -felf64 -o so.o
so.s:1: error: parser: instruction expected
so.s:3: error: parser: instruction expected
globl fun <-- note this is line 1
fun:
mov %eax, $333 <--- this is line 3
ret
nasm so.s -felf64 -o so.o
so.s:1: error: parser: instruction expected
so.s:3: error: expression syntax error
globl fun
fun:
mov eax, 333
ret
nasm so.s -felf64 -o so.o
so.s:1: error: parser: instruction expected
global fun
fun:
mov eax, 333
ret
And nasm is happy
as so.s -o so.o
so.s: Assembler messages:
so.s:1: Error: no such instruction: `global fun'
so.s:3: Error: too many memory references for `mov'
.global fun
fun:
mov 333, eax
ret
so.s: Assembler messages:
so.s:3: Error: too many memory references for `mov'
.global fun
fun:
mov $333, eax
ret
so.s: Assembler messages:
so.s:3: Error: no instruction mnemonic suffix given and no register operands; can't size instruction
.global fun
fun:
movl $333, eax
ret
and as is happy BUT, this is broken it thinks eax is a label to be filled in later
0000000000000000 <fun>:
0: c7 04 25 00 00 00 00 movl $0x14d,0x0
7: 4d 01 00 00
b: c3 retq
.global fun
fun:
movl $333, %eax
ret
0000000000000000 <fun>:
0: b8 4d 01 00 00 mov $0x14d,%eax
5: c3 retq
.global fun
fun:
movl $333, %eax
retq
0000000000000000 <fun>:
0: b8 4d 01 00 00 mov $0x14d,%eax
5: c3 retq
.global fun
fun:
mov $333, %eax
retq
0000000000000000 <fun>:
0: b8 4d 01 00 00 mov $0x14d,%eax
5: c3 retq
nasm:
global fun
fun:
mov eax, 333
ret
0000000000000000 <fun>:
0: b8 4d 01 00 00 mov $0x14d,%eax
5: c3 retq
Same machine code, different assembly language in more ways than just reversing the source and destination (I used objdump to disassemble so that is why you see that syntax).
gas takes .globl or .global. Since the size of the mov is obvious due to the eax register which is 32 bits the suffix isn't needed movl or mov apparently work with the binutils I have. Likewise ret vs retq produced the same instruction.
The joys of assembly language especially with a painful target like x86 (the last instruction set you want to learn there is a list of more useful/better ones).
But you can see that assembly language can/does differ for the same target the same instructions based on the tool used. And something like MARS starts to make even more sense for that use case.
You won't go wrong learning the gcc/binutils (gnu) tools as you can use them on Windows, Mac, Linux, BSD, etc and all but the system calls and possibly binary file formats are going to be the same experience (okay linker scripts, OS specific stuff will differ).
Depending on the target there may be other good choices too. nasm is popular for the folks that learned Intel syntax from the old days and I suppose others, as well as code that may have been laying about for a while that gas pukes on you might have half a chance with nasm.
And one or the other or both have command line options for the Intel vs ATT source/destination swapping.

How to use ld to make a dynamic library? [duplicate]

I have a function foo written in assembly and compiled with yasm and GCC on Linux (Ubuntu) 64-bit. It simply prints a message to stdout using puts(), here is how it looks:
bits 64
extern puts
global foo
section .data
message:
db 'foo() called', 0
section .text
foo:
push rbp
mov rbp, rsp
lea rdi, [rel message]
call puts
pop rbp
ret
It is called by a C program compiled with GCC:
extern void foo();
int main() {
foo();
return 0;
}
Build commands:
yasm -f elf64 foo_64_unix.asm
gcc -c foo_main.c -o foo_main.o
gcc foo_64_unix.o foo_main.o -o foo
./foo
Here is the problem:
When running the program it prints an error message and immediately segfaults during the call to puts:
./foo: Symbol `puts' causes overflow in R_X86_64_PC32 relocation
Segmentation fault
After disassembling with objdump I see that the call is made with the wrong address:
0000000000000660 <foo>:
660: 90 nop
661: 55 push %rbp
662: 48 89 e5 mov %rsp,%rbp
665: 48 8d 3d a4 09 20 00 lea 0x2009a4(%rip),%rdi
66c: e8 00 00 00 00 callq 671 <foo+0x11> <-- here
671: 5d pop %rbp
672: c3 retq
(671 is the address of the next instruction, not address of puts)
However, if I rewrite the same code in C the call is done differently:
645: e8 c6 fe ff ff callq 510 <puts#plt>
i.e. it references puts from the PLT.
Is it possible to tell yasm to generate similar code?
TL:DR: 3 options:
Build a non-PIE executable (gcc -no-pie -fno-pie call-lib.c libcall.o) so the linker will generate a PLT entry for you transparently when you write call puts.
call puts wrt ..plt like gcc -fPIE would do.
call [rel puts wrt ..got] like gcc -fno-plt would do.
The latter two will work in PIE executables or shared libraries. The 3rd way, wrt ..got, is slightly more efficient.
Your gcc is building PIE executables by default (32-bit absolute addresses no longer allowed in x86-64 Linux?).
I'm not sure why, but when doing so the linker doesn't automatically resolve call puts to call puts#plt. There is still a puts PLT entry generated, but the call doesn't go there.
At runtime, the dynamic linker tries to resolve puts directly to the libc symbol of that name and fixup the call rel32. But the symbol is more than +-2^31 away, so we get a warning about overflow of the R_X86_64_PC32 relocation. The low 32 bits of the target address are correct, but the upper bits aren't. (Thus your call jumps to a bad address).
Your code works for me if I build with gcc -no-pie -fno-pie call-lib.c libcall.o. The -no-pie is the critical part: it's the linker option. Your YASM command doesn't have to change.
When making a traditional position-dependent executable, the linker turns the puts symbol for the call target into puts#plt for you, because we're linking a dynamic executable (instead of statically linking libc with gcc -static -fno-pie, in which case the call could go directly to the libc function.)
Anyway, this is why gcc emits call puts#plt (GAS syntax) when compiling with -fpie (the default on your desktop, but not the default on https://godbolt.org/), but just call puts when compiling with -fno-pie.
See What does #plt mean here? for more about the PLT, and also Sorry state of dynamic libraries on Linux from a few years ago. (The modern gcc -fno-plt is like one of the ideas in that blog post.)
BTW, a more accurate/specific prototype would let gcc avoid zeroing EAX before calling foo:
extern void foo(); in C means extern void foo(...);
You could declare it as extern void foo(void);, which is what () means in C++. C++ doesn't allow function declarations that leave the args unspecified.
asm improvements
You can also put message in section .rodata (read-only data, linked as part of the text segment).
You don't need a stack frame, just something to align the stack by 16 before a call. A dummy push rax will do it.
Or we can tail-call puts by jumping to it instead of calling it, with the same stack position as on entry to this function. This works with or without PIE. Just replace call with jmp, as long as RSP is pointing at your own return address.
If you want to make PIE executables (or shared libraries), you have two options
call puts wrt ..plt - explicitly call through the PLT.
call [rel puts wrt ..got] - explicitly do an indirect call through the GOT entry, like gcc's -fno-plt style of code-gen. (Using a RIP-relative addressing mode to reach the GOT, hence the rel keyword).
WRT = With Respect To. The NASM manual documents wrt ..plt, and see also section 7.9.3: special symbols and WRT.
Normally you would use default rel at the top of your file so you can actually use call [puts wrt ..got] and still get a RIP-relative addressing mode. You can't use a 32-bit absolute addressing mode in PIE or PIC code.
call [puts wrt ..got] assembles to a memory-indirect call using the function pointer that dynamic linking stored in the GOT. (Early-binding, not lazy dynamic linking.)
NASM documents ..got for getting the address of variables in section 9.2.3. Functions in (other) libraries are identical: you get a pointer from the GOT instead of calling directly, because the offset isn't a link-time constant and might not fit in 32-bits.
YASM also accepts call [puts wrt ..GOTPCREL], like AT&T syntax call *puts#GOTPCREL(%rip), but NASM does not.
; don't use BITS 64. You *want* an error if you try to assemble this into a 32-bit .o
default rel ; RIP-relative addressing instead of 32-bit absolute by default; makes the [rel ...] optional
section .rodata ; .rodata is best for constants, not .data
message:
db 'foo() called', 0
section .text
global foo
foo:
sub rsp, 8 ; align the stack by 16
; PIE with PLT
lea rdi, [rel message] ; needed for PIE
call puts WRT ..plt ; tailcall puts
;or
; PIE with -fno-plt style code, skips the PLT indirection
lea rdi, [rel message]
call [rel puts wrt ..got]
;or
; non-PIE
mov edi, message ; more efficient, but only works in non-PIE / non-PIC
call puts ; linker will rewrite it into call puts#plt
add rsp,8 ; restore the stack, undoing the add
ret
In a position-dependent Linux executable, you can use mov edi, message instead of a RIP-relative LEA. It's smaller code-size and can run on more execution ports on most CPUs. (Fun fact: MacOS always puts the "image base" outside the low 4GiB so this optimization isn't possible there.)
In a non-PIE executable, you also might as well use call puts or jmp puts and let the linker sort it out, unless you want more efficient no-plt style dynamic linking. But if you do choose to statically link libc, I think this is the only way you'll get a direct jmp to the libc function.
(I think the possibility of static linking for non-PIE is why ld is willing to generate PLT stubs automatically for non-PIE, but not for PIE or shared libraries. It requires you to say what you mean when linking ELF shared objects.)
If you did use call puts in a PIE (call rel32), it could only work if you statically linked a position-independent implementation of puts into your PIE, so the entire thing was one executable that would get loaded at a random address at runtime (by the usual dynamic-linker mechanism), but simply didn't have a dependency on libc.so.6
Linker "relaxing" calls when the target is present at static-link time
GAS call *bar#GOTPCREL(%rip) uses R_X86_64_GOTPCRELX (relaxable)
NASM call [rel bar wrt ..got] uses R_X86_64_GOTPCREL (not relaxable)
This is less of a problem with hand-written asm; you can just use call bar when you know the symbol will be present in another .o (rather than .so) that you're going to link. But C compilers don't know the difference between library functions and other user functions you declare with prototypes (unless you use stuff like gcc -fvisibility=hidden https://gcc.gnu.org/wiki/Visibility or attributes / pragmas).
Still, you might want to write asm source that the linker can optimize if you statically link a library, but AFAIK you can't do that with NASM. You can export a symbol as hidden (visible at static-link time, but not for dynamic linking in the final .so) with global bar:function hidden, but that's in the source file defining the function, not files accessing it.
global bar
bar:
mov eax,231
syscall
call bar wrt ..plt
call [rel bar wrt ..got]
extern bar
The 2nd file, after assembling with nasm -felf64 and disassembling with objdump -drwc -Mintel to see the relocations:
0000000000000000 <.text>:
0: e8 00 00 00 00 call 0x5 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # 0xb 7: R_X86_64_GOTPCREL bar-0x4
After linking with ld (GNU Binutils) 2.35.1 - ld bar.o bar2.o -o bar
0000000000401000 <_start>:
401000: e8 0b 00 00 00 call 401010 <bar>
401005: ff 15 ed 1f 00 00 call QWORD PTR [rip+0x1fed] # 402ff8 <.got>
40100b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
0000000000401010 <bar>:
401010: b8 e7 00 00 00 mov eax,0xe7
401015: 0f 05 syscall
Note that the PLT form got relaxed to just a direct call bar, PLT eliminated. But the ff 15 call [rel mem] was not relaxed to an e8 rel32
With GAS:
_start:
call bar#plt
call *bar#GOTPCREL(%rip)
gcc -c foo.s && disas foo.o
0000000000000000 <_start>:
0: e8 00 00 00 00 call 5 <_start+0x5> 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # b <_start+0xb> 7: R_X86_64_GOTPCRELX bar-0x4
Note the X at the end of R_X86_64_GOTPCRELX.
ld bar2.o foo.o -o bar && disas bar:
0000000000401000 <bar>:
401000: b8 e7 00 00 00 mov eax,0xe7
401005: 0f 05 syscall
0000000000401007 <_start>:
401007: e8 f4 ff ff ff call 401000 <bar>
40100c: 67 e8 ee ff ff ff addr32 call 401000 <bar>
Both calls got relaxed to a direct e8 call rel32 straight to the target address. The extra byte in indirect call is filled with a 67 address-size prefix (which has no effect on call rel32), padding the instruction to the same length. (Because it's too late to re-assemble and re-compute all relative branches within functions, and alignment and so on.)
That would happen for call *puts#GOTPCREL(%rip) if you statically linked libc, with gcc -static.
The 0xe8 opcode is followed by a signed offset to be applied to the PC (which has advanced to the next instruction by that time) to compute the branch target. Hence objdump is interpreting the branch target as 0x671.
YASM is rendering zeros because it has likely put a relocation on that offset, which is how it asks the loader to populate the correct offset for puts during loading. The loader is encountering an overflow when computing the relocation, which may indicate that puts is at a further offset from your call than can be represented in a 32-bit signed offset. Hence the loader fails to fix this instruction, and you get a crash.
66c: e8 00 00 00 00 shows the unpopulated address. If you look in your relocation table, you should see a relocation on 0x66d. It is not uncommon for the assembler to populate addresses/offsets with relocations as all zeros.
This page suggests that YASM has a WRT directive that can control use of .got, .plt, etc.
Per S9.2.5 on the NASM documentation, it looks like you can use CALL puts WRT ..plt (presuming YASM has the same syntax).

How to access a C global variable through GOT in GAS assembly on x86-64 Linux?

My problem
I am trying to write a shared library(not an executable, so please do not tell me to use -no-pie) with assembly and C in separate files(not inline assembly).
And I would like to access a C global variable through Global Offset Table in assembly code, because the function called might be defined in any other shared libraries.
I know the PLT/GOT stuff but I do not know for sure how to tell the compiler to correctly generate relocation information for the linker(what is the syntax), and how to tell the linker to actually relocate my code with that information(what is the linker options).
My code compiles with a linking error
/bin/ld: tracer.o: relocation R_X86_64_PC32 against
/bin/ld: final link failed: bad value
Furthermore, it would be better if someone could share some detailed documentation on the GAS assembly about relocation. For example, an exhaustive list on how to interpolate between C and assembly with GNU assembler.
Source Code
Compile the C and assembly code and link the into ONE shared library.
# Makefile
liba.so: tracer2.S target2.c
gcc -shared -g -o liba.so tracer2.S target2.c
// target2.c
// NOTE: This is a variable, not a function.
int (*read_original)(int fd, void *data, unsigned long size) = 0;
// tracer2.S
.text
// external symbol declarition
.global read_original
read:
lea read_original(%rip), %rax
mov (%rax), %rax
jmp *%rax
Expectation and Result
I expect the linker to happily link my object files but it says
g++ -shared -g -o liba.so tracer2.o target2.c -ldl
/bin/ld: tracer.o: relocation R_X86_64_PC32 against
/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
make: *** [Makefile:2: liba.so] Error 1
and commenting out the line
// lea read_original(%rip), %rax
makes the error disappear.
Solution.
lea read_original#GOTPCREL(%rip), %rax
The keyword GOTPCREL will tell the compiler this is a PC-relative relocation to GOT table. The linker will calculate the offset from current rip to the target GOT table entry.
You can verify with
$ objdump -d liba.so
10e9: 48 8d 05 f8 2e 00 00 lea 0x2ef8(%rip),%rax # 3fe8 <read_original##Base-0x40>
10f0: 48 8b 00 mov (%rax),%rax
10f3: ff e0 jmpq *%rax
Thanks to Peter.
Some information that might be related or not
1. I can call a C function with
call read#plt
objdump shows it calls into the correct PLT entry.
$ objdump -d liba.so
...
0000000000001109 <read1>:
1109: e8 22 ff ff ff callq 1030 <read#plt>
110e: ff e0 jmpq *%rax
2. I can lea a PLT entry address correctly
0xffffff23 is -0xdd, 0x1109 - 0xdd = 102c
0000000000001020 <.plt>:
1020: ff 35 e2 2f 00 00 pushq 0x2fe2(%rip) # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
1026: ff 25 e4 2f 00 00 jmpq *0x2fe4(%rip) # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
102c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000001030 <read#plt>:
1030: ff 25 e2 2f 00 00 jmpq *0x2fe2(%rip) # 4018 <read#GLIBC_2.2.5>
1036: 68 00 00 00 00 pushq $0x0
103b: e9 e0 ff ff ff jmpq 1020 <.plt>
0000000000001109 <read1>:
1109: 48 8d 04 25 23 ff ff lea 0xffffffffffffff23,%rax
1110: ff
1111: ff e0 jmpq *%rax
Environment
Arch Linux 20190809
$ uname -a
Linux alex-arch 5.2.6-arch1-1-ARCH #1 SMP PREEMPT Sun Aug 4 14:58:49 UTC 2019 x86_64 GNU/Linux
$ gcc -v
Using built-in specs.
COLLECT_GCC=/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto
Thread model: posix
gcc version 9.1.0 (GCC)
$ ld --version
GNU ld (GNU Binutils) 2.32
Copyright (C) 2019 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
Apparently the linker enforces global vs. hidden visibility for symbols in ELF shared objects, not allowing "back door" access to symbols that participate in symbol-interposition (and thus can potentially be more than 2GB away.)
To access it directly from other code in the same shared object with normal RIP-relative addressing, make the symbol hidden by setting its ELF visibility as such. (See also https://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/ and Ulrich Drepper's How to Write Shared Libraries)
__attribute__ ((visibility("hidden")))
int (*read_original)(int fd, void *data, unsigned long size) = 0;
Then gcc -save-temps tracer2.S target2.c -shared -fPIC compiles/assembles + links a shared library. GCC also has options like -fvisibility=hidden that makes that the default, requiring explicit attributes on symbols you do want to export for dynamic linking. That's a very good idea if you have any globals that you use inside your library, to get the compiler to emit efficient code for using them. It also protects you from global name-clashes with other libraries. The GCC manuals strongly recommends it.
It also works with g++; C++ name mangling only applies to function names, not variables (including function-pointers). But generally don't compile .c files with a C++ compiler.
If you do want to support symbol interposition, you need to use the GOT; obviously you can just look at how the compiler does it:
int glob; // with default visibility = default
int foo() { return glob; }
compiles to this asm with GCC -O3 -fPIC (without any visibility options, so global symbols are fully globally visible: exported from shared objects and participating in symbol interposition).
foo:
movq glob#GOTPCREL(%rip), %rax
movl (%rax), %eax
ret
Obviously this is less efficient than mov glob(%rip), %eax so prefer keeping your global vars scoped to the library (hidden), not truly global.
There are tricks you can do with weak aliases to let you export a symbol that only this library defines, and access that definition efficiently via a "hidden" alias.

Can't call C standard library function on 64-bit Linux from assembly (yasm) code

I have a function foo written in assembly and compiled with yasm and GCC on Linux (Ubuntu) 64-bit. It simply prints a message to stdout using puts(), here is how it looks:
bits 64
extern puts
global foo
section .data
message:
db 'foo() called', 0
section .text
foo:
push rbp
mov rbp, rsp
lea rdi, [rel message]
call puts
pop rbp
ret
It is called by a C program compiled with GCC:
extern void foo();
int main() {
foo();
return 0;
}
Build commands:
yasm -f elf64 foo_64_unix.asm
gcc -c foo_main.c -o foo_main.o
gcc foo_64_unix.o foo_main.o -o foo
./foo
Here is the problem:
When running the program it prints an error message and immediately segfaults during the call to puts:
./foo: Symbol `puts' causes overflow in R_X86_64_PC32 relocation
Segmentation fault
After disassembling with objdump I see that the call is made with the wrong address:
0000000000000660 <foo>:
660: 90 nop
661: 55 push %rbp
662: 48 89 e5 mov %rsp,%rbp
665: 48 8d 3d a4 09 20 00 lea 0x2009a4(%rip),%rdi
66c: e8 00 00 00 00 callq 671 <foo+0x11> <-- here
671: 5d pop %rbp
672: c3 retq
(671 is the address of the next instruction, not address of puts)
However, if I rewrite the same code in C the call is done differently:
645: e8 c6 fe ff ff callq 510 <puts#plt>
i.e. it references puts from the PLT.
Is it possible to tell yasm to generate similar code?
TL:DR: 3 options:
Build a non-PIE executable (gcc -no-pie -fno-pie call-lib.c libcall.o) so the linker will generate a PLT entry for you transparently when you write call puts.
call puts wrt ..plt like gcc -fPIE would do.
call [rel puts wrt ..got] like gcc -fno-plt would do.
The latter two will work in PIE executables or shared libraries. The 3rd way, wrt ..got, is slightly more efficient.
Your gcc is building PIE executables by default (32-bit absolute addresses no longer allowed in x86-64 Linux?).
I'm not sure why, but when doing so the linker doesn't automatically resolve call puts to call puts#plt. There is still a puts PLT entry generated, but the call doesn't go there.
At runtime, the dynamic linker tries to resolve puts directly to the libc symbol of that name and fixup the call rel32. But the symbol is more than +-2^31 away, so we get a warning about overflow of the R_X86_64_PC32 relocation. The low 32 bits of the target address are correct, but the upper bits aren't. (Thus your call jumps to a bad address).
Your code works for me if I build with gcc -no-pie -fno-pie call-lib.c libcall.o. The -no-pie is the critical part: it's the linker option. Your YASM command doesn't have to change.
When making a traditional position-dependent executable, the linker turns the puts symbol for the call target into puts#plt for you, because we're linking a dynamic executable (instead of statically linking libc with gcc -static -fno-pie, in which case the call could go directly to the libc function.)
Anyway, this is why gcc emits call puts#plt (GAS syntax) when compiling with -fpie (the default on your desktop, but not the default on https://godbolt.org/), but just call puts when compiling with -fno-pie.
See What does #plt mean here? for more about the PLT, and also Sorry state of dynamic libraries on Linux from a few years ago. (The modern gcc -fno-plt is like one of the ideas in that blog post.)
BTW, a more accurate/specific prototype would let gcc avoid zeroing EAX before calling foo:
extern void foo(); in C means extern void foo(...);
You could declare it as extern void foo(void);, which is what () means in C++. C++ doesn't allow function declarations that leave the args unspecified.
asm improvements
You can also put message in section .rodata (read-only data, linked as part of the text segment).
You don't need a stack frame, just something to align the stack by 16 before a call. A dummy push rax will do it.
Or we can tail-call puts by jumping to it instead of calling it, with the same stack position as on entry to this function. This works with or without PIE. Just replace call with jmp, as long as RSP is pointing at your own return address.
If you want to make PIE executables (or shared libraries), you have two options
call puts wrt ..plt - explicitly call through the PLT.
call [rel puts wrt ..got] - explicitly do an indirect call through the GOT entry, like gcc's -fno-plt style of code-gen. (Using a RIP-relative addressing mode to reach the GOT, hence the rel keyword).
WRT = With Respect To. The NASM manual documents wrt ..plt, and see also section 7.9.3: special symbols and WRT.
Normally you would use default rel at the top of your file so you can actually use call [puts wrt ..got] and still get a RIP-relative addressing mode. You can't use a 32-bit absolute addressing mode in PIE or PIC code.
call [puts wrt ..got] assembles to a memory-indirect call using the function pointer that dynamic linking stored in the GOT. (Early-binding, not lazy dynamic linking.)
NASM documents ..got for getting the address of variables in section 9.2.3. Functions in (other) libraries are identical: you get a pointer from the GOT instead of calling directly, because the offset isn't a link-time constant and might not fit in 32-bits.
YASM also accepts call [puts wrt ..GOTPCREL], like AT&T syntax call *puts#GOTPCREL(%rip), but NASM does not.
; don't use BITS 64. You *want* an error if you try to assemble this into a 32-bit .o
default rel ; RIP-relative addressing instead of 32-bit absolute by default; makes the [rel ...] optional
section .rodata ; .rodata is best for constants, not .data
message:
db 'foo() called', 0
section .text
global foo
foo:
sub rsp, 8 ; align the stack by 16
; PIE with PLT
lea rdi, [rel message] ; needed for PIE
call puts WRT ..plt ; tailcall puts
;or
; PIE with -fno-plt style code, skips the PLT indirection
lea rdi, [rel message]
call [rel puts wrt ..got]
;or
; non-PIE
mov edi, message ; more efficient, but only works in non-PIE / non-PIC
call puts ; linker will rewrite it into call puts#plt
add rsp,8 ; restore the stack, undoing the add
ret
In a position-dependent Linux executable, you can use mov edi, message instead of a RIP-relative LEA. It's smaller code-size and can run on more execution ports on most CPUs. (Fun fact: MacOS always puts the "image base" outside the low 4GiB so this optimization isn't possible there.)
In a non-PIE executable, you also might as well use call puts or jmp puts and let the linker sort it out, unless you want more efficient no-plt style dynamic linking. But if you do choose to statically link libc, I think this is the only way you'll get a direct jmp to the libc function.
(I think the possibility of static linking for non-PIE is why ld is willing to generate PLT stubs automatically for non-PIE, but not for PIE or shared libraries. It requires you to say what you mean when linking ELF shared objects.)
If you did use call puts in a PIE (call rel32), it could only work if you statically linked a position-independent implementation of puts into your PIE, so the entire thing was one executable that would get loaded at a random address at runtime (by the usual dynamic-linker mechanism), but simply didn't have a dependency on libc.so.6
Linker "relaxing" calls when the target is present at static-link time
GAS call *bar#GOTPCREL(%rip) uses R_X86_64_GOTPCRELX (relaxable)
NASM call [rel bar wrt ..got] uses R_X86_64_GOTPCREL (not relaxable)
This is less of a problem with hand-written asm; you can just use call bar when you know the symbol will be present in another .o (rather than .so) that you're going to link. But C compilers don't know the difference between library functions and other user functions you declare with prototypes (unless you use stuff like gcc -fvisibility=hidden https://gcc.gnu.org/wiki/Visibility or attributes / pragmas).
Still, you might want to write asm source that the linker can optimize if you statically link a library, but AFAIK you can't do that with NASM. You can export a symbol as hidden (visible at static-link time, but not for dynamic linking in the final .so) with global bar:function hidden, but that's in the source file defining the function, not files accessing it.
global bar
bar:
mov eax,231
syscall
call bar wrt ..plt
call [rel bar wrt ..got]
extern bar
The 2nd file, after assembling with nasm -felf64 and disassembling with objdump -drwc -Mintel to see the relocations:
0000000000000000 <.text>:
0: e8 00 00 00 00 call 0x5 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # 0xb 7: R_X86_64_GOTPCREL bar-0x4
After linking with ld (GNU Binutils) 2.35.1 - ld bar.o bar2.o -o bar
0000000000401000 <_start>:
401000: e8 0b 00 00 00 call 401010 <bar>
401005: ff 15 ed 1f 00 00 call QWORD PTR [rip+0x1fed] # 402ff8 <.got>
40100b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
0000000000401010 <bar>:
401010: b8 e7 00 00 00 mov eax,0xe7
401015: 0f 05 syscall
Note that the PLT form got relaxed to just a direct call bar, PLT eliminated. But the ff 15 call [rel mem] was not relaxed to an e8 rel32
With GAS:
_start:
call bar#plt
call *bar#GOTPCREL(%rip)
gcc -c foo.s && disas foo.o
0000000000000000 <_start>:
0: e8 00 00 00 00 call 5 <_start+0x5> 1: R_X86_64_PLT32 bar-0x4
5: ff 15 00 00 00 00 call QWORD PTR [rip+0x0] # b <_start+0xb> 7: R_X86_64_GOTPCRELX bar-0x4
Note the X at the end of R_X86_64_GOTPCRELX.
ld bar2.o foo.o -o bar && disas bar:
0000000000401000 <bar>:
401000: b8 e7 00 00 00 mov eax,0xe7
401005: 0f 05 syscall
0000000000401007 <_start>:
401007: e8 f4 ff ff ff call 401000 <bar>
40100c: 67 e8 ee ff ff ff addr32 call 401000 <bar>
Both calls got relaxed to a direct e8 call rel32 straight to the target address. The extra byte in indirect call is filled with a 67 address-size prefix (which has no effect on call rel32), padding the instruction to the same length. (Because it's too late to re-assemble and re-compute all relative branches within functions, and alignment and so on.)
That would happen for call *puts#GOTPCREL(%rip) if you statically linked libc, with gcc -static.
The 0xe8 opcode is followed by a signed offset to be applied to the PC (which has advanced to the next instruction by that time) to compute the branch target. Hence objdump is interpreting the branch target as 0x671.
YASM is rendering zeros because it has likely put a relocation on that offset, which is how it asks the loader to populate the correct offset for puts during loading. The loader is encountering an overflow when computing the relocation, which may indicate that puts is at a further offset from your call than can be represented in a 32-bit signed offset. Hence the loader fails to fix this instruction, and you get a crash.
66c: e8 00 00 00 00 shows the unpopulated address. If you look in your relocation table, you should see a relocation on 0x66d. It is not uncommon for the assembler to populate addresses/offsets with relocations as all zeros.
This page suggests that YASM has a WRT directive that can control use of .got, .plt, etc.
Per S9.2.5 on the NASM documentation, it looks like you can use CALL puts WRT ..plt (presuming YASM has the same syntax).

Resources