NASM ASSEMBLY - Print "Hello World" - string

I've created a string and turned it into an array. Looping through each index and moving to the al register so it can print out to the vga. The problem is, it prints the size of the string with no problem, but the characters in gibberish. Can you please help me figure out what the problem is in the code. It will be highly appreciated.
org 0
bits 16
section .text
global _start
_start:
mov si, msg
loop:
inc si
mov ah, 0x0e
mov al, [si]
or al, al
jz end
mov bh, 0x00
int 0x10
jmp loop
end:
jmp .done
.done:
jmp $
msg db 'Hello, world!',0xa
len equ $ - msg
TIMES 510 - ($ - $$) db 0
DW 0xAA55
bootloader code
ORG 0x7c00
BITS 16
boot:
mov ah, 0x02
mov al, 0x01
mov ch, 0x00
mov cl, 0x02
mov dh, 0x00
mov dl, 0x00
mov bx, 0x1000
mov es, bx
int 0x13
jmp 0x1000:0x00
times 510 - ($ - $$) db 0
dw 0xAA55

The bootloader
Before tackling the kernel code, let's look at the bootloader that brings the kernel in memory.
You have written a very minimalistic version of a bootloader, one that omits much of the usual stuff like setting up segment registers, but thanks to its reduced nature that's not really a problem.
What could be a problem is that you wrote mov dl, 0x00, hardcoding a zero to select the first floppy as your bootdisk. No problem if this is indeed the case, but it would be much better to just use whatever value the BIOS preloaded the DL register with. That's the ID for the disk that holds your bootloader and kernel.
What is a problem is that you load the kernel to the segmented address 0x1000:0x1000 and then later jump to the segmented address 0x1000:0x0000 which is 4096 bytes short of the kernel. You got lucky that the kernel code did run in the end, thanks to the memory between these two addresses most probably being filled with zero-bytes that (two by two) translate into the instruction add [bx+si], al. Because you omitted setting up the DS segment register, we don't know what unlucky byte got overwritten so many times. Let's hope it was not an important byte...
mov bx, 0x1000
mov es, bx
xor bx, bx <== You forgot to write this instruction!
int 0x13
jmp 0x1000:0x0000
What is a problem is that you ignore the possibility of encountering troubles when loading a sector from the disk. At the very least you should inspect the carry flag that the BIOS.ReadSector function 02h reports and if the flag is set you could abort cleanly. A more sophisticated approach would also retry a limited number of times, say 3 times.
ORG 0x7C00
BITS 16
; IN (dl)
mov dh, 0x00 ; DL is bootdrive
mov cx, 0x0002
mov bx, 0x1000
mov es, bx
xor bx, bx
mov ax, 0x0201 ; BIOS.ReadSector
int 0x13 ; -> AH CF
jc ERR
jmp 0x1000:0x0000
ERR:
cli
hlt
jmp ERR
times 510 - ($ - $$) db 0
dw 0xAA55
The kernel
After the jmp 0x1000:0x0000 instruction has brought you to the first instruction of your kernel, the CS code segment register holds the value 0x1000. None of the other segment registers did change, and since you did not setup any of them in the bootloader, we still don't know what any of them contain. However in order to retrieve the bytes from the message at msg with the mov al, [si] instruction, we need a correct value for the DS data segment register. In accordance with the ORG 0 directive, the correct value is the one we already have in CS. Just two 1-byte instructions are needed: push cs pop ds.
There's more to be said about the kernel code:
The printing loop uses a pre-increment on the pointer in the SI register. Because of this the first character of the string will not get displayed. You could compensate for this via mov si, msg - 1.
The printing loop processes a zero-terminating string. You don't need to prepare that len equate. What you do need is an explicit zero byte that terminates the string. You should not rely on that large number of zero bytes thattimes produced. In some future version of the code there might be no zero byte at all!
You (think you) have included a newline (0xa) in the string. For the BIOS.Teletype function 0Eh, this is merely a linefeed that moves down on the screen. To obtain a newline, you need to include both carriage return (13) and linefeed (10).
There's no reason for your kernel code to have the bootsector signature bytes at offset 510. Depending on how you get this code to the disk, it might be necessary to pad the code up to (a multiple of) 512, so keep times 512 - ($ - $$) db 0.
The kernel:
ORG 0
BITS 16
section .text
global _start
_start:
push cs
pop ds
mov si, msg
mov bx, 0x0007 ; DisplayPage=0, GraphicsColor=7 (White)
jmp BeginLoop
PrintLoop:
mov ah, 0x0E ; BIOS.Teletype
int 0x10
BeginLoop:
mov al, [si]
inc si
test al, al
jnz PrintLoop
cli
hlt
jmp $-2
msg db 'Hello, world!', 13, 10, 0
TIMES 512 - ($ - $$) db 0

Related

Compact shellcode to print a 0-terminated string pointed-to by a register, given puts or printf at known absolute addresses?

Background: I am a beginner trying to understand how to golf assembly, in particular to solve an online challenge.
EDIT: clarification: I want to print the value at the memory address of RDX. So “SUPER SECRET!”
Create some shellcode that can output the value of register RDX in <= 11 bytes. Null bytes are not allowed.
The program is compiled with the c standard library, so I have access to the puts / printf statement. It’s running on x86 amd64.
$rax : 0x0000000000010000 → 0x0000000ac343db31
$rdx : 0x0000555555559480 → "SUPER SECRET!"
gef➤ info address puts
Symbol "puts" is at 0x7ffff7e3c5a0 in a file compiled without debugging.
gef➤ info address printf
Symbol "printf" is at 0x7ffff7e19e10 in a file compiled without debugging.
Here is my attempt (intel syntax)
xor ebx, ebx ; zero the ebx register
inc ebx ; set the ebx register to 1 (STDOUT
xchg ecx, edx ; set the ECX register to RDX
mov edx, 0xff ; set the length to 255
mov eax, 0x4 ; set the syscall to print
int 0x80 ; interrupt
hexdump of my code
My attempt is 17 bytes and includes null bytes, which aren't allowed. What other ways can I lower the byte count? Is there a way to call puts / printf while still saving bytes?
FULL DETAILS:
I am not quite sure what is useful information and what isn't.
File details:
ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=5810a6deb6546900ba259a5fef69e1415501b0e6, not stripped
Source code:
void main() {
char* flag = get_flag(); // I don't get access to the function details
char* shellcode = (char*) mmap((void*) 0x1337,12, 0, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
mprotect(shellcode, 12, PROT_READ | PROT_WRITE | PROT_EXEC);
fgets(shellcode, 12, stdin);
((void (*)(char*))shellcode)(flag);
}
Disassembly of main:
gef➤ disass main
Dump of assembler code for function main:
0x00005555555551de <+0>: push rbp
0x00005555555551df <+1>: mov rbp,rsp
=> 0x00005555555551e2 <+4>: sub rsp,0x10
0x00005555555551e6 <+8>: mov eax,0x0
0x00005555555551eb <+13>: call 0x555555555185 <get_flag>
0x00005555555551f0 <+18>: mov QWORD PTR [rbp-0x8],rax
0x00005555555551f4 <+22>: mov r9d,0x0
0x00005555555551fa <+28>: mov r8d,0xffffffff
0x0000555555555200 <+34>: mov ecx,0x22
0x0000555555555205 <+39>: mov edx,0x0
0x000055555555520a <+44>: mov esi,0xc
0x000055555555520f <+49>: mov edi,0x1337
0x0000555555555214 <+54>: call 0x555555555030 <mmap#plt>
0x0000555555555219 <+59>: mov QWORD PTR [rbp-0x10],rax
0x000055555555521d <+63>: mov rax,QWORD PTR [rbp-0x10]
0x0000555555555221 <+67>: mov edx,0x7
0x0000555555555226 <+72>: mov esi,0xc
0x000055555555522b <+77>: mov rdi,rax
0x000055555555522e <+80>: call 0x555555555060 <mprotect#plt>
0x0000555555555233 <+85>: mov rdx,QWORD PTR [rip+0x2e26] # 0x555555558060 <stdin##GLIBC_2.2.5>
0x000055555555523a <+92>: mov rax,QWORD PTR [rbp-0x10]
0x000055555555523e <+96>: mov esi,0xc
0x0000555555555243 <+101>: mov rdi,rax
0x0000555555555246 <+104>: call 0x555555555040 <fgets#plt>
0x000055555555524b <+109>: mov rax,QWORD PTR [rbp-0x10]
0x000055555555524f <+113>: mov rdx,QWORD PTR [rbp-0x8]
0x0000555555555253 <+117>: mov rdi,rdx
0x0000555555555256 <+120>: call rax
0x0000555555555258 <+122>: nop
0x0000555555555259 <+123>: leave
0x000055555555525a <+124>: ret
Register state right before shellcode is executed:
$rax : 0x0000000000010000 → "EXPLOIT\n"
$rbx : 0x0000555555555260 → <__libc_csu_init+0> push r15
$rcx : 0x000055555555a4e8 → 0x0000000000000000
$rdx : 0x0000555555559480 → "SUPER SECRET!"
$rsp : 0x00007fffffffd940 → 0x0000000000010000 → "EXPLOIT\n"
$rbp : 0x00007fffffffd950 → 0x0000000000000000
$rsi : 0x4f4c5058
$rdi : 0x00007ffff7fa34d0 → 0x0000000000000000
$rip : 0x0000555555555253 → <main+117> mov rdi, rdx
$r8 : 0x0000000000010000 → "EXPLOIT\n"
$r9 : 0x7c
$r10 : 0x000055555555448f → "mprotect"
$r11 : 0x246
$r12 : 0x00005555555550a0 → <_start+0> xor ebp, ebp
$r13 : 0x00007fffffffda40 → 0x0000000000000001
$r14 : 0x0
$r15 : 0x0
(This register state is a snapshot at the assembly line below)
●→ 0x555555555253 <main+117> mov rdi, rdx
0x555555555256 <main+120> call rax
Since I already spilled the beans and "spoiled" the answer to the online challenge in comments, I might as well write it up. 2 key tricks:
Create 0x7ffff7e3c5a0 (&puts) in a register with lea reg, [reg + disp32], using the known value of RDI which is within the +-2^31 range of a disp32. (Or use RBP as a starting point, but not RSP: that would need a SIB byte in the addressing mode).
This is a generalization of the code-golf trick of lea edi, [rax+1] trick to create small constants from other small constants (especially 0) in 3 bytes, with code that runs less slowly than push imm8 / pop reg.
The disp32 is large enough to not have any zero bytes; you have a couple registers to choose from in case one had been too close.
Copy a 64-bit register in 2 bytes with push reg / pop reg, instead of 3-byte mov rdi, rdx (REX + opcode + modrm). No savings if either push needs a REX prefix (for R8..R15), and actually costs bytes if both are "non-legacy" registers.
See other answers on Tips for golfing in x86/x64 machine code on codegolf.SE for more.
bits 64
lea rsi, [rdi - 0x166f30]
;; add rbp, imm32 ; alternative, but that would mess up a call-preserved register so we might crash on return.
push rdx
pop rdi ; copy RDX to first arg, x86-64 SysV calling convention
jmp rsi ; tailcall puts
This is exactly 11 bytes, and I don't see a way for it to be smaller. add r64, imm32 is also 7 bytes, same as LEA. (Or 6 bytes if the register is RAX, but even the xchg rax, rdi short form would cost 2 bytes to get it there, and the RAX value is still the fgets return value, which is the small mmap buffer address.)
The puts function pointer doesn't fit in 32 bits, so we need a REX prefix on any instruction that puts it into a register. Otherwise we could just mov reg, imm32 (5 bytes) with the absolute address, not deriving it from another register.
$ nasm -fbin -o exploit.bin -l /dev/stdout exploit.asm
1 bits 64
2 00000000 488DB7D090E9FF lea rsi, [rdi - 0x166f30]
3 ;; add rbp, imm32 ; we can avoid messing up any call-preserved registers
4 00000007 52 push rdx
5 00000008 5F pop rdi ; copy to first arg
6 00000009 FFE6 jmp rsi ; tailcall
$ ll exploit.bin
-rw-r--r-- 1 peter peter 11 Apr 24 04:09 exploit.bin
$ ./a.out < exploit.bin # would work if the addresses in my build matched yours
My build of your incomplete .c uses different addresses on my machine, but it does reach this code (at address 0x10000, mmap_min_addr which mmap picks after the amusing choice of 0x1337 as a hint address, which isn't even page aligned but doesn't result in EIVAL on current Linux.)
Since we only tailcall puts with correct stack alignment and don't modify any call-preserved registers, this should successfully return to main.
Note that 0 bytes (ASCII NUL, not NULL) would actually work in shellcode for this test program, if not for the requirement that forbids it.
The input is read using fgets (apparently to simulate a gets() overflow).
fgets actually can read a 0 aka '\0'; the only critical character is 0xa aka '\n' newline. See Is it possible to read null characters correctly using fgets or gets_s?
Often buffer overflows exploit a strcpy or something else that stops on a 0 byte, but fgets only stops on EOF or newline. (Or the buffer size, a feature gets is missing, hence its deprecation and removal from even the ISO C standard library! It's literally impossible to use safely unless you control the input data). So yes, it's totally normal to forbid zero bytes.
BTW, your int 0x80 attempt is not viable: What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? - you can't use the 32-bit ABI to pass 64-bit pointers to write, and the string you want to output is not in the low 32 bits of virtual address space.
Of course, with the 64-bit syscall ABI, you're fine if you can hardcode the length.
push rdx
pop rsi
shr eax, 16 ; fun 3-byte way to turn 0x10000` into `1`, __NR_write 64-bit, instead of just push 1 / pop
mov edi, eax ; STDOUT_FD = __NR_write
lea edx, [rax + 13 - 1] ; 3 bytes. RDX = 13 = string length
; or mov dl, 0xff ; 2 bytes leaving garbage in rest of RDX
syscall
But this is 12 bytes, as well as hard-coding the length of the string (which was supposed to be part of the secret?).
mov dl, 0xff could make sure the length was at least 255, and actually much more in this case, if you don't mind getting reams of garbage after the string you want, until write hits an unmapped page and returns early. That would save a byte, making this 11.
(Fun fact, Linux write does not return an error when it's successfully written some bytes; instead it returns how many it did write. If you try again with buf + write_len, you would get a -EFAULT return value for passing a bad pointer to write.)

Boot sector printing the wrong thing when I %include my print function

I am having an issue with some assembly code. I am trying to print out a string using a function from a different assembly file. But it doesn't output the string but instead an "S ". How do I fix this? I would like to add that I use the NASM assembler.
code:
string.asm
print_string:
pusha
mov ah, 0x0e
loop:
mov al, [bx]
cmp al, 0
je return
int 0x10
inc bx
jmp loop
return:
popa
ret
boot_sector.asm -
[org 0x7c00]
%include "string.asm"
mov bx, [my_string]
call print_string
my_string:
db 'hello world', 0
times 510 - ($ - $$) db 0
dw 0xaa55
Execution of a boot sector begins with the first byte. In this case, the first instruction is the top of your function, because you put it first.
The code assembles exactly the same as if you had included it manually before assembling. So your boot sector is really:
[org 0x7c00]
print_string:
pusha
...
ret
mov bx, [my_string] ; BX = load first 2 bytes of my_string.
; should have been
; mov bx, my_string ; BX = address of my_string. mov bx, imm16
call print_string
It should be pretty obvious why that doesn't work, and you would have noticed this if you single-stepped your code with the debugger built-in to BOCHS (or any other way of debugging a boot sector). Even just looking at disassembly might have clued you in.
Solution: put the %include after your other code, and avoid having execution fall into it. e.g. put this after the call:
cli ; disable interrupts
hlt ; halt until the next interrupt. (except for NMI)
(If NMI is possible, you can put the hlt inside an infinite loop with jmp.)
This is not your only bug. As #MichaelPetch points out, you were loading 2 bytes from the string instead of putting its address into BX.

How can i copy an array in nasm x86 assembly for Linux, porting 16-bit DOS code?

I have to write a program which copy an array in other array, using x86 assembler
The original code is written in MSDOS' TASM for 8086 processor, but I want port this to Linux NASM using i386 processor
The code in TASM is this:
.MODEL SMALL
.DATA
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0
.CODE
MOV AX, SEG TABLE_B
MOV DS, AX
MOV SI, 0
LOOP:
MOV AL, TABLE_A[SI]
MOV TABLE_B[SI], AL
INC SI
CMP SI, 2
JBE LOOP
MOV AH, 4Ch
INT 21h
END
I'm trying to rewrite this in nasm, but I don't get to sit in the correct array position, similar to TABLE_A[SI] instruction
How can I do it?
The final code in nasm is this
section .text
global _start
cpu 386
_start:
MOV ESI, TABLE_A
MOV EDI, TABLE_B
MOV CX, 3
COPY_LOOP:
MOV AL, [ESI]
MOV [EDI], AL
INC SI
INC DI
LOOP COPY_LOOP
MOV AX,1
INT 80h
section .data
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0
How could I do it?
(question from comments on self-answer)
Well, first you read Instruction reference guide to understand what the instruction does, and then you can use it, if it fits your purpose. This is the important step, keep re-reading instruction details every so often, to verify it does modify registers and flags in a way you expect it. Especially if in debugger you see the CPU state of change you didn't expect.
As you are in linux, the ds/es segment registers are very likely already set to reasonably values (covering .data section), so after setting eSi to Source address, eDi to Destination address, and eCx to Count, you write instead of COPY_LOOP: just rep movsb ... and then exit trough int 80h (eax=1). (notice the emphasized letters in register names, Intel picked those intentionally to make it easy to recall)
BTW, just now I noticed, you wrote in your code sort of bugs:
inc si/di should be inc esi/edi, because you use esi/edi to address. If you would be copying array over 64k memory boundary, inc si would wrap around on it.
set ecx to 3, in 32b mode the loop instruction does use whole 32b ecx, not 16b part cx only. If the code ahead of copy would use some large number in ecx setting some of upper 16 bits, your loop would copy many more bytes than only 3.
ahead of calling int 80h again you must set whole eax with the function number, otherwise you risk to have some garbage in upper 16 bits of eax from previous code, requesting invalid function.
So after applying these your code may look like this:
section .text
global _start
cpu 386
_start:
MOV ESI, TABLE_A
MOV EDI, TABLE_B
MOV ECX, 3
REP MOVSB ; copy ECX bytes from DS:ESI to ES:EDI
MOV EAX,1 ; call sys_exit, again FIXED to EAX!
INT 80h
section .data
TABLE_A DB 10, 5, 1
TABLE_B DB 0, 0, 0
If you did read the docs about registers, you should already understand what is difference between eax and ax. In Linux you are in 32b mode (when you link the binary as 32b elf, nowadays the 64b may be default on 64b system, which differs a bit from 32b mode), so by default use the 32b register variants. Unless you really want the 16b/8b variant for particular reason, and you make sure the code doesn't work later with 32b register while you set only less of it (like loop, rep movsb and int 80h do).
Also it makes the code usually faster, as using 16b ax in 32b mode requires additional opcode byte ahead of instruction, for example mov eax,ebx is 2 bytes opcode 89 D8, mov ax,bx is 3 bytes opcode 66 89 D8.
In response to marc
I tried this form, without successful result:
MOV SI, 0
MOV AX, 0
LOOP:
MOV AX, [TABLE_A + SI]
MOV [TABLE_B + SI], AX
INC SI
CMP SI, 2
JBE LOOP
Use pointers (SI, DI) to the arrays and CX as counter :
MOV SI, Table_A ;POINTER TO TABLE_A.
MOV DI, Table_B ;POINTER TO TABLE_B.
MOV CX, 3 ;ARRAY LENGTH.
REPEAT:
MOV AL, [SI]
MOV [DI], AL
INC SI
INC DI
LOOP REPEAT ;CX-1. IF CX>0 JUMP TO REPEAT.

How to use ORG addresses > 0xFFFF?

I am trying to write a simply bootloader in assembler.
The bootloader copies sector 2 from a floppy to address 0x5000 (segment 0x500, offset 0x0), jumps to the segment and prints a message.
However, when I change the segment address to 0x1000, the message does not get printed anymore. I suspect the org 0x10000 instruction has a problem, which might be related to segmentation. I tried org 0x1000:0 too, but the message won't be printed.
Here is my bootloader code, which gets written to the first sector of the floppy:
[BITS 16]
org 0x7C00
start:
mov ah, 0x02 ; Read sectors from drive
mov al, 1 ; Read 1 sector
mov ch, 0 ; Cylinder 0
mov cl, 2 ; Sector 2
mov dh, 0 ; Head 0
mov bx, sect2dest;
mov es, bx
mov bx, 0x0
int 0x13
jmp sect2dest:0;
data:
sect2dest equ 0x500
The magic identifier in the end is written by a custom linking script, so don't worry about that.
Here is my sector two, which should print a message:
[BITS 16]
org 0x5000
sect2:
mov ah, 0x13
mov al, 1
mov bl, 0x17
mov cx, msg_len
mov dh, 0
mov dl, 0
mov bh, 0
mov bp, 0
mov es, bp
mov bp, msg
int 0x10
jmp $
msg db 13,10,"Hello, World!"
msg_len equ $ - msg
As mentioned above, when I try writing sector 2 to any address larger than 0xFFFF, the message doesn't get printed.
Consider that bp is 16 bit, so if you use an ORG of 10000h any offset won't fit in it.
I was expecting the assembler to raise a warning but a quick test shown otherwise.
Remember also that generally is best to avoid challenging the BIOS, thought I don't know how it is actually handled, I would avoid to print a string that strides two segments.
Since you are putting zero in es, make sure that the ORG is at most 10000h-[msg_len], so that the whole string is reachable within es.

Print register value to console

I want to print the value in %RCX directly to the console, let's say an ASCII value. I've searched through some wise books and tutorials, but all use buffers to pass anything. Is it possible to print anything without creating special buffer for that purpose?
lets say i am here (all this answers are fat too complicated to me and use different syntax):
movq $5, %rax
...???(print %rax)
Output on console:
\>5
in example, to print buffer i use code:
SYSWRITE = 4
STDOUT = 1
EXIT_SUCCESS = 0
.text
buff: .ascii "Anything to print\n"
buff_len = . - buff
movq $SYSWRITE, %eax
mov $STDOUT, %ebx
mov $buff, %ecx
mov $buff_len, %edx
NO C CODE OR DIFFERENT ASS SYNTAX ALLOWED!!!
In order to print a register (in hex representation or numeric) the routine (write to stdout, stderr, etc.) expects ASCII characters. Just writing a register will cause the routine to try an display the ascii equivalent of the value in the register. You may get lucky sometimes if each of the bytes in the register happen to fall into the printable character range.
You will need to convert it vis-a-vis routines that convert to decimal or hex. Here is an example of converting a 64 bit register to the hex representation (using intel syntax w/nasm):
section .rodata
hex_xlat: db "0123456789abcdef"
section .text
; Called with RDI is the register to convert and
; RSI for the buffer to fill
;
register_to_hex:
push rsi ; Save for return
xor eax,eax
mov ecx, 16 ; looper
lea rdx, [rel hex_xlat] ; position-independent code can't index a static array directly
ALIGN 16
.loop:
rol rdi, 4 ; dil now has high bit nibble
mov al, dil ; capture low nibble
and al, 0x0f
mov al, byte [rdx+rax] ; look up the ASCII encoding for the hex digit
; rax is an 'index' with range 0x0 - 0xf.
; The upper bytes of rax are still zero from xor
mov byte [rsi], al ; store in print buffer
inc rsi ; position next pointer
dec ecx
jnz .loop
.exit:
pop rax ; Get original buffer pointer
ret
This answer is an addendum to the answer given by Frank, and utilizes the mechanism used there to do the conversion.
You mention the register %RCX in your question. This suggests you are looking at 64-bit code and that your environment is likely GCC/GAS (GNU Assembler) based since % is usually the AT&T style prefix for registers.
With that in mind I've created a quick and dirty macro that can be used inline anywhere you need to print a 64-bit register, 64-bit memory operand, or a 32-bit immediate value in GNU Assembly. This version was a proof of concept and could be amended to support 64 bit immediate values. All the registers that are used are preserved, and the code will also account for the Linux 64-bit System V ABI red zone.
The code below is commented to point out what is occurring at each step.
printmac.inc:
.macro memreg_to_hex src # Macro takes one input
# src = memory operand, register,
# or 32 bit constant to print
# Define the translation table only once for the current object
.ifndef MEMREG_TO_HEX_NOT_FIRST
.set MEMREG_TO_HEX_NOT_FIRST, 1
.PushSection .rodata
hex_xlat: .ascii "0123456789abcdef"
.PopSection
.endif
add $-128,%rsp # Avoid 128 byte red zone
push %rsi # Save all registers that will be used
push %rdi
push %rdx
push %rcx
push %rbx
push %rax
push %r11 # R11 is destroyed by SYSCALL
mov \src, %rdi # Move src value to RDI for processing
# Output buffer on stack at ESP-16 to ESP-1
lea -16(%rsp),%rsi # RSI = output buffer on stack
lea hex_xlat(%rip), %rdx # RDX = translation buffer address
xor %eax,%eax # RAX = Index into translation array
mov $16,%ecx # 16 nibbles to print
.align 16
1:
rol $4,%rdi # rotate high nibble to low nibble
mov %dil,%al # dil now has previous high nibble
and $0xf,%al # mask off all but low nibble
mov (%rdx,%rax,1),%al # Lookup in translation table
mov %al,(%rsi) # Store in output buffer
inc %rsi # Update output buffer address
dec %ecx
jne 1b # Loop until counter is 0
mov $1,%eax # Syscall 1 = sys_write
mov %eax,%edi # EDI = 1 = STDIN
mov $16,%edx # EDX = Number of chars to print
sub %rdx,%rsi # RSI = beginning of output buffer
syscall
pop %r11 # Restore all registers used
pop %rax
pop %rbx
pop %rcx
pop %rdx
pop %rdi
pop %rsi
sub $-128,%rsp # Restore stack
.endm
printtest.s
.include "printmac.inc"
.global main
.text
main:
mov $0x123456789abcdef,%rcx
memreg_to_hex %rcx # Print the 64-bit value 0x123456789abcdef
memreg_to_hex %rsp # Print address containing ret pointer
memreg_to_hex (%rsp) # Print return pointer
memreg_to_hex $0x402 # Doesn't support 64-bit immediates
# but can print anything that fits a DWORD
retq
This can be compiled and linked with:
gcc -m64 printtest.s -o printtest
The macro doesn't print an end of line character so the output of the test program looks like:
0123456789abcdef00007fff5283d74000007f5c4a080a500000000000000402
The memory addresses will be be different.
Since the macros are inlined, each time you invoke the macro the entire code will be emitted. The code is space inefficient. The bulk of the code could be moved to an object file you can include at link time. Then a stub macro could wrap a CALL to the main printing function.
The code doesn't use printf because at some point I thought I saw a comment that you couldn't use the C library. If that's not the case this can be simplified greatly by calling printf to format the output to print a 64-bit hexadecimal value.
Just for fun, here are a couple other sequences for storing a hex string from a register. Printing the buffer is not the interesting part, IMO; copy that part from Michael's excellent answer if needed.
I tested some of these. I've included a main that calls one of these functions and then uses printf("%s\n%lx\n", result, test_value); to make it easy to spot problems.
Test main():
extern printf
global main
main:
push rbx
mov rdi, 0x1230ff56dcba9911
mov rbx, rdi
sub rsp, 32
mov rsi, rsp
mov byte [rsi+16], 0
call register_to_hex_ssse3
mov rdx, rbx
mov edi, fmt
mov rsi, rsp
xor eax,eax
call printf
add rsp, 32
pop rbx
ret
section .rodata
fmt: db `%s\n%lx\n`, 0 ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
; NASM needs
; %use smartalign
; ALIGNMODE p6, 32
; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN
SSSE3 pshufb for the LUT
This version doesn't need a loop, but the code size is much larger than the rotate-loop versions because SSE instructions are longer.
section .rodata
ALIGN 16
hex_digits:
hex_xlat: db "0123456789abcdef"
section .text
;; rdi = val rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3: ;;;; 0x39 bytes of code
;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
movaps xmm5, [rel hex_digits]
;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
bswap rdi
movq xmm1, rdi
;; generate a constant on the fly, rather than loading
;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
pcmpeqw xmm4,xmm4
psrlw xmm4, 12
packuswb xmm4,xmm4 ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte
movdqa xmm0, xmm1 ; xmm0 = low nibbles at the bottom of each byte
psrlw xmm1, 4 ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
punpcklbw xmm1, xmm0 ; unpacked nibbles (with garbage in the high 4b of some bytes)
pand xmm1, xmm4 ; mask off the garbage bits because pshufb reacts to the MSB of each element. Delaying until after interleaving the hi and lo nibbles means we only need one
pshufb xmm5, xmm1 ; xmm5 = the hex digit for the corresponding nibble in xmm0
movups [rsi], xmm5
ret
AVX2: you can do two integers at once, with something like
int64x2_to_hex_avx2: ; (const char buf[32], uint64_t first, uint64_t second)
bswap rsi ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq xmm1, rsi
bswap rdx
vpinsrq xmm1, xmm1, rdx, 1
vpmovzxbw ymm1, xmm1 ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw ymm1, ymm1, 12 ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor ymm0, ymm0, ymm1 ; merge while hi and lo elements both need the same shift
vpsrlw ymm1, ymm1, 4 ; low nibbles in elems 1, 3, 5, ...
; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret
Using pmovzx and shifts to avoid pand is a win compared to generating the constant on the fly, I think, but probably not otherwise. It takes 2 extra shifts and a por. It's an option for the 16B non-AVX version, but it's SSE4.1.
Optimized for code-size (fits in 32 (0x20) bytes)
(Derived from Frank's loop)
Using cmov instead of the LUT to handle 0-9 vs. a-f might take fewer than 16B of extra code size. That might be fun: edits welcome.
The ways to get a nibble from the bottom of rsi into an otherwise-zeroed rax include:
mov al, sil (3B (REX required for sil)) / and al, 0x0f (2B special encoding for and al, imm8).
mov eax, esi (2B) / and eax, 0x0f (3B): same size and doesn't require an xor beforehand to zero the upper bytes of rax.
Would be smaller if the args were reversed, so the dest buffer was already in rdi. stosb is a tiny instruction (but slower than mov [rdi], al / inc rdi), so it actually saved overall bytes to use xchg rdi, rsi to set up for it. changing the function signature could save 5 bytes: void reg_to_hex(char buf[16], uint64_t val) would save two bytes from not having to return buf in rax, and 3 bytes from dropping the xchg. The caller will probably use 16B of stack, and having the caller do a mov rdx, rsp instead of mov rdx, rax before calling another function / syscall on the buffer doesn't save anything.
The next function is probably going to ALIGN 16, though, so shrinking the function to even smaller than 32B isn't as useful as getting it inside half a cache-line.
Absolute addressing for the LUT (hex_xlat) would save a few bytes
(use mov al, byte [hex_xlat + rax] instead of needing the lea).
global register_to_hex_size
register_to_hex_size:
push rsi ; pushing/popping return value (instead of mov rax, rsi) frees up rax for stosb
xchg rdi, rsi ; allows stosb. Better: remove this and change the function signature
mov cl, 16 ; 3B shorter than mov ecx, 16
lea rdx, [rel hex_xlat]
;ALIGN 16
.loop:
rol rsi, 4
mov eax, esi ; mov al, sil to allow 2B AND AL,0xf requires a 2B xor eax,eax
and eax, 0x0f
mov al, byte [rdx+rax]
stosb
;; loop .loop ; setting up ecx instead of cl takes more bytes than loop saves
dec cl
jne .loop
pop rax ; get the return value back off the stack
ret
Using xlat costs 2B (to save/restore rbx), but saves 3B, for a net savings of 1B. It's a 3-uop instruction, with 7c latency, one per 2c throughput (Intel Skylake). The latency and throughput aren't a problem here, since each iteration is a separate dependency chain, and there's too much overhead for this to run at one clock per iteration anyway. So the main problem is that it's 3 uops, making it less uop-cache-friendly. With xlat, the loop becomes 10 uops instead of 8 (using stosb), so that sucks.
112: 89 f0 mov eax,esi
114: 24 0f and al,0xf
116: d7 xlat BYTE PTR ds:[rbx]
117: aa stos BYTE PTR es:[rdi],al
vs.
f1: 89 f0 mov eax,esi
f3: 83 e0 0f and eax,0xf
f6: 8a 04 02 mov al,BYTE PTR [rdx+rax*1]
f9: aa stos BYTE PTR es:[rdi],al
Interestingly, this still has no partial-register stalls, because we never read a wide register after writing only part of it. mov eax, esi is write-only, so it cleans up the partial-reg-ness from the load into al. So there would be no advantage to using movzx eax, byte [rdx+rax]. Even when we return to the caller, the pop rax doesn't leave the caller succeptible to partial-reg problems.
(If we don't bother returning the input pointer in rax, then the caller could have a problem. Except in that case it shouldn't be reading rax at all. Usually it only matters if you call with call-preserved registers in a partial-reg state, because the called function might push them. Or more obviously, with arg-passing / return-value registers.
Efficient version (uop-cache friendly)
Looping backwards didn't turn out to save any instructions or bytes, but I've included this version because it's more different from the version in Frank's answer.
ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
mov rax, rsi ; return value, and loop bound
add rsi, 15 ; last char of the buffer
lea rcx, [rel hex_xlat] ; position-independent code
ALIGN 16
.loop:
mov edx, edi
and edx, 0x0f ; isolate low nibble
mov dl, byte [rcx+rdx] ; look up the ascii encoding for the hex digit
; rdx is an 'index' with range 0x0 - 0xf
; non-PIC version: mov dl, [hex_digits + rdx]
mov byte [rsi], dl
shr rdi, 4
dec rsi
cmp rsi, rax
jae .loop ; rsi counts backwards down to its initial value
ret
The whole thing is only 12 insns (11 uops with macro-fusion, or 12 including the NOP for alignment). Some CPUs can fuse cmp/jcc but not dec/jcc (e.g. AMD, and Nehalem)
Another option for looping backwards was mov ecx, 15, and store with mov [rsi+rcx], dl, but two-register addressing modes can't micro-fuse. Still, that would only bring the loop up to 8 uops, so it would be fine.
Instead of always storing 16 digits, this version could use rdi becoming zero as the loop condition to avoid printing leading zeros. i.e.
add rsi, 16
...
.loop:
...
dec rsi
mov byte [rsi], dl
shr rdi, 4
jnz .loop
; lea rax, [rsi+1] ; correction not needed because of adjustments to how rsi is managed
mov rax, rsi
ret
printing from rax to the end of the buffer gives just the significant digits of the integer.

Resources