I have written an assembly code to print numbers from 1 to 9 but the code only prints 1 and no other element other than 1 is printed and only one output is received.It means that the loop is also not being run. I cant figure out what is wrong with my code.
section .bss
lena equ 1024
outbuff resb lena
section .data
section .text
global _start
_start:
nop
mov cx,0
incre:
inc cx
add cx,30h
mov [outbuff],cx
cmp cx,39h
jg done
cmp cx,39h
jl print
print:
mov rax,1 ;sys_write
mov rdi,1
mov rsi,outbuff
mov rdx,lena
syscall
jmp incre
done:
mov rax,60 ;sys_exit
mov rdi,0
syscall
My OS is 64 bit linux. this code is built using nasm with the following commands : nasm -f elf64 -g -o num.o num.asm and ld -o num num.asm
Answer rewritten after some experimentation.
There two errors in your code, and a few inefficiencies.
First, you add 0x30 to the number (to turn it from the number 1 to the ASCII 1). However, you do that increment inside the loop. As a result, your first iteration cx is 0x31, second 0x62 ("b"), third 0x93 (invalid UTf-8 sequence) etc.
Just initialize cx to 0x30 and remove the add from inside the loop.
But there's another problem. RCX is clobbered during system calls. Replacing cx with r12 causes the program to work.
In addition to that, you pass the buffer's length to write, but it only has one character. The program so far:
section .bss
lena equ 1024
outbuff resb lena
section .data
section .text
global _start
_start:
nop
mov r12,30h
incre:
inc r12
mov [outbuff],r12
cmp r12,39h
jg done
cmp r12,39h
jl print
print:
mov rax,1 ;sys_write
mov rdi,1
mov rsi,outbuff
mov rdx,1
syscall
jmp incre
done:
mov rax,60 ;sys_exit
mov rdi,0
syscall
Except even now, the code is extremely inefficient. You have two compares on the same condition, one of them branches to the very next instruction.
Also, your code would be much much much faster and smaller if you moved the breaking condition to the end of the code. Also, cx is a 16 bit register. r12 is a 64 bit register. We actually only need 8 bits. Using larger registers than needed means all of our immediates waste up space in memory and the cache. We therefor switch to the 8 bit variant of r12. After these changes, we get:
section .bss
lena equ 1024
outbuff resb lena
section .data
section .text
global _start
_start:
nop
mov r12b,30h
incre:
inc r12b
mov [outbuff],r12b
mov rax,1 ;sys_write
mov rdi,1
mov rsi,outbuff
mov rdx,1
syscall
cmp r12b,39h
jl incre
mov rax,60 ;sys_exit
mov rdi,0
syscall
There's still lots more you can do. For example, you call the write system call 9 times, instead of filling the buffer and then calling it once (despite the fact that you've allocated a 1024 bytes buffer). It will probably be faster to initialize r12 with zero (xor r12, r12) and then add 0x30. (not relevant for the 8 bit version of the register).
I want to print the value in %RCX directly to the console, let's say an ASCII value. I've searched through some wise books and tutorials, but all use buffers to pass anything. Is it possible to print anything without creating special buffer for that purpose?
lets say i am here (all this answers are fat too complicated to me and use different syntax):
movq $5, %rax
...???(print %rax)
Output on console:
\>5
in example, to print buffer i use code:
SYSWRITE = 4
STDOUT = 1
EXIT_SUCCESS = 0
.text
buff: .ascii "Anything to print\n"
buff_len = . - buff
movq $SYSWRITE, %eax
mov $STDOUT, %ebx
mov $buff, %ecx
mov $buff_len, %edx
NO C CODE OR DIFFERENT ASS SYNTAX ALLOWED!!!
In order to print a register (in hex representation or numeric) the routine (write to stdout, stderr, etc.) expects ASCII characters. Just writing a register will cause the routine to try an display the ascii equivalent of the value in the register. You may get lucky sometimes if each of the bytes in the register happen to fall into the printable character range.
You will need to convert it vis-a-vis routines that convert to decimal or hex. Here is an example of converting a 64 bit register to the hex representation (using intel syntax w/nasm):
section .rodata
hex_xlat: db "0123456789abcdef"
section .text
; Called with RDI is the register to convert and
; RSI for the buffer to fill
;
register_to_hex:
push rsi ; Save for return
xor eax,eax
mov ecx, 16 ; looper
lea rdx, [rel hex_xlat] ; position-independent code can't index a static array directly
ALIGN 16
.loop:
rol rdi, 4 ; dil now has high bit nibble
mov al, dil ; capture low nibble
and al, 0x0f
mov al, byte [rdx+rax] ; look up the ASCII encoding for the hex digit
; rax is an 'index' with range 0x0 - 0xf.
; The upper bytes of rax are still zero from xor
mov byte [rsi], al ; store in print buffer
inc rsi ; position next pointer
dec ecx
jnz .loop
.exit:
pop rax ; Get original buffer pointer
ret
This answer is an addendum to the answer given by Frank, and utilizes the mechanism used there to do the conversion.
You mention the register %RCX in your question. This suggests you are looking at 64-bit code and that your environment is likely GCC/GAS (GNU Assembler) based since % is usually the AT&T style prefix for registers.
With that in mind I've created a quick and dirty macro that can be used inline anywhere you need to print a 64-bit register, 64-bit memory operand, or a 32-bit immediate value in GNU Assembly. This version was a proof of concept and could be amended to support 64 bit immediate values. All the registers that are used are preserved, and the code will also account for the Linux 64-bit System V ABI red zone.
The code below is commented to point out what is occurring at each step.
printmac.inc:
.macro memreg_to_hex src # Macro takes one input
# src = memory operand, register,
# or 32 bit constant to print
# Define the translation table only once for the current object
.ifndef MEMREG_TO_HEX_NOT_FIRST
.set MEMREG_TO_HEX_NOT_FIRST, 1
.PushSection .rodata
hex_xlat: .ascii "0123456789abcdef"
.PopSection
.endif
add $-128,%rsp # Avoid 128 byte red zone
push %rsi # Save all registers that will be used
push %rdi
push %rdx
push %rcx
push %rbx
push %rax
push %r11 # R11 is destroyed by SYSCALL
mov \src, %rdi # Move src value to RDI for processing
# Output buffer on stack at ESP-16 to ESP-1
lea -16(%rsp),%rsi # RSI = output buffer on stack
lea hex_xlat(%rip), %rdx # RDX = translation buffer address
xor %eax,%eax # RAX = Index into translation array
mov $16,%ecx # 16 nibbles to print
.align 16
1:
rol $4,%rdi # rotate high nibble to low nibble
mov %dil,%al # dil now has previous high nibble
and $0xf,%al # mask off all but low nibble
mov (%rdx,%rax,1),%al # Lookup in translation table
mov %al,(%rsi) # Store in output buffer
inc %rsi # Update output buffer address
dec %ecx
jne 1b # Loop until counter is 0
mov $1,%eax # Syscall 1 = sys_write
mov %eax,%edi # EDI = 1 = STDIN
mov $16,%edx # EDX = Number of chars to print
sub %rdx,%rsi # RSI = beginning of output buffer
syscall
pop %r11 # Restore all registers used
pop %rax
pop %rbx
pop %rcx
pop %rdx
pop %rdi
pop %rsi
sub $-128,%rsp # Restore stack
.endm
printtest.s
.include "printmac.inc"
.global main
.text
main:
mov $0x123456789abcdef,%rcx
memreg_to_hex %rcx # Print the 64-bit value 0x123456789abcdef
memreg_to_hex %rsp # Print address containing ret pointer
memreg_to_hex (%rsp) # Print return pointer
memreg_to_hex $0x402 # Doesn't support 64-bit immediates
# but can print anything that fits a DWORD
retq
This can be compiled and linked with:
gcc -m64 printtest.s -o printtest
The macro doesn't print an end of line character so the output of the test program looks like:
0123456789abcdef00007fff5283d74000007f5c4a080a500000000000000402
The memory addresses will be be different.
Since the macros are inlined, each time you invoke the macro the entire code will be emitted. The code is space inefficient. The bulk of the code could be moved to an object file you can include at link time. Then a stub macro could wrap a CALL to the main printing function.
The code doesn't use printf because at some point I thought I saw a comment that you couldn't use the C library. If that's not the case this can be simplified greatly by calling printf to format the output to print a 64-bit hexadecimal value.
Just for fun, here are a couple other sequences for storing a hex string from a register. Printing the buffer is not the interesting part, IMO; copy that part from Michael's excellent answer if needed.
I tested some of these. I've included a main that calls one of these functions and then uses printf("%s\n%lx\n", result, test_value); to make it easy to spot problems.
Test main():
extern printf
global main
main:
push rbx
mov rdi, 0x1230ff56dcba9911
mov rbx, rdi
sub rsp, 32
mov rsi, rsp
mov byte [rsi+16], 0
call register_to_hex_ssse3
mov rdx, rbx
mov edi, fmt
mov rsi, rsp
xor eax,eax
call printf
add rsp, 32
pop rbx
ret
section .rodata
fmt: db `%s\n%lx\n`, 0 ; YASM doesn't support `string with escapes`, so this only assembles with NASM.
; NASM needs
; %use smartalign
; ALIGNMODE p6, 32
; or similar, to stop it using braindead repeated single-byte NOPs for ALIGN
SSSE3 pshufb for the LUT
This version doesn't need a loop, but the code size is much larger than the rotate-loop versions because SSE instructions are longer.
section .rodata
ALIGN 16
hex_digits:
hex_xlat: db "0123456789abcdef"
section .text
;; rdi = val rsi = buffer
ALIGN 16
global register_to_hex_ssse3
register_to_hex_ssse3: ;;;; 0x39 bytes of code
;; use PSHUFB to do 16 nibble->ASCII LUT lookups in parallel
movaps xmm5, [rel hex_digits]
;; x86 is little-endian, but we want the hex digit for the high nibble to be the first character in the string
;; so reverse the bytes, and later unpack nibbles like [ LO HI ... LO HI ]
bswap rdi
movq xmm1, rdi
;; generate a constant on the fly, rather than loading
;; this is a bit silly: we already load the LUT, might as well load another 16B from the same cache line, a memory operand for PAND since we manage to only use it once
pcmpeqw xmm4,xmm4
psrlw xmm4, 12
packuswb xmm4,xmm4 ; [ 0x0f 0x0f 0x0f ... ] mask for low-nibble of each byte
movdqa xmm0, xmm1 ; xmm0 = low nibbles at the bottom of each byte
psrlw xmm1, 4 ; xmm1 = high nibbles at the bottom of each byte (with garbage from next byte)
punpcklbw xmm1, xmm0 ; unpacked nibbles (with garbage in the high 4b of some bytes)
pand xmm1, xmm4 ; mask off the garbage bits because pshufb reacts to the MSB of each element. Delaying until after interleaving the hi and lo nibbles means we only need one
pshufb xmm5, xmm1 ; xmm5 = the hex digit for the corresponding nibble in xmm0
movups [rsi], xmm5
ret
AVX2: you can do two integers at once, with something like
int64x2_to_hex_avx2: ; (const char buf[32], uint64_t first, uint64_t second)
bswap rsi ; We could replace the two bswaps with one 256b vpshufb, but that would require a mask
vmovq xmm1, rsi
bswap rdx
vpinsrq xmm1, xmm1, rdx, 1
vpmovzxbw ymm1, xmm1 ; upper lane = rdx, lower lane = rsi, with each byte zero-extended to a word element
vpsllw ymm1, ymm1, 12 ; shift the high nibbles out, leaving the low nibbles at the top of each word
vpor ymm0, ymm0, ymm1 ; merge while hi and lo elements both need the same shift
vpsrlw ymm1, ymm1, 4 ; low nibbles in elems 1, 3, 5, ...
; high nibbles in elems 0, 2, 4, ...
pshufb / store ymm0 / ret
Using pmovzx and shifts to avoid pand is a win compared to generating the constant on the fly, I think, but probably not otherwise. It takes 2 extra shifts and a por. It's an option for the 16B non-AVX version, but it's SSE4.1.
Optimized for code-size (fits in 32 (0x20) bytes)
(Derived from Frank's loop)
Using cmov instead of the LUT to handle 0-9 vs. a-f might take fewer than 16B of extra code size. That might be fun: edits welcome.
The ways to get a nibble from the bottom of rsi into an otherwise-zeroed rax include:
mov al, sil (3B (REX required for sil)) / and al, 0x0f (2B special encoding for and al, imm8).
mov eax, esi (2B) / and eax, 0x0f (3B): same size and doesn't require an xor beforehand to zero the upper bytes of rax.
Would be smaller if the args were reversed, so the dest buffer was already in rdi. stosb is a tiny instruction (but slower than mov [rdi], al / inc rdi), so it actually saved overall bytes to use xchg rdi, rsi to set up for it. changing the function signature could save 5 bytes: void reg_to_hex(char buf[16], uint64_t val) would save two bytes from not having to return buf in rax, and 3 bytes from dropping the xchg. The caller will probably use 16B of stack, and having the caller do a mov rdx, rsp instead of mov rdx, rax before calling another function / syscall on the buffer doesn't save anything.
The next function is probably going to ALIGN 16, though, so shrinking the function to even smaller than 32B isn't as useful as getting it inside half a cache-line.
Absolute addressing for the LUT (hex_xlat) would save a few bytes
(use mov al, byte [hex_xlat + rax] instead of needing the lea).
global register_to_hex_size
register_to_hex_size:
push rsi ; pushing/popping return value (instead of mov rax, rsi) frees up rax for stosb
xchg rdi, rsi ; allows stosb. Better: remove this and change the function signature
mov cl, 16 ; 3B shorter than mov ecx, 16
lea rdx, [rel hex_xlat]
;ALIGN 16
.loop:
rol rsi, 4
mov eax, esi ; mov al, sil to allow 2B AND AL,0xf requires a 2B xor eax,eax
and eax, 0x0f
mov al, byte [rdx+rax]
stosb
;; loop .loop ; setting up ecx instead of cl takes more bytes than loop saves
dec cl
jne .loop
pop rax ; get the return value back off the stack
ret
Using xlat costs 2B (to save/restore rbx), but saves 3B, for a net savings of 1B. It's a 3-uop instruction, with 7c latency, one per 2c throughput (Intel Skylake). The latency and throughput aren't a problem here, since each iteration is a separate dependency chain, and there's too much overhead for this to run at one clock per iteration anyway. So the main problem is that it's 3 uops, making it less uop-cache-friendly. With xlat, the loop becomes 10 uops instead of 8 (using stosb), so that sucks.
112: 89 f0 mov eax,esi
114: 24 0f and al,0xf
116: d7 xlat BYTE PTR ds:[rbx]
117: aa stos BYTE PTR es:[rdi],al
vs.
f1: 89 f0 mov eax,esi
f3: 83 e0 0f and eax,0xf
f6: 8a 04 02 mov al,BYTE PTR [rdx+rax*1]
f9: aa stos BYTE PTR es:[rdi],al
Interestingly, this still has no partial-register stalls, because we never read a wide register after writing only part of it. mov eax, esi is write-only, so it cleans up the partial-reg-ness from the load into al. So there would be no advantage to using movzx eax, byte [rdx+rax]. Even when we return to the caller, the pop rax doesn't leave the caller succeptible to partial-reg problems.
(If we don't bother returning the input pointer in rax, then the caller could have a problem. Except in that case it shouldn't be reading rax at all. Usually it only matters if you call with call-preserved registers in a partial-reg state, because the called function might push them. Or more obviously, with arg-passing / return-value registers.
Efficient version (uop-cache friendly)
Looping backwards didn't turn out to save any instructions or bytes, but I've included this version because it's more different from the version in Frank's answer.
ALIGN 16
global register_to_hex_countdown
register_to_hex_countdown:
;;; work backwards in the buffer, starting with the least-significant nibble as the last char
mov rax, rsi ; return value, and loop bound
add rsi, 15 ; last char of the buffer
lea rcx, [rel hex_xlat] ; position-independent code
ALIGN 16
.loop:
mov edx, edi
and edx, 0x0f ; isolate low nibble
mov dl, byte [rcx+rdx] ; look up the ascii encoding for the hex digit
; rdx is an 'index' with range 0x0 - 0xf
; non-PIC version: mov dl, [hex_digits + rdx]
mov byte [rsi], dl
shr rdi, 4
dec rsi
cmp rsi, rax
jae .loop ; rsi counts backwards down to its initial value
ret
The whole thing is only 12 insns (11 uops with macro-fusion, or 12 including the NOP for alignment). Some CPUs can fuse cmp/jcc but not dec/jcc (e.g. AMD, and Nehalem)
Another option for looping backwards was mov ecx, 15, and store with mov [rsi+rcx], dl, but two-register addressing modes can't micro-fuse. Still, that would only bring the loop up to 8 uops, so it would be fine.
Instead of always storing 16 digits, this version could use rdi becoming zero as the loop condition to avoid printing leading zeros. i.e.
add rsi, 16
...
.loop:
...
dec rsi
mov byte [rsi], dl
shr rdi, 4
jnz .loop
; lea rax, [rsi+1] ; correction not needed because of adjustments to how rsi is managed
mov rax, rsi
ret
printing from rax to the end of the buffer gives just the significant digits of the integer.
i have problem. I tried build a loop in assembly (nasm,linux). The loop should "cout" number 0 - 10, but it not work and i don't know why. Here is a code :
section .text
global _start
_start:
xor esi,esi
_ccout:
cmp esi,10
jnl _end
inc esi
mov eax,4
mov ebx,1
mov ecx,esi
mov edx,2
int 80h
jmp _ccout
_end:
mov eax,1
int 80h
section .data
Well, the loop is working, but you aren't using the syscall correctly. There are some magic numbers involved here, so let's get that out of the way first:
4 is the syscall number for write
1 is the file descriptor for the standard output
So far, so good. write requires a file descriptor, the address of a buffer and the length of that buffer or the part of it that it's supposed to write to the file descriptor. So, the way this is supposed to look is similar to
mov eax,4 ; write syscall
mov ebx,1 ; stdout
mov ecx,somewhere_in_memory ; buffer
mov edx,1 ; one byte at a time
compare that to your code:
mov eax,4
mov ebx,1
mov ecx,esi ; <-- particularly here
mov edx,2
int 80h
What you are doing there (apart from passing the wrong length) is passing the contents of esi to write as a memory address from which to read the stuff it's supposed to write to stdout. By pure happenstance this doesn't crash, but there's no useful data at that position in memory.
In order to solve this, you will need a location in memory to put it. Moreover, since write works on characters, not numbers, you'll have to to the formatting yourself by adding '0' (which is 48 in ASCII). All in all, it could look something like this:
section .data
text db 0 ; text is a byte in memory
section .text
global _start
_start:
xor esi,esi
_ccout:
cmp esi,10
jnl _end
inc esi
lea eax,['0'+esi] ; print '0' + esi. lea == load effective address
mov [text],al ; is useful here even though we're not really working on addresses
mov eax,4 ; write
mov ebx,1 ; to fd 1 (stdout)
mov ecx,text ; from address text
mov edx,1 ; 1 byte
int 80h
jmp _ccout
_end:
mov [text],byte 10 ; 10 == newline
mov eax,4 ; write that
mov ebx,1 ; like before.
mov ecx,text
mov edx,1
int 80h
mov eax,1
mov ebx,0
int 80h
The output 123456789: is probably not exactly what you want, but you should be able to take it from here. Exercise for the reader and all that.
I am trying to understand the short jmp instruction. I have a very simple program, compiled with nasm:
SECTION .data
bsh: db "/bin/sh",0
arr: dq bsh,0
SECTION .text
global main
main:
jmp short 0x20
mov edx, 0
mov rsi, arr
mov rdi, bsh
mov rax, 0x3b
syscall
mov ebx, 0
mov eax, 0x3c
syscall
Disassembled, the code looks like this in gdb (disassemble main):
0x00000000004000b0 <+0>: jmp 0x4000d1 <main+33>
0x00000000004000b2 <+2>: mov $0x0,%edx
0x00000000004000b7 <+7>: movabs $0x6000e8,%rsi
0x00000000004000c1 <+17>: movabs $0x6000e0,%rdi
0x00000000004000cb <+27>: mov $0x3b,%eax
0x00000000004000d0 <+32>: syscall
0x00000000004000d2 <+34>: mov $0x0,%ebx
0x00000000004000d7 <+39>: mov $0x3c,%eax
0x00000000004000dc <+44>: syscall
I'm trying to jump to 0x4000d2. 34 - 2 = 32 = 0x20. 0x4000d2 - 0x4000b2 = 0x20. No matter what I assemble, nasm always seems to code the jump address as an offset from one byte past the start of the jump instruction. Why is jmp short 0x20 assembling wrong? (not to mention that jmp 0x20 had a different result, and was a 5 byte instruction instead of a 2 byte instruction)
I'm also reading about smashing the stack for fun and profit. Aleph1 wants to jump from jmp to call and then from call to popl. This is the code he uses:
jmp 0x26 # 2 bytes
popl %esi # 1 byte
movl %esi,0x8(%esi) # 3 bytes
movb $0x0,0x7(%esi) # 4 bytes
movl $0x0,0xc(%esi) # 7 bytes
movl $0xb,%eax # 5 bytes
movl %esi,%ebx # 2 bytes
leal 0x8(%esi),%ecx # 3 bytes
leal 0xc(%esi),%edx # 3 bytes
int $0x80 # 2 bytes
movl $0x1, %eax # 5 bytes
movl $0x0, %ebx # 5 bytes
int $0x80 # 2 bytes
call -0x2b # 5 bytes
.string \"/bin/sh\" # 8 bytes
Adding up the bytes from popl %esi to call -0x2b I get 42. Shouldn't the first instruction then be jmp 0x2a? And subtracting bytes from the end of the call instruction to the beginning of popl %esi I get -47. Shouldn't the call be call -0x2f? When he actually creates a c file and puts his assembly in an __asm__ block, he uses the offsets I calculated, but not in this code which is before that. What changed?
And while I'm here, couldn't he have just accessed eip and used that to get the relative offset of the string in memory?
With Intel syntax, this should be:
jmp short $+022h ;jump from current location ($ == 4000b0) to 4000d2
Note that a long jump using the same $+022h syntax would still end up jumping to 4000d2, as the assembler would generate a smaller offset field. This type of usage is rare, with the most common exception being jmp short $+2 used in legacy code to generate very short delays between I/O device accesses.
I want to call a syscall in assembly. The problem is I can't mov ecx,rsp. rsp is 64-bit register, ecx is a 32-bit register. I want to pass the buffer addr as a parameter of this syscall. What can I do? Thanks.
section .data
s0: db "Largest basic function number supported:%s\n",0
s0len: equ $-s0
section .text
global main
extern write
main:
sub rsp, 16
xor eax, eax
cpuid
mov [rsp], ebx
mov [rsp+4], edx
mov [rsp+8], ecx
mov [rsp+12], word 0x0
mov eax, 4
mov ebx, 1
mov ecx, rsp
mov edx, 4
int 80h
mov eax, 4
mov ebx, 1
mov ecx, s0
mov edx, s0len
int 80h
mov eax, 1
int 80h
To make a system call in 64-bit Linux, place the system call number in rax, and its arguments, in order, in rdi, rsi, rdx, r10, r8, and r9, then invoke syscall.
Note that 64-bit call numbers are different from 32-bit call numbers.
Here is an example in GAS syntax. NASM syntax for putting an address in a register is lea rsi, [rel message] using a RIP-relative LEA.
.global _start
.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
lea message(%rip), %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall
# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # return code 0
syscall
.section .rodata # read-only data section
message:
.ascii "Hello, World\n"
See also What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?