Let's say I want to define a initialized variable string before running my assembly program (in section .data). The variable I chose to create is called Digits and it is a string that contains all the hexadecimal symbols.
Digits: db "0123456789ABCDEF"
I defined the variable with db, that means define byte. Does this mean that the Digits variable is of 8-bits long? This doesn't seem to have sense for me because:
Each character in the string is an ASCII character, therefore I will need 2 bytes for each character. In total, I would need 32 bytes for the whole string!
So what does it mean when I define the variable as byte? Word? Double word? I don't see the difference. Because of my misunderstanding, it seems to be redundant to tell the type of data you need for the string.
PD: This question didn't help me to understand.
NASM answer, MASM is totally different
One of the answers on the linked question has a quote from the NASM manual's examples which does answer your question. As requested, I'll expand on it for all three cases (and correct the lower-case vs. upper-case ASCII encoding error!):
db 'ABCDE' ; 0x41 0x42 0x43 0x44 0x45 (5 bytes)
dw 'ABCDE' ; 0x41 0x42 0x43 0x44 0x45 0x00 (6 bytes, 3 words)
dd 'ABCDE' ; 0x41 0x42 0x43 0x44 0x45 0x00 0x00 0x00 (8 bytes, 2 doublewords)
dq 'ABCDE' ; 0x41 0x42 0x43 0x44 0x45 0x00 0x00 0x00 (8 bytes, 1 quadword)
So the difference is that it pads out to a multiple of the element size with zeros when you use dd or dw instead of db.
According to #Jose's comment, some assemblers may use a different byte order for dd or dw string constants. In NASM syntax, the string is always stored in memory in the same order it appears in the quoted constant.
You can assemble this with NASM (e.g. into the default flat binary output) and use hexdump -C or something to confirm the byte ordering and amount of padding.
Note that this padding to the element size applies to each comma-separated element. So the seemingly-innocent dd '%lf', 10, 0 actually assembles like this:
;dd '%lf', 10, 0
db '%lf',0, 10,0,0,0, 0,0,0,0 ;; equivalent with db
Note the 0 before the newline; if you pass a pointer to this to printf, the C string is just "%lf", terminated by the first 0 byte.
(write system call or fwrite function with an explicit length would print the whole thing, including the 0 bytes, because those functions work on binary data, not C implicit-length strings.)
Also note that in NASM, you can do stuff like mov dword [rdi], "abc" to store "abc\0" to memory. i.e. multi-character literals work as numeric literals in any context in NASM.
MASM is very different
See When using the MOV mnemonic to load/copy a string to a memory register in MASM, are the characters stored in reverse order? for more. Even in a dd "abcd", MASM breaks your strings, reversing the byte order inside chunks compared to source order.
I want to clarify something:
example: db 'ABCDE';
This reserves 5 bytes in total, each containing a letter.
ex2: db 1 ;
reserves a byte that contains 1
ex3: db "cool;
reserves 4 bytes and each byte contains a letter
ex4: db "cool", 1, 3;
reserves 3 bytes?
answers: ex4 is 6 bytes
For each character in the string "0123456789ABCDEF" you need just one byte. So, the string will occupy 16 bytes in the memory.
In case of this declaration:
vark db 1
you can make this:
mov [vark],128
and cannot:
mov [vark],1024
but in this case:
vark dw 1
you can.
Related
Suppose I have the following declared:
section .bss
buffer resb 1
And these instructions follow in section .text:
mov al, 5 ; mov-immediate
mov [buffer], al ; store
mov bl, [buffer] ; load
mov cl, buffer ; mov-immediate?
Am I correct in understanding that bl will contain the value 5, and cl will contain the memory address of the variable buffer?
I am confused about the differences between
moving an immediate into a register,
moving a register into an immediate (what goes in, the data or the address?) and
moving an immediate into a register without the brackets
For example, mov cl, buffer vs mov cl, [buffer]
UPDATE: After reading the responses, I suppose the following summary is accurate:
mov edi, array puts the memory address of the zeroth array index in edi. i.e. the label address.
mov byte [edi], 3 puts the VALUE 3 into the zeroth index of the array
after add edi, 3, edi now contains the memory address of the 3rd index of the array
mov al, [array] loads the DATA at the zeroth index into al.
mov al, [array+3] loads the DATA at the third index into al.
mov [al], [array] is invalid because x86 can't encode 2 explicit memory operands, and because al is only 8 bits and can't be used even in a 16-bit addressing mode. Referencing the contents of a memory location. (x86 addressing modes)
mov array, 3 is invalid, because you can't say "Hey, I don't like the offset at which array is stored, so I'll call it 3". An immediate can only be a source operand.
mov byte [array], 3 puts the value 3 into the zeroth index (first byte) of the array. The byte specifier is needed to avoid ambiguity between byte/word/dword for instructions with memory, immediate operands. That would be an assemble-time error (ambiguous operand size) otherwise.
Please mention if any of these is false. (editor's note: I fixed syntax errors / ambiguities so the valid ones actually are valid NASM syntax. And linked other Q&As for details)
The square brackets essentially work like a dereference operator (e.g., like * in C).
So, something like
mov REG, x
moves the value of x into REG, whereas
mov REG, [x]
moves the value of the memory location where x points to into REG. Note that if x is a label, its value is the address of that label.
As for you're question:
Am I correct in understanding that bl will contain the value 5, and cl
will contain the memory address of the variable buffer?
Yes, you are correct. But beware that, since CL is only 8 bits wide, it will only contain the least significant byte of the address of buffer.
Indeed, your thought is correct.That is, bl will contain 5 and cl the memory address of buffer(in fact the label buffer is a memory address itself).
Now, let me explain the differences between the operations you mentioned:
moving an immediate into a register can be done using mov reg,imm.What may be confusing is that labels e.g buffer are immediate values themselves that contain an address.
You cannot really move a register into an immediate, since immediate values are constants, like 2 or FF1Ah.What you can do is move a register to the place where the constant points to.You can do it like mov [const], reg .
You can also use indirect addressing like mov reg2,[reg1] provided reg1 points to a valid location, and it will transfer the value pointed by reg1 to reg2.
So, mov cl, buffer will move the address of buffer to cl(which may or may not give the correct address, since cl is only one byte long) , whereas mov cl, [buffer] will get the actual value.
Summary
When you use [a], then you refer to the value at the place where a points to.For example, if a is F5B1, then [a] refers to the address F5B1 in RAM.
Labels are addresses,i.e values like F5B1.
Values stored in registers do not have to be referenced to as [reg] because registers do not have addresses.In fact, registers can be thought of as immediate values.
You are getting the idea. However, there are a few details worth bearing in mind:
Addresses can and usually are greater than what 8 bits can hold (cl is 8-bit, cx is 16-bit, ecx is 32-bit, rcx is 64-bit). So, cl is likely going to be unequal to the address of the variable buffer. It'll only have the least significant 8 bits of the address.
If there are interrupt routines or threads that can preempt the above code and/or access buffer, the value in bl may differ from 5. Broken interrupt routines may actually affect any register when they fail to preserve register values.
For all instruction with using immediate values as an operand for to write the value into a ram location (or for calculating within), we have to specify how many bytes we want to access. Because our assemble can not know if we want access only one byte, a word, or a doppleword for example if the immediate value is a lower value, like the following instructions shows.
array db 0FFh, 0FFh, 0FFh, 0FFh
mov byte [array], 3
results:
array db 03h, 0FFh, 0FFh, 0FFh
....
mov word [array], 3
results:
array db 03h, 00h, 0FFh, 0FFh
....
mov dword [array], 3
results:
array db 03h, 00h, 00h, 00h
Dirk
I'm trying to encode a binary file into base64.
Althrough, I'm stuck at the few steps and I'm also not sure if this is the way to think, see commentaries in code below :
SECTION .bss ; Section containing uninitialized data
BUFFLEN equ 6 ; We read the file 6 bytes at a time
Buff: resb BUFFLEN ; Text buffer itself
SECTION .data ; Section containing initialised data
B64Str: db "000000"
B64LEN equ $-B64Str
Base64: db "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
SECTION .text ; Section containing code
global _start ; Linker needs this to find the entry point!
_start:
nop ; This no-op keeps gdb happy...
; Read a buffer full of text from stdin:
Read:
mov eax,3 ; Specify sys_read call
mov ebx,0 ; Specify File Descriptor 0: Standard Input
mov ecx,Buff ; Pass offset of the buffer to read to
mov edx,BUFFLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
mov ebp,eax ; Save # of bytes read from file for later
cmp eax,0 ; If eax=0, sys_read reached EOF on stdin
je Done ; Jump If Equal (to 0, from compare)
; Set up the registers for the process buffer step:
mov esi,Buff ; Place address of file buffer into esi
mov edi,B64Str ; Place address of line string into edi
xor ecx,ecx ; Clear line string pointer to 0
;;;;;;
GET 6 bits from input
;;;;;;
;;;;;;
Convert to B64 char
;;;;;;
;;;;;;
Print the char
;;;;;;
;;;;;;
process to the next 6 bits
;;;;;;
; All done! Let's end this party:
Done:
mov eax,1 ; Code for Exit Syscall
mov ebx,0 ; Return a code of zero
int 80H ; Make kernel call
So, in text, it should do that :
1) Hex value :
7C AA 78
2) Binary value :
0111 1100 1010 1010 0111 1000
3) Groups in 6 bits :
011111 001010 101001 111000
4) Convert to numbers :
31 10 41 56
5) Each number is a letter, number or symbol :
31 = f
10 = K
41 = p
56 = 4
So, final output is : fKp4
So, my questions are :
How to get the 6 bits and how to convert those bits in char ?
EDIT after few years:
Lately somebody did run into this example, and while discussing how it works and how to convert it to x64 for 64b linux, I turned it into fully working example, source available here: https://gist.github.com/ped7g/c96a7eec86f9b090d0f33ba36af056c1
You have two major ways how to implement it, either by generic loop capable to pick any 6 bits, or by having fixed code dealing with 24 bits (3 bytes) of input (will produce exactly 4 base64 characters and end at byte-boundary, so you can read next 24bits from +3 offset).
Let's say you have esi pointing into source binary data, which are padded enough with zeroes to make abundant memory access beyond input buffer safe (+3 bytes at worst case).
And edi pointing to some output buffer (having at least ((input_length+2)/3*4) bytes, maybe with some padding as B64 requires for ending sequence).
; convert 3 bytes of input into four B64 characters of output
mov eax,[esi] ; read 3 bytes of input
; (reads actually 4B, 1 will be ignored)
add esi,3 ; advance pointer to next input chunk
bswap eax ; first input byte as MSB of eax
shr eax,8 ; throw away the 1 junk byte (LSB after bswap)
; produce 4 base64 characters backward (last group of 6b is converted first)
; (to make the logic of 6b group extraction simple: "shr eax,6 + and 0x3F)
mov edx,eax ; get copy of last 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (4th)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bl,[Base64+edx] ; convert 0-63 value into B64 character (3rd)
shl ebx,16 ; make room in ebx for next character (4+3 in upper 32b)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (2nd)
; here eax contains exactly only 6 bits (zero extended to 32b)
mov bl,[Base64+eax] ; convert 0-63 value into B64 character (1st)
mov [edi],ebx ; store four B64 characters as output
add edi,4 ; advance output pointer
After the last group of 3B input you must overwrite last output with proper amount of '=' to fix the fake zeroes outputted. I.e. input 1B (needs 8 bits, 2x B64 chars) => output ends with '==', 2B input (needs 16b, 3x B64 char) => ends '=', 3B input => full 24bits used => valid 4x B64 char.
If you don't want to read whole file into memory and produce whole output buffer in memory, you can make the in/out buffer of limited length, like only 900B input -> 1200B output, and process input in 900B blocks. Or you can use 3B -> 4B in/out buffer, then remove the pointer advancing completely (or even esi/edi usage, and use fixed memory), as you will have to load/store in/out for every iteration separately then.
Disclaimer: this code is written to be straightforward, not performant, as you asked how to extract 6 bits and how to convert value into character, so I guess staying with the basic x86 asm instructions is best.
I'm not even sure how to make it perform better without profiling the code for bottlenecks and experimenting with other variants. Surely the partial register usage (bh, bl vs ebx) will be costly, so there's very likely better solution (or maybe even some SIMD optimized version for larger input block).
And I didn't debug that code, just written in here in answer, so proceed with caution and check in debugger how/if it works.
My programm is supposed to encode binaries into base 64.
Everything works fine until the EOF. I have troubles to add '=' at the end of my output string.
This should happen only when the last piece of bytes is being read. It should fill the empty space. Here is my code to dectect whenever I have to add one or two '='.
Read:
mov eax,3 ; Specify sys_read call
mov ebx,0 ; Specify File Descriptor 0: Standard Input
mov ecx,Bytes ; Pass offset of the buffer to read to
mov edx,BYTESLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
mov ebp,eax ; Save # of bytes read from file for later
cmp rax,1 ; If EAX=0, sys_read reached EOF on stdin
je MissingTwoByte ; Jump If Equal (to 1, from compare)
cmp rax,2 ; If EAX=0, sys_read reached EOF on stdin
je MissingOneByte ; Jump If Equal (to 2, from compare)
cmp eax,0 ; If EAX=0, sys_read reached EOF on stdin
je Done ; Jump If Equal (to 0, from compare)
So in my :MissingOneByte and :MissingTwoByte function, I should add to Bytes my '=', right ? How can I achieve that ?
In my previous answer.. that code was supposed to eat 3 bytes always, padded by zeroes, and to fix/patch the result afterwards!
I.e. for single input byte 0x44 the Bytes needs to be set to 44 00 00 (the first 44 is set by sys_read, other two need to be cleared by code). You will get the wrong conversion result RAAA, and then you need to patch it to the correct RA==.
I.e.
SECTION .bss
BYTESLEN equ 3 ; 3 bytes of real buffer are needed
Bytes: resb BYTESLEN + 5; real buffer +5 padding (total 8B)
B64output: resb 4+4 ; 4 bytes are real output buffer
; +4 bytes are padding (total 8B)
SECTION .text
;...
Read:
mov eax,3 ; Specify sys_read call
xor ebx,ebx ; Specify File Descriptor 0: Standard Input
mov ecx,Bytes ; Pass offset of the buffer to read to
mov edx,BYTESLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
test eax,eax
jl ReadingError ; OS has problem, system "errno" is set
mov ebp,eax ; Save # of bytes read from file for later
jz Done ; 0 bytes read, no more input
; eax = 1, 2, 3
mov [ecx + eax],ebx ; clear padding bytes
; ^^ this is a bit nasty EBX reuse, works only for STDIN (0)
; for any file handle use fixed zero: mov word [ecx+eax],0
call ConvertBytesToB64Output ; convert to Base64 output
; overwrite last two/one/none characters based on how many input
; bytes were read (B64output+3+1 = B64output+4 => beyond 4 chars)
mov word [B64output + ebp + 1], '=='
;TODO store B64output where you wish
cmp ebp,3
je Read ; if 3 bytes were read, loop again
; 1 or 2 bytes will continue with "Done:"
Done:
; ...
ReadingError:
; ...
ConvertBytesToB64Output:
; ...
ret
Written again to be short and simple, not caring about performance much.
The trick to make instructions simple is to have enough padding at end of buffers, so you don't need to worry about overwriting memory beyond buffers, then you can write two '==' after each output, and just position it at desired place (either overwriting two last characters, or one last character, or writing it completely outside of output, into the padding area).
Without that probably lot of if (length == 1/2/3) {...} else {...} would creep into code, to guard the memory writes, and overwrite only the output buffer and nothing more.
So make sure you understand what I did and how it works, and add enough padding to your own buffers.
Also... !disclaimer!: I actually don't know how many = should be at end of base64 output, and when... that's up to OP to study the base64 definition. I'm just showing how to fix wrong output of 3B->4B conversion, which takes shorter input padded by zeroes. Hmm, according to online BASE64 generator it actually works as my code... (input%3) => 0 has no =, 1 has two =, and 2 has single =.
I am developing a simple 16-bit Real Mode kernel in Assembly, as a DOS clone kernel. I am currently trying to read user input as a string by taking every character they type, and appending it to an array. The function should return the array (why I have yet to implement) as a string. However, my following code does not work. Why? Thanks in advance. I am using the NASM Assembler, if it makes a difference.
7 section .bss:
8 INPUT_STRING: resb 4
9 section .text:
....
39 scans:
40 mov ah, 0x00
41 .repeat:
42 int 16h
43 cmp al, 0
44 je .done
45 cmp al, 0x0D
46 ; jump straight to returning the array, implement later
47 mov [INPUT_STRING+al], 1
48 add al, 4
49 jmp .repeat
50 .done:
51 ret
You can't easily dynamically grow your array in assembly, the INPUT_STRING: resb 4 reserves 4 bytes (max input is then 3 characters + 13 <CR> char). Then add al,4 makes your pointer in al to advance by 4 bytes, i.e. completely off the reserved memory after first iteration (not even mentioning that al is modified by BIOS to return value in it, and that you need 16 bit register to store memory address offset in real mode, while al is only 8 bits), you can write chars only into memory at addresses INPUT_STRING+0, INPUT_STRING+1, INPUT_STRING+2, INPUT_STRING+3 (unless you want to overwrite some memory which you may accidentally use for something else). That's general simple principle how fixed-size "arrays" may be implemented in ASM (you can of course use more complicated design if you wish, only your code is the limit, what you do with your CPU and memory): you reserve N*data_type_size bytes, and write there values at offsets +0*data_type_size, +1*data_type_size, 2*data_type_size ... in case of ASCII characters each character is 1 byte long, so the offsets of elements of "array" are simple 0, 1, 2, ...
Also in your code you have to re-set the AH to zero every time ahead of int 16h, because the interrupt will modify AH with keyboard scancode. And you should check for maximum input size, if you have fixed-size input array.
Some simple very basic and crude example (proper command line input should also handle special keys like backspace, etc):
In the data the global fixed size buffer (256 bytes) is reserved like this:
INPUT_BUFFER_MAX_LEN equ 255
; reserver 255 bytes for input and +1 byte for nul terminator
INPUT_STRING: resb INPUT_BUFFER_MAX_LEN+1
And the code to store user input into it, checking for max length input.
...
scans:
mov di,INPUT_STRING ; pointer of input buffer
lea si,[di+INPUT_BUFFER_MAX_LEN] ; pointer beyond buffer
.repeat:
; check if input buffer is full
cmp di,si
jae .done ; full input buffer, end input
; wait for key press from user, using BIOS int 16h,ah=0
xor ah,ah ; ah = 0
int 0x16
; AL = ASCII char or zero for extended key, AH = scancode
test al,al
jz .done ; any extended key will end input
cmp al,13
je .done ; "Enter" key will end input (not stored in buffer)
; store the ASCII character to input buffer
mov [di],al ; expects ds = data segment
inc di ; make the pointer to point to next char
jmp .repeat ; read more chars
.done:
; store nul-terminator at the end of user input
mov [di],byte 0
ret
After ret the memory at address INPUT_STRING will contain bytes with user inputted ASCII characters. For example if the user will hit Abc123<enter>, the memory at address INPUT_STRING will look like this (bytes in hexadecimal): 41 62 63 31 32 33 00 ?? ?? whatever was there before ... ?? ??, six ASCII characters and the null terminator at seventh (+6 offset) position. This would suffice as "C string" for common C functions like printf and similar (it's same memory structure/logic, as the C language does use for it's "strings").
Suppose I have the following declared:
section .bss
buffer resb 1
And these instructions follow in section .text:
mov al, 5 ; mov-immediate
mov [buffer], al ; store
mov bl, [buffer] ; load
mov cl, buffer ; mov-immediate?
Am I correct in understanding that bl will contain the value 5, and cl will contain the memory address of the variable buffer?
I am confused about the differences between
moving an immediate into a register,
moving a register into an immediate (what goes in, the data or the address?) and
moving an immediate into a register without the brackets
For example, mov cl, buffer vs mov cl, [buffer]
UPDATE: After reading the responses, I suppose the following summary is accurate:
mov edi, array puts the memory address of the zeroth array index in edi. i.e. the label address.
mov byte [edi], 3 puts the VALUE 3 into the zeroth index of the array
after add edi, 3, edi now contains the memory address of the 3rd index of the array
mov al, [array] loads the DATA at the zeroth index into al.
mov al, [array+3] loads the DATA at the third index into al.
mov [al], [array] is invalid because x86 can't encode 2 explicit memory operands, and because al is only 8 bits and can't be used even in a 16-bit addressing mode. Referencing the contents of a memory location. (x86 addressing modes)
mov array, 3 is invalid, because you can't say "Hey, I don't like the offset at which array is stored, so I'll call it 3". An immediate can only be a source operand.
mov byte [array], 3 puts the value 3 into the zeroth index (first byte) of the array. The byte specifier is needed to avoid ambiguity between byte/word/dword for instructions with memory, immediate operands. That would be an assemble-time error (ambiguous operand size) otherwise.
Please mention if any of these is false. (editor's note: I fixed syntax errors / ambiguities so the valid ones actually are valid NASM syntax. And linked other Q&As for details)
The square brackets essentially work like a dereference operator (e.g., like * in C).
So, something like
mov REG, x
moves the value of x into REG, whereas
mov REG, [x]
moves the value of the memory location where x points to into REG. Note that if x is a label, its value is the address of that label.
As for you're question:
Am I correct in understanding that bl will contain the value 5, and cl
will contain the memory address of the variable buffer?
Yes, you are correct. But beware that, since CL is only 8 bits wide, it will only contain the least significant byte of the address of buffer.
Indeed, your thought is correct.That is, bl will contain 5 and cl the memory address of buffer(in fact the label buffer is a memory address itself).
Now, let me explain the differences between the operations you mentioned:
moving an immediate into a register can be done using mov reg,imm.What may be confusing is that labels e.g buffer are immediate values themselves that contain an address.
You cannot really move a register into an immediate, since immediate values are constants, like 2 or FF1Ah.What you can do is move a register to the place where the constant points to.You can do it like mov [const], reg .
You can also use indirect addressing like mov reg2,[reg1] provided reg1 points to a valid location, and it will transfer the value pointed by reg1 to reg2.
So, mov cl, buffer will move the address of buffer to cl(which may or may not give the correct address, since cl is only one byte long) , whereas mov cl, [buffer] will get the actual value.
Summary
When you use [a], then you refer to the value at the place where a points to.For example, if a is F5B1, then [a] refers to the address F5B1 in RAM.
Labels are addresses,i.e values like F5B1.
Values stored in registers do not have to be referenced to as [reg] because registers do not have addresses.In fact, registers can be thought of as immediate values.
You are getting the idea. However, there are a few details worth bearing in mind:
Addresses can and usually are greater than what 8 bits can hold (cl is 8-bit, cx is 16-bit, ecx is 32-bit, rcx is 64-bit). So, cl is likely going to be unequal to the address of the variable buffer. It'll only have the least significant 8 bits of the address.
If there are interrupt routines or threads that can preempt the above code and/or access buffer, the value in bl may differ from 5. Broken interrupt routines may actually affect any register when they fail to preserve register values.
For all instruction with using immediate values as an operand for to write the value into a ram location (or for calculating within), we have to specify how many bytes we want to access. Because our assemble can not know if we want access only one byte, a word, or a doppleword for example if the immediate value is a lower value, like the following instructions shows.
array db 0FFh, 0FFh, 0FFh, 0FFh
mov byte [array], 3
results:
array db 03h, 0FFh, 0FFh, 0FFh
....
mov word [array], 3
results:
array db 03h, 00h, 0FFh, 0FFh
....
mov dword [array], 3
results:
array db 03h, 00h, 00h, 00h
Dirk