NASM Linux x64 | Encode binary to base64 - linux

I'm trying to encode a binary file into base64.
Althrough, I'm stuck at the few steps and I'm also not sure if this is the way to think, see commentaries in code below :
SECTION .bss ; Section containing uninitialized data
BUFFLEN equ 6 ; We read the file 6 bytes at a time
Buff: resb BUFFLEN ; Text buffer itself
SECTION .data ; Section containing initialised data
B64Str: db "000000"
B64LEN equ $-B64Str
Base64: db "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
SECTION .text ; Section containing code
global _start ; Linker needs this to find the entry point!
_start:
nop ; This no-op keeps gdb happy...
; Read a buffer full of text from stdin:
Read:
mov eax,3 ; Specify sys_read call
mov ebx,0 ; Specify File Descriptor 0: Standard Input
mov ecx,Buff ; Pass offset of the buffer to read to
mov edx,BUFFLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
mov ebp,eax ; Save # of bytes read from file for later
cmp eax,0 ; If eax=0, sys_read reached EOF on stdin
je Done ; Jump If Equal (to 0, from compare)
; Set up the registers for the process buffer step:
mov esi,Buff ; Place address of file buffer into esi
mov edi,B64Str ; Place address of line string into edi
xor ecx,ecx ; Clear line string pointer to 0
;;;;;;
GET 6 bits from input
;;;;;;
;;;;;;
Convert to B64 char
;;;;;;
;;;;;;
Print the char
;;;;;;
;;;;;;
process to the next 6 bits
;;;;;;
; All done! Let's end this party:
Done:
mov eax,1 ; Code for Exit Syscall
mov ebx,0 ; Return a code of zero
int 80H ; Make kernel call
So, in text, it should do that :
1) Hex value :
7C AA 78
2) Binary value :
0111 1100 1010 1010 0111 1000
3) Groups in 6 bits :
011111 001010 101001 111000
4) Convert to numbers :
31 10 41 56
5) Each number is a letter, number or symbol :
31 = f
10 = K
41 = p
56 = 4
So, final output is : fKp4
So, my questions are :
How to get the 6 bits and how to convert those bits in char ?

EDIT after few years:
Lately somebody did run into this example, and while discussing how it works and how to convert it to x64 for 64b linux, I turned it into fully working example, source available here: https://gist.github.com/ped7g/c96a7eec86f9b090d0f33ba36af056c1
You have two major ways how to implement it, either by generic loop capable to pick any 6 bits, or by having fixed code dealing with 24 bits (3 bytes) of input (will produce exactly 4 base64 characters and end at byte-boundary, so you can read next 24bits from +3 offset).
Let's say you have esi pointing into source binary data, which are padded enough with zeroes to make abundant memory access beyond input buffer safe (+3 bytes at worst case).
And edi pointing to some output buffer (having at least ((input_length+2)/3*4) bytes, maybe with some padding as B64 requires for ending sequence).
; convert 3 bytes of input into four B64 characters of output
mov eax,[esi] ; read 3 bytes of input
; (reads actually 4B, 1 will be ignored)
add esi,3 ; advance pointer to next input chunk
bswap eax ; first input byte as MSB of eax
shr eax,8 ; throw away the 1 junk byte (LSB after bswap)
; produce 4 base64 characters backward (last group of 6b is converted first)
; (to make the logic of 6b group extraction simple: "shr eax,6 + and 0x3F)
mov edx,eax ; get copy of last 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (4th)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bl,[Base64+edx] ; convert 0-63 value into B64 character (3rd)
shl ebx,16 ; make room in ebx for next character (4+3 in upper 32b)
mov edx,eax ; get copy of next 6 bits
shr eax,6 ; throw away 6bits being processed already
and edx,0x3F ; keep only last 6 bits
mov bh,[Base64+edx] ; convert 0-63 value into B64 character (2nd)
; here eax contains exactly only 6 bits (zero extended to 32b)
mov bl,[Base64+eax] ; convert 0-63 value into B64 character (1st)
mov [edi],ebx ; store four B64 characters as output
add edi,4 ; advance output pointer
After the last group of 3B input you must overwrite last output with proper amount of '=' to fix the fake zeroes outputted. I.e. input 1B (needs 8 bits, 2x B64 chars) => output ends with '==', 2B input (needs 16b, 3x B64 char) => ends '=', 3B input => full 24bits used => valid 4x B64 char.
If you don't want to read whole file into memory and produce whole output buffer in memory, you can make the in/out buffer of limited length, like only 900B input -> 1200B output, and process input in 900B blocks. Or you can use 3B -> 4B in/out buffer, then remove the pointer advancing completely (or even esi/edi usage, and use fixed memory), as you will have to load/store in/out for every iteration separately then.
Disclaimer: this code is written to be straightforward, not performant, as you asked how to extract 6 bits and how to convert value into character, so I guess staying with the basic x86 asm instructions is best.
I'm not even sure how to make it perform better without profiling the code for bottlenecks and experimenting with other variants. Surely the partial register usage (bh, bl vs ebx) will be costly, so there's very likely better solution (or maybe even some SIMD optimized version for larger input block).
And I didn't debug that code, just written in here in answer, so proceed with caution and check in debugger how/if it works.

Related

How do I edit an 8 character string with a NASM Assembler based on user input? [duplicate]

Ok, so I'm fairly new to assembly, infact, I'm very new to assembly. I wrote a piece of code which is simply meant to take numerical input from the user, multiply it by 10, and have the result expressed to the user via the programs exit status (by typing echo $? in terminal)
Problem is, it is not giving the correct number, 4x10 showed as 144. So then I figured the input would probably be as a character, rather than an integer. My question here is, how do I convert the character input to an integer so that it can be used in arithmetic calculations?
It would be great if someone could answer keeping in mind that I'm a beginner :)
Also, how can I convert said integer back to a character?
section .data
section .bss
input resb 4
section .text
global _start
_start:
mov eax, 3
mov ebx, 0
mov ecx, input
mov edx, 4
int 0x80
mov ebx, 10
imul ebx, ecx
mov eax, 1
int 0x80
Here's a couple of functions for converting strings to integers, and vice versa:
; Input:
; ESI = pointer to the string to convert
; ECX = number of digits in the string (must be > 0)
; Output:
; EAX = integer value
string_to_int:
xor ebx,ebx ; clear ebx
.next_digit:
movzx eax,byte[esi]
inc esi
sub al,'0' ; convert from ASCII to number
imul ebx,10
add ebx,eax ; ebx = ebx*10 + eax
loop .next_digit ; while (--ecx)
mov eax,ebx
ret
; Input:
; EAX = integer value to convert
; ESI = pointer to buffer to store the string in (must have room for at least 10 bytes)
; Output:
; EAX = pointer to the first character of the generated string
int_to_string:
add esi,9
mov byte [esi],STRING_TERMINATOR
mov ebx,10
.next_digit:
xor edx,edx ; Clear edx prior to dividing edx:eax by ebx
div ebx ; eax /= 10
add dl,'0' ; Convert the remainder to ASCII
dec esi ; store characters in reverse order
mov [esi],dl
test eax,eax
jnz .next_digit ; Repeat until eax==0
mov eax,esi
ret
And this is how you'd use them:
STRING_TERMINATOR equ 0
lea esi,[thestring]
mov ecx,4
call string_to_int
; EAX now contains 1234
; Convert it back to a string
lea esi,[buffer]
call int_to_string
; You now have a string pointer in EAX, which
; you can use with the sys_write system call
thestring: db "1234",0
buffer: resb 10
Note that I don't do much error checking in these routines (like checking if there are characters outside of the range '0' - '9'). Nor do the routines handle signed numbers. So if you need those things you'll have to add them yourself.
The basic algorith for string->digit is: total = total*10 + digit, starting from the MSD. (e.g. with digit = *p++ - '0' for an ASCII string of digits). So the left-most / Most-Significant / first digit (in memory, and in reading order) gets multiplied by 10 N times, where N is the total number of digits after it.
Doing it this way is generally more efficient than multiplying each digit by the right power of 10 before adding. That would need 2 multiplies; one to grow a power of 10, and another to apply it to the digit. (Or a table look-up with ascending powers of 10).
Of course, for efficiency you might use SSSE3 pmaddubsw and SSE2 pmaddwd to multiply digits by their place-value in parallel: see Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? and arbitrary-length How to implement atoi using SIMD?. But the latter probably isn't a win when numbers are typically short. A scalar loop is efficient when most numbers are only a couple digits long.
Adding on to #Michael's answer, it may be useful to have the int->string function stop at the first non-digit, instead of at a fixed length. This will catch problems like your string including a newline from when the user pressed return, as well as not turning 12xy34 into a very large number. (Treat it as 12, like C's atoi function). The stop character can also be the terminating 0 in a C implicit-length string.
I've also made some improvements:
Don't use the slow loop instruction unless you're optimizing for code-size. Just forget it exists and use dec / jnz in cases where counting down to zero is still what you want to do, instead of comparing a pointer or something else.
2 LEA instructions are significantly better than imul + add: lower latency.
accumulate the result in EAX where we want to return it anyway. (If you inline this instead of calling it, use whatever register you want the result in.)
I changed the registers so it follows the x86-64 System V ABI (First arg in RDI, return in EAX).
Porting to 32-bit: This doesn't depend on 64-bitness at all; it can be ported to 32-bit by just using 32-bit registers. (i.e. replace rdi with edi, rax with ecx, and rax with eax). Beware of C calling-convention differences between 32 and 64-bit, e.g. EDI is call-preserved and args are usually passed on the stack. But if your caller is asm, you can pass an arg in EDI.
; args: pointer in RDI to ASCII decimal digits, terminated by a non-digit
; clobbers: ECX
; returns: EAX = atoi(RDI) (base 10 unsigned)
; RDI = pointer to first non-digit
global base10string_to_int
base10string_to_int:
movzx eax, byte [rdi] ; start with the first digit
sub eax, '0' ; convert from ASCII to number
cmp al, 9 ; check that it's a decimal digit [0..9]
jbe .loop_entry ; too low -> wraps to high value, fails unsigned compare check
; else: bad first digit: return 0
xor eax,eax
ret
; rotate the loop so we can put the JCC at the bottom where it belongs
; but still check the digit before messing up our total
.next_digit: ; do {
lea eax, [rax*4 + rax] ; total *= 5
lea eax, [rax*2 + rcx] ; total = (total*5)*2 + digit
; imul eax, 10 / add eax, ecx
.loop_entry:
inc rdi
movzx ecx, byte [rdi]
sub ecx, '0'
cmp ecx, 9
jbe .next_digit ; } while( digit <= 9 )
ret ; return with total in eax
This stops converting on the first non-digit character. Often this will be the 0 byte that terminates an implicit-length string. You could check after the loop that it was a string-end, not some other non-digit character, by checking ecx == -'0' (which still holds the str[i] - '0' integer "digit" value that was out of range), if you want to detect trailing garbage.
If your input is an explicit-length string, you'd need to use a loop counter instead of checking a terminator (like #Michael's answer), because the next byte in memory might be another digit. Or it might be in an unmapped page.
Making the first iteration special and handling it before jumping into the main part of the loop is called loop peeling. Peeling the first iteration allows us to optimize it specially, because we know total=0 so there's no need to multiply anything by 10. It's like starting with sum = array[0]; i=1 instead of sum=0, i=0;.
To get nice loop structure (with the conditional branch at the bottom), I used the trick of jumping into the middle of the loop for the first iteration. This didn't even take an extra jmp because I was already branching in the peeled first iteration. Reordering a loop so an if()break in the middle becomes a loop branch at the bottom is called loop rotation, and can involve peeling the first part of the first iteration and the 2nd part of the last iteration.
The simple way to solve the problem of exiting the loop on a non-digit would be to have a jcc in the loop body, like an if() break; statement in C before the total = total*10 + digit. But then I'd need a jmp and have 2 total branch instructions in the loop, meaning more overhead.
If I didn't need the sub ecx, '0' result for the loop condition, I could have used lea eax, [rax*2 + rcx - '0'] to do it as part of the LEA as well. But that would have made the LEA latency 3 cycles instead of 1, on Sandybridge-family CPUs. (3-component LEA vs. 2 or less.) The two LEAs form a loop-carried dependency chain on eax (total), so (especially for large numbers) it would not be worth it on Intel. On CPUs where base + scaled-index is no faster than base + scaled-index + disp8 (Bulldozer-family / Ryzen), then sure, if you have an explicit length as your loop condition and don't want to check the digits at all.
I used movzx to load with zero extension in the first place, instead of doing that after converting the digit from ASCII to integer. (It has to be done at some point to add into 32-bit EAX). Often code that manipulates ASCII digits uses byte operand-size, like mov cl, [rdi]. But that would create a false dependency on the old value of RCX on most CPUs.
sub al,'0' saves 1 byte over sub eax,'0', but causes a partial-register stall on Nehalem/Core2 and even worse on PIII. Fine on all other CPU families, even Sandybridge: it's a RMW of AL, so it doesn't rename the partial reg separately from EAX. But cmp al, 9 doesn't cause a problem, because reading a byte register is always fine. It saves a byte (special encoding with no ModRM byte), so I used that at the top of the function.
For more optimization stuff, see http://agner.org/optimize, and other links in the x86 tag wiki.
The tag wiki also has beginner links, including an FAQ section with links to integer->string functions, and other common beginner questions.
Related:
How do I print an integer in Assembly Level Programming without printf from the c library? is the reverse of this question, integer -> base10string.
Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? highly optimized SSSE3 pmaddubsw / pmaddwd for 8-digit integers.
How to implement atoi using SIMD? using a shuffle to handle variable-length
Conversion of huge decimal numbers (128bit) formatted as ASCII to binary (hex) handles long strings, e.g. a 128-bit integer that takes 4x 32-bit registers. (It's not very efficient, and might be better to convert in multiple chunks and then do extended-precision multiplies by 1e9 or something.)
Convert from ascii to integer in AT&T Assembly Inefficient AT&T version of this.

NASM Comparison Always Falls to Default [duplicate]

Ok, so I'm fairly new to assembly, infact, I'm very new to assembly. I wrote a piece of code which is simply meant to take numerical input from the user, multiply it by 10, and have the result expressed to the user via the programs exit status (by typing echo $? in terminal)
Problem is, it is not giving the correct number, 4x10 showed as 144. So then I figured the input would probably be as a character, rather than an integer. My question here is, how do I convert the character input to an integer so that it can be used in arithmetic calculations?
It would be great if someone could answer keeping in mind that I'm a beginner :)
Also, how can I convert said integer back to a character?
section .data
section .bss
input resb 4
section .text
global _start
_start:
mov eax, 3
mov ebx, 0
mov ecx, input
mov edx, 4
int 0x80
mov ebx, 10
imul ebx, ecx
mov eax, 1
int 0x80
Here's a couple of functions for converting strings to integers, and vice versa:
; Input:
; ESI = pointer to the string to convert
; ECX = number of digits in the string (must be > 0)
; Output:
; EAX = integer value
string_to_int:
xor ebx,ebx ; clear ebx
.next_digit:
movzx eax,byte[esi]
inc esi
sub al,'0' ; convert from ASCII to number
imul ebx,10
add ebx,eax ; ebx = ebx*10 + eax
loop .next_digit ; while (--ecx)
mov eax,ebx
ret
; Input:
; EAX = integer value to convert
; ESI = pointer to buffer to store the string in (must have room for at least 10 bytes)
; Output:
; EAX = pointer to the first character of the generated string
int_to_string:
add esi,9
mov byte [esi],STRING_TERMINATOR
mov ebx,10
.next_digit:
xor edx,edx ; Clear edx prior to dividing edx:eax by ebx
div ebx ; eax /= 10
add dl,'0' ; Convert the remainder to ASCII
dec esi ; store characters in reverse order
mov [esi],dl
test eax,eax
jnz .next_digit ; Repeat until eax==0
mov eax,esi
ret
And this is how you'd use them:
STRING_TERMINATOR equ 0
lea esi,[thestring]
mov ecx,4
call string_to_int
; EAX now contains 1234
; Convert it back to a string
lea esi,[buffer]
call int_to_string
; You now have a string pointer in EAX, which
; you can use with the sys_write system call
thestring: db "1234",0
buffer: resb 10
Note that I don't do much error checking in these routines (like checking if there are characters outside of the range '0' - '9'). Nor do the routines handle signed numbers. So if you need those things you'll have to add them yourself.
The basic algorith for string->digit is: total = total*10 + digit, starting from the MSD. (e.g. with digit = *p++ - '0' for an ASCII string of digits). So the left-most / Most-Significant / first digit (in memory, and in reading order) gets multiplied by 10 N times, where N is the total number of digits after it.
Doing it this way is generally more efficient than multiplying each digit by the right power of 10 before adding. That would need 2 multiplies; one to grow a power of 10, and another to apply it to the digit. (Or a table look-up with ascending powers of 10).
Of course, for efficiency you might use SSSE3 pmaddubsw and SSE2 pmaddwd to multiply digits by their place-value in parallel: see Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? and arbitrary-length How to implement atoi using SIMD?. But the latter probably isn't a win when numbers are typically short. A scalar loop is efficient when most numbers are only a couple digits long.
Adding on to #Michael's answer, it may be useful to have the int->string function stop at the first non-digit, instead of at a fixed length. This will catch problems like your string including a newline from when the user pressed return, as well as not turning 12xy34 into a very large number. (Treat it as 12, like C's atoi function). The stop character can also be the terminating 0 in a C implicit-length string.
I've also made some improvements:
Don't use the slow loop instruction unless you're optimizing for code-size. Just forget it exists and use dec / jnz in cases where counting down to zero is still what you want to do, instead of comparing a pointer or something else.
2 LEA instructions are significantly better than imul + add: lower latency.
accumulate the result in EAX where we want to return it anyway. (If you inline this instead of calling it, use whatever register you want the result in.)
I changed the registers so it follows the x86-64 System V ABI (First arg in RDI, return in EAX).
Porting to 32-bit: This doesn't depend on 64-bitness at all; it can be ported to 32-bit by just using 32-bit registers. (i.e. replace rdi with edi, rax with ecx, and rax with eax). Beware of C calling-convention differences between 32 and 64-bit, e.g. EDI is call-preserved and args are usually passed on the stack. But if your caller is asm, you can pass an arg in EDI.
; args: pointer in RDI to ASCII decimal digits, terminated by a non-digit
; clobbers: ECX
; returns: EAX = atoi(RDI) (base 10 unsigned)
; RDI = pointer to first non-digit
global base10string_to_int
base10string_to_int:
movzx eax, byte [rdi] ; start with the first digit
sub eax, '0' ; convert from ASCII to number
cmp al, 9 ; check that it's a decimal digit [0..9]
jbe .loop_entry ; too low -> wraps to high value, fails unsigned compare check
; else: bad first digit: return 0
xor eax,eax
ret
; rotate the loop so we can put the JCC at the bottom where it belongs
; but still check the digit before messing up our total
.next_digit: ; do {
lea eax, [rax*4 + rax] ; total *= 5
lea eax, [rax*2 + rcx] ; total = (total*5)*2 + digit
; imul eax, 10 / add eax, ecx
.loop_entry:
inc rdi
movzx ecx, byte [rdi]
sub ecx, '0'
cmp ecx, 9
jbe .next_digit ; } while( digit <= 9 )
ret ; return with total in eax
This stops converting on the first non-digit character. Often this will be the 0 byte that terminates an implicit-length string. You could check after the loop that it was a string-end, not some other non-digit character, by checking ecx == -'0' (which still holds the str[i] - '0' integer "digit" value that was out of range), if you want to detect trailing garbage.
If your input is an explicit-length string, you'd need to use a loop counter instead of checking a terminator (like #Michael's answer), because the next byte in memory might be another digit. Or it might be in an unmapped page.
Making the first iteration special and handling it before jumping into the main part of the loop is called loop peeling. Peeling the first iteration allows us to optimize it specially, because we know total=0 so there's no need to multiply anything by 10. It's like starting with sum = array[0]; i=1 instead of sum=0, i=0;.
To get nice loop structure (with the conditional branch at the bottom), I used the trick of jumping into the middle of the loop for the first iteration. This didn't even take an extra jmp because I was already branching in the peeled first iteration. Reordering a loop so an if()break in the middle becomes a loop branch at the bottom is called loop rotation, and can involve peeling the first part of the first iteration and the 2nd part of the last iteration.
The simple way to solve the problem of exiting the loop on a non-digit would be to have a jcc in the loop body, like an if() break; statement in C before the total = total*10 + digit. But then I'd need a jmp and have 2 total branch instructions in the loop, meaning more overhead.
If I didn't need the sub ecx, '0' result for the loop condition, I could have used lea eax, [rax*2 + rcx - '0'] to do it as part of the LEA as well. But that would have made the LEA latency 3 cycles instead of 1, on Sandybridge-family CPUs. (3-component LEA vs. 2 or less.) The two LEAs form a loop-carried dependency chain on eax (total), so (especially for large numbers) it would not be worth it on Intel. On CPUs where base + scaled-index is no faster than base + scaled-index + disp8 (Bulldozer-family / Ryzen), then sure, if you have an explicit length as your loop condition and don't want to check the digits at all.
I used movzx to load with zero extension in the first place, instead of doing that after converting the digit from ASCII to integer. (It has to be done at some point to add into 32-bit EAX). Often code that manipulates ASCII digits uses byte operand-size, like mov cl, [rdi]. But that would create a false dependency on the old value of RCX on most CPUs.
sub al,'0' saves 1 byte over sub eax,'0', but causes a partial-register stall on Nehalem/Core2 and even worse on PIII. Fine on all other CPU families, even Sandybridge: it's a RMW of AL, so it doesn't rename the partial reg separately from EAX. But cmp al, 9 doesn't cause a problem, because reading a byte register is always fine. It saves a byte (special encoding with no ModRM byte), so I used that at the top of the function.
For more optimization stuff, see http://agner.org/optimize, and other links in the x86 tag wiki.
The tag wiki also has beginner links, including an FAQ section with links to integer->string functions, and other common beginner questions.
Related:
How do I print an integer in Assembly Level Programming without printf from the c library? is the reverse of this question, integer -> base10string.
Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number? highly optimized SSSE3 pmaddubsw / pmaddwd for 8-digit integers.
How to implement atoi using SIMD? using a shuffle to handle variable-length
Conversion of huge decimal numbers (128bit) formatted as ASCII to binary (hex) handles long strings, e.g. a 128-bit integer that takes 4x 32-bit registers. (It's not very efficient, and might be better to convert in multiple chunks and then do extended-precision multiplies by 1e9 or something.)
Convert from ascii to integer in AT&T Assembly Inefficient AT&T version of this.

Nasm Linux x64-86 | Add bits at the end of file for correct base 64 encoding

My programm is supposed to encode binaries into base 64.
Everything works fine until the EOF. I have troubles to add '=' at the end of my output string.
This should happen only when the last piece of bytes is being read. It should fill the empty space. Here is my code to dectect whenever I have to add one or two '='.
Read:
mov eax,3 ; Specify sys_read call
mov ebx,0 ; Specify File Descriptor 0: Standard Input
mov ecx,Bytes ; Pass offset of the buffer to read to
mov edx,BYTESLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
mov ebp,eax ; Save # of bytes read from file for later
cmp rax,1 ; If EAX=0, sys_read reached EOF on stdin
je MissingTwoByte ; Jump If Equal (to 1, from compare)
cmp rax,2 ; If EAX=0, sys_read reached EOF on stdin
je MissingOneByte ; Jump If Equal (to 2, from compare)
cmp eax,0 ; If EAX=0, sys_read reached EOF on stdin
je Done ; Jump If Equal (to 0, from compare)
So in my :MissingOneByte and :MissingTwoByte function, I should add to Bytes my '=', right ? How can I achieve that ?
In my previous answer.. that code was supposed to eat 3 bytes always, padded by zeroes, and to fix/patch the result afterwards!
I.e. for single input byte 0x44 the Bytes needs to be set to 44 00 00 (the first 44 is set by sys_read, other two need to be cleared by code). You will get the wrong conversion result RAAA, and then you need to patch it to the correct RA==.
I.e.
SECTION .bss
BYTESLEN equ 3 ; 3 bytes of real buffer are needed
Bytes: resb BYTESLEN + 5; real buffer +5 padding (total 8B)
B64output: resb 4+4 ; 4 bytes are real output buffer
; +4 bytes are padding (total 8B)
SECTION .text
;...
Read:
mov eax,3 ; Specify sys_read call
xor ebx,ebx ; Specify File Descriptor 0: Standard Input
mov ecx,Bytes ; Pass offset of the buffer to read to
mov edx,BYTESLEN ; Pass number of bytes to read at one pass
int 80h ; Call sys_read to fill the buffer
test eax,eax
jl ReadingError ; OS has problem, system "errno" is set
mov ebp,eax ; Save # of bytes read from file for later
jz Done ; 0 bytes read, no more input
; eax = 1, 2, 3
mov [ecx + eax],ebx ; clear padding bytes
; ^^ this is a bit nasty EBX reuse, works only for STDIN (0)
; for any file handle use fixed zero: mov word [ecx+eax],0
call ConvertBytesToB64Output ; convert to Base64 output
; overwrite last two/one/none characters based on how many input
; bytes were read (B64output+3+1 = B64output+4 => beyond 4 chars)
mov word [B64output + ebp + 1], '=='
;TODO store B64output where you wish
cmp ebp,3
je Read ; if 3 bytes were read, loop again
; 1 or 2 bytes will continue with "Done:"
Done:
; ...
ReadingError:
; ...
ConvertBytesToB64Output:
; ...
ret
Written again to be short and simple, not caring about performance much.
The trick to make instructions simple is to have enough padding at end of buffers, so you don't need to worry about overwriting memory beyond buffers, then you can write two '==' after each output, and just position it at desired place (either overwriting two last characters, or one last character, or writing it completely outside of output, into the padding area).
Without that probably lot of if (length == 1/2/3) {...} else {...} would creep into code, to guard the memory writes, and overwrite only the output buffer and nothing more.
So make sure you understand what I did and how it works, and add enough padding to your own buffers.
Also... !disclaimer!: I actually don't know how many = should be at end of base64 output, and when... that's up to OP to study the base64 definition. I'm just showing how to fix wrong output of 3B->4B conversion, which takes shorter input padded by zeroes. Hmm, according to online BASE64 generator it actually works as my code... (input%3) => 0 has no =, 1 has two =, and 2 has single =.

How do I add chars to an array of 32-bits in Assembly?

I am developing a simple 16-bit Real Mode kernel in Assembly, as a DOS clone kernel. I am currently trying to read user input as a string by taking every character they type, and appending it to an array. The function should return the array (why I have yet to implement) as a string. However, my following code does not work. Why? Thanks in advance. I am using the NASM Assembler, if it makes a difference.
7 section .bss:
8 INPUT_STRING: resb 4
9 section .text:
....
39 scans:
40 mov ah, 0x00
41 .repeat:
42 int 16h
43 cmp al, 0
44 je .done
45 cmp al, 0x0D
46 ; jump straight to returning the array, implement later
47 mov [INPUT_STRING+al], 1
48 add al, 4
49 jmp .repeat
50 .done:
51 ret
You can't easily dynamically grow your array in assembly, the INPUT_STRING: resb 4 reserves 4 bytes (max input is then 3 characters + 13 <CR> char). Then add al,4 makes your pointer in al to advance by 4 bytes, i.e. completely off the reserved memory after first iteration (not even mentioning that al is modified by BIOS to return value in it, and that you need 16 bit register to store memory address offset in real mode, while al is only 8 bits), you can write chars only into memory at addresses INPUT_STRING+0, INPUT_STRING+1, INPUT_STRING+2, INPUT_STRING+3 (unless you want to overwrite some memory which you may accidentally use for something else). That's general simple principle how fixed-size "arrays" may be implemented in ASM (you can of course use more complicated design if you wish, only your code is the limit, what you do with your CPU and memory): you reserve N*data_type_size bytes, and write there values at offsets +0*data_type_size, +1*data_type_size, 2*data_type_size ... in case of ASCII characters each character is 1 byte long, so the offsets of elements of "array" are simple 0, 1, 2, ...
Also in your code you have to re-set the AH to zero every time ahead of int 16h, because the interrupt will modify AH with keyboard scancode. And you should check for maximum input size, if you have fixed-size input array.
Some simple very basic and crude example (proper command line input should also handle special keys like backspace, etc):
In the data the global fixed size buffer (256 bytes) is reserved like this:
INPUT_BUFFER_MAX_LEN equ 255
; reserver 255 bytes for input and +1 byte for nul terminator
INPUT_STRING: resb INPUT_BUFFER_MAX_LEN+1
And the code to store user input into it, checking for max length input.
...
scans:
mov di,INPUT_STRING ; pointer of input buffer
lea si,[di+INPUT_BUFFER_MAX_LEN] ; pointer beyond buffer
.repeat:
; check if input buffer is full
cmp di,si
jae .done ; full input buffer, end input
; wait for key press from user, using BIOS int 16h,ah=0
xor ah,ah ; ah = 0
int 0x16
; AL = ASCII char or zero for extended key, AH = scancode
test al,al
jz .done ; any extended key will end input
cmp al,13
je .done ; "Enter" key will end input (not stored in buffer)
; store the ASCII character to input buffer
mov [di],al ; expects ds = data segment
inc di ; make the pointer to point to next char
jmp .repeat ; read more chars
.done:
; store nul-terminator at the end of user input
mov [di],byte 0
ret
After ret the memory at address INPUT_STRING will contain bytes with user inputted ASCII characters. For example if the user will hit Abc123<enter>, the memory at address INPUT_STRING will look like this (bytes in hexadecimal): 41 62 63 31 32 33 00 ?? ?? whatever was there before ... ?? ??, six ASCII characters and the null terminator at seventh (+6 offset) position. This would suffice as "C string" for common C functions like printf and similar (it's same memory structure/logic, as the C language does use for it's "strings").

NASM: How to create/handle basic bmp file using intel 64 bit assembly?

How do I create/handle simple bmp file filling it with one color only using intel 64 bit assembly and nasm assembler?
The steps that include such operation are:
Create bmp file header with fixed values (explanation of specific fields below)
Create buffer which contains enough space - three bytes per pixel (one color = red + green + blue)
Open/create file
Fill the buffer
Write header to file
Write buffer to file
Close file
Exit program
Ad. 2: This is a bit more tricky - if the number of pixels per row is not divisible by 4 the program has to fill lacking bytes with 0xFF. Here I purpousely created a picture 201x201. On this example we can see that we will have 3*201=603 bytes per row meaning that we will need additional byte per row. Because of this the size required for picture buffer is 604*201=121404.
The source code that answers questions:
section .text
global _start ;must be declared for linker (ld)
_start: ;tell linker entry point
;#######################################################################
;### This program creates empty bmp file - 64 bit version ##############
;#######################################################################
;### main ##############################################################
;#######################################################################
; open file
mov rax,85 ;system call number - open/create file
mov rdi,msg ;file name
;flags
mov rsi,111111111b ;mode
syscall ;call kernel
; save file descriptor
mov r8, rax
; write headline to file
mov rax, 1 ;system call number - write
mov rdi, r8 ;load file desc
mov rsi, bmpheadline ;load adress of buffer to write
mov rdx, 54 ;load number of bytes
syscall ;call kernel
mov rbx, 201 ;LOOPY counter
mov rdx, empty_space ;load address of buffer (space allocated for picture pixels)
LOOPY:
mov rcx, 201 ;LOOPX counter
LOOPX:
mov byte [rdx+0], 0x00 ;BLUE
mov byte [rdx+1], 0xFF ;GREEN
mov byte [rdx+2], 0xFF ;RED
dec rcx ;decrease counter_x
add rdx, 3 ;move address pointer by 3 bytes (1 pixel = 3 bytes, which we just have written)
cmp rcx, 0 ;check if counter is 0
jne LOOPX ;if not jump to LOOPX
dec rbx ;decrease counter_y
mov byte [rdx], 0xFF ;additional byte per row
inc rdx ;increase address
cmp rbx, 0 ;check if counter is 0
jne LOOPY ;if not jump to LOOPY
; write content to file
mov rax, 1 ;system call number - write
mov rdi, r8 ;load file desc
mov rsi, empty_space ;load adress of buffer to write
mov rdx, 121404 ;load number of bytes
syscall ;call kernel
; close file
mov rax, 3 ;system call number - close
mov rdi, r8 ;load file desc
syscall ;call kernel
; exit program
mov rax,60 ;system call number - exit
syscall ;call kernel
section .data
msg db 'filename.bmp',0x00 ;name of out file, 0x00 = end of string
bmpheadline db 0x42,0x4D,0x72,0xDA,0x01,0x00,0x00,0x00,0x00,0x00,0x36,0x00,0x00,0x00,0x28,0x00,0x00,0x00,0xC9,0x00,0x00,0x00,0xC9,0x00,0x00,0x00,0x01,0x00,0x18,0x00,0x00,0x00,0x00,0x00,0x3C,0xDA,0x01,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
section .bss ;this section is responsible for preallocated block of memory of fixed size
empty_space: resb 121404 ;preallocation of 121404 bytes
Here the explaination of bmp headline (more under this link: http://www.dragonwins.com/domains/getteched/bmp/bmpfileformat.htm )
;### File Header - 14 bytes
;#######################################################################
;### bfType, 2 bytes, The characters "BM"
;### 0x42,0x4D = "B","M"
;###
;### bfSize, 4 bytes, The size of the file in bytes
;### 0x72,0xDA,0x01,0x00 => 0x00,0x01,0xDA,0x72 = 0x1DA72 = 121458 bytes
;### 121458 = 54 + 201 * (201 + 1) * 3
;###
;### Comment:
;### We want to create file 201x201, that means 201 rows and 201 columns
;### meaning each row will take 201*3 = 603 bytes
;###
;### According to BMP file specification each such row must be adjusted
;### so its size is dividable by 4, this gives us plus 1 byte for each
;### row.
;###
;###
;### bfReserved1, 2 bytes, Unused - must be zero
;### 0x00,0x00
;###
;### bfReserved2, 2 bytes, Unused - must be zero
;### 0x00,0x00
;###
;### bfOffBits, 4 bytes, Offset to start of Pixel Data
;### 0x36,0x00,0x00,0x00 = 54 bytes
;###
;### Image Header - 40 bytes
;#######################################################################
;### biSize 4 Header Size - Must be at least 40
;### 0x28,0x00,0x00,0x00 = 40
;###
;### biWidth 4 Image width in pixels
;### 0xC9,0x00,0x00,0x00 = 201
;###
;### biHeight 4 Image height in pixels
;### 0xC9,0x00,0x00,0x00 = 201
;###
;### biPlanes 2 Must be 1
;### 0x01,0x00
;###
;### biBitCount 2 Bits per pixel - 1, 4, 8, 16, 24, or 32
;### 0x18,0x00 = 24
;###
;### biCompression 4 Compression type (0 = uncompressed)
;### 0x00,0x00,0x00,0x00
;###
;### biSizeImage 4 Image Size - may be zero for uncompressed images
;### 0x3C,0xDA,0x01,0x00 => 0x00,0x01,0xDA,0x3C = 121404 bytes
;###
;### biXPelsPerMeter 4 Preferred resolution in pixels per meter
;### 0x00,0x00,0x00,0x00
;###
;### biYPelsPerMeter 4 Preferred resolution in pixels per meter
;### 0x00,0x00,0x00,0x00
;###
;### biClrUsed 4 Number Color Map entries that are actually used
;### 0x00,0x00,0x00,0x00
;###
;### biClrImportant 4 Number of significant colors
;### 0x00,0x00,0x00,0x00
;###
Here's an improved version of rbraun's answer. This should really be a Q&A over on codereview.SE >.<
I decided to post a separate answer instead of an edit, but feel free to copy any of this back into that answer if you want. I've tested this for a few different row/column sizes, and it works.
I improved the comments, as well as optimizing a bit. Comments like "call kernel" are too obvious to bother writing; that's just noise. I changed the comments on the system calls to more clearly say what was going on. e.g. it looks like you're calling sys_open, but you're actually using sys_creat. That means there is no flags arg, even though you mention it in a comment.
I also parameterized the BMP header and loops it so it works for any assemble-time value of BMPcols and BMProws with no extra overhead at run-time. If the row width is a multiple of 4B without padding, it leaves out the store and increment instructions altogether.
For very large buffers, it would make a lot of sense to use multiple write() calls on a buffer that ends at the end of a line, so you can reuse it. e.g. any multiple of lcm(4096, row_bytes) would be good, since it holds a whole number of rows. Around 128kiB is maybe a good size, because L2 cache size in Intel CPUs since Nehalem is 256kiB, so the data can hopefully stay hot in L2 while the kernel memcpys it into the pagecache repeatedly. You definitely want the buffer to be significantly smaller than last-level cache size.
Changes from original:
Fixed file-creation mode: don't set the execute bits, just read/write. Use octal like a normal person.
Improve comments, as discussed above: be more explicit about what system calls we're making. Avoid re-stating what's already clear from the asm instructions.
Demonstrate RIP-relative addressing for static objects
Put static constant data in .rodata. We don't need a .data section/segment at all.
Used 32-bit operand size where possible, especially for putting small constants in registers. (And note that mov-immediate is not really a "load").
Improved loop idiom: dec / jnz with no separate CMP.
Parameterized on BMProws / BMPcols, and defined assemble-time constants for various sizes instead of hard-coding. The assembler can do the math for you, so take advantage of it.
Define the BMP header with separately named dd items, instead of a no-longer-meaningful block of bytes with db.
Make only one write() system call: copy the BMP header into the buffer first. A 54 byte memcpy is much faster than an extra syscall.
Save some instructions by not repeating the setup of args for system calls when they're already there.
Merged the three byte stores for pixel components into one dword store. These stores overlap, but that's fine.
DEFAULT REL ; default to RIP-relative addressing for static data
;#######################################################################
;### This program creates empty bmp file - 64 bit version ##############
section .rodata ; read-only data is the right place for these, not .data
BMPcols equ 2019
BMProws equ 2011
; 3 bytes per pixel, with each row padded to a multiple of 4B
BMPpixbytes equ 3 * BMProws * ((BMPcols + 3) & ~0x3)
;; TODO: rewrite this header with separate db and dd directives for the different fields. Preferably in terms of assembler-constant width and height
ALIGN 16 ; for efficient rep movs
bmpheader:
;; BMP is a little-endian format, so we can use dd and stuff directly instead of encoding the bytes ourselves
bfType: dw "BM"
bfSize: dd BMPpixbytes + bmpheader_len ; size of file in bytes
dd 0 ; reserved
bfOffBits: dd bmpheader_len ; yes we can refer to stuff that's defined later.
biSize: dd 40 ; header size, min = 40
biWidth: dd BMPcols
biHeight: dd BMProws
biPlanes: dw 1 ; must be 1
biBitCount: dw 24 ; bits per pixel: 1, 4, 8, 16, 24, or 32
biCompression: dd 0 ; uncompressed = 0
biSizeImage: dd BMPpixbytes ; Image Size - may be zero for uncompressed images
biXPelsPerMeter: dd 0 ; Preferred resolution in pixels per meter
biYPelsPerMeter: dd 0 ; Preferred resolution in pixels per meter
biClrUsed: dd 0 ; Number Color Map entries that are actually used
biClrImportant: dd 0 ; Number of significant colors
bmpheader_len equ $ - bmpheader ; Let the assembler calculate this for us. Should be 54. `.` is the current position
; output filename is hard-coded. Checking argc / argv is left as an exercise for the reader.
; Of course it would be even easier to be more Unixy and just always write to stdout, so the user could redirect
fname db 'filename.bmp',0x00 ;name of out file, 0x00 = end of string
section .bss ;this section is responsible for fixed size preallocated blocks
bmpbuf: resb 54 + BMPpixbytes ; static buffer big enough to hold the whole file (including header).
bmpbuf_len equ $ - bmpbuf
section .text
global _start ;make the symbol externally visible
_start: ;The linker looks for this symbol to set the entry point
;#######################################################################
;### main ##############################################################
; creat(fname, 0666)
mov eax,85 ; SYS_creat from /usr/include/x86_64-linux-gnu/asm/unistd_64.h
;mov edi, fname ;file name string. Static data is always in the low 2G, so you can use 32bit immediates.
lea rdi, [fname] ; file name, PIC version. We don't need [rel fname] since we used DEFAULT REL.
; Ubuntu 16.10 defaults to enabling position-independent executables that can use ASLR, but doesn't require it the way OS X does.)
;creat doesn't take flags. It's equivalent to open(path, O_CREAT|O_WRONLY|O_TRUNC, mode).
mov esi, 666o ;mode in octal, to be masked by the user's umask
syscall ; eax = fd or -ERRNO
test eax,eax ; error checking on system calls.
js .handle_error ; We don't print anything, so run under strace to see what happened.
;;; memcpy the BMP header to the start of our buffer.
;;; SSE loads/stores would probably be more efficient for such a small copy
mov edi, bmpbuf
mov esi, bmpheader
;Alternative: rep movsd or movsq may be faster.
;mov ecx, bmpheader_len/4 + 1 ; It's not a multiple of 4, but copy extra bytes because MOVSD is faster
mov ecx, bmpheader_len
rep movsb
; edi now points to the first byte after the header, where pixels should be stored
; mov edi, bmpbuffer+bmpheader_len might let out-of-order execution get started on the rest while rep movsb was still running, but IDK.
;######### main loop
mov ebx, BMProws
.LOOPY: ; do{
mov ecx, BMPcols ; Note the use of a macro later to decide whether we need padding at the end of each row or not, so arbitrary widths should work.
.LOOPX: ; do{
mov dword [rdi], (0xFF <<16) | (0xFF <<8) | 0x00 ;RED=FF, GREEN=FF, BLUE=00
; stores one extra byte, but we overlap it with the next store
add rdi, 3 ;move address pointer by 3 bytes (1 pixel = 3 bytes, which we just have written)
dec ecx
jne .LOOPX ; } while(--x != 0)
; end of inner loop
%if ((BMPcols * 3) % 4) != 0
; Pad the row to a multiple of 4B
mov dword [rdi], 0xFFFFFFFF ; might only need a byte or word store, but another dword store that we overlap is fine as long as it doesn't go off the end of the buffer
add rdi, 4 - (BMPcols * 3) % 4 ; advance to a 4B boundary
%endif
dec ebx
jne .LOOPY ; } while(--y != 0)
;##### Write out the buffer to the file
; fd is still where we left it in RAX.
; write and close calls both take it as the first arg,
; and the SYSCALL ABI only clobbers RAX, RCX, and R11, so we can just put it in EDI once.
mov edi, eax ; fd
; write content to file: write(fd, bmpbuf, bmpbuf_len)
mov eax, 1 ;SYS_write
lea rsi, [bmpbuf] ;buffer.
; We already have enough info in registers that reloading this stuff as immediate constants isn't necessary, but it's much more readable and probably at least as efficient anyway.
mov edx, bmpbuf_len
syscall
; close file
mov eax, 3 ;SYS_close
; fd is still in edi
syscall
.handle_error:
; exit program
mov rax,60 ;system call number - exit
syscall
I used RIP-relative LEA sometimes, and absolute addressing (mov r32, imm32) sometimes for the static data. This is silly; really I should have just picked one and used it everywhere. (And if I picked absolute non-PIC so I know the address is definitely in the low 31 bits of virtual address space, take advantage of that everywhere with stuff like add edi,3 instead of RDI.)
See my comments on the original answer for more optimization suggestions. I didn't implement anything more than the most basic thing of combining the three byte-stores into one dword store. Unrolling so you can use wider stores would help a lot, but this is left as an exercise for the reader.

Resources