Understanding input and output in ASM - linux

can Anyone please help me understand what exactly is happening here?
I am new to assembly language and have written a simple code as follows:
(I am developing on LINUX)
What i want to do is accept an integer from the user and just display what the user has entered.
.section .data
number:
.long 0
.section .text
.globl _start
_start:
movl $3, %eax #Read system call
movl $0, %ebx #file descriptor (STDIN)
movl $number, %ecx #the address to which data is to be read into.
movl $4, %edx #number of bytes to be read
int $0x80
#the entered number is stored in %ebx. it can be viewed using "echo $? "
movl number , %ebx
movl $1, %eax
int $0x80
but I am not getting the expected result. instead i am getting ASCII codes for any character that i am inputting.
for ex:
input - 2
output - 50
input - 1
output - 49
input - a
output - 97 .... and so on?
what is wrong? what changes should i make to have the desired result? what is the basic concept that i missed understanding.

Input is done in the system's native codepage. If you want to convert the numeral's ASCII code into its corresponding number then you need to do two things first:
Bounds check it. The value must be between '0' and '9' inclusive, otherwise it wasn't a numeral.
Subtract '0'. This way '0' becomes 0, '5' becomes 5, etc.

Write it in C then use your compiler to generate the assembly.
Then fix up the assembly to make it look like you wrote it.

well I have figured out an answer for my own question. It may be useful for others who have just started to learn ASM and come up with same interpreting problem that i had.
first of all lets look at what is happening in the code:
.section .data
number:
.long 0
.section .text
.globl _start
_start:
movl $3, %eax #Read system call
movl $0, %ebx #file descriptor (STDIN)
movl $number, %ecx #the address to which data is to be read into.
movl $4, %edx #number of bytes to be read
int $0x80
Till here what is happening is : program waits for the user to enter a number.
But , what actually is being read is the ASCII code for the "bytes".
Thus when a user enters '2' what actually is read is 0x32 or (110010 in binary or
its decimal equivalent is 50)
The keyboard driver arranges to put these bits in a special memory location that is written when something is read from the input.
What is true for the input is same for output. When a memory location contains a binary sequence that is reported on the screen , similar ASCII conversion is applied.
(considering only ASCII standard for the moment)
thus,
INPUT------> ASCII CONVERSION--------> converted data (bytes)
STORED BYTES-------->ASCII CONVERSION--------->Output Data
so , now the question arises why the output (actually the contents of %ebx ) of this program
are not as expected...?
In our program, we are not displaying anything to screen. All echo $? is doing is simply reading whatever there is in ebx -- that is 110010 (50). If you want echo $? to display 2, then you have got to put 2 (i.e. 00010) in ebx.
The echo $? program is "interpreting" the contents of the memory (ebx) as an integer, not a character.
This is what that makes the difference!
And other thing that has to do is with the "little endian" intel architecture.
when you enter 4523 , why do we get output as "52" and not 53, 50 etc?....
In little endian architecture , when you enter the number 4523, '4' is read
and is stored in least significant byte in the address 'number'
If you compile/run program on big endian you may have different results (vice-versa)
And the most important thing is : when you "echo $?" the least significant byte of %ebx
is interpreted.
That is why we get output 52 i.e '4'.
This is a tricky and difficult concept to understand and explain. You have to spend some time thinking over this. Once you get it, you feel great.
thankyou.

Related

Assistance on x86 Assembler running on Linux

I am currently learning a bit of Assembler on Linux and I need your advice.
Here is the small program:
.section .data
zahlen:
.float 12,44,5,143,223,55,0
.section .text
.global _start
_start:
movl $0, %edi
movl zahlen (,%edi,4), %eax
movl %eax, %ebx
begin_loop:
cmpl $0, %eax
je prog_end
incl %edi
movl zahlen (,%edi,4), %eax
cmpl %ebx, %eax
jle begin_loop
movl %eax, %ebx
jmp begin_loop
prog_end:
movl $1, %eax
int $0x80
The program seems to compiling and running fine.
But I have some unclear questions/behaviors:
if I check the return value, which is the highers number in register %ebx, with the command "echo %?" it always return 0. I expect the value 223.
Any Idea why this happens?
I checked with DDD and gdb compiling with debugging option. So i saw that the program runs the correct steps.
But if i want to exam the register with command ie. "i r eax" it only shows me the address i believe, not the value. Same on DDD. I see only registers rax rbx and so on.
Here i need some advise to get on the right track.
Any Help appreciated.
Thanks
The "main" registers eax, ebx, ecx, edx, etc. are all designed to work with integers only. A float is a shorthand term that typically refers to a very specific data format (namely, the IEEE-754 binary32 standard), for which your CPU has dedicated registers and hardware to work with. As you saw, you are allowed to load them into integer registers as-is, but the value isn't going to convert itself like it would in a high-level, dynamically-typed language. Your code loaded the raw bit pattern instead, which likely is not at all what you intended.
This is because assembly has no type safety or runtime type-checking. The CPU has no knowledge of what type you declared your data as in your program. So when loading from memory into eax the CPU assumes that the data is a 32-bit integer, even if you declared it in your source code as something else.
If you're curious as to what a float actually looks like you can check this out: Floating Point to Hex Calculator
Switching from float to long solved the problem. Think mistake by myself. Also compiling and linking as 32bit shows the right registers in the debugger.

read and write to file assembly

I have an inputfile.txt which looks like this: 3 4 2 0 8 1 5 3
I'm trying to write inside an outputfile.txt each character of inputfile incremented by 1.
So inside outputfile.txt I should see 4 5 3 1 9 2 6 4.
I have tried to write this piece of code but I have several doubts.
.section .data
buff_size: .long 18
.section .bss
.lcomm buff, 18
.section .text # declaring our .text segment
.globl _start # telling where program execution should start
_start:
popl %eax # Get the number of arguments
popl %ebx # Get the program name
popl %ebx # Get the first actual argument - file to read
# open the file
movl $5, %eax # open
movl $0, %ecx # read-only mode
int $0x80
# read the file
movl $0, %esi
movl %eax, %ebx # file_descriptor
analyzecharacter: #here I want to read a single character
movl $3, %eax
movl $buff, %edi
leal (%esi,%edi,1), %ecx
movl $1, %edx
int $0x80
add $1, %esi #this point is not clear to me, what I'd like to do is to increment the index of the buffer in order to be positioned on the next cell of buffer array, I've added 1 but I think is not correct
cmp $8, %esi # if I've read all 8 characters then I'll exit
je exit
openoutputfile:
popl %ebx # Get the second actual argument - file to write
movl $5, %eax # open
movl $2, %ecx # read-only mode
int $0x80
writeinoutputfile:
#increment by 1 and write the character to STDOUT
movl %eax, %ebx # file_descriptor
movl $4, %eax
leal (%esi,%edi,1), %ecx
add $1, %ecx #increment by 1
movl $1, %edx
int $0x80
jmp analyzecharacter
exit:
movl $1, %eax
movl $0, %ebx
int $0x80
I have 2 problems/doubts:
1- my first doubt is about this instruction: add $1, %esi. Is this the right way to move through buffer array?
2- The second doubt is: When I analyze each character should I always invoke openoutputfile label? I think that in this way I'm reopening the file and the previous content is overwritten.
Indeed if I run the program I see only a single character \00 (a garbage character, caused by the value of %esi in this instruction I guess: leal (%esi,%edi,1), %ecx ).
I hope my problems are clear, I'm pretty new to assembly and I've spent several hours on this.
FYI:
I'm using GAS Compiler and the syntax is AT&T.
Moreover I'm on Ubuntu 64 bit and Intel CPU.
So, how I would do the code...
Thinking about it, I'm so used to Intel syntax, that I'm unable to write AT&T source from my head on the web without bugs (and I'm too lazy to actually do the real thing and debug it), so I will try to avoid writing instructions completely and just describe the process, to let you fill up the instructions.
So let's decide you want to do it char by char, version 1 of my source:
start:
; verify the command line has enough parameters, if not jump to exitToOs
; open both input and output files at the start of the code
processingLoop:
; read single char
; if no char was read (EOF?), jmp finishProcessing
; process it
; write it
jmp processingLoop
finishProcessing:
; close both input and output files
exitToOs:
; exit back to OS
now "run" it in your mind, verify all the major branch points make sense and will handle correctly for all major corner cases.
make sure you understand how the code will work, where it will loop, and where and why it will break out of loop.
make sure there's no infinite loop, or leaking of resources
After going trough my checklist, there's one subtle problem with this design, it's not rigorously checking file system errors, like failing to open either of the files, or writing the character (but your source doesn't care either). Otherwise I think it should work well.
So let's extend it in version 2 to be more close to real ASM instructions (asterisk marked instructions are by me, so probably with messed syntax, it's up to you to make final version of those):
start:
; verify the command line has enough parameters, if not jump to exitToOs
popl %eax # Get the number of arguments
* cmpl $3,eax ; "./binary fileinput fileoutput" will have $3 here?? Debug!
* jnz exitToOs
; open both input and output files at the start of the code
movl $5, %eax # open
popl %ebx # Get the program name
; open input file first
popl %ebx # Get the first actual argument - file to read
movl $0, %ecx # read-only mode
int $0x80
cmpl $-1, %eax ; valid file handle?
jz exitToOs
* movl %eax, ($varInputHandle) ; store input file handle to memory
; open output file, make it writable, create if not exists
movl $5, %eax # open
popl %ebx # Get the second actual argument - file to write
* ; next two lines should use octal numbers, I hope the syntax is correct
* movl $0101, %ecx # create flag + write only access (if google is telling me truth)
* movl $0666, %edx ; permissions for out file as rw-rw-rw-
int $0x80
cmpl $-1, %eax ; valid file handle?
jz exitToOs
movl %eax, ($varOutputHandle) ; store output file handle to memory
processingLoop:
; read single char to varBuffer
movl $3, %eax
movl ($varInputHandle), %ebx
movl $varBuffer, %ecx
movl $1, %edx
int $0x80
; if no char was read (EOF?), jmp finishProcessing
cmpl $0, %eax
jz finishProcessing ; looks like total success, finish cleanly
;TODO process it
* incb ($varBuffer) ; you wanted this IIRC?
; write it
movl $4, %eax
movl ($varOutputHandle), %ebx # file_descriptor
movl $varBuffer, %ecx ; BTW, still set from char read, so just for readability
movl $1, %edx ; this one is still set from char read too
int $0x80
; done, go for the next char
jmp processingLoop
finishProcessing:
movl $0, ($varExitCode) ; everything went OK, set exit code to 0
exitToOs:
; close both input and output files, if any of them is opened
movl ($varOutputHandle), %ebx # file_descriptor
call closeFile
movl ($varInputHandle), %ebx
call closeFile
; exit back to OS
movl $1, %eax
movl ($varExitCode), %ebx
int $0x80
closeFile:
cmpl $-1, %ebx
ret z ; file not opened, just ret
movl $6, %eax ; sys_close
int $0x80
; returns 0 when OK, or -1 in case of error, but no handling here
ret
.data
varExitCode: dd 1 ; no idea about AT&T syntax, "dd" is "define dword" in NASM
; default value for exit code is "1" (some error)
varInputHandle: dd -1 ; default = invalid handle
varOutputHandle: dd -1 ; default = invalid handle
varBuffer: db ? ; (single byte buffer)
Whoa, I actually wrote it fully? (of course it needs the syntax check + cleanup of asterisks, and ";" for comments, etc...)
But I mean, the comments from version 1 were already so detailed, that each required only handful of ASM instructions, so it was not that difficult (although now I see I did submit the first answer 53min ago, so this was about ~1h of work for me (including googling and a bit of other errands elsewhere)).
And I absolutely don't get how some human may want to use AT&T syntax, which is so ridiculously verbose. I can easily understand why the GCC is using it, for compilers this is perfectly fine.
But maybe you should check NASM, which is "human" oriented (to write only as few syntax sugar, as possible, and focus on instructions). The major problem (or advantage in my opinion) with NASM is Intel syntax, e.g. MOV eax,ebx puts number ebx into eax, which is Intels fault, taking LD syntax from other microprocessors manufacturers, ignoring the LD = load meaning, and changing it to MOV = move to not blatantly copy the instruction set.
Then again, I have absolutely no idea why ADD $1,%eax is the correct way in AT&T (instead of eax,1 order), and I don't even want to know, but it doesn't make any sense to me (the reversed MOV makes at least some sense due to LD origins of Intel's MOV syntax).
OTOH I can relate to cmp $number,%reg since I started to use "yoda" formatting in C++ to avoid variable value changes by accident in if (compare: if (0 = variable) vs if (variable = 0), both having typo = instead of wanted == .. the "yoda" one will not compile even with warnings OFF).
But ... oh.. this is my last AT&T ASM answer for this week, it annoys hell out of me. (I know this is personal preference, but all those additional $ and % annoys me just as much, as the reversed order).
Please, I spend serious amount of time writing this. Try to spend serious time studying it, and trying to understand it. If confused, ask in comments, but it would be pitiful waste of our time, if you would completely miss the point and not learn anything useful from this. :) So keep on.
Final note: and search hard for some debugger, find something what suits you well (probably some visual one like old "TD" from Borland in DOS days would be super nice for newcomer), but it's absolutely essential for you to improve quickly, to be able to step instruction by instruction over the code, and watch how the registers and memory content do change values. Really, if you would be able to debug your own code, you would soon realize you are reading second character from wrong file handle in %ebx... (at least I hope so).
Just to clear 1) early: add $1, %esi is indeed equivalent to inc %esi.
While you are learning assembler, I would go for the inc variant, so you don't forget about its existence and get used to it. Back in 286-586 times it would be also faster to execute, today the add is used instead - because of the complexity of micro architecture (μops), where inc is tiny fraction more complicated for CPU (translating it back to add μops I guess, but you shouldn't worry about this while learning basics, aim rather for "human" readability of the source, do not any performance tricks yet).
Is it the right way?
Well, you should firstly decide whether you want to parse it per character (or rather go for byte, as character is nowadays often utf8 glyph, which can have size from 1 to 6 or how many bytes; I'm not even sure) OR to process it with buffers.
Your mix of the two is making it easy to do additional mistakes in the code.
From a quick look I see:
you read only single byte per syscall, yet you store it at new place
in buffer+counter (why? Just use single byte buffer, if you work per byte)
when counter is 8, you exit (not processing the 8th
read byte at all).
you lose forever your input file descriptor after opening output file first time by popl %ebx (leaking file handles is very bad)
then second char is read from output file (reusing the file handle from write)
then you popl %ebx again, but there's no third parameter on command line, i.e. you fetch undefined memory from stack
indeed you reopen the output file each time, so unless it's in append mode, it will overwrite content.
That's probably all major blunders you did, but that's actually so many, that I would suggest you to start over from scratch.
I will try to do a quick my version in next answer (as this one is getting a bit long), to show you how I would do it. But at first please try (hard) to find all the points I did highlight above, and understand how your code works. If you will fully understand what your instructions do, and why they really do the error I described, you will have much easier time to design your next code, plus debugging it. So the more of the points you will really find, and fully understand, the better for you.
"BTW notes":
I never did linux asm programming (I'm now itching to do something after reading about your effort), but from some wiki about system calls I read:
All registers are preserved during the syscall.
Except return value in eax of course.
Keep this in mind, it may save you some hassle with repeating register setup before call, if you group syscalls appropriately.

x86 Assembly (Linux) stdin read syscall returning 1 with no data input

When I compare $0 against %eax for error checking, and enter no input when prompted, the error message does not display. However, when I compare $1 against %eax and enter no input, the error message displays. I'm aware that a read syscall returns the amount of bytes read into %eax, although I'm unsure as to why it returns a byte was read when no input is given, the man pages also don't give me any indication to why this is the case. Is stdin input null-terminated or is it something else?
movl $3, %eax
movl $0, %ebx
movl $BUFFER, %ecx
movl $BUFFER_SIZE, %edx
int $0x80
cmpl $0, %eax
jle input_error
If cmpl $0 is changed to cmpl $1, and no input is given, it jumps to input_error, with cmpl $0 program flow proceeds when no input is given.
You can't call sys_read with these parameters and to expect that there will be no input.
sys_read is blocking call. So, it will block until some input appears. In your case, IMO, you press ENTER and in this case, sys_read returns with 1 byte read in the buffer - LF, ascii code 0Ah and eax=1.
P.S. It is not my job, but better use FASM or at least NASM. This way you will get much more help in your ASM programming. GAS has really terrible syntax. :)

Confused about AT&T Assembly Syntax for addressing modes vs. jmp and far jmp

In AT&T Assembly Syntax, literal values must be prefixed with a $ sign
But, in Memory Addressing, literal values do not have $ sign
for example:
mov %eax, -100(%eax)
and
jmp 100
jmp $100, $100
are different.
My question is why the $ prefix so confusing?
jmp 100 is a jump to absolute address 100, just like jmp my_label is a jump to the code at my_label. EIP = 100 or EIP = the address of my_label.
(jmp 100 assembles to a jmp rel32 with a R_386_PC32 relocation, asking the linker to fill in the right relative offset from the jmp instruction's own address to the absolute target.)
So in AT&T syntax, you can think of jmp x as sort of like an LEA into EIP.
Or another way to think of it is that code-fetch starts from the specified memory location. Requiring a $ for an immediate wouldn't really make sense, because the machine encoding for direct near jumps uses a relative displacement, not absolute. (http://felixcloutier.com/x86/JMP.html).
Also, indirect jumps use a different syntax (jmp *%eax register indirect or jmp *(%edi, %ecx, 4) memory indirect), so a distinction between immediate vs. memory isn't needed.
But far jump is a different story.
jmp ptr16:32 and jmp m16:32 are both available in 32-bit mode, so you do need to distinguish between ljmp *(%edi) vs. ljmp $100, $100.
Direct far jump (jmp far ptr16:32) does take an absolute segment:offset encoded into the instruction, just like add $123, %eax takes an immediate encoded into the instruction.
Question: My question is why the prefixed $ so confused ?
$ prefix is used to load/use the value as is.
example:
movl $5, %eax #copy value 5 to eax
addl $10,%eax # add 10 + 5 and store result in eax
$5, $10 are values (constants) and are not take from any external source like register or memory
In Memory addressing, Specifically "Direct addressing mode" we want to use the value stored in particular memory location.
example:
movl 20, %eax
The above would get the value stored in Memory location at 20.
Practially since memory locations are numbered in hexadecimal (0x00000000 to 0xfffffffff), it's difficult to specify the memory locations in hexadecimals in instructions. So we assign a symbol to the location
Example:
.section .data
mydata:
long 4
.section .text
.globl _start
_start
movl mydata, %eax
In the above code, mydata is symbolic representation given a particular memory location where value "4" is stored.
I hope the above clears your confusion.

Conditional move problem

Code fragment from Assembly exercise (GNU Assembler, Linux 32 bit)
.data
more:
.asciz "more\n"
.text
...
movl $more, %eax # this is compiled
cmova more, %eax # this is compiled
cmova $more, %eax # this is not compiled
Error: suffix or operands invalid for `cmova'
I can place string address to %eax using movl, but cmova is not compiled. I need the source operand to be $more and not more, to use it for printing. Finally, this value goes to %ecx register of Linux system call 4 (write).
The assembler is correct! The CMOVcc instructions are more limited than MOV: they can only move 16/32/64-bit values from memory into a register, or from one register to another. They don't support immediate (or 8-bit register) operands.
(Reference: http://www.intel.com/Assets/PDF/manual/253666.pdf - from the set of manuals available at http://www.intel.com/products/processor/manuals/index.htm .)

Resources