Limit to the number of characters read on sys_read - linux

I was playing around with some assembly programming and wrote some code to read in 4096 bytes from stdin using the syscall sys_read. However, it reads only around a 120 bytes from stdin.
Why does this happen? Is there any system level setting that I can change in order to read more bytes in one go? Is there any other way I can get around this limitation and force the program or sys_read to read in more bytes?

stdin may be line buffered, do you happen to have a line feed at that position?
In general, however, read is allowed to return less than what you ask for. The solution is to read in a loop until you got all the bytes needed.

Related

Why does the sys_read system call end when it detects a new line?

I'm a beginner in assembly (using nasm). I'm learning assembly through a college course.
I'm trying to understand the behavior of the sys_read linux system call when it's invoked. Specifically, sys_read stops when it reads a new line or line feed. According to what I've been taught, this is true. This online tutorial article also affirms the fact/claim.
When sys_read detects a linefeed, control returns to the program and the users input is located at the memory address you passed in ECX.
I checked the linux programmer's manual for the sys_read call (via "man 2 read"). It does not mention the behavior when it's supposed to, right?
read() attempts to read up to count bytes from file descriptor fd
into the buffer starting at buf.
On files that support seeking, the read operation commences at the
file offset, and the file offset is incremented by the number of bytes
read. If the file offset is at or past the end of file, no bytes are
read, and read() returns zero.
If count is zero, read() may detect the errors described below. In
the absence of any errors, or if read() does not check for errors, a
read() with a count of 0 returns zero and has no other effects.
If count is greater than SSIZE_MAX, the result is unspecified.
So my question really is, why does the behavior happen? Is it a specification in the linux kernel that this should happen or is it a consequence of something else?
It's because you're reading from a POSIX tty in canonical mode (where backspace works before you press return to "submit" the line; that's all handled by the kernel's tty driver). Look up POSIX tty semantics / stty / ioctl. If you ran ./a.out < input.txt, you wouldn't see this behaviour.
Note that read() on a TTY will return without a newline if you hit control-d (the EOF tty control-sequence).
Assuming that read() reads whole lines is ok for a toy program, but don't start assuming that in anything that needs to be robust, even if you've checked that you're reading from a TTY. I forget what happens if the user pastes multiple lines of text into a terminal emulator. Quite probably they all end up in a single read() buffer.
See also my answer on a question about small read()s leaving unread data on the terminal: if you type more characters on one line than the read() buffer size, you'll need at least one more read system call to clear out the input.
As you noted, the read(2) libc function is just a thin wrapper around sys_read. The answer to this question really has nothing to do with assembly language, and is the same for systems programming in C (or any other language).
Further reading:
stty(1) man page: where you can change which control character does what.
The TTY demystified: some history, and some diagrams showing how xterm, the kernel, and the process reading from the tty all interact. And stuff about session management, and signals.
https://en.wikipedia.org/wiki/POSIX_terminal_interface#Canonical_mode_processing and related parts of that article.
This is not an attribute of the read() system call, but rather a property of termios, the terminal driver. In the default configuration, termios buffers incoming characters (i.e. what you type) until you press Enter, after which the entire line is sent to the program reading from the terminal. This is for convenience so you can edit the line before sending it off.
As Peter Cordes already said, this behaviour is not present when reading from other kinds of files (like regular files) and can be turned off by configuring termios.
What the tutorial says is garbage, please disregard it.

Can posix read() receive less than requested 4 bytes from a pipe?

A program from the answer https://stackoverflow.com/a/1586277/6362199 uses the system call read() to receive exactly 4 bytes from a pipe. It assumes that the function read() returns -1, 0 or 4. Can the read() function return 1, 2 or 3 for example if it was interrupted by a signal?
In the man page read(2) there is:
On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number. It
is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
terminal), or because read() was interrupted by a signal.
Does this mean that the read() function can be interrupted during receiving such a small amount of data as 4 bytes? Should the source code from this answer be corrected?
In the man page pipe(7) there is:
POSIX.1-2001 says that write(2)s of less than PIPE_BUF bytes must be atomic: the output data is written to the pipe as a contiguous sequence.
but there is nothing similar about read().
If the write is atomic, that means that the entire content is already present in the buffer when the read happens so the only way to have an incomplete read is if the kernel thread decides to yield before it's finished - which wouldn't happen here.
In general you can rely on small write()s on pipes on the same system mapping to identical read()s. 4 bytes is unquestionably far smaller than any buffer would ever be, so it will definitely be atomic.

Linux read operations requesting duplicate bytes?

This is a bit of a strange question. I'm writing a fuse module using the go-fuse library, and at the moment I have a "fake" file with a size of 6000 bytes, and which will output some unrelated data for all read requests. My read function looks like this:
func (f *MyFile) Read(buf []byte, off int64) (fuse.ReadResult, fuse.Status) {
log.Printf("Reading into buffer of len %d from %d\n",len(buf),off)
FillBuffer(buf, uint64(off), f.secret)
return fuse.ReadResultData(buf), fuse.OK
}
As you can see I'm outputting a log on every read containing the range of the read request. The weird thing is that when I cat the file I get the following:
2013/09/13 21:09:03 Reading into buffer of len 4096 from 0
2013/09/13 21:09:03 Reading into buffer of len 8192 from 0
So cat is apparently reading the first 4096 bytes of data, discarding it, then reading 8192 bytes, which encompasses all the data and so succeeds. I've tried with other programs too, including hexdump and vim, and they all do the same thing. Interestingly, if I do a head -c 3000 dir/fakefile it still does the two reads, even though the later one is completely unnecessary. Does anyone have any insights into why this might be happening?
I suggest you strace your cat process to see for yourself. On my system, cat reads by 64K chunks, and does a final read() to make sure it read the whole file. That last read() is necessary to make the distinction between a reading a "chunk-sized file" and a bigger file. i.e. it makes sure there is nothing left to read, as the file size could have changed between the fstat() and the read() system calls.
Is your "fake file" size being returned correctly to FUSE by stat/fstat() system calls?

is there any good point to use 500 byte buffer for simple upper-case converter?

I am reading programming from the ground up
in chapter 5,
the program uses 500 byte buffer for 1byte long character converting.
shouldn't it have to use double loop?
loop1 for read 500 byte by 500 byte from file.
loop2 for processing something in the 500 byte maybe a byte at at time.
and I think this make program little bit more complicated.
if I use a byte buffer for convert
there is nothing need but just one loop
loop1: read 1byte and processing it.
is there any good point to use 500 byte buffer for simple upper-case converter?
my development environment is x86,linux,assembly,at&t syntax
The only reason to consider doing it 500 (or more) bytes at a time is that it may reduce the number of function calls into the library and/or Operating System services you're using for I/O. I suggest you try it both ways and measure the performance difference for yourself. Say your two versions are compiled to executables named ala uppercase.version, you can get a report on the CPU and elapsed time for it to run by typing the following at the shell prompt:
time uppercase.byte_by_byte < input > output
time uppercase.500_byte_blocks < input > output

How can I read a whole line of input in Assembly?

The only subroutine I know of capable of reading a user's alphabetical input is read_char, but how I want to be able to read the user's whole input of char no matter how long.
I have a vague notion that I have to make memory room to store the whole input or something? I'm really lost as I'm not certain if Assembly has a C++ equivalent of reading strings.
Thanks in advance.
Well, you should have a limit when reading input from the user, otherwise your program might not work properly anymore (see buffer overflow for more informations), so making room for the input and ensure the input won't exceed the buffer is very important.
Now, to get a string you have to call a dos interrupt, giving a pointer to your buffer and some other stuff. It will read until a carriage return is met.
But I think your prof wants you to read using his read_char, so (since this is homework), I'll give you a small advice: you have to do a loop and read chars until..

Resources