what happens after read is called for a Linux socket - linux

What actually happens after calling read:
n = read(fd, buf, try_read_size);
here fd is a TCP socket descriptor. buf is the buffer. try_read_size is the number of bytes that the program tries to read.
I guess this may finally invokes a system call to the kernel. But could anyone provide some details? say the source code implementation in glibc or kernel source?

From a high-level perspective, this is what happens:
A wrapper function provided by glibc is called
The wrapper function puts the parameters passed on the stack into registers and sets the syscall number in the register dedicated for that purpose (e.g. EAX on x86)
The wrapper function executes a trap or equivalent instruction (e.g. SYSENTER)
The CPU switches to ring0, and the trap handler is invoked
The trap handler checks the syscall number for validity and looks it up in a jump table to kernel functions
The respective kernel function checks whether arguments are valid (e.g. the range buf to buf+try_read_size refers to accessible memory pages, fd is really a file descriptor). If something is amiss, a negative error code (e.g. -EFAULT) is generated, the cpu is switched back to user mode and the call returns to the wrapper.
Another function is called depending on the file descriptor's type (in your case a socket, but one could read from a block device or a proc entry or something more exotic)
The socket's input buffer is checked:
If there is some data in the buffer, min(available, try_read_size) is copied to buf, the amount is written to the return code register (EAX on x86), the cpu is switched back to user mode and the call returns to the wrapper.
If the input buffer is empty
If the connection has been closed, zero is written to the return code register, the cpu is switched back to user mode and the call returns to the wrapper
If the connection has not been closed
A negative error code (-EAGAIN) is written to the return code register if the socket is nonblocking, the cpu is switched back to user mode and the call returns to the wrapper.
The process is suspended if the socket is not non-blocking
The wrapper function checks whether the return value is negative (error).
If positive or zero, it returns the value.
If negative, it sets errno to the negated value (a positive error is reported) and returns -1

Related

Is it possible to hook a function call with kprobes?

According to https://docs.kernel.org/trace/kprobes.html it is possible to set the instruction pointer within a kprobe's pre_handler function.
Since kprobes can probe into a running kernel code, it can change the register set, including instruction pointer. This operation requires maximum care, such as keeping the stack frame, recovering the execution path etc. Since it operates on a running kernel and needs deep knowledge of computer architecture and concurrent computing, you can easily shoot your foot.
If you change the instruction pointer (and set up other related registers) in pre_handler, you must return !0 so that kprobes stops single stepping and just returns to the given address. This also means post_handler should not be called anymore.
The same type of question was asked here, https://linux-kernel.vger.kernel.narkive.com/et7AyFPm/kprobe-pre-handler-change-return-ip it appears that if the current kprobe is "cleaned up" and the pre_handler sets the new instruction pointer and then returns 1, then you can enter a function separate from the intended instruction.
I may doing things wrong but here is my kprobes pre_handler function:
int handler_pre(struct kprobe *kp, struct pt_regs *regs) {
regs->ip = (unsigned long)mock_function;
reset_current_kprobe();
preempt_enable_no_resched();
return 1;
}
First off, when I compile my module I get the error:
WARNING: "per_cpu__current_kprobe" undefined!
If I try to add the line:
EXPORT_PER_CPU_SYMBOL(current_kprobe);
After I define the kprobe, I still get the undefined warning above. Removing the reset_current_kprobe call removes the compiler warning and allows me to insert the module but, as you may have guessed, it completely crashes the kernel. Since the kernel crashes, I am unable to figure out what may be going wrong.
My understanding is that kprobes replace the first instruction at a probed address with a breakpoint instruction which triggers the pre_handler. So by the time the pre_handler is reached, a stack frame for the intended function shouldn't have been created. In my mind this removes the possibility that I could be somehow messing up the stack but I could be completely wrong.
Does anyone have any insight as to how I could go about fixing this issue or what I am doing wrong?

Close call does not release underlying resources for the device

A bit of context: Linux 3.10.40, Multi-threads application, main thread waiting for user input (keyboard), other threads waiting (epoll_wait()) for events. No specific priority for either application or child threads, no bounding to a specific core.
I have a problem when I try to close the device /dev/ttyGS from my application in user space. Close return 0 and file descriptor is indeed removed from the process fd list but the underlying tty port is not released (that because the gs_close() callback is not called).
It "only" happens when I test the following scenario: unloading my driver whereas the /dev/ttyGS is still opened.
However, if I close /dev/ttyGS during the "normal" application exit path, i.e. do the tear down sequence (including the close(fd) call) and exit the application, then unload the driver (in the shell) I am not facing this issue.
From my (main thread) application:
// during application initialization
fd = open("/dev/ttyGS0", O_NONBLOCK | O_NOCTTY)
fd1 = epoll_create(....);
epoll_ctl(fd1, EPOLL_CTL_ADD, fd, &evt);
fd2 = epoll_create(....)
....
// then during application life
system("rmmod mydriver");
mydriver_exit
// some code ....
eventfd_signal
// some code ....
wait_event_interruptible
// Then from my event thread of my application
exit epoll_wait(fd2)
// some code ....
epoll_ctl(fd1, EPOLL_CTL_DEL, fd, NULL);
close(fd)
// .... some code within the kernel fs subsystem
fput(filp);
if (atomic_long_dec_and_test(&file->f_count)) {
// some code ....
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
if (!task_work_add(task, &file->f_u.fu_rcuhead, true))
return;
// some code ....
schedule_work(&delayed_fput_work);
spin_unlock_irqrestore(&delayed_fput_lock, flags);
}
// return from syscall
// some code ....
write(some_sysfs_attribute)
// some code ....
wake_up_interruptible
// return from syscall
// some code ....
go_back_to_epoll_wait(fd2)
// etc...
Is that correct to call close from a child thread whereas the open was performed in another (the main) thread of my application? I guess so...
The problem I have here is that file->f_count is greater than 1, so the if branch is not taken and therefore the work, which eventually will triggered tty_release() and thus gs_write callback, is not scheduled.
I grepped the f_count increment location in the fs subsystem and and from the result I get, apart from open, there were in the locking subpart (i.e. /fs/lockd).
So I was wondering whether it could be some lock (involved by the close() call) that has a grasp on the file (increasing the reference count) during the close which could prevent the work from being scheduled (and thus the callback).
From what I know file descriptors are shared between all the thread of a process, and looking in /proc/<my_app_pid>/fd and /proc/<my_app_child_pid>/fd I indeed see the same fds.
Still if I am not mistaken I think the fd table is shared between all the threads (within the same process), which I guess might/should? involve some kind of lock which might explain the problem.
The thing is that I don't really know fs subsystem (neither architecture nor source code). I try to read the source but although some parts of it are understandable, others are less (or rather more tricky especially without a good overview). I am struggling a bit to identity what could have grasp on the reference count.
Any idea of what the problem could be?

Advantage of kprobes over kretprobes

Both kprobes and kretprobes allows you to put probe on a particular instruction in the kernel address.
If you register a kprobe, the pre_handler gets executed before the actual function and post_handler after the actual function
With kretprobes, you can get the entry_handler to execute before the actual function and ret_handler to execute after the actual function and it contain the return value of the function call.
So, what is the advantage of using kprobes over kretprobes, as kretprobes has the feature of kprobes plus the return value of the function
A kprobe can be placed on any instruction, not only at the start of a kernel function (if kprobes are allowed in the given kernel code, of course).
The handlers of a kprobe run before and after the instruction.
Kretprobes only make sense for probing function entries and exits. The handlers of a kretprobe run on entry to a function and at its exit, rather than before and after some instruction, like kprobe handlers do.
Besides, if you don't need to run your code at the function exit, kprobes might be a better choice than kretprobes for probing functions (although Ftrace might be even better). Kretprobes meddle with the return address of the function on the stack to get the handler executed. If the function crashes or dumps the backtrace for some other reason, the backtrace may include the addresses of kretprobe internals rather than the real return addresses, which may be confusing.
https://www.kernel.org/doc/Documentation/kprobes.txt

Can dup2 really return EINTR?

In the spec and two implementations:
According to POSIX, dup2() may return EINTR.
The linux man pages list it as permitted.
The FreeBSD man pages indicate it's not ever returned. Is this a bug - since its close implementation can EINTR (at least for TCP linger if nothing else).
In reality, can Linux return EINTR for dup2()? Presumably if so, it would be because close() decided to wait and a signal arrived (TCP linger or dodgy file system drivers that try to sync when closing).
In reality, does FreeBSD guarantee not to return EINTR for dup2()? In that case, it must be that it doesn't bother waiting for any outstanding operations on the old fd and simply unlinks the fd.
What does POSIX dup2() mean when it refers to "closing" (not in italics), rather than referencing the actual close() function - are we to understand it's just talking about "closing" it in an informal way (unlinking the file descriptor), or is it attempting to say that the effect should be as if the close() function were first called and then dup2() were called atomically.
If fildes2 is already a valid open file descriptor, it shall be closed first, unless fildes is equal to fildes2 in which case dup2() shall return fildes2 without closing it.
If dup2() does have to close, wait, then atomically dup, it's going to be a nightmare for implementors! It's much worse than the EINTR with close() fiasco. Cowardly POSIX doesn't even say if the dup took place in the case of EINTR...
Here's the relevant information from the C/POSIX library documentation with respect to the standard Linux implementation:
If OLD and NEW are different numbers, and OLD is a valid
descriptor number, then `dup2' is equivalent to:
close (NEW);
fcntl (OLD, F_DUPFD, NEW)
However, `dup2' does this atomically; there is no instant in the
middle of calling `dup2' at which NEW is closed and not yet a
duplicate of OLD.
It lists the possible error values returned by dup and dup2 as EBADF, EINVAL, and EMFILE, and no others. The documentation states that all functions that can return EINTR are listed as such, which indicates that these don't. Note that these are implemented via fcntl, not a call to close.
8 years later this still seems to be undocumented.
I looked at the linux sources and my conclusion is that dup2 can't return EINTR in a current version of Linux.
In particular, the function do_dup2 in fs/file.c ignores the return value of filp_close, which is what can cause close to return EINTR in some cases (see fs/open.c and fs/file.c).
The way dup2 works is it first makes the atomic file descriptor update, and then waits for any flushing that needs to happen on close. Any errors happening on flush are simply ignored.

When does the write() system call write all of the requested buffer versus just doing a partial write?

If I am counting on my write() system call to write say e.g., 100 bytes, I always put that write() call in a loop that checks to see if the length that gets returned is what I expected to send and, if not, it bumps the buffer pointer and decreases the length by the amount that was written.
So once again I just did this, but now that there's StackOverflow, I can ask you all if people know when my writes will write ALL that I ask for versus give me back a partial write?
Additional comments: X-Istence's reply reminded me that I should have noted that the file descriptor was blocking (i.e., not non-blocking). I think he is suggesting that the only way a write() on a blocking file descriptor will not write all the specified data is when the write() is interrupted by a signal. This seems to make at least intuitive sense to me...
write may return partial write especially using operations on sockets or if internal buffers full. So good way is to do following:
while(size > 0 && (res=write(fd,buff,size))!=size) {
if(res<0 && errno==EINTR)
continue;
if(res < 0) {
// real error processing
break;
}
size-=res;
buf+=res;
}
Never relay on what usually happens...
Note: in case of full disk you would get ENOSPC not partial write.
You need to check errno to see if your call got interrupted, or why write() returned early, and why it only wrote a certain number of bytes.
From man 2 write
When using non-blocking I/O on objects such as sockets that are subject to flow control, write() and writev() may write fewer bytes than requested; the return value must be noted, and the remainder of the operation should be retried when possible.
Basically, unless you are writing to a non-blocking socket, the only other time this will happen is if you get interrupted by a signal.
[EINTR] A signal interrupted the write before it could be completed.
See the Errors section in the man page for more information on what can be returned, and when it will be returned. From there you need to figure out if the error is severe enough to log an error and quit, or if you can continue the operation at hand!
This is all discussed in the book: Advanced Unix Programming by Marc J. Rochkind, I have written countless programs with the help of this book, and would suggest it while programming for a UNIX like OS.
Writes shouldn't have any reason to ever write a partial buffer afaik. Possible reasons I could think of for a partial write is if you run out of disk space, you're writing past the end of a block device, or if you're writing to a char device / some other sort of device.
However, the plan to retry writes blindly is probably not such a good one - check errno to see whether you should be retrying first.

Resources