Problem handling file I/O with libevent2 - linux

I worked with libevent2 for some time, but usually I used it to handle network I/O (using sockets). Now I need to read many different files so I also wanted to use it. I created this code:
int file = open(filename, O_RDONLY);
struct event *ev_file_read = event_new(ev_base, file, EV_READ | EV_PERSIST, read_file, NULL);
if(event_add(ev_file_read, NULL))
error("adding file event");
Unfortunately it doesn't work. I get this message when trying to add event:
[warn] Epoll ADD(1) on fd 7 failed. Old events were 0; read change was 1 (add); write change was 0 (none): Operation not permitted
adding file event: Operation not permitted
The file exists and has rights to read/write.
Anyone has any idea how to handle file IO using libevent? I thought also about bufferred events, but in API there's only function bufferevent_socket_new() which doesn't apply here.
Thanks in advance.

I needed libevent to read many files regarding priorities. The problem was in epoll not in libevent. Epoll doesn't support regular Unix files.
To solve it I forced libevent not to use epoll:
struct event_config *cfg = event_config_new();
event_config_avoid_method(cfg, "epoll");
ev_base = event_base_new_with_config(cfg);
event_config_free(cfg);
Next method on the preference list was poll, which fully support files just as I wanted to.
Thank you all for answers.

Makes no sense to register regular file descriptors with libevent. File descriptors associated with regular files shall always select true for ready to read, ready to write, and error conditions.

if you want to do async disk i/o you may want to check the aio_* family (see man (3) aio_read). it's POSIX.1-2001 and available on linux and bsd (at least).
for integrating aio operations with libevent, see libevent aio patch and a related stackoverflow post that mention using signalfd(2) to route the aio signal events to a file descriptor that can be used with various fd event polling implementations (so implicitly with libevent loop).
EDIT: libevent also has signal handling support (totally forgot about that) so you can try and register/handle the aio signals directlry with/from libevent loop. I'd personally go and try the libevent patch first if your development rules allows you to.

Related

Group multiple file descriptors to one "virtual" file descriptor for exporting an FD over an API

If a subsystem has event handling capabilities, then it is common in the Unix/Linux world to add an API call to that subsystem to allow for exposing a file descriptor so that said event handling can be integrated into existing mainloops that use something like poll() or select(). For example, in Wayland, there's wl_display_get_fd(). If that FD shows activity, wl_display_read_events() and friends can be called.
This works trivially if that subsystem internally has exactly one FD. But what if there are multiple FDs that need to be watched for events?
I only see two solutions:
Expose all FDs. However, I am not aware of any API that does that.
Expose some sort of "virtual" FD that is in some way coupled to the internal, "real" FDs. Once a real FD receives data and is marked as readable, then so is that virtual FD. Once a real FD can be written to, then the virtual FD is automatically marked as writable etc.
#2 sounds cleaner to me. Is it possible to do that? Or are there better ways to deal with this?
If you're specifically working with Linux, then you can use the epoll mechanism. You first create an epoll instance with
int fd;
fd = epoll_create(1); // The argument is legacy and doesn't matter. It just has to be positive.
After that, you can add selectors that you care about.
if ( epoll_ctl(fd, EPOLL_CTL_ADD, some_file_descriptor, NULL) != 0 ) {
// handle error
}
That last argument can actually contain data that want passed back to you later when one of your file descriptors becomes ready. Check the man page for the specifics.
You can inquire about any ready descriptors using epoll_wait or epoll_pwait.

Can dup2 really return EINTR?

In the spec and two implementations:
According to POSIX, dup2() may return EINTR.
The linux man pages list it as permitted.
The FreeBSD man pages indicate it's not ever returned. Is this a bug - since its close implementation can EINTR (at least for TCP linger if nothing else).
In reality, can Linux return EINTR for dup2()? Presumably if so, it would be because close() decided to wait and a signal arrived (TCP linger or dodgy file system drivers that try to sync when closing).
In reality, does FreeBSD guarantee not to return EINTR for dup2()? In that case, it must be that it doesn't bother waiting for any outstanding operations on the old fd and simply unlinks the fd.
What does POSIX dup2() mean when it refers to "closing" (not in italics), rather than referencing the actual close() function - are we to understand it's just talking about "closing" it in an informal way (unlinking the file descriptor), or is it attempting to say that the effect should be as if the close() function were first called and then dup2() were called atomically.
If fildes2 is already a valid open file descriptor, it shall be closed first, unless fildes is equal to fildes2 in which case dup2() shall return fildes2 without closing it.
If dup2() does have to close, wait, then atomically dup, it's going to be a nightmare for implementors! It's much worse than the EINTR with close() fiasco. Cowardly POSIX doesn't even say if the dup took place in the case of EINTR...
Here's the relevant information from the C/POSIX library documentation with respect to the standard Linux implementation:
If OLD and NEW are different numbers, and OLD is a valid
descriptor number, then `dup2' is equivalent to:
close (NEW);
fcntl (OLD, F_DUPFD, NEW)
However, `dup2' does this atomically; there is no instant in the
middle of calling `dup2' at which NEW is closed and not yet a
duplicate of OLD.
It lists the possible error values returned by dup and dup2 as EBADF, EINVAL, and EMFILE, and no others. The documentation states that all functions that can return EINTR are listed as such, which indicates that these don't. Note that these are implemented via fcntl, not a call to close.
8 years later this still seems to be undocumented.
I looked at the linux sources and my conclusion is that dup2 can't return EINTR in a current version of Linux.
In particular, the function do_dup2 in fs/file.c ignores the return value of filp_close, which is what can cause close to return EINTR in some cases (see fs/open.c and fs/file.c).
The way dup2 works is it first makes the atomic file descriptor update, and then waits for any flushing that needs to happen on close. Any errors happening on flush are simply ignored.

Can AIO run without creating thread?

I would like aio to signal to my program when a read operation completes, and according to this page, such notification can be received by either a signal sent by the kernel, or by starting a thread running a user function. Either behavior can be selected by setting the right value of sigev_notify.
I gave it a try and soon discover that even when set to receive the notification by signal, another thread was created.
(gdb) info threads
Id Target Id Frame
2 Thread 0x7ffff7ff9700 (LWP 6347) "xnotify" 0x00007ffff7147e50 in gettimeofday () from /lib64/libc.so.6
* 1 Thread 0x7ffff7fc3720 (LWP 6344) "xnotify" 0x0000000000401834 in update (this=0x7fffffffdc00)
The doc also states that: The implementation of these functions can be done using support in the kernel (if available) or using an implementation based on threads at userlevel.
I would like to have no thread at all, is this possible?
I checked on my kernel, and that looks okay:
qdii#localhost /home/qdii $ grep -i aio /usr/src/linux/.config
CONFIG_AIO=y
Is it possible to run aio without any (userland) thread at all (apart from the main one, of course)?
EDIT:
I digged deeper into it. librt seems to provide a collection of aio functions: looking through the glibc sources exposed something fishy: inside /rt/aio_read.c is a function stub :
int aio_read (struct aiocb *aiocbp)
{
__set_errno (ENOSYS);
return -1;
}
stub_warning (aio_read)
I found a first relevant implementation in the subdirectory sysdeps/pthread, which directly called __aio_enqueue_request(..., LIO_READ), which in turn created pthreads. But as I was wondering why there would be a stup in that case, I thought maybe the stub could be implemented by the linux kernel itself, and that pthread implementation would be some sort of fallback code.
Grepping aio_read through my /usr/src/linux directory gives a lot of results, which I’m trying to understand now.
I found out that there are actually two really different aio libraries: one is part of glibc, included in librt, and performs asynchronous access by using pthreads. The other aio library implements the same interface as the first one, but is built upon the linux kernel itself and can use signals to run asynchronously.

Question about epoll and splice

My application is going to send huge amount of data over network, so I decided (because I'm using Linux) to use epoll and splice. Here's how I see it (pseudocode):
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
while(1)
{
epoll_wait (tmp_structure);
if (tmp_structure->fd == file_descriptor)
{
epoll_ctl (file_fd, EPOLL_CTL_DEL);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_ADD); // wait for EPOLLOUT event
}
if (tmp_structure->fd == tcp_socket_descriptor)
{
splice (file_fd, tcp_socket_fd);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_DEL);
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
}
}
I assume, that my application will open up to 2000 TCP sockets. I want o ask you about two things:
There will be quite a lot of epoll_ctl calls, won't wit be slow when I will have so many sockets?
File descriptor has to become readable first and there will be some interval before socket will become writable. Can I be sure, that at the moment when socket becomes writable file descriptor is still readable (to avoid blocking call)?
1st question
You can use edge triggered rather then even triggered polling thus you do not have to delete socket each time.
You can use EPOLLONESHOT to prevent removing socket
File descriptor has to become readable first and there will be some interval before socket will become writable.
What kind of file descriptor? If this file on file system you can't use select/poll or other tools for this purpose, file will be always readable or writeable regardless the state if disk and cache. If you need to do staff asynchronous you may use aio_* API but generally just read from file write to file and assume it is non-blocking.
If it is TCP socket then it would be writeable most of the time. It is better to use
non-blocking calls and put sockets to epoll when you get EWOULDBLOCK.
Consider using EPOLLET flag. This is definitely for that case. When using this flag you can use event loop in a proper way without deregistering (or modifying mode on) file descriptors since first registration in epoll. :) enjoy!

When does the write() system call write all of the requested buffer versus just doing a partial write?

If I am counting on my write() system call to write say e.g., 100 bytes, I always put that write() call in a loop that checks to see if the length that gets returned is what I expected to send and, if not, it bumps the buffer pointer and decreases the length by the amount that was written.
So once again I just did this, but now that there's StackOverflow, I can ask you all if people know when my writes will write ALL that I ask for versus give me back a partial write?
Additional comments: X-Istence's reply reminded me that I should have noted that the file descriptor was blocking (i.e., not non-blocking). I think he is suggesting that the only way a write() on a blocking file descriptor will not write all the specified data is when the write() is interrupted by a signal. This seems to make at least intuitive sense to me...
write may return partial write especially using operations on sockets or if internal buffers full. So good way is to do following:
while(size > 0 && (res=write(fd,buff,size))!=size) {
if(res<0 && errno==EINTR)
continue;
if(res < 0) {
// real error processing
break;
}
size-=res;
buf+=res;
}
Never relay on what usually happens...
Note: in case of full disk you would get ENOSPC not partial write.
You need to check errno to see if your call got interrupted, or why write() returned early, and why it only wrote a certain number of bytes.
From man 2 write
When using non-blocking I/O on objects such as sockets that are subject to flow control, write() and writev() may write fewer bytes than requested; the return value must be noted, and the remainder of the operation should be retried when possible.
Basically, unless you are writing to a non-blocking socket, the only other time this will happen is if you get interrupted by a signal.
[EINTR] A signal interrupted the write before it could be completed.
See the Errors section in the man page for more information on what can be returned, and when it will be returned. From there you need to figure out if the error is severe enough to log an error and quit, or if you can continue the operation at hand!
This is all discussed in the book: Advanced Unix Programming by Marc J. Rochkind, I have written countless programs with the help of this book, and would suggest it while programming for a UNIX like OS.
Writes shouldn't have any reason to ever write a partial buffer afaik. Possible reasons I could think of for a partial write is if you run out of disk space, you're writing past the end of a block device, or if you're writing to a char device / some other sort of device.
However, the plan to retry writes blindly is probably not such a good one - check errno to see whether you should be retrying first.

Resources