Using select()/poll() in device driver - linux

I have a driver, which handles several TCP connections.
Is there a way to perform something similar to user space application api's select/poll()/epoll() in kernel given a list of struct sock's?
Thanks

You may want to write your own custom sk_buff handler, which calls the kernel_select() that tries to lock the semaphore and does a blocking wait when the socket is open.
Not sure if you have already gone through this link Simulate effect of select() and poll() in kernel socket programming

On the kernel side it's easy to avoid using sys_epoll() interface outright. After all, you've got a direct access to kernel objects, no need to jump through hoops.
Each file object, sockets included, "overrides" a poll method in its file_operations "vtable". You can simply loop around all your sockets, calling ->poll() on each of them and yielding periodically or when there's no data available.
If the sockets are fairly high traffic, you won't need anything more than this.
A note on the API:
poll() method requires a poll_table() argument, however if you do not intend to wait on it, it can safely be initialized to null:
poll_table pt;
init_poll_funcptr(&pt, NULL);
...
// struct socket *sk;
...
unsigned event_mask = sk->ops->poll(sk->file, sk, &pt);
If you do want to wait, just play around with the callback set into poll_table by init_poll_funcptr().

Related

How multiple simultaneous requests are handled in Node.js when response is async?

I can imagine situation where 100 requests come to single Node.js server. Each of them require some DB interactions, which is implemented some natively async code - using task queue or at least microtask queue (e.g. DB driver interface is promisified).
How does Node.js return response when request handler stopped being sync? What happens to connection from api/web client where these 100 requests from description originated?
This feature is available at the OS level and is called (funnily enough) asynchronous I/O or non-blocking I/O (Windows also calls/called it overlapped I/O).
At the lowest level, in C (C#/Swift), the operating system provides an API to keep track of requests and responses. There are various APIs available depending on the OS you're on and Node.js uses libuv to automatically select the best available API at compile time but for the sake of understanding how asynchronous API works let's look at the API that is available to all platforms: the select() system call.
The select() function looks something like this:
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, time *timeout);
The fd_set data structure is a set/list of file descriptors that you are interested in watching for I/O activity. And remember, in POSIX sockets are also file descriptors. The way you use this API is as follows:
// Pseudocode:
// Say you just sent a request to a mysql database and also sent a http
// request to google maps. You are waiting for data to come from both.
// Instead of calling `read()` which would block the thread you add
// the sockets to the read set:
add mysql_socket to readfds
add maps_socket to readfds
// Now you have nothing else to do so you are free to wait for network
// I/O. Great, call select:
select(2, &readfds, NULL, NULL, NULL);
// Select is a blocking call. Yes, non-blocking I/O involves calling a
// blocking function. Yes it sounds ironic but the main difference is
// that we are not blocking waiting for each individual I/O activity,
// we are waiting for ALL of them
// At some point select returns. This is where we check which request
// matches the response:
check readfds if mysql_socket is set {
then call mysql_handler_callback()
}
check readfds if maps_socket is set {
then call maps_handler_callback()
}
go to beginning of loop
So basically the answer to your question is we check a data structure what socket/file just triggered an I/O activity and execute the appropriate code.
You no doubt can easily spot how to generalize this code pattern: instead of manually setting and checking the file descriptors you can keep all pending async requests and callbacks in a list or array and loop through it before and after the select(). This is in fact what Node.js (and javascript in general) does. And it is this list of callbacks/file-descriptors that is sometimes called the event queue - it is not a queue per-se, just a collection of things you are waiting to execute.
The select() function also has a timeout parameter at the end which can be used to implement setTimeout() and setInterval() and in browsers process GUI events so that we can run code while waiting for I/O. Because remember, select is blocking - we can only run other code if select returns. With careful management of timers we can calculate the appropriate value to pass as the timeout to select.
The fd_set data structure is not actually a linked list. In older implementations it is a bitfield. More modern implementation can improve on the bitfield as long as it complies with the API. But this partly explains why there is so many competing async API like poll, epoll, kqueue etc. They were created to overcome the limitations of select. Different APIs keep track of the file descriptors differently, some use linked lists, some hash tables, some catering for scalability (being able to listen to tens of thousands of sockets) and some catering for speed and most try to do both better than the others. Whatever they use, in the end what is used to store the request is just a data structure that keeps tracks of file descriptors.

Group multiple file descriptors to one "virtual" file descriptor for exporting an FD over an API

If a subsystem has event handling capabilities, then it is common in the Unix/Linux world to add an API call to that subsystem to allow for exposing a file descriptor so that said event handling can be integrated into existing mainloops that use something like poll() or select(). For example, in Wayland, there's wl_display_get_fd(). If that FD shows activity, wl_display_read_events() and friends can be called.
This works trivially if that subsystem internally has exactly one FD. But what if there are multiple FDs that need to be watched for events?
I only see two solutions:
Expose all FDs. However, I am not aware of any API that does that.
Expose some sort of "virtual" FD that is in some way coupled to the internal, "real" FDs. Once a real FD receives data and is marked as readable, then so is that virtual FD. Once a real FD can be written to, then the virtual FD is automatically marked as writable etc.
#2 sounds cleaner to me. Is it possible to do that? Or are there better ways to deal with this?
If you're specifically working with Linux, then you can use the epoll mechanism. You first create an epoll instance with
int fd;
fd = epoll_create(1); // The argument is legacy and doesn't matter. It just has to be positive.
After that, you can add selectors that you care about.
if ( epoll_ctl(fd, EPOLL_CTL_ADD, some_file_descriptor, NULL) != 0 ) {
// handle error
}
That last argument can actually contain data that want passed back to you later when one of your file descriptors becomes ready. Check the man page for the specifics.
You can inquire about any ready descriptors using epoll_wait or epoll_pwait.

Asynchronous Sockets in Linux-- Polling vs. Callback via

In deciding to implement asynchronous sockets in my simple server (linux), I have run into a problem. I was going to continually poll(), and do some cleanup and caching between calls. Now this seems wastefull, so I did more digging and found a way to possibly implement some callbacks on i/o.
Would I incur a performace penalty, and more importantly would it work, if I created a socket with O_NONBLOCK, use SIOCSPGRP ioctl() to send a SIGIO on i/o, and use sigaction() to define a callback function during i/o.
In addition, can I define different functions for different sockets?
"I was going to continually poll(), and do some cleanup and caching between calls. Now this seems wasteful"
Wasteful how? Did you actually try and implement this?
You have your fd list. You call poll or (better) epoll() with the list. When it triggers, you walk the fd list and deal with each one appropriately. You need to cache incoming and outgoing data, so each fd needs some kind of struct. When I've done this, I've used a hash table for the fd structs (generating a key from the fd), but you are probably fine, at least initially, just using a fixed length array and checking in case the OS issues you a weirdly high fd (nb, I have never seen that happen and I've squinted thru more logs than I can count). The structs hold pointers to incoming and outgoing buffers, perhaps a state variable, eg:
struct connection {
int fd; // mandatory for the hash table version
unsigned char *dataOut;
unsigned char *dataIn;
int state; // probably from an enum
};
struct connection connected[1000]; // your array, or...
...probably a linked list is actually best for the fd's, I had an unrelated requirement for the hash table.
Start there and refine stepwise. I think you are just trying to find an easy way out -- that you may pay for later by making other things harder ;) $0.02.

epoll_create cleanup?

I'm using epoll_create to wait on a socket.
What is the life-cycle of the returned resource tied to? Is there something like an epoll_destroy or is it tied to the socket's close or destory call?
Can I re-use the result of epoll_create if close my socket and re-open a new one. Or should I just call epoll_create and forget about the previous result of epoll_create.
epoll_create(2) returns a file descriptor, so you just use close(2) on it when done.
Then, the idea of I/O multiplexing, often called Asynchronous I/O, is to wait for multiple events, and handle them one at a time. That means you generally need only one polling file descriptor.
epoll(7) manual page contains basic example of suggested API usage.

Question about epoll and splice

My application is going to send huge amount of data over network, so I decided (because I'm using Linux) to use epoll and splice. Here's how I see it (pseudocode):
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
while(1)
{
epoll_wait (tmp_structure);
if (tmp_structure->fd == file_descriptor)
{
epoll_ctl (file_fd, EPOLL_CTL_DEL);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_ADD); // wait for EPOLLOUT event
}
if (tmp_structure->fd == tcp_socket_descriptor)
{
splice (file_fd, tcp_socket_fd);
epoll_ctl (tcp_socket_fd, EPOLL_CTL_DEL);
epoll_ctl (file_fd, EPOLL_CTL_ADD); // waiting for EPOLLIN event
}
}
I assume, that my application will open up to 2000 TCP sockets. I want o ask you about two things:
There will be quite a lot of epoll_ctl calls, won't wit be slow when I will have so many sockets?
File descriptor has to become readable first and there will be some interval before socket will become writable. Can I be sure, that at the moment when socket becomes writable file descriptor is still readable (to avoid blocking call)?
1st question
You can use edge triggered rather then even triggered polling thus you do not have to delete socket each time.
You can use EPOLLONESHOT to prevent removing socket
File descriptor has to become readable first and there will be some interval before socket will become writable.
What kind of file descriptor? If this file on file system you can't use select/poll or other tools for this purpose, file will be always readable or writeable regardless the state if disk and cache. If you need to do staff asynchronous you may use aio_* API but generally just read from file write to file and assume it is non-blocking.
If it is TCP socket then it would be writeable most of the time. It is better to use
non-blocking calls and put sockets to epoll when you get EWOULDBLOCK.
Consider using EPOLLET flag. This is definitely for that case. When using this flag you can use event loop in a proper way without deregistering (or modifying mode on) file descriptors since first registration in epoll. :) enjoy!

Resources