Socket ReceiveTimeout on Linux - linux

I am writing a synchronous client. Part of it is a Connection object which is responsible for the actual sending and receiving of the data. The entire library is written using the Boost ASIO ip::tcp::socket class.
I have a test in which the client calls a method on the server (which sleeps for 2 seconds) with a timeout of 1 second. My code has detected that execution took more than the requested time, but it didn't return in time. Instead, it returned after the 2 whole seconds.
I have narrowed down the problem to the receive method:
void Connection::receive(const mutable_buffers_1& buffers, const DurationType& timeout)
{
// to make sure it isn't 0 by mistake
auto actualTimeout = std::max(DurationType(milliseconds(1)), timeout);
SocketReceiveTimeoutOption timeoutOption(actualTimeout);
error_code ec;
_socket.set_option(timeoutOption, ec);
RPC_LOG(TRACE) << "Setting timeout " << actualTimeout << " returned: " << ec.message();
RPC_LOG(TRACE) << "Receiving...";
if (_socket.receive(buffers, MSG_WAITALL, ec) != buffer_size(buffers))
{
throw RpcCommunicationError("Did not receive the expected number of bytes from connection");
}
RPC_LOG(TRACE) << "Received! With error code: " << ec.message();
}
DurationType is just a convenience typedef:
typedef boost::chrono::system_clock ClockType;
typedef ClockType::time_point::duration DurationType;
SocketReceiveTimeoutOption is an option implemented for sockets:
template <int Name>
class SocketTimeoutOption
{
public:
#ifdef BSII_WINDOWS
SocketTimeoutOption(const DurationType& timeout) : _value(static_cast<DWORD>(boost::chrono::duration_cast<boost::chrono::milliseconds>(timeout).count())) {}
#else
SocketTimeoutOption(const DurationType& timeout) : _value(Utils::toTimeval(timeout)) {}
#endif
// Get the level of the socket option.
template <typename Protocol>
int level(const Protocol&) const
{
return SOL_SOCKET;
}
// Get the name of the socket option.
template <typename Protocol>
int name(const Protocol&) const
{
return Name;
}
// Get the address of the timeout data.
template <typename Protocol>
void* data(const Protocol&)
{
return &_value;
}
// Get the address of the timeout data.
template <typename Protocol>
const void* data(const Protocol&) const
{
return &_value;
}
// Get the size of the boolean data.
template <typename Protocol>
std::size_t size(const Protocol&) const
{
return sizeof(_value);
}
private:
#ifdef BSII_WINDOWS
DWORD _value;
#else
timeval _value;
#endif
};
typedef SocketTimeoutOption<SO_RCVTIMEO> SocketReceiveTimeoutOption;
typedef SocketTimeoutOption<SO_SNDTIMEO> SocketSendTimeoutOption;
And finally
namespace Utils
{
inline
timeval toTimeval(const DurationType& duration)
{
timeval val;
auto seconds = boost::chrono::duration_cast<boost::chrono::seconds>(duration); // TODO: make sure this is truncated down in case there's fractional seconds
val.tv_sec = static_cast<long>(seconds.count());
auto micro = boost::chrono::duration_cast<boost::chrono::microseconds>(duration - seconds);
val.tv_usec = static_cast<long>(micro.count());
return val;
}
}
The problem is that even though I specify a 1s timeout, the receive method still takes the entire 2 seconds. Here's the log:
2014-09-14 10:27:53.348383 | trace | 0x007f24e50ae7c0 | Setting timeout 999917107 nanoseconds returned: Success
2014-09-14 10:27:53.348422 | trace | 0x007f24e50ae7c0 | Receiving...
2014-09-14 10:27:55.349152 | trace | 0x007f24e50ae7c0 | Received! With error code: Success
As you can see, setting the timeout worked, but still the receive method took 2 seconds.
The same code works just fine on Windows.

socket::receive() will block until either:
One or more bytes of data has been received successfully
An error occurs that would prevent data from being received
For blocking synchronous operations, if the underlying OS operation returns with a non-critical error, such as one indicating that the operation would block or should be tried again, then Boost.Asio will block in poll() waiting for the file descriptor to become ready. The blocking call to poll() is not affected by the SO_RCVTIMEO socket option. Once the file descriptor is ready, Boost.Asio will reattempt the operation.
Thus, the scenario in the original question occurs as follows:
Time | Client | Server
-----+----------------------------------------+-------------------------------
| socket.connect(...); | acceptor.accept(...);
0.00 | socket.set_option(timeout(second(1))); | sleep(seconds(2));
0.01 | socket.receive(...); |
0.02 | |-- recv(...); |
1.02 | | // timeout, errno = EAGAIN |
1.03 | |-- poll(socket); |
2.00 | | // data available, poll unblocks | socket.write(...);
2.01 | `-- recv(...);// success |
To get the desired timeout behavior, either:
Use the pattern presented in the official Boost.Asio timeout examples.
Invoke the OS call directly. However, be cautious, as other operations may indirectly affect this approach. For example, if an asynchronous operation is initiated on the socket, then the socket will be set to non-blocking. This will cause the recv() function to return immediately on a non-blocking socket regardless of the SO_RCVTIMEO socket option.

Related

What s the Windows exact equivalent of WaitOnAddress() on Linux?

Using shared memory with the shmget() system call, the aim of my C++ program, is to fetch a bid price from the Internet through a server written in Rust so that each times the value changes, I m performing a financial transaction.
Server pseudocode
Shared_struct.price = new_price
Client pseudocode
Infinite_loop_label:
Wait until memory address pointed by Shared_struct.price changes.
Launch_transaction(Shared_struct.price*1.13)
Goto Infinite_loop
Since launching a transaction involve paying transaction fees, I want to create a transaction only once per buy price change.
Using a semaphore or a futex, I can do the reverse, I m meaning waiting for a variable to reachs a specific value, but how to wait until a variable is no longer equal to current value?
Whereas on Windows I can do something like this on the address of the shared segment:
ULONG g_TargetValue; // global, accessible to all process
ULONG CapturedValue;
ULONG UndesiredValue;
UndesiredValue = 0;
CapturedValue = g_TargetValue;
while (CapturedValue == UndesiredValue) {
WaitOnAddress(&g_TargetValue, &UndesiredValue, sizeof(ULONG), INFINITE);
CapturedValue = g_TargetValue;
}
Is there a way to do this on Linux? Or a straight equivalent?
You can use futex. (I assumed "var" is in shm mem)
/* Client */
int prv;
while (1) {
int prv = var;
int ret = futex(&var, FUTEX_WAIT, prv, NULL, NULL, 0);
/* Spurious wake-up */
if (!ret && var == prv) continue;
doTransaction();
}
/* Server */
int prv = NOT_CACHED;
while(1) {
var = updateVar();
if (var != prv || prv = NOT_CACHED)
futex(&var, FUTEX_WAKE, 1, NULL, NULL, 0);
prv = var;
}
It requires the server side to call futex as well to notify client(s).
Note that the same holds true for WaitOnAddress.
According to MSDN:
Any thread within the same process that changes the value at the address on which threads are waiting should call WakeByAddressSingle to wake a single waiting thread or WakeByAddressAll to wake all waiting threads.
(Added)
More high level synchronization method for this problem is to use condition variable.
It is also implemented based on futex.
See link

How does epoll's EPOLLEXCLUSIVE mode interact with level-triggering?

Suppose the following series of events occurs:
We set up a listening socket
Thread A blocks waiting for the listening socket to become readable, using EPOLLIN | EPOLLEXCLUSIVE
Thread B also blocks waiting for the listening socket to become readable, also using EPOLLIN | EPOLLEXCLUSIVE
An incoming connection arrives at the listening socket, making the socket readable, and the kernel elects to wake up thread A.
But, before the thread actually wakes up and calls accept, a second incoming connection arrives at the listening socket.
Here, the socket is already readable, so the second connection doesn't change that. This is level-triggered epoll, so according to the normal rules, the second connection can be treated as a no-op, and the second thread doesn't need to be awoken. ...Of course, not waking up the second thread would kind of defeat the whole purpose of EPOLLEXCLUSIVE? But my trust in API designers doing the right thing is not as strong as it once was, and I can't find anything in the documentation to rule this out.
Questions
a) Is the above scenario possible, where two connections arrive but only thread is woken? Or is it guaranteed that every distinct incoming connection on a listening socket will wake another thread?
b) Is there a general rule to predict how EPOLLEXCLUSIVE and level-triggered epoll interact?
b) What about EPOLLIN | EPOLLEXCLUSIVE and EPOLLOUT | EPOLLEXCLUSIVE for byte-stream fds, like a connected TCP socket or a pipe? E.g. what happens if more data arrives while a pipe is already readable?
Edited (original answer is after the code used for testing)
To make sure things are clear, I'll go over EPOLLEXCLUSIVE as it relates to edge triggered events (EPOLLET) as well as level-triggered events, to show how these effect expected behavior.
As you well know:
Edge Triggered: Once you set EPOLLET, events are triggered only if they change the state of the fd - meaning that only the first event is triggered and no new events will get triggered until that event is fully handled.
This design is explicitly meant to prevent epoll_wait from returning due to an event that is in the process of being handled (i.e., when new data arrives while the EPOLLIN was already raised but read hadn't been called or not all of the data was read).
The edge-triggered event rule is simple all same-type (i.e. EPOLLIN) events are merged until all available data was processed.
In the case of a listening socket, the EPOLLIN event won't be triggered again until all existing listen "backlog" sockets have been accepted using accept.
In the case of a byte stream, new events won't be triggered until all the the available bytes have been read from the stream (the buffer was emptied).
Level Triggered: On the other hand, level triggered events will behave closer to how legacy select (or poll) operates, allowing epoll to be used with older code.
The event-merger rule is more complex: events of the same type are only merged if no one is waiting for an event (no one is waiting for epoll_wait to return), or if multiple events happen before epoll_wait can return... otherwise any event causes epoll_wait to return.
In the case of a listening socket, the EPOLLIN event will be triggered every time a client connects... unless no one is waiting for epoll_wait to return, in which case the next call for epoll_wait will return immediately and all the EPOLLIN events that occurred during that time will have been merged into a single event.
In the case of a byte stream, new events will be triggered every time new data comes in... unless, of course, no one is waiting for epoll_wait to return, in which case the next call will return immediately for all the data that arrive util epoll_wait returned (even if it arrived in different chunks / events).
Exclusive return: The EPOLLEXCLUSIVE flag is used to prevent the "thundering heard" behavior, so only a single epoll_wait caller is woken up for each fd wake-up event.
As I pointed out before, for edge-triggered states, an fd wake-up event is a change in the fd state. So all EPOLLIN events will be raised until all data was read (the listening socket's backlog was emptied).
On the other hand, for level triggered events, each EPOLLIN will invoke a wake up event. If no one is waiting, these events will be merged.
Following the example in your question:
For level triggered events: every time a client connects, a single thread will return from epoll_wait... BUT, if two more clients were to connect while both threads were busy accepting the first two clients, these EPOLLIN events would merge into a single event and the next call to epoll_wait will return immediately with that merged event.
In the context of the example given in the question, thread B is expected to "wake up" due to epoll_wait returning.
In this case, both threads will "race" towards accept.
However, this doesn't defeat the EPOLLEXCLUSIVE directive or intent.
The EPOLLEXCLUSIVE directive is meant to prevent the "thundering heard" phenomenon. In this case, two threads are racing to accept two connections. Each thread can (presumably) call accept safely, with no errors. If three threads were used, the third would keep on sleeping.
If the EPOLLEXCLUSIVE weren't used, all the epoll_wait threads would have been woken up whenever a connection was available, meaning that as soon as the first connection arrived, both threads would have been racing to accept a single connection (resulting in a possible error for one of them).
For edge triggered events: only one thread is expected to receive the "wake up" call. That thread is expected to accept all waiting connections (empty the listen "backlog"). No more EPOLLIN events will be raised for that socket until the backlog is emptied.
The same applies to readable sockets and pipes. The thread that was woken up is expected to deal with all the readable data. This prevents to waiting threads from attempting to read the data concurrently and experiencing file lock race conditions.
I would recommend (and this is what I do) to set the listening socket to non-blocking mode and calling accept in a loop until an EAGAIN (or EWOULDBLOCK) error is raised, indicating that the backlog is empty. There is no way to avoid the risk of events being merged. The same is true for reading from a socket.
Testing this with code:
I wrote a simple test, with some sleep commands and blocking sockets. Client sockets are initiated only after both threads start waiting for epoll.
Client thread initiation is delayed, so client 1 and client 2 start a second apart.
Once a server thread is woken up, it will sleep for a second (allowing the second client to do it's thing) before calling accept. Maybe the servers should sleep a little more, but it seems close enough to manage the scheduler without resorting to conditional variables.
Here are the results of my test code (which might be a mess, I'm not the best person for test design)...
On Ubuntu 16.10, which supports EPOLLEXCLUSIVE, the test results show that the listening threads are woken up one after the other, in response to the clients. In the example in the question, thread B is woken up.
Test address: <null>:8000
Server thread 2 woke up with 1 events
Server thread 2 will sleep for a second, to let things happen.
client number 1 connected
Server thread 1 woke up with 1 events
Server thread 1 will sleep for a second, to let things happen.
client number 2 connected
Server thread 2 accepted a connection and saying hello.
client 1: Hello World - from server thread 2.
Server thread 1 accepted a connection and saying hello.
client 2: Hello World - from server thread 1.
To compare with Ubuntu 16.04 (without EPOLLEXCLUSIVE support), than both threads are woken up for the first connection. Since I use blocking sockets, the second thread hangs on accept until client # 2 connects.
main.c:178:2: warning: #warning EPOLLEXCLUSIVE undeclared, test is futile [-Wcpp]
#warning EPOLLEXCLUSIVE undeclared, test is futile
^
Test address: <null>:8000
Server thread 1 woke up with 1 events
Server thread 1 will sleep for a second, to let things happen.
Server thread 2 woke up with 1 events
Server thread 2 will sleep for a second, to let things happen.
client number 1 connected
Server thread 1 accepted a connection and saying hello.
client 1: Hello World - from server thread 1.
client number 2 connected
Server thread 2 accepted a connection and saying hello.
client 2: Hello World - from server thread 2.
For one more comparison, the results for level triggered kqueue show that both threads are awoken for the first connection. Since I use blocking sockets, the second thread hangs on accept until client # 2 connects.
Test address: <null>:8000
client number 1 connected
Server thread 2 woke up with 1 events
Server thread 1 woke up with 1 events
Server thread 2 will sleep for a second, to let things happen.
Server thread 1 will sleep for a second, to let things happen.
Server thread 2 accepted a connection and saying hello.
client 1: Hello World - from server thread 2.
client number 2 connected
Server thread 1 accepted a connection and saying hello.
client 2: Hello World - from server thread 1.
My test code was (sorry for the lack of comments and the messy code, I wasn't writing for future maintenance):
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#define ADD_EPOLL_OPTION 0 // define as EPOLLET or 0
#include <arpa/inet.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <netdb.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/resource.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/types.h>
#include <time.h>
#include <unistd.h>
#if !defined(__linux__) && !defined(__CYGWIN__)
#include <sys/event.h>
#define reactor_epoll 0
#else
#define reactor_epoll 1
#include <sys/epoll.h>
#include <sys/timerfd.h>
#endif
int sock_listen(const char *address, const char *port);
void *listen_threard(void *arg);
void *client_thread(void *arg);
int server_fd;
char const *address = NULL;
char const *port = "8000";
int main(int argc, char const *argv[]) {
if (argc == 2) {
port = argv[1];
} else if (argc == 3) {
port = argv[2];
address = argv[1];
}
fprintf(stderr, "Test address: %s:%s\n", address ? address : "<null>", port);
server_fd = sock_listen(address, port);
/* code */
pthread_t threads[4];
for (size_t i = 0; i < 2; i++) {
if (pthread_create(threads + i, NULL, listen_threard, (void *)i))
perror("couldn't initiate server thread"), exit(-1);
}
for (size_t i = 2; i < 4; i++) {
sleep(1);
if (pthread_create(threads + i, NULL, client_thread, (void *)i))
perror("couldn't initiate client thread"), exit(-1);
}
// join only server threads.
for (size_t i = 0; i < 2; i++) {
pthread_join(threads[i], NULL);
}
close(server_fd);
sleep(1);
return 0;
}
/**
Sets a socket to non blocking state.
*/
inline int sock_set_non_block(int fd) // Thanks to Bjorn Reese
{
/* If they have O_NONBLOCK, use the Posix way to do it */
#if defined(O_NONBLOCK)
/* Fixme: O_NONBLOCK is defined but broken on SunOS 4.1.x and AIX 3.2.5. */
int flags;
if (-1 == (flags = fcntl(fd, F_GETFL, 0)))
flags = 0;
// printf("flags initial value was %d\n", flags);
return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
#else
/* Otherwise, use the old way of doing it */
static int flags = 1;
return ioctl(fd, FIOBIO, &flags);
#endif
}
/* open a listenning socket */
int sock_listen(const char *address, const char *port) {
int srvfd;
// setup the address
struct addrinfo hints;
struct addrinfo *servinfo; // will point to the results
memset(&hints, 0, sizeof hints); // make sure the struct is empty
hints.ai_family = AF_UNSPEC; // don't care IPv4 or IPv6
hints.ai_socktype = SOCK_STREAM; // TCP stream sockets
hints.ai_flags = AI_PASSIVE; // fill in my IP for me
if (getaddrinfo(address, port, &hints, &servinfo)) {
perror("addr err");
return -1;
}
// get the file descriptor
srvfd =
socket(servinfo->ai_family, servinfo->ai_socktype, servinfo->ai_protocol);
if (srvfd <= 0) {
perror("socket err");
freeaddrinfo(servinfo);
return -1;
}
// // keep the server socket blocking for the test.
// // make sure the socket is non-blocking
// if (sock_set_non_block(srvfd) < 0) {
// perror("couldn't set socket as non blocking! ");
// freeaddrinfo(servinfo);
// close(srvfd);
// return -1;
// }
// avoid the "address taken"
{
int optval = 1;
setsockopt(srvfd, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval));
}
// bind the address to the socket
{
int bound = 0;
for (struct addrinfo *p = servinfo; p != NULL; p = p->ai_next) {
if (!bind(srvfd, p->ai_addr, p->ai_addrlen))
bound = 1;
}
if (!bound) {
// perror("bind err");
freeaddrinfo(servinfo);
close(srvfd);
return -1;
}
}
freeaddrinfo(servinfo);
// listen in
if (listen(srvfd, SOMAXCONN) < 0) {
perror("couldn't start listening");
close(srvfd);
return -1;
}
return srvfd;
}
/* will start listenning, sleep for 5 seconds, then accept all the backlog and
* finish */
void *listen_threard(void *arg) {
int epoll_fd;
ssize_t event_count;
#if reactor_epoll
#ifndef EPOLLEXCLUSIVE
#warning EPOLLEXCLUSIVE undeclared, test is futile
#define EPOLLEXCLUSIVE 0
#endif
// create the epoll wait fd
epoll_fd = epoll_create1(0);
if (epoll_fd < 0)
perror("couldn't create epoll fd"), exit(1);
// add the server fd to the epoll watchlist
{
struct epoll_event chevent = {0};
chevent.data.ptr = (void *)((uintptr_t)server_fd);
chevent.events =
EPOLLOUT | EPOLLIN | EPOLLERR | EPOLLEXCLUSIVE | ADD_EPOLL_OPTION;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, server_fd, &chevent);
}
// wait with epoll
struct epoll_event events[10];
event_count = epoll_wait(epoll_fd, events, 10, 5000);
#else
// testing on BSD, use kqueue
epoll_fd = kqueue();
if (epoll_fd < 0)
perror("couldn't create kqueue fd"), exit(1);
// add the server fd to the kqueue watchlist
{
struct kevent chevent[2];
EV_SET(chevent, server_fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0,
(void *)((uintptr_t)server_fd));
EV_SET(chevent + 1, server_fd, EVFILT_WRITE, EV_ADD | EV_ENABLE, 0, 0,
(void *)((uintptr_t)server_fd));
kevent(epoll_fd, chevent, 2, NULL, 0, NULL);
}
// wait with kqueue
static struct timespec reactor_timeout = {.tv_sec = 5, .tv_nsec = 0};
struct kevent events[10];
event_count = kevent(epoll_fd, NULL, 0, events, 10, &reactor_timeout);
#endif
close(epoll_fd);
if (event_count <= 0) {
fprintf(stderr, "Server thread %lu wakeup no events / error\n",
(size_t)arg + 1);
perror("errno ");
return NULL;
}
fprintf(stderr, "Server thread %lu woke up with %lu events\n",
(size_t)arg + 1, event_count);
fprintf(stderr,
"Server thread %lu will sleep for a second, to let things happen.\n",
(size_t)arg + 1);
sleep(1);
int connfd;
struct sockaddr_storage client_addr;
socklen_t client_addrlen = sizeof client_addr;
/* accept up all connections. we're non-blocking, -1 == no more connections */
if ((connfd = accept(server_fd, (struct sockaddr *)&client_addr,
&client_addrlen)) >= 0) {
fprintf(stderr,
"Server thread %lu accepted a connection and saying hello.\n",
(size_t)arg + 1);
if (write(connfd, arg ? "Hello World - from server thread 2."
: "Hello World - from server thread 1.",
35) < 35)
perror("server write failed");
close(connfd);
} else {
fprintf(stderr, "Server thread %lu failed to accept a connection",
(size_t)arg + 1);
perror(": ");
}
return NULL;
}
void *client_thread(void *arg) {
int fd;
// setup the address
struct addrinfo hints;
struct addrinfo *addrinfo; // will point to the results
memset(&hints, 0, sizeof hints); // make sure the struct is empty
hints.ai_family = AF_UNSPEC; // don't care IPv4 or IPv6
hints.ai_socktype = SOCK_STREAM; // TCP stream sockets
hints.ai_flags = AI_PASSIVE; // fill in my IP for me
if (getaddrinfo(address, port, &hints, &addrinfo)) {
perror("client couldn't initiate address");
return NULL;
}
// get the file descriptor
fd =
socket(addrinfo->ai_family, addrinfo->ai_socktype, addrinfo->ai_protocol);
if (fd <= 0) {
perror("client couldn't create socket");
freeaddrinfo(addrinfo);
return NULL;
}
// // // Leave the socket blocking for the test.
// // make sure the socket is non-blocking
// if (sock_set_non_block(fd) < 0) {
// freeaddrinfo(addrinfo);
// close(fd);
// return -1;
// }
if (connect(fd, addrinfo->ai_addr, addrinfo->ai_addrlen) < 0 &&
errno != EINPROGRESS) {
fprintf(stderr, "client number %lu FAILED\n", (size_t)arg - 1);
perror("client connect failure");
close(fd);
freeaddrinfo(addrinfo);
return NULL;
}
freeaddrinfo(addrinfo);
fprintf(stderr, "client number %lu connected\n", (size_t)arg - 1);
char buffer[128];
if (read(fd, buffer, 35) < 35) {
perror("client: read error");
close(fd);
} else {
buffer[35] = 0;
fprintf(stderr, "client %lu: %s\n", (size_t)arg - 1, buffer);
close(fd);
}
return NULL;
}
P.S.
As a final recommendation, I would consider having no more than a single thread and a single epoll fd per process. This way the "thundering heard" is a non-issue and EPOLLEXCLUSIVE (which is still very new and isn't widely supported) can be disregarded... the only "thundering heard" this still exposes is for the limited amount of shared sockets, where the race condition might be good for load balancing.
Original Answer
I'm not sure I understand the confusion, so I'll go over EPOLLET and EPOLLEXCLUSIVE to show their combined expected behavior.
As you well know:
Once you set EPOLLET (edge triggered), events are triggered on fd state changes rather than fd events.
This design is explicitly meant to prevent epoll_wait from returning due to an event that is in the process of being handled (i.e., when new data arrives while the EPOLLIN was already raised but read hadn't been called or not all of the data was read).
In the case of a listening socket, the EPOLLIN event won't be triggered again until all existing listen "backlog" sockets have been accepted using accept.
The EPOLLEXCLUSIVE flag is used to prevent the "thundering heard" behavior, so only a single epoll_wait caller is woken up for each fd wake-up event.
As I pointed out before, for edge-triggered states, an fd wake-up event is a change in the fd state. So all EPOLLIN events will be raised until all data was read (the listening socket's backlog was emptied).
When merging these behaviors, and following the example in your question, only one thread is expected to receive the "wake up" call. That thread is expected to accept all waiting connections (empty the listen "backlog") or no more EPOLLIN events will be raised for that socket.
The same applies to readable sockets and pipes. The thread that was woken up is expected to deal with all the readable data. This prevents to waiting threads from attempting to read the data concurrently and experiencing file lock race conditions.
I would recommend that you consider avoiding the edge triggered events if you mean to call accept only once for each epoll_wait wake-up event. Regardless of using EPOLLEXCLUSIVE, you run the risk of not emptying the existing "backlog", so that no new wake-up events will be raised.
Alternatively, I would recommend (and this is what I do) to set the listening socket to non-blocking mode and calling accept in a loop until and an EAGAIN (or EWOULDBLOCK) error is raised, indicating that the backlog is empty.
EDIT 1: Level Triggered Events
It seems, as Nathaniel pointed out in the comment, that I totally misunderstood the question... I guess I'm used to EPOLLET being the misunderstood element.
So, what happens with normal, level-triggered, events (NOT EPOLLET)?
Well... the expected behavior is the exact mirror image (opposite) of edge triggered events.
For listenning sockets, the epoll_wait is expected return whenever a new connected is available, whether accept was called after a previous event or not.
Events are only "merged" if no-one is waiting with epoll_wait... in which case the next call for epoll_wait will return immediately.
In the context of the example given in the question, thread B is expected to "wake up" due to epoll_wait returning.
In this case, both threads will "race" towards accept.
However, this doesn't defeat the EPOLLEXCLUSIVE directive or intent.
The EPOLLEXCLUSIVE directive is meant to prevent the "thundering heard" phenomenon. In this case, two threads are racing to accept two connections. Each thread can (presumably) call accept safely, with no errors. If three threads were used, the third would keep on sleeping.
If the EPOLLEXCLUSIVE weren't used, all the epoll_wait threads would have been woken up whenever a connection was available, meaning that as soon as the first connection arrived, both threads would have been racing to accept a single connection (resulting in a possible error for one of them).
This is only a partial answer, but Jason Baron (the author of the EPOLLEXCLUSIVE patch) just responded to an email I sent him to confirm that when using EPOLLEXCLUSIVE in level-triggered mode he does think it's possible that two connections will arrive but only one thread will be woken (thread B keeps sleeping). So when using EPOLLEXCLUSIVE you have to use the same kinds of defensive programming as you use for edge-trigged epoll, regardless of whether you set EPOLLET.

Linux kernel module to read out GPS device via USB

I'm writing a Linux kernel module to read out a GPS device (a u-blox NEO-7) via USB by using the book Linux Device Drivers.
I already can probe and read out data from the device successfully. But, there is a problem when reading the device with multiple applications simultaneously (I used "cat /dev/ublox" to read indefinitely). When the active/reading applications is cancelled via "Ctrl + C", the next reading attempt from the other application fails (exactly method call usb_submit_urb(...) returns -EINVAL).
I use following ideas for my implementation:
The kernel module methods should be re-entrant. Therefore, I use a mutex to protect critical sections. E.g. allowing only one reader simultaneously.
To safe ressources, I reuse the struct urb for different reading requests (see an explanation)
Device-specific data like USB endpoint address and so on is held in a device-specific struct called ublox_device.
After submitting the USB read request, the calling process is sent to sleep until the asynchronous complete handler is called.
I verified that the ideas are implemented correctly: I have run two instances of "cat /dev/ublox" simultaneously and I got the correct output (only one instance accessed the critical read section at a time). And also reusing the "struct urb" is working. Both instances read out data alternatively.
The problem only occurs if the currently active instance is cancelled via "Ctrl + C". I can solve the problem by not reusing the "struct urb" but I would like to avoid that. I.e. by allocating a new "struct urb" for each read request via usb_alloc_urb(...) (usually it is allocated once when probing the USB device).
My code follows the USB skeleton driver from Greg Kroah-Hartman who also reuse the "struct urb" for different reading requests.
Maybe someone has a clue what's going wrong here.
The complete code can be found on pastebin. Here is a small excerpt of the read method and the USB request complete handler.
static ssize_t ublox_read(struct file *file, char *buffer, size_t count, loff_t *pos)
{
struct ublox_device *ublox_device = file->private_data;
...
return_value = mutex_lock_interruptible(&ublox_device->bulk_in_mutex);
if (return_value < 0)
return -EINTR;
...
retry:
usb_fill_bulk_urb(...);
ublox_device->read_in_progress = 1;
/* Next call fails if active application is cancelled via "Ctrl + C" */
return_value = usb_submit_urb(ublox_device->bulk_in_urb, GFP_KERNEL);
if (return_value) {
printk(KERN_ERR "usb_submit_urb(...) failed!\n");
ublox_device->read_in_progress = 0;
goto exit;
}
/* Go to sleep until read operation has finished */
return_value = wait_event_interruptible(ublox_device->bulk_in_wait_queue, (!ublox_device->read_in_progress));
if (return_value < 0)
goto exit;
...
exit:
mutex_unlock(&ublox_device->bulk_in_mutex);
return return_value;
}
static void ublox_read_bulk_callback(struct urb *urb)
{
struct ublox_device *ublox_device = urb->context;
int status = urb->status;
/* Evaluate status... */
...
ublox_device->transferred_bytes = urb->actual_length;
ublox_device->read_in_progress = 0;
wake_up_interruptible(&ublox_device->bulk_in_wait_queue);
}
Now, I allocate a new struct urb for each read request. This avoids the problem with the messed up struct urb after an active read request is cancelled by the calling application. The allocated struct is freed in the complete handler.
I will come back to LKML when I optimize my code. For now, it is okay to allocate a new struct urb for each single read request. The complete code of the kernel module is on pastebin.
static ssize_t ublox_read(struct file *file, char *buffer, size_t count, loff_t *pos)
{
struct ublox_device *ublox_device = file->private_data;
...
retry:
ublox_device->bulk_in_urb = usb_alloc_urb(0, GFP_KERNEL);
...
usb_fill_bulk_urb(...);
...
return_value = usb_submit_urb(ublox_device->bulk_in_urb, GFP_KERNEL);
...
}
static void ublox_read_bulk_callback(struct urb *urb)
{
struct ublox_device *ublox_device = urb->context;
...
usb_free_urb(ublox_device->bulk_in_urb);
...
}

IOCP multithreaded server and reference counted class

I work on IOCP Server (Overlapped I/O , 4 threads, CreateIoCompletionPort, GetQueuedCompletionStatus, WSASend etc). And my goal is to send single reference counted buffer too all connected sockets.(I followed Len Holgate's suggestion from this post WSAsend to all connected socket in multithreaded iocp server) . After sending buffer to all connected clients it should be deleted.
this is class with buffer to be send
class refbuf
{
private:
int m_nLength;
int m_wsk;
char *m_pnData; // buffer to send
mutable int mRefCount;
public:
...
void grab() const
{
++mRefCount;
}
void release() const
{
if(mRefCount > 0);
--mRefCount;
if(mRefCount == 0) {delete (refbuf *)this;}
}
...
char* bufadr() { return m_pnData;}
};
sending buffer to all socket
refbuf *refb = new refbuf(4);
...
EnterCriticalSection(&g_CriticalSection);
pTmp1 = g_pCtxtList; // start of linked list with sockets
while( pTmp1 )
{
pTmp2 = pTmp1->pCtxtBack;
ovl=TakeOvl(); // ovl -struct containing WSAOVERLAPPED
ovl->wsabuf.buf=refb->bufadr();// adress m_pnData from refbuf
ovl->rcb=refb; //when GQCS get notification rcb is used to decrease mRefCount
ovl->wsabuf.len=4;
refb->grab(); // mRefCount ++
WSASend(pTmp1->Socket, &(ovl->wsabuf),1,&dwSendNumBytes,0,&(ovl->Overlapped), NULL);
pTmp1 = pTmp2;
}
LeaveCriticalSection(&g_CriticalSection);
and 1 of 4 threads
GetQueuedCompletionStatus(hIOCP, &dwIoSize,(PDWORD_PTR)&lpPerSocketContext, (LPOVERLAPPED *)&lpOverlapped, INFINITE);
...
lpIOContext = (PPER_IO_CONTEXT)lpOverlapped;
lpIOContext->rcb->release(); //mRefCount --,if mRefCount reach 0, delete object
i check this with 5 connected clients and it seems to work. When GQCS receives all notifaction, mRefCount reachs 0 and delete is executed.
And my questions: is that approach appropriate? What if there will be for example 100 or more clients? Is situation avoided when one thread can delete object before another still use it? How to implement atomic reference count in this scernario? Thanks in advance.
Obvious issues; in order of importance...
Your refbuf class doesn't use thread safe ref count manipulation. Use InterlockedIncrement() etc.
I assume that TakeOvl() obtains a new OVERLAPPED and WSABUF structure per operation.
Your naming could be better, why grab() rather than AddRef(), what does TakeOvl() take from? Those Tmp variables are something and the least important something is that they're 'temporary' so name them after a more important something. Go Read Code Complete.

There's no sleep()/wait for mutex in node.js, so how to deal with large IO tasks?

I have a large array of filenames I need to check, but I also need to respond to network clients. The easiest way is to perform:
for(var i=0;i < array.length;i++) {
fs.readFile(array[i], function(err, data) {...});
}
, but array can be of any length, say 100000, so it's not a good idea to perform 100000 reads at once, on the other hand doing fs.readFileSync() can take too long. Also launching next fs.readFile() in callback, like this:
var Idx = 0;
function checkFile() {
fs.readFile(array[Idx], function (err, data) {
Idx++;
if (Idx < array.length) {
checkFile();
} else {
Idx = 0;
setTimeout(checkFile, 10000); // start checking files in one second
}
});
}
is also not a best option, because array[] gets constantly updated by network clients - some items deleted, new added and so on.
What is the best way to accomplish such a task in node.js?
You should stick to your first solution (fs.readFile). For file I/O, node.js uses a thread pool. The reason is that most unix kernels don't provide efficient asynchronous APIs for the file system. Even if you start 10,000 reads concurrently, only a few reads will actually run and the rest will wait in a queue.
In order to make this answer more interesting, I browsed through node's code again to make sure that things hadn't changed.
Long story short, file I/O uses blocking system calls and is made by a thread pool with at most 4 concurrent threads.
The important code is in libeio, which is abstracted by libuv. All I/O code is wrapped by macros which queue requests. For example:
eio_req *eio_read (int fd, void *buf, size_t length, off_t offset, int pri, eio_cb cb, void *data, eio_channel *channel)
{
REQ (EIO_READ); req->int1 = fd; req->offs = offset; req->size = length; req->ptr2 = buf; SEND;
}
REQ prepares the request and SEND queues it. We eventually end up in etp_maybe_start_thread:
static unsigned int started, idle, wanted = 4;
(...)
static void
etp_maybe_start_thread (void)
{
if (ecb_expect_true (etp_nthreads () >= wanted))
return;
(...)
The queue keeps 4 threads running to process the requests. When our read request is finally executed, eio simply use the block read from unistd.h:
case EIO_READ: ALLOC (req->size);
req->result = req->offs >= 0
? pread (req->int1, req->ptr2, req->size, req->offs)
: read (req->int1, req->ptr2, req->size); break;

Resources