Are epoll events being watched when not epoll_waiting - linux

I'm rather new to event based programming. I'm experimenting with epoll's edge-mode which apparently only signals files which have become ready for read/write (as opposed to level-mode which signals all ready files, regardless of whether there were already ready, or just became ready).
What's not clear to me, is: in edge-mode, am I informed of readiness events that happen while I'm not epoll_waiting ? What about events on one-shot files that haven't been rearmed yet ?
To illustrate why I'm asking that, consider the following scenario:
have 10 non-blocking sockets connected
configure epoll_ctl to react when the sockets are ready for read, in edge-mode + oneshot : EPOLLET | EPOLLONESHOT | EPOLLIN
epoll_wait for something to happen (reports max 10 events)
linux wakes my process and reports sockets #1 and #2 are ready
I read and process data socket #1 (until E_AGAIN)
I read and process data socket #2 (until E_AGAIN)
While I'm doing that, a socket S receives data
I processed all events, so I rearm the triggered files with epoll_ctl in EPOLL_CTL_MOD mode, because of oneshot
my loop goes back to epoll_waiting the next batch of events
Ok, so will the last epoll_wait always be notified of the readiness of socket S ? Event if S is #1 (i.e. it's not rearmed) ?

I'm experimenting with epoll's edge-mode which apparently only signals
files which have become ready for read/write (as opposed to level-mode
which signals all ready files, regardless of whether there were
already ready, or just became ready)
First let's get a clear view of the system, you need an accurate mental model of how the system works. Your view of epoll(7) is not really accurate.
The difference between edge-triggered and level-triggered is the definition of what exactly makes an event. The former generates one event for each action that has been subscribed on the file descriptor; once you consume the event, it is gone - even if you didn't consume all the data that generated such an event. OTOH, the latter keeps generating the same event over and over until you consume all the data that generated the event.
Here's an example that puts these concepts in action, blatantly stolen from man 7 epoll:
The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.
A pipe writer writes 2 kB of data on the write side of the pipe.
A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.
The pipe reader reads 1 kB of data from rfd.
A call to epoll_wait(2) is done.
If the rfd file descriptor has been added to the epoll interface using
the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in
step 5 will probably hang despite the available data still present in
the file input buffer; meanwhile the remote peer might be expecting a
response based on the data it already sent. The reason for this is
that edge-triggered mode delivers events only when changes occur on
the monitored file descriptor. So, in step 5 the caller might end up
waiting for some data that is already present inside the input buffer.
In the above example, an event on rfd will be generated because of the
write done in 2 and the event is consumed in 3. Since the read
operation done in 4 does not consume the whole buffer data, the call
to epoll_wait(2) done in step 5 might block indefinitely.
In short, the fundamental difference is in the definition of "event": edge-triggered treats events as a single unit that you consume once; level-triggered defines the consumption of an event as being equivalent to consuming all of the data belonging to that event.
Now, with that out of the way, let's address your specific questions.
in edge-mode, am I informed of readiness events that happen while I'm
not epoll_waiting
Yes, you are. Internally, the kernel queues up the interesting events that happened on each file descriptor. They are returned on the next call to epoll_wait(2), so you can rest assured that you won't lose events. Well, maybe not exactly on the next call if there are other events pending and the events buffer passed to epoll_wait(2) can't accommodate them all, but the point is, eventually these events will be reported.
What about events on one-shot files that haven't been rearmed yet?
Again, you never lose events. If the file descriptor hasn't been rearmed yet, should any interesting event arise, it is simply queued in memory until the file descriptor is rearmed. Once it is rearmed, any pending events - including those that happened before the descriptor was rearmed - will be reported in the next call to epoll_wait(2) (again, maybe not exactly the next one, but they will be reported). In other words, EPOLLONESHOT does not disable event monitoring, it simply disables event notification temporarily.
Ok, so will the last epoll_wait always be notified of the readiness of
socket S? Event if S is #1 (i.e. it's not rearmed)?
Given what I said above, by now it should be pretty clear: yes, it will. You won't lose any event. epoll offers strong guarantees, it's awesome. It's also thread-safe and you can wait on the same epoll fd in different threads and update event subscription concurrently. epoll is very powerful, and it is well worth taking the time to learn it!

Related

Is there a function for determining how many bytes are left to read on a unix domain socket?

The aim is interact with an OpenEthereum server using json-rpc.
The problem is once connected, I need to react only when receving data as the aim is to subscribe to an event so I need the recv() function to be blocking.
But in that case, if I ask to read more in the buffer than what the server sent the request will be blocking.
The OpenEthereum server is separating it s requests with a linefeed \n character but I don t know how this can help.
I know about simply waiting recv() to timeout. But I using C++ and ipc for having a better latency than my competitors on arbitrage. This also means I need to have the fewest number of context switches as possible.
How to effciently read a message whoes length cannot be determined in advance?
Is there a function for determining how many bytes are left to read on a unix domain socket?
No - just keep doing non-blocking reads until one returns EAGAIN or EWOULDBLOCK.
There may be a platform-specific ioctl or fcntl - but you haven't named a platform, and it's neither portable nor necessary.
How to effciently read a message whoes length cannot be determined in advance?
Just do a non-blocking read into a buffer large enough to contain the largest message you might receive.
I need to react only when receving data as the aim is to subscribe to an event so I need the recv() function to be blocking
You're confusing two things.
How to be notified when the socket becomes readable:
by using select or poll to wait until the socket is readable. Just read their manuals, that's their most common use case.
How to read everything available to read without blocking indefinitely:
by doing non-blocking reads until EWOULDBLOCK or EAGAIN is returned.
There is logically a third step, for stream-based protocols like this, which is correctly managing buffers in case of partial messages. Oh, and actually parsing the messages, but I assume you have a JSON library already.
This is entirely normal, basic UNIX I/O design. It is not an exotic optimization.

TCP close() vs shutdown() in Linux OS

I know there are already a lot similar questions in stackoverflow, but nothing seems convincing. Basically trying to understand under what circumstances I need to use one over the other or use both.
Also would like to understand if close() & shutdown() with shut_rdwr are the same.
Closing TCP connections has gathered so much confusion that we can rightfully say either this aspect of TCP has been poorly designed, or is lacking somewhere in documentation.
Short answer
To do it the proper way, you should use all 3: shutdown(SHUT_WR), shutdown(SHUT_RD) and close(), in this order. No, shutdown(SHUT_RDWR) and close() are not the same. Read their documentation carefully and questions on SO and articles about it, you need to read more of them for an overview.
Longer answer
The first thing to clarify is what you aim for, when closing a connection. Presumably you use TCP for a higher lever protocol (request-response, steady stream of data etc.). Once you decide to "close" (terminate) connection, all you had to send/receive, you sent and received (otherwise you would not decide to terminate) - so what more do you want? I'm trying to outline what you may want at the time of termination:
to know that all data sent in either direction reached the peer
if there are any errors (in transmitting the data in process of being sent when you decided to terminate, as well as after that, and in doing the termination itself - which also requires data being sent/received), the application is informed
optionally, some applications want to be non-blocking up to and including the termination
Unfortunately TCP doesn't make these features easily available, and the user needs to understand what's under the hood and how the system calls interact with what's under the hood. A key sentence is in the recv manpage:
When a stream socket peer has performed an orderly shutdown, the
return value will be 0 (the traditional "end-of-file" return).
What the manpage means here is, orderly shutdown is done by one end (A) choosing to call shutdown(SHUT_WR), which causes a FIN packet to be sent to the peer (B), and this packet takes the form of a 0 return code from recv inside B. (Note: the FIN packet, being an implementation aspect, is not mentioned by the manpage). The "EOF" as the manpage calls it, means there will be no more transmission from A to B, but application B can, and should continue to send what it was in the process of sending, and even send some more, potentially (A is still receiving). When that sending is done (shortly), B should itself call shutdown(SHUT_WR) to close the other half of the duplex. Now app A receives EOF and all transmission has ceased. The two apps are OK to call shutdown(SHUT_RD) to close their sockets for reading and then close() to free system resources associated with the socket (TODO I haven't found clear documentation taht says the 2 calls to shutdown(SHUT_RD) are sending the ACKs in the termination sequence FIN --> ACK, FIN --> ACK, but this seems logical).
Onwards to our aims, for (1) and (2) basically the application must somehow wait for the shutdown sequence to happen, and observe its outcome. Notice how if we follow the small protocol above, it is clear to both apps that the termination initiator (A) has sent everything to B. This is because B received EOF (and EOF is received only after everything else). A also received EOF, which is issued in reply to its own EOF, so A knows B received everything (there is a caveat here - the termination protocol must have a convention of who initiates the termination - so not both peers do so at once). However, the reverse is not true. After B calls shutdown(SHUT_WR), there is nothing coming back app-level, to tell B that A received all data sent, plus the FIN (the A->B transmission had ceased!). Correct me if I'm wrong, but I believe at this stage B is in state "LAST_ACK" and when the final ACK arrives (step #4 of the 4-way handshake), concludes the close but the application is not informed unless it had set SO_LINGER with a long-enough timeout. SO_LINGER "ON" instructs the shutdown call to block (be performed in the forground) hence the shutdown call itself will do the waiting.
In conclusion what I recommend is to configure SO_LINGER ON with a long timeout, which causes it to block and hence return any errors. What is not entirely clear is whether it is shutdown(SHUT_WR) or shutdown(SHUT_RD) which blocks in expectation of the LAST_ACK, but that is of less importance as we need to call both.
Blocking on shutdown is problematic for requirement #3 above where e.g. you have a single-threaded design that serves all connections. Using SO_LINGER may block all connections on the termination of one of them. I see 3 routes to address the problem:
shutdown with LINGER, from a different thread. This will of course complicate a design
linger in background and either
2A. "Promote" FIN and FIN2 to app-level messages which you can read and hence wait for. This basically moves the problem that TCP was meant to solve, one level higher, which I consider hack-ish, also because the ensuing shutdown calls may still end in a limbo.
2B. Try to find a lower-level facility such as SIOCOUTQ ioctl described here that queries number of unACKed bytes in the network stack. The caveats are many, this is Linux specific and we are not sure if it aplies to FIN ACKs (to know whether closing is fully done), plus you'd need to poll taht periodically, which is complicated. Overall I'm leaning towards option 1.
I tried to write a comprehensive summary of the issue, corrections/additions welcome.
TCP sockets are bidirectional - you send and receive over the one socket. close() stops communication in both directions. shutdown() provides another parameter that allows you to specify which direction you might want to stop using.
Another difference (between close() and shutdown(rw)) is that close() will keep the socket open if another process is using it, while shutdown() shuts down the socket irrespective of other processes.
shutdown() is often used by clients to provide framing - to indicate the end of their request, e.g. an echo service might buffer up what it receives until the client shutdown()s their send side, which tells the server that the client has finished, and the server then replies; the client can receive the reply because it has only shutdown() writing, not reading, through its socket.
Close will close both send and receving end of socket.If you want only sending part of socket should be close not receving part or vice versa you can use shutdown.
close()------->will close both sending and receiving end.
shutdown()------->only want to close sending or receiving.
argument:SHUT_RD(shutdown reading end (receiving end))
SHUT_WR(shutdown writing end(sending end))
SHUT_RDWR(shutdown both)

Synchronize Threads with WINAPI

I would like to synchronize threads with WINAPI calls only but I have no success.
The situation is to LOG activities with time and date as soon as my WNDPROC gets a message.
The problem is that my WNDPROC needs to write to the log and it will get out of hand since writing to a file takes time. I tried to enter a critical section as soon as WNDPROC starts and leave a critical section as soon as writing to a log is finished, but no luck. How can make them wait for each other?
Don't wait - queue.
A Windows message is so small, (within itself:), that copying the entire message into a producer-consumer queue is a reasonable approach. You could raise your own queue class, or you could maybe use the PostThreadMessage() API to copy and queue the received messages to a logger thread:
http://msdn.microsoft.com/en-gb/library/windows/desktop/ms644946%28v=vs.85%29.aspx
The snag with PTM() is that only the message data gets copied and queued up - no time/date. Thge time/date would have to be added in the logger thread when it gets the message copy. Check your requirements to see if this is acceptable. If not, you will have to use a different 'message' struct that has members for both the Windows message and date/time.
Queueing insulates the UI thread from the, possibly lengthy, disk logging write operation and allows extra flexibility to incorporate lazy-writes and other such optimizations, if required.

Linux Device Driver - How to unblock reading thread when closing file?

I am attempting to implement a character device driver for Linux and am having trouble. In short, data written to the device is buffered for reading. When no data is available, the call to read blocks via 'wait_event_interruptible'. Data received by the write handler calls 'wake_up_interruptible'. The release handler also calls 'wake_up_interruptible' to unblock the reader but sets a flag to indicate the driver is releasing.
From user space I have an executable that opens the driver via 'open' and then starts another thread. The main thread proceeds to call 'read'. As intended, no data is available for reading and the call blocks. The other thread sleeps for one second (providing sufficient time for the main thread to read and block), calls 'close' and then calls 'close' again. The first call returns '0' while the second returns '-1' (as expected). However, my driver's release handler is never called and I cannot understand how to unblock my reading thread without explicitly sending it a signal or writing some data to the device. My understanding is that when the last handle to the driver closes that its release handler is invoked. I am trying to implement what I believe is standard user space behavior- a thread blocked reading from a file will become unblocked and receive an end-of-file return value when asynchronously closed.
Do I have the correct understanding of read/close at the file level in user space? Do I have the correct device driver understanding? Am I missing something else? I looked through 'Linux Device Drivers 3rd Edition' and couldn't quite find an answer to this question. I have also searched Google but cannot seem to find the answer either. Any help you can provide is appreciated. My kernel version is 3.0.15.
Unfortunately the read syscall keeps a reference on the file itself and not the file descriptor. So closing the file descriptor will not abort the read.
In all cases you must be careful about races conditions between unblocking and closing, you don't want the thread (or another one) to re-enter the syscall between ;)

Is there any way to emulate epoll_wait with kqueue/kevent?

I have a list of a bunch of file descriptors that I have created kevents for, and I'm trying to figure out if there's any way to get the number of them that are ready for read or write access.
Is there any way to get a list of "ready" file descriptors, like what epoll_wait provides?
Events that occurred are placed into the eventlist buffer passed to the kevent call. So making this buffer sufficiently large will give you the list you are seeking. The return
value of the kevent call will tell you have many events
are in the eventlist buffer.
If using a large buffer is not feasible for some reason,
you can always do a loop calling kevent with a zero timeout
and a smaller buffer, until you get zero events in the eventlist.
To give a little more context...
One of the expected scenarios with kevent() is that you will thread pool calls to it. If you had 3 thread pools all asking for 4 events, the OS would like to be able to pool and dispatch the actual events as it sees fit.
If 7 events are available the OS might want to dispatch to 3 threads, or it might want to dispatch to all 3 threads if it believed it had empty cores and less overhead.
I'm not saying your scenario is invalid at all; just that the system is more or less designed to keep that information away from you so it doesn't get into scenarios of saying 'well, 12 descriptors are ready. Oh, hrm, I just told you that but 3 of them got surfaced before you had a chance to do anything'.
Grrr pretty much nailed the scenario. You register/deregister your descriptors once and the relevent descriptor will be provided back to you with the event when the event triggers.

Resources