epoll - is EPOLLET prone to race conditions? - linux

Process B epolls on the pipe (EPOLLIN|EPOLLET).
Process A writes 1KiB in pipe.
Process B wakes up.
Process B reads 1KiB from the pipe.
Process A writes 1KiB in pipe.
Process B epolls on the pipe.
The state of the pipe does not change during epoll, but has changed since the last read. Will process B wake up again?

My understanding from the FAQ (Q9) in http://linux.die.net/man/4/epoll is that you will get another event in step 6 (assuming that you can guarantee that step 5 really happens after step 4 and the pipe is empty after step 4).
Having said that, you might get more events than guaranteed (but you have to be careful only to rely on documented behavior) - see http://cmeerw.org/blog/753.html#753 and http://cmeerw.org/blog/750.html#750

While it's true that the kernel wakes up on step 6, that is not what's documented by the manual page. The use case you provide does not conform to how EPOLLET is supposed to be used.
According to the documentation, step 6 should be "read from the FD". The only time you are supposed to poll from the FD is after you tried to read and got EAGAIN.
See also: What is the use case for EPOLLET?

Related

Canonical way to "broadcast" data to multiple processes in linux?

I've got an application that needs to send a stream of data from one process to multiple readers, each of which needs to see its own copy of the stream. This is reasonably high-rate (100MB/s is not uncommon), so I'd like to avoid duplication if possible. In my ideal world, linux would have named pipes that supported multiple readers, with a fast path for the common single-reader case.
I'd like something that provides some measure of namespace isolation (eg: broadcasting on 127.0.0.1 is open to any process I believe...). Unix domain sockets don't support broadcast, and UDP is "unreliable" anyways (server will drop packets instead of blocking in my case). I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
The short answer: No
The long answer: Yes [and you're on the right track]
I've had to do this before [for even higher speeds], so I had to research this. The following is what I came up with.
In the main process, create a pool of shared buffers [use SysV shm or private mmap as you chose]. Assign ID numbers to them (e.g. 1,2,3,...). Now there is a mapping from bufid to buffer memory address. To make this accessible to child processes, do this before you fork them. The children also inherit the shared memory mappings, so not much work
Now fork the children. Give them each a unique process id. You can just incrementally start with a number: 2,3,4,... [main is 1] or just use regular pids.
Open up a SysV msg channel (msgget et. al.). Again, if you do this in the main process before the fork, they are available to the children [IIRC].
Now here's how it works:
main finds an unused buffer and fills it. For each child, main sends an IPC message via msgsnd (on the single common IPC channel) where the message payload [mtext] is the bufid number. Each message has the standard header's mtype field set to the destination child's pid.
After doing this, main remembers the buffer as "in flight" and not yet reusable.
Each child does a msgrcv with the mtype set to its pid. It then extracts the bufid from mtext and processes the buffer. When it's done, it sends an IPC message [again on the same channel] with mtype set to main's pid with an mtext of the bufid it just processed.
main's loop does an non-blocking msgrcv, noting all "release" messages for a given bufid. When all children have released the buffer, it's put back on the buffer "free queue". In main's service loop, it may fill new buffers and send more messages as appropriate [intersperse with the waits].
The child then does an msgrcv and the cycle repeats.
So, we're using [large] shared memory buffers and short [a few bytes] bufid descriptor IPC messages.
Okay, so the question you may be asking: "Why SysV IPC for the comm channel?" [vs. multiple pipes or sockets].
You already know that a shared buffer avoids sending multiple copies of your data.
So, that's the way to go. But, why not send the above bufid messages across sockets or pipes [or shared queues, condition variables, mutexes, etc]?
The answer is speed and the wakeup characteristics of the target process.
For a highly realtime response, when main sends out the bufid messages, you want the target process [if it's been sleeping] to wake up immediately and start processing the buffer.
I examined the linux kernel source and the only mechanism that has that characteristic is SysV IPC. All others have a [scheduling] lag.
When process A does msgsnd on a channel that process B has done msgrcv on, three things will happen:
process B will be marked runnable by the scheduler.
[IIRC] B will be moved to the front of its scheduling queue
Also, more importantly, this then causes an immediate reschedule of all processes.
B will start right away [as opposed to next timer interrupt or when some other process just happens to sleep]. On a single core machine, A will be put to sleep and B will run in its stead.
Caveat: All my research was done a few years back before the CFS scheduler, but, I believe the above should still hold. Also, I was using the RT scheduler, which may be a possible option if CFS doesn't work as intended.
UPDATE:
Looking at the POSIX message queue source, I think that the same immediate-wakeup behavior you discussed with the System V queues is going on, which gives the added benefit of POSIX compatibility.
The timing semantics are possible [and desirable] so I wouldn't be surprised. But, SysV is actually more standard and ubiquitous than POSIX mqueues. And, there are some semantic differences [See below].
For timing, you can build a unit test program [just using msgs] with nsec timestamps. I used TSC stamps, but clock_gettime(CLOCK_REALTIME,...) might also work. Stamp departure time and arrival/wakeup time to see. Compare both SysV and mq
With either SysV or mq you may need to bump up the max # of msgs, max msg size, max # of queues via /proc/*. The default values are relatively small. If you don't, you may find tasks blocked waiting for a msg but master can't send one [is blocked] due to a msg queue maximum parameter being exceeded. I actually had such a bug, so I changed my code to bump up these values [it was running as root] during startup. So, you may need to do this as an RC boot script (or whatever the [atrocious ;-)] systemd equivalent is)
I looked at using mq to replace SysV in my own code. It didn't have the same semantics for a many-to-one return-to-free-pool msg. In my original answer, I had forgotten to mention that two msg queues are needed: master-to-children (e.g. work-to-do) and children-to-master (e.g. returning a now available buffer).
I had several different types of buffers (e.g. compressed video, compressed audio, uncompressed video, uncompressed audio) that had varying types and struct descriptors.
Also, multiple different buffer queues as these buffers got passed from thread to thread [different processing stages].
With SysV you can use a single msg queue for multiple buffer lists/queues, the buffer list ID is the msg mtype. A child msgrcv waits with mtype set to the ID value. The master waits on the return-to-free msg queue with mtype of 0.
mq* requires a separate mqd_t for each ID because it doesn't allow a wait on a msg subtype.
msgrcv allows IPC_NOWAIT on each call, but to get the same effect with mq_receive you have to open the queue with O_NONBLOCK or use the timed version. This gets used during the "shutdown" or "restart" phase (e.g. send a msg to children that no more data will arrive and they should terminate [or reconfigure, etc.]). The IPC_NOWAIT is handy for "draining" a queue during program startup [to get rid of stale messages from a prior invocation] or drain stale messages from a prior configuration during operation.
So, instead of just two SysV msg queues to handle an arbitrary number of buffer lists, you'll need a separate mqd_t for each buffer list/type.

Are Unix/Linux pipes producer or consumer driven?

Suppose I have this:
A | B | C
How does the pipeline work? Does A produce data only when B requests it? Does A continually produce data and then block if B can't currently accept it? What's C's role? I realized that a system I'm designing is conceptually very similar to these pipelines -- I'd like to draw upon the existing paradigm rather than inventing something novel that only works half as well.
Pipes in Unix have a buffer, so even if the right side process (RSP) does not consume any data, the left side process (LSP) is able to produce a few kilobytes before blocking.
Then, if the buffer gets full, the LSP is eventually blocked. When the RSP reads data it frees part or all of the buffer space and the LSP resumes the operation.
If instead of 2 processes you have 3, the situation is more or less the same: a faster producer is blocked by a slower consumer. And obviously, a faster consumer is blocked by a slower producer if the pipe gets empty: just think of an interactive shell, waiting of the slowest producer of all: the user.
For example the following command:
$ yes | cat | more
Since more blocks when the screen is full, until the user presses a key, the cat process will fill its output buffer and stall, then the yes process will fill its buffer and also stall. Everything waiting for the user to continue, as it should be.
PS: As an interesting fact is: what happens when the more process ends? well, the right side of that pipe is closed, so the cat process will get a SIGPIPE signal (if it ever writes again in the pipe, and it will) and will die. The same will happen to the yes process. All processes die, as it should be.
A has a pipe to B, and B has a pipe to C. Each pipe has a buffer; B and C block if they try to read, and there isn't any input available (end-of-stream counts as input). A and B block if they have output to write, but the pipe's buffer is full.
All three processes run concurrently, using as much CPU as they can. The OS blocks them in the read/write system call as necessary if the pipe buffer is exhausted/full respectively.
So, they're driven by both the consumer and producer, that is, the rate is the min of both the consuming rate and producing rate. If the consumer is faster, the performance is driven by the producer, and vv.

perl: thread termination blocked by user input

I have made a program which can terminate in 2 ways, either user enters a string say- "kill" or a specific thread signals SIGINT.
In this terminator thread I have a statement(to catch "kill"):
$a = <>;
followed by a 'return;'
I have appropriate signal handler (for INT) too on the top which does:
print "signal received\n";
threads->exit();
but in the case of automatic termination(that is SIGINT is sent from other thread), the print stmt doesn't come until I press some key, no matter how long I wait. I suppose <> is blocking.
Could someone please tell how can I provide some sort of input to <> in the auto termination case so as to see the results immediately.
Thanks.
You can't do what you're trying to do, the way you're trying to do it. If a file is being read, and 'pending input' then process goes into an uninterruptible wait state. You basically can't interrupt it via signalling at this point. Uninterruptible waits are a kernel thing and the key point is preventing file corruption.
To do what you're trying to do, you would probably need to make use of something like IO::Select and the can_read function. You can test which filehandles are ready for IO, in a polling loop - this polling loop is interruptible by kill signals.
Alternatively, instead of using a filehandle read, you can use Term::ReadKey which will allow you to trap a keypress in a nonblocking fashion

flock locking order?

im using a simple test script from
http://www.tuxradar.com/practicalphp/8/11/0
like this
<?php
$fp = fopen("foo.txt", "w");
if (flock($fp, LOCK_EX)) {
print "Got lock!\n";
sleep(10);
flock($fp, LOCK_UN);
}
i opened 5 shell's and executed the script one after the other
the scripts block until the lock is free'ed and then continues after released
im not really interessted in php stuff, but my question is:
anyone knows the order in which flock() is acquired?
e.g.
t0: process 1 lock's
t1: process 2 try_lock < blocking
t2: process 3 try_lock < blocking
t3: process 1 releases lock
t4: ?? which process get's the lock?
is there a simple deterministic order, like a queue or does the kernel 'just' pick one by "more advanced rules"?
If there are multiple processes waiting for an exclusive lock, it's not specified which one succeeds in acquiring it first. Don't rely on any particular ordering.
Having said that, the current kernel code wakes them in the order they blocked. This comment is in fs/locks.c:
/* Insert waiter into blocker's block list.
* We use a circular list so that processes can be easily woken up in
* the order they blocked. The documentation doesn't require this but
* it seems like the reasonable thing to do.
*/
If you want to have a set of processes run in order, don't use flock(). Use SysV semaphores (semget() / semop()).
Create a semaphore set that contains one semaphore for each process after the first, and initialise them all to -1. For every process after the first, do a semop() on that process's semaphore with a sem_op value of zero - this will block it. After the first process is complete, it should do a semop() on the second process's semaphore with a sem_op value of 1 - this will wake the second process. After the second process is complete, it should do a semop() on the third process's semaphore with a sem_op value of 1, and so on.

When does the write() system call write all of the requested buffer versus just doing a partial write?

If I am counting on my write() system call to write say e.g., 100 bytes, I always put that write() call in a loop that checks to see if the length that gets returned is what I expected to send and, if not, it bumps the buffer pointer and decreases the length by the amount that was written.
So once again I just did this, but now that there's StackOverflow, I can ask you all if people know when my writes will write ALL that I ask for versus give me back a partial write?
Additional comments: X-Istence's reply reminded me that I should have noted that the file descriptor was blocking (i.e., not non-blocking). I think he is suggesting that the only way a write() on a blocking file descriptor will not write all the specified data is when the write() is interrupted by a signal. This seems to make at least intuitive sense to me...
write may return partial write especially using operations on sockets or if internal buffers full. So good way is to do following:
while(size > 0 && (res=write(fd,buff,size))!=size) {
if(res<0 && errno==EINTR)
continue;
if(res < 0) {
// real error processing
break;
}
size-=res;
buf+=res;
}
Never relay on what usually happens...
Note: in case of full disk you would get ENOSPC not partial write.
You need to check errno to see if your call got interrupted, or why write() returned early, and why it only wrote a certain number of bytes.
From man 2 write
When using non-blocking I/O on objects such as sockets that are subject to flow control, write() and writev() may write fewer bytes than requested; the return value must be noted, and the remainder of the operation should be retried when possible.
Basically, unless you are writing to a non-blocking socket, the only other time this will happen is if you get interrupted by a signal.
[EINTR] A signal interrupted the write before it could be completed.
See the Errors section in the man page for more information on what can be returned, and when it will be returned. From there you need to figure out if the error is severe enough to log an error and quit, or if you can continue the operation at hand!
This is all discussed in the book: Advanced Unix Programming by Marc J. Rochkind, I have written countless programs with the help of this book, and would suggest it while programming for a UNIX like OS.
Writes shouldn't have any reason to ever write a partial buffer afaik. Possible reasons I could think of for a partial write is if you run out of disk space, you're writing past the end of a block device, or if you're writing to a char device / some other sort of device.
However, the plan to retry writes blindly is probably not such a good one - check errno to see whether you should be retrying first.

Resources