Suppose I have this:
A | B | C
How does the pipeline work? Does A produce data only when B requests it? Does A continually produce data and then block if B can't currently accept it? What's C's role? I realized that a system I'm designing is conceptually very similar to these pipelines -- I'd like to draw upon the existing paradigm rather than inventing something novel that only works half as well.
Pipes in Unix have a buffer, so even if the right side process (RSP) does not consume any data, the left side process (LSP) is able to produce a few kilobytes before blocking.
Then, if the buffer gets full, the LSP is eventually blocked. When the RSP reads data it frees part or all of the buffer space and the LSP resumes the operation.
If instead of 2 processes you have 3, the situation is more or less the same: a faster producer is blocked by a slower consumer. And obviously, a faster consumer is blocked by a slower producer if the pipe gets empty: just think of an interactive shell, waiting of the slowest producer of all: the user.
For example the following command:
$ yes | cat | more
Since more blocks when the screen is full, until the user presses a key, the cat process will fill its output buffer and stall, then the yes process will fill its buffer and also stall. Everything waiting for the user to continue, as it should be.
PS: As an interesting fact is: what happens when the more process ends? well, the right side of that pipe is closed, so the cat process will get a SIGPIPE signal (if it ever writes again in the pipe, and it will) and will die. The same will happen to the yes process. All processes die, as it should be.
A has a pipe to B, and B has a pipe to C. Each pipe has a buffer; B and C block if they try to read, and there isn't any input available (end-of-stream counts as input). A and B block if they have output to write, but the pipe's buffer is full.
All three processes run concurrently, using as much CPU as they can. The OS blocks them in the read/write system call as necessary if the pipe buffer is exhausted/full respectively.
So, they're driven by both the consumer and producer, that is, the rate is the min of both the consuming rate and producing rate. If the consumer is faster, the performance is driven by the producer, and vv.
Related
I have a thread that gets new frames, 2 other threads that process the newly gotten image and 1 that prints the output based on the processing threads.
The program cycle goes,
>thread 3, print an output based on the previous outputs of the thread 0 and 1
>thread 0 get new image
>> thread 1, process image for color
>> thread 2, process image for haar cascade
going cyclically 3&0>1&2>3&0>1&2>
'>' indicates join before spawning the next set
How do I pass the opencv Mat between the threads 0 to 1&2?
Also how would I pass the data from threads 1&2 to thread 0?
I would guess a message queue system, how does one implement that?
I suspect I may get some abuse and down-votes for this because global variables are generally frowned upon, however I feel the situation is different in image-processing where:
the code is multi-threaded, and
the data structures (images) are large.
Here I think it is important to avoid expensive copying and "transmission" of data down sockets etc. when it is already available in memory that is shared and visible amongst threads.
So, in concrete terms, I would go for an array, or vector (according to your preference) of say 16 OpenCV Mats that is globally accessible.
Acquire into the first, notify the the next thread when that buffer is full, then acquire into the next. And so on.
As regards the notification, you have several options. The cleanest, and most modern is probably using condition variables letting each processing thread (not the acquiring thread) wait on a condvar for "buffer full". Next is probably POSIX message queues, though if you are porting to macOS later you may regret that as there is no support. Another, easily programmed method is to use sockets but just send a single byte that is the index into the global array of 16 Mats - that way there is no problem with incomplete, multi-byte reads on sockets. The processing threads then just sit in a loop doing blocking reads on the socket to know which buffer to process. You can also define a special index that means "quit".
Check the size of your images in terms of width x height x channels x bytes per channel and get a feel for how much memory a global vector of 16 Mats will need and be sure you have that available before using this strategy - you may be up against the wall with a Raspberry Pi for example.
I've got an application that needs to send a stream of data from one process to multiple readers, each of which needs to see its own copy of the stream. This is reasonably high-rate (100MB/s is not uncommon), so I'd like to avoid duplication if possible. In my ideal world, linux would have named pipes that supported multiple readers, with a fast path for the common single-reader case.
I'd like something that provides some measure of namespace isolation (eg: broadcasting on 127.0.0.1 is open to any process I believe...). Unix domain sockets don't support broadcast, and UDP is "unreliable" anyways (server will drop packets instead of blocking in my case). I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
The short answer: No
The long answer: Yes [and you're on the right track]
I've had to do this before [for even higher speeds], so I had to research this. The following is what I came up with.
In the main process, create a pool of shared buffers [use SysV shm or private mmap as you chose]. Assign ID numbers to them (e.g. 1,2,3,...). Now there is a mapping from bufid to buffer memory address. To make this accessible to child processes, do this before you fork them. The children also inherit the shared memory mappings, so not much work
Now fork the children. Give them each a unique process id. You can just incrementally start with a number: 2,3,4,... [main is 1] or just use regular pids.
Open up a SysV msg channel (msgget et. al.). Again, if you do this in the main process before the fork, they are available to the children [IIRC].
Now here's how it works:
main finds an unused buffer and fills it. For each child, main sends an IPC message via msgsnd (on the single common IPC channel) where the message payload [mtext] is the bufid number. Each message has the standard header's mtype field set to the destination child's pid.
After doing this, main remembers the buffer as "in flight" and not yet reusable.
Each child does a msgrcv with the mtype set to its pid. It then extracts the bufid from mtext and processes the buffer. When it's done, it sends an IPC message [again on the same channel] with mtype set to main's pid with an mtext of the bufid it just processed.
main's loop does an non-blocking msgrcv, noting all "release" messages for a given bufid. When all children have released the buffer, it's put back on the buffer "free queue". In main's service loop, it may fill new buffers and send more messages as appropriate [intersperse with the waits].
The child then does an msgrcv and the cycle repeats.
So, we're using [large] shared memory buffers and short [a few bytes] bufid descriptor IPC messages.
Okay, so the question you may be asking: "Why SysV IPC for the comm channel?" [vs. multiple pipes or sockets].
You already know that a shared buffer avoids sending multiple copies of your data.
So, that's the way to go. But, why not send the above bufid messages across sockets or pipes [or shared queues, condition variables, mutexes, etc]?
The answer is speed and the wakeup characteristics of the target process.
For a highly realtime response, when main sends out the bufid messages, you want the target process [if it's been sleeping] to wake up immediately and start processing the buffer.
I examined the linux kernel source and the only mechanism that has that characteristic is SysV IPC. All others have a [scheduling] lag.
When process A does msgsnd on a channel that process B has done msgrcv on, three things will happen:
process B will be marked runnable by the scheduler.
[IIRC] B will be moved to the front of its scheduling queue
Also, more importantly, this then causes an immediate reschedule of all processes.
B will start right away [as opposed to next timer interrupt or when some other process just happens to sleep]. On a single core machine, A will be put to sleep and B will run in its stead.
Caveat: All my research was done a few years back before the CFS scheduler, but, I believe the above should still hold. Also, I was using the RT scheduler, which may be a possible option if CFS doesn't work as intended.
UPDATE:
Looking at the POSIX message queue source, I think that the same immediate-wakeup behavior you discussed with the System V queues is going on, which gives the added benefit of POSIX compatibility.
The timing semantics are possible [and desirable] so I wouldn't be surprised. But, SysV is actually more standard and ubiquitous than POSIX mqueues. And, there are some semantic differences [See below].
For timing, you can build a unit test program [just using msgs] with nsec timestamps. I used TSC stamps, but clock_gettime(CLOCK_REALTIME,...) might also work. Stamp departure time and arrival/wakeup time to see. Compare both SysV and mq
With either SysV or mq you may need to bump up the max # of msgs, max msg size, max # of queues via /proc/*. The default values are relatively small. If you don't, you may find tasks blocked waiting for a msg but master can't send one [is blocked] due to a msg queue maximum parameter being exceeded. I actually had such a bug, so I changed my code to bump up these values [it was running as root] during startup. So, you may need to do this as an RC boot script (or whatever the [atrocious ;-)] systemd equivalent is)
I looked at using mq to replace SysV in my own code. It didn't have the same semantics for a many-to-one return-to-free-pool msg. In my original answer, I had forgotten to mention that two msg queues are needed: master-to-children (e.g. work-to-do) and children-to-master (e.g. returning a now available buffer).
I had several different types of buffers (e.g. compressed video, compressed audio, uncompressed video, uncompressed audio) that had varying types and struct descriptors.
Also, multiple different buffer queues as these buffers got passed from thread to thread [different processing stages].
With SysV you can use a single msg queue for multiple buffer lists/queues, the buffer list ID is the msg mtype. A child msgrcv waits with mtype set to the ID value. The master waits on the return-to-free msg queue with mtype of 0.
mq* requires a separate mqd_t for each ID because it doesn't allow a wait on a msg subtype.
msgrcv allows IPC_NOWAIT on each call, but to get the same effect with mq_receive you have to open the queue with O_NONBLOCK or use the timed version. This gets used during the "shutdown" or "restart" phase (e.g. send a msg to children that no more data will arrive and they should terminate [or reconfigure, etc.]). The IPC_NOWAIT is handy for "draining" a queue during program startup [to get rid of stale messages from a prior invocation] or drain stale messages from a prior configuration during operation.
So, instead of just two SysV msg queues to handle an arbitrary number of buffer lists, you'll need a separate mqd_t for each buffer list/type.
Process B epolls on the pipe (EPOLLIN|EPOLLET).
Process A writes 1KiB in pipe.
Process B wakes up.
Process B reads 1KiB from the pipe.
Process A writes 1KiB in pipe.
Process B epolls on the pipe.
The state of the pipe does not change during epoll, but has changed since the last read. Will process B wake up again?
My understanding from the FAQ (Q9) in http://linux.die.net/man/4/epoll is that you will get another event in step 6 (assuming that you can guarantee that step 5 really happens after step 4 and the pipe is empty after step 4).
Having said that, you might get more events than guaranteed (but you have to be careful only to rely on documented behavior) - see http://cmeerw.org/blog/753.html#753 and http://cmeerw.org/blog/750.html#750
While it's true that the kernel wakes up on step 6, that is not what's documented by the manual page. The use case you provide does not conform to how EPOLLET is supposed to be used.
According to the documentation, step 6 should be "read from the FD". The only time you are supposed to poll from the FD is after you tried to read and got EAGAIN.
See also: What is the use case for EPOLLET?
I've found many questions and answers about pipes on Linux, but almost all discuss the reader side.
For a process that shall be ready to deliver data to a named pipe as soon as the data is available and a reading process is connected, is there a way to, in a non-blocking fashion:
wait (poll(2)) for reader to open the pipe,
wait in a loop (again poll(2)) for signal that writing to the pipe will not block, and
when such signal is received, check how many bytes may be written to the pipe without blocking
I understand how to do (2.), but I wasn't able to find consistent answers for (1.) and (3.).
EDIT: I was looking for (something like) FIONWRITE for pipes, but Linux does not have FIONWRITE (for pipes) (?)
EDIT2: The intended main loop for the writer (kind of pseudo code, target language is C/C++):
forever
poll(can_read_command, can_write_to_the_fifo)
if (can_read_command) {
read and parse command
update internal status
continue
}
if (can_write_to_the_fifo) {
length = min(data_available, space_for_nonblocking_write)
write(output_fifo, buffer, length)
update internal status
continue
}
I'm considering a multi-threaded architecture for a processing pipeline. My main processing module has an input queue, from which it receives data packets. It then performs transformations on these packets (decryption, etc.) and places them into an output queue.
The threading comes in where many input packets can have their contents transformed independently from one another.
However, the punchline is that the output queue must have the same ordering as the input queue (i.e., the first pulled off the input queue must be the first pushed onto the output queue, regardless of whether its transformations finished first.)
Naturally, there will be some kind of synchronisation at the output queue, so my question is: what would be the best way of ensuring that this ordering is maintained?
Have a single thread read the input queue, post a placeholder on the output queue, and then hand the item over to a worker thread to process. When the data is ready the worker thread updates the placeholder. When the thread that needs the value from the output queue reads the placeholder it can then block until the associated data is ready.
Because only a single thread reads the input queue, and this thread immediately puts the placeholder on the output queue, the order in the output queue is the same as that in the input. The worker threads can be numerous, and can do the transformations in any order.
On platforms that support futures, they are ideal as the placeholder. On other systems you can use an event, monitor or condition variable.
With the following assumptions
there should be one input queue, one output queue and one working queue
there should be only one input queue
listener
output message should contain a wait
handle and a pointer to worker/output data
there may be an arbitrary number of
worker threads
I would consider the following flow:
Input queue listener does these steps:
extracts input message;
creates output message:
initializes worker data struct
resets the wait handle
enqueues the pointer to the output message into the working queue
enqueues the pointer to the output message into the output queue
Worker thread does the following:
waits on a working queue to
extract a pointer to an output
message from it
processes the message based on the given data and sets the event when done
consumer does the following:
waits on n output queue to
extract a pointer to an output
message from it
waits on a handle until the output data is ready
does something with the data
That's going to be implementation-specific. One general solution is to number the input items and preserve the numbering so you can later sort the output items. This could be done once the output queue is filled, or it could be done as part of filling it. In other words, you could insert them into their proper position and only allow the queue to be read when the next available item is sequential.
edit
I'm going to sketch out a basic scheme, trying to keep it simple by using the appropriate primitives:
Instead of queueing a Packet into the input queue, we create a future value around it and enqueue that into both the input and output queues. In C#, you could write it like this:
var future = new Lazy<Packet>(delegate() { return Process(packet); }, LazyThreadSafetyMode.ExecutionAndPublication);
A thread from the pool of workers dequeues a future from the input queue and executes future.Value, which causes the delegate to run JIT and returns once the delegate is done processing the packet.
One or more consumers dequeues a future from the output queue. Whenever they need the value of the packet, they call future.Value, which returns immediately if a worker thread has already called the delegate.
Simple, but works.
If you are using a windowed-approach (known number of elements), use an array for the output queue. For example if it is media streaming and you discard packages which haven't been processed quickly enough.
Otherwise, use a priority queue (special kind of heap, often implemented based on a fixed size array) for the output items.
You need to add a sequence number or any datum on which you can sort the items to each data packet. A priority queue is a tree like structure which ensures the sequence of items on insert/pop.