I've got an application that needs to send a stream of data from one process to multiple readers, each of which needs to see its own copy of the stream. This is reasonably high-rate (100MB/s is not uncommon), so I'd like to avoid duplication if possible. In my ideal world, linux would have named pipes that supported multiple readers, with a fast path for the common single-reader case.
I'd like something that provides some measure of namespace isolation (eg: broadcasting on 127.0.0.1 is open to any process I believe...). Unix domain sockets don't support broadcast, and UDP is "unreliable" anyways (server will drop packets instead of blocking in my case). I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
The short answer: No
The long answer: Yes [and you're on the right track]
I've had to do this before [for even higher speeds], so I had to research this. The following is what I came up with.
In the main process, create a pool of shared buffers [use SysV shm or private mmap as you chose]. Assign ID numbers to them (e.g. 1,2,3,...). Now there is a mapping from bufid to buffer memory address. To make this accessible to child processes, do this before you fork them. The children also inherit the shared memory mappings, so not much work
Now fork the children. Give them each a unique process id. You can just incrementally start with a number: 2,3,4,... [main is 1] or just use regular pids.
Open up a SysV msg channel (msgget et. al.). Again, if you do this in the main process before the fork, they are available to the children [IIRC].
Now here's how it works:
main finds an unused buffer and fills it. For each child, main sends an IPC message via msgsnd (on the single common IPC channel) where the message payload [mtext] is the bufid number. Each message has the standard header's mtype field set to the destination child's pid.
After doing this, main remembers the buffer as "in flight" and not yet reusable.
Each child does a msgrcv with the mtype set to its pid. It then extracts the bufid from mtext and processes the buffer. When it's done, it sends an IPC message [again on the same channel] with mtype set to main's pid with an mtext of the bufid it just processed.
main's loop does an non-blocking msgrcv, noting all "release" messages for a given bufid. When all children have released the buffer, it's put back on the buffer "free queue". In main's service loop, it may fill new buffers and send more messages as appropriate [intersperse with the waits].
The child then does an msgrcv and the cycle repeats.
So, we're using [large] shared memory buffers and short [a few bytes] bufid descriptor IPC messages.
Okay, so the question you may be asking: "Why SysV IPC for the comm channel?" [vs. multiple pipes or sockets].
You already know that a shared buffer avoids sending multiple copies of your data.
So, that's the way to go. But, why not send the above bufid messages across sockets or pipes [or shared queues, condition variables, mutexes, etc]?
The answer is speed and the wakeup characteristics of the target process.
For a highly realtime response, when main sends out the bufid messages, you want the target process [if it's been sleeping] to wake up immediately and start processing the buffer.
I examined the linux kernel source and the only mechanism that has that characteristic is SysV IPC. All others have a [scheduling] lag.
When process A does msgsnd on a channel that process B has done msgrcv on, three things will happen:
process B will be marked runnable by the scheduler.
[IIRC] B will be moved to the front of its scheduling queue
Also, more importantly, this then causes an immediate reschedule of all processes.
B will start right away [as opposed to next timer interrupt or when some other process just happens to sleep]. On a single core machine, A will be put to sleep and B will run in its stead.
Caveat: All my research was done a few years back before the CFS scheduler, but, I believe the above should still hold. Also, I was using the RT scheduler, which may be a possible option if CFS doesn't work as intended.
UPDATE:
Looking at the POSIX message queue source, I think that the same immediate-wakeup behavior you discussed with the System V queues is going on, which gives the added benefit of POSIX compatibility.
The timing semantics are possible [and desirable] so I wouldn't be surprised. But, SysV is actually more standard and ubiquitous than POSIX mqueues. And, there are some semantic differences [See below].
For timing, you can build a unit test program [just using msgs] with nsec timestamps. I used TSC stamps, but clock_gettime(CLOCK_REALTIME,...) might also work. Stamp departure time and arrival/wakeup time to see. Compare both SysV and mq
With either SysV or mq you may need to bump up the max # of msgs, max msg size, max # of queues via /proc/*. The default values are relatively small. If you don't, you may find tasks blocked waiting for a msg but master can't send one [is blocked] due to a msg queue maximum parameter being exceeded. I actually had such a bug, so I changed my code to bump up these values [it was running as root] during startup. So, you may need to do this as an RC boot script (or whatever the [atrocious ;-)] systemd equivalent is)
I looked at using mq to replace SysV in my own code. It didn't have the same semantics for a many-to-one return-to-free-pool msg. In my original answer, I had forgotten to mention that two msg queues are needed: master-to-children (e.g. work-to-do) and children-to-master (e.g. returning a now available buffer).
I had several different types of buffers (e.g. compressed video, compressed audio, uncompressed video, uncompressed audio) that had varying types and struct descriptors.
Also, multiple different buffer queues as these buffers got passed from thread to thread [different processing stages].
With SysV you can use a single msg queue for multiple buffer lists/queues, the buffer list ID is the msg mtype. A child msgrcv waits with mtype set to the ID value. The master waits on the return-to-free msg queue with mtype of 0.
mq* requires a separate mqd_t for each ID because it doesn't allow a wait on a msg subtype.
msgrcv allows IPC_NOWAIT on each call, but to get the same effect with mq_receive you have to open the queue with O_NONBLOCK or use the timed version. This gets used during the "shutdown" or "restart" phase (e.g. send a msg to children that no more data will arrive and they should terminate [or reconfigure, etc.]). The IPC_NOWAIT is handy for "draining" a queue during program startup [to get rid of stale messages from a prior invocation] or drain stale messages from a prior configuration during operation.
So, instead of just two SysV msg queues to handle an arbitrary number of buffer lists, you'll need a separate mqd_t for each buffer list/type.
Related
Say I have a swap chain consisting of n images and I allow k "frames in flight". I ensure correct synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a set of semaphores imageAvailableSemaphore and renderFinishedSemaphore and a fence imageInFlight like it is done in this tutorial:
imageAvailableSemaphores.resize(MAX_FRAMES_IN_FLIGHT);
renderFinishedSemaphores.resize(MAX_FRAMES_IN_FLIGHT);
inFlightFences.resize(MAX_FRAMES_IN_FLIGHT);
The fences are needed to ensure that we don't use the semaphores again before the GPU has completed consuming the corresponding image. So, this fence needs to be specified in vkQueueSubmit.
On the other hand, I'm creating command buffers independently of the "frames in flight". They are "one-time submit" command buffers. Hence, once submitted, I add them to a "to-be-deleted" list. I need to know when the GPU has finished execution of the command buffers in this list.
But I cannot specify another fence in vkQueueSubmit. How can I solve this problem?
I allow k "frames in flight"
Well, that's your answer. Each thread that is going to contribute command buffers for a "frame" should have some multiple of k command buffers. They should use them in a ring-buffer fashion. These command buffers should be created from a transient-allocating pool. When they pick the least-recently used CB from the ring buffer, they should reset it before recording into it.
You ensure that no thread tries to reset a CB that is still in use by not starting any work for the next frame until the kth frame in the past has completed (using a fence).
If for some reason you absolutely cannot tell your threads what k is up front, you're still going to have to tell them something. When you start work on the thread, you need to tell them how many frames are still in fight. This allows them to check the size of their ring buffer against this number of frames. If the number of elements in the ring buffer is less than the number of frames, then the oldest CB in the ring buffer is not in use. Otherwise, it will have to allocate a new CB from the pool and shove that into the ring buffer.
You can use Timeline Semaphores for this. You can read in depth about them here: https://www.khronos.org/blog/vulkan-timeline-semaphores
Instead of being signaled or not signaled timeline semaphores carry a specific value. The function interesting for you is vkGetSemaphoreCounterValue, which allows you to read the value of the semaphore without blocking.
To create a timeline semaphore you simply set the pNext value of your VkSemaphoreCreateInfo to a VkSemaphoreTypeCreateInfo looking like the following.
VkSemaphoreTypeCreateInfo
timelineCreateInfo{VK_STRUCTURE_TYPE_SEMAPHORE_TYPE_CREATE_INFO};
timelineCreateInfo.semaphoreType = VK_SEMAPHORE_TYPE_TIMELINE;
timelineCreateInfo.initialValue = 0;
The pNext value of your VkSubmitInfo needs to be set to a VkTimelineSemaphoreSubmitInfo.
VkTimelineSemaphoreSubmitInfo timelineInfo{VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO};
timelineInfo.signalSemaphoreValueCount = 1;
timelineInfo.pSignalSemaphoreValues = &signalValue;
After the command buffer is done the value of the semaphore will be whatever value you set signalValue to. After that you can query for the value with:
uint64_t value;
vkGetSemaphoreCounterValue(device, semaphore, &value);
So assuming you set signalValue to 1 value here will be either 1 or 0 which is what we gave the semaphore as initial value in VkSemaphoreTypeCreateInfo. After that you can safely delete your one time command buffer.
Note: Timeline Semaphores are actually meant as a semi replacement of fences and binary semaphores and should be the main synchronization primitive you use. I think the only function that requires a binary semaphore is vkAcquireSwapchainImage.
I have a thread that gets new frames, 2 other threads that process the newly gotten image and 1 that prints the output based on the processing threads.
The program cycle goes,
>thread 3, print an output based on the previous outputs of the thread 0 and 1
>thread 0 get new image
>> thread 1, process image for color
>> thread 2, process image for haar cascade
going cyclically 3&0>1&2>3&0>1&2>
'>' indicates join before spawning the next set
How do I pass the opencv Mat between the threads 0 to 1&2?
Also how would I pass the data from threads 1&2 to thread 0?
I would guess a message queue system, how does one implement that?
I suspect I may get some abuse and down-votes for this because global variables are generally frowned upon, however I feel the situation is different in image-processing where:
the code is multi-threaded, and
the data structures (images) are large.
Here I think it is important to avoid expensive copying and "transmission" of data down sockets etc. when it is already available in memory that is shared and visible amongst threads.
So, in concrete terms, I would go for an array, or vector (according to your preference) of say 16 OpenCV Mats that is globally accessible.
Acquire into the first, notify the the next thread when that buffer is full, then acquire into the next. And so on.
As regards the notification, you have several options. The cleanest, and most modern is probably using condition variables letting each processing thread (not the acquiring thread) wait on a condvar for "buffer full". Next is probably POSIX message queues, though if you are porting to macOS later you may regret that as there is no support. Another, easily programmed method is to use sockets but just send a single byte that is the index into the global array of 16 Mats - that way there is no problem with incomplete, multi-byte reads on sockets. The processing threads then just sit in a loop doing blocking reads on the socket to know which buffer to process. You can also define a special index that means "quit".
Check the size of your images in terms of width x height x channels x bytes per channel and get a feel for how much memory a global vector of 16 Mats will need and be sure you have that available before using this strategy - you may be up against the wall with a Raspberry Pi for example.
To my current understanding, after calling MPI_Send, the calling thread should block until the variable is received, so my code below shouldn't work. However, I tried sending several variables in a row and receiving them gradually while doing operations on them and this still worked... See below. Can someone clarify step by step what is going on here?
matlab code: (because I am using a matlab mex wrapper for MPI functions)
%send
if mpirank==0
%arguments to MPI_Send_variable are (variable, destination, tag)
MPI_Send_variable(x,0,'A_22')%thread 0 should block here!
MPI_Send_variable(y,0,'A_12')
MPI_Send_variable(z,1,'A_11')
MPI_Send_variable(w,1,'A_21')
end
%recieve
if mpirank==0
%arguments to MPI_Recv_variable are (source, tag)
a=MPI_Recv_variable(0,'A_12')*MPI_Recv_variable(0,'A_22');
end
if mpirank==1
c=MPI_Recv_variable(0,'A_21')*MPI_Recv_variable(0,'A_22');
end
MPI_SEND is a blocking call only in the sense that it blocks until it is safe for the user to use the buffer provided to it. The important text to read here is in Section 3.4:
The send call described in Section 3.2.1 uses the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver.
I highlighted the part that you're running up against in bold there. If your message is sufficiently small (and there are sufficiently few of them), MPI will copy your send buffers to an internal buffer and keep track of things internally until the message has been received remotely. There's no guarantee that when MPI_SEND is done, the message has been received.
On the other hand, if you do want to know that the message was actually received, you can use MPI_SSEND. That function will synchronize (hence the extra S both sides before allowing them to return from the MPI_SSEND and the matching receive call on the other end.
In a correct MPI program, you cannot do a blocking send to yourself without first posting a nonblocking receive. So a correct version of your program would look something like this:
Irecv(..., &req1);
Irecv(..., &req2);
Send(... to self ...);
Send(.... to self ...);
Wait(&req1, ...);
/* do work */
Wait(&req2, ...);
/* do more work */
Your code is technically incorrect, but the reason it is working correctly is because the MPI implementation is using internal buffers to buffer your send data before it is transmitted to the receiver (or matched to the later receive operation in the case of self sends). An MPI implementation is not required to have such buffers (generally called "eager buffers"), but most implementations do.
Since the data you are sending is small, the eager buffers are generally sufficient to buffer them temporarily. If you send large enough data, the MPI implementation will not have enough eager buffer space and your program will deadlock. Try sending, for example, 10 MB instead of a double in your program to notice the deadlock.
I assume that there is just a MPI_Send() behind MPI_Send_variable() and MPI_Receive() behind MPI_Receive_variable().
How do a process can ever receive a message that he sent to himself if both the send and receive operations are blocking ? Either send to self or receive to self are non-blocking or you will get a deadlock, and sending to self is forbidden.
Following answer of #Greginozemtsev Is the behavior of MPI communication of a rank with itself well-defined? , the MPI standard states that send to self and receive to self are allowed. I guess it implies that it's non blocking in this particular case.
In MPI 3.0, in section 3.2.4 Blocking Receive here, page 59, the words have not changed since MPI 1.1 :
Source = destination is allowed, that is, a process can send a message to itself.
(However, it is unsafe to do so with the blocking send
and receive operations described above, since this may lead to deadlock.
See Section 3.5.)
I rode section 3.5, but it's not clear enough for me...
I guess that the parenthesis are here to tell us that talking to oneself is not a good practice, at least for MPI communications !
Suppose I have this:
A | B | C
How does the pipeline work? Does A produce data only when B requests it? Does A continually produce data and then block if B can't currently accept it? What's C's role? I realized that a system I'm designing is conceptually very similar to these pipelines -- I'd like to draw upon the existing paradigm rather than inventing something novel that only works half as well.
Pipes in Unix have a buffer, so even if the right side process (RSP) does not consume any data, the left side process (LSP) is able to produce a few kilobytes before blocking.
Then, if the buffer gets full, the LSP is eventually blocked. When the RSP reads data it frees part or all of the buffer space and the LSP resumes the operation.
If instead of 2 processes you have 3, the situation is more or less the same: a faster producer is blocked by a slower consumer. And obviously, a faster consumer is blocked by a slower producer if the pipe gets empty: just think of an interactive shell, waiting of the slowest producer of all: the user.
For example the following command:
$ yes | cat | more
Since more blocks when the screen is full, until the user presses a key, the cat process will fill its output buffer and stall, then the yes process will fill its buffer and also stall. Everything waiting for the user to continue, as it should be.
PS: As an interesting fact is: what happens when the more process ends? well, the right side of that pipe is closed, so the cat process will get a SIGPIPE signal (if it ever writes again in the pipe, and it will) and will die. The same will happen to the yes process. All processes die, as it should be.
A has a pipe to B, and B has a pipe to C. Each pipe has a buffer; B and C block if they try to read, and there isn't any input available (end-of-stream counts as input). A and B block if they have output to write, but the pipe's buffer is full.
All three processes run concurrently, using as much CPU as they can. The OS blocks them in the read/write system call as necessary if the pipe buffer is exhausted/full respectively.
So, they're driven by both the consumer and producer, that is, the rate is the min of both the consuming rate and producing rate. If the consumer is faster, the performance is driven by the producer, and vv.
I can find many examples regarding wait_queue_head.
It works as a signal, create a wait_queue_head, someone
can sleep using it until someother kicks it up.
But I can not find a good example of using wait_queue itself, supposedly very related to it.
Could someone gives example, or under the hood of them?
From Linux Device Drivers:
The wait_queue_head_t type is a fairly simple structure, defined in
<linux/wait.h>. It contains only a lock variable and a linked list
of sleeping processes. The individual data items in the list are of
type wait_queue_t, and the list is the generic list defined in
<linux/list.h>.
Normally the wait_queue_t structures are allocated on the stack by
functions like interruptible_sleep_on; the structures end up in the
stack because they are simply declared as automatic variables in the
relevant functions. In general, the programmer need not deal with
them.
Take a look at A Deeper Look at Wait Queues part.
Some advanced applications, however, can require dealing with
wait_queue_t variables directly. For these, it's worth a quick look at
what actually goes on inside a function like interruptible_sleep_on.
The following is a simplified version of the implementation of
interruptible_sleep_on to put a process to sleep:
void simplified_sleep_on(wait_queue_head_t *queue)
{
wait_queue_t wait;
init_waitqueue_entry(&wait, current);
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(queue, &wait);
schedule();
remove_wait_queue (queue, &wait);
}
The code here creates a new wait_queue_t variable (wait, which gets
allocated on the stack) and initializes it. The state of the task is
set to TASK_INTERRUPTIBLE, meaning that it is in an interruptible
sleep. The wait queue entry is then added to the queue (the
wait_queue_head_t * argument). Then schedule is called, which
relinquishes the processor to somebody else. schedule returns only
when somebody else has woken up the process and set its state to
TASK_RUNNING. At that point, the wait queue entry is removed from the
queue, and the sleep is done
The internals of the data structures involved in wait queues:
Update:
for the users who think the image is my own - here is one more time the link to the Linux Device Drivers where the image is taken from
Wait queue is simply a list of processes and a lock.
wait_queue_head_t represents the queue as a whole. It is the head of the waiting queue.
wait_queue_t represents the item of the list - a single process waiting in the queue.