What is the point to use copy of exported fence or semaphore in vulkan? - graphics

I am trying to synchronize access to external memory across 2 processes on android vulkan, my system is telling me that it is only supporting
VK_EXTERNAL_FENCE_HANDLE_TYPE_SYNC_FD_BIT
which can be exported as as copy, not as a reference.
What I am doing is this in process A ( exporting fd )
CALL_VK(vkQueueSubmit(queue, 1, &submit_info, fence));
int fd;
VkFenceGetFdInfoKHR getFdInfo{};
getFdInfo.sType = VK_STRUCTURE_TYPE_FENCE_GET_FD_INFO_KHR;
getFdInfo.handleType = VK_EXTERNAL_FENCE_HANDLE_TYPE_SYNC_FD_BIT;
getFdInfo.fence = fence;
CALL_VK(vkGetFenceFdKHR(device.device_, &getFdInfo, &fd));
and then this in process B ( importing copy of a payload from a fd)
VkImportFenceFdInfoKHR importFenceFdInfo{};
importFenceFdInfo.sType = VK_STRUCTURE_TYPE_IMPORT_FENCE_FD_INFO_KHR;
importFenceFdInfo.handleType = VK_EXTERNAL_FENCE_HANDLE_TYPE_SYNC_FD_BIT;
importFenceFdInfo.fd = fd;
importFenceFdInfo.fence = fence;
importFenceFdInfo.flags = VK_FENCE_IMPORT_TEMPORARY_BIT;
CALL_VK(vkImportFenceFdKHR(device.device_, &importFenceFdInfo));
The thing is if in the process B the fence I've got back is still unsignalled ( it was pending after vkQueueSubmit ) then since this is a copy and not a reference it will never be signalled and I will never know when my GPU runtime will have been finished. I am getting a freeze in the following call in my process B
VkResult result = vkWaitForFences(device, 1, &fence, VK_TRUE, -1);
Then what is the point in all this?
I am expecting that I shall be able to get the state of the fence updated in my process B once it is signalled by GPU.

When you perform a fence import operation, that operation has a "transference" associated with it. Reference transference means that the fence you imported into is able to directly see the state of the fence payload at all times. If it waits on the fence to be signaled, then it will be woken up when that happens.
Copy transference means that it sees the state of the fence payload only at the moment the transfer happens. Any subsequent changes in the fence's payload are not available.
If it is your intent to have some process wait until a fence in a different process has completed, it cannot do this by performing a copy transference only once. It basically has to poll the other process through importing the fence, with each import seeing the payload state at the time of the polling.
And of course, you need to build in some system for the other process to know that the listening process has seen the signaling of the fence. Otherwise, the signaling process might reset the fence before the listening process has seen the signal.
Note that if an implementation restricts you to temporary, reference importation only, that represents a hardware limitation. So polling (along with listener protection) is the best the system allows. So Vulkan forces you to do it manually.

It seems that my problem was in how I transfer the fd to another process. I was simply using a piece of shared memory where I was storing to/reading from a fd. But this is incorrect, I need to import fd into a table of opened files in another process and seems like the only "approved" way for doing that is to send a socket message with SCM_RIGHTS type.

Related

In Vulkan (or any other modern graphics API), should fences be waited per queue submission or per frame?

I am trying to set up my renderer in a way that rendering always renders into texture, then I just present any texture I like as long as its format is swapchain compatible. This means that, I need to deal with one graphics queue (I don't have compute yet) that renders the scene, ui etc; one transfer queue that copies the rendered image into swapchain; and one present queue for presenting the swapchain. This is a use-case that I am trying to tackle at the moment but I will be having more use-cases like this (e.g compute queues) as my renderer matures.
Here is a pseudocode on what I am trying to achieve. I added some of my own assumptions here as well:
// wait for fences per frame
waitForFences(fences[currentFrame]);
resetFences(fences[currentFrame]);
// 1. Rendering (queue = Graphics)
commandBuffer.begin();
renderEverything();
commandBuffer.end();
QueueSubmitInfo renderSubmit{};
renderSubmit.commandBuffer = commandBuffer;
// Nothing to wait for
renderSubmit.waitSemaphores = nullptr;
// Signal that rendering is complete
renderSubmit.signalSemaphores = { renderSemaphores[currentFrame] };
// Do not signal the fence yet
queueSubmit(renderSubmit, nullptr);
// 2. Transferring to swapchain (queue = Transfer)
// acquire the image that we want to copy into
// and signal that it is available
swapchain.acquireNextImage(imageAvailableSemaphore[currentFrame]);
commandBuffer.begin();
copyTexture(textureToPresent, swapchain.getAvailableImage());
commandBuffer.end();
QueueSubmitInfo transferSubmit{};
transferSubmit.commandBuffer = commandBuffer;
// Wait for swapchain image to be available
// and rendering to be complete
transferSubmit.waitSemaphores = { renderSemaphores[currentFrame], imageAvailableSemaphore[currentFrame] };
// Signal another semaphore that swapchain
// is ready to be used
transferSubmit.signalSemaphores = { readyForPresenting[currentFrame] };
// Now, signal the fence since this is the end of frame
queueSubmit(transferSubmit, fences[currentFrame]);
// 3. Presenting (queue = Present)
PresentQueueSubmitInfo presentSubmit{};
// Wait until the swapchain is ready to be presented
// Basically, waits until the image is copied to swapchain
presentSubmit.waitSemaphores = { readyForPresenting[currentFrame] };
presentQueueSubmit(presentSubmit);
My understanding is that fences are needed to make sure that the CPU waits until GPU is done submitting the previous command buffer to the queue.
When dealing with multiple queues, is it enough to make the CPU wait only for the frame and synchronize different queues with semaphores (pseudocode above is based on this)? Or should each queue wait for a fence separately?
To get into technical details, what will happen if two command buffers are submitted to the same queue without any semaphores? Pseudocode:
// first submissions
commandBufferOne.begin();
doSomething();
commandBufferOne.end();
SubmitInfo firstSubmit{};
firstSubmit.commandBuffer = commandBufferOne;
queueSubmit(firstSubmit, nullptr);
// second submission
commandBufferTwo.begin();
doSomethingElse();
commandBufferTwo.end();
SubmitInfo secondSubmit{};
secondSubmit.commandBuffer = commandBufferOne;
queueSubmit(secondSubmit, nullptr);
Will the second submission overwrite the first one or will the first FIFO queue be executed before the second one since it was submitted first?
This entire organizational scheme seems dubious.
Even ignoring the fact that the Vulkan specification does not require GPUs to offer separate queues for all of these things, you're spreading a series of operations across asynchronous execution, despite the fact that these operations are inherently sequential. You cannot copy from an image to the swapchain until the image has been rendered, and you cannot present the swapchain image until the copy has completed.
So there is basically no advantage to putting these things into their own queues. Just do all of them on the same queue (with one submit and one vkQueuePresentKHR), using appropriate execution and memory dependencies between the operations. This means there's only one thing to wait on: the single submission.
Plus, submit operations are really expensive; doing two submits instead of one submit containing both pieces of work is only a good thing if the submissions are being done on different CPU threads that can work concurrently. But binary semaphores stop that from working. You cannot submit a batch that waits for semaphore A until you have submitted a batch that signals semaphore A. This means that the batch signaling must either be earlier in the same submit command or must have been submitted in a prior submit command. Which means if you put those submits on different threads, you have to use a mutex or something to ensure that the signaling submit happens-before the waiting submit.1
So you don't get any asynchronous execution of the queue submit operation. So neither the CPU nor the GPU will asynchronously execute any of this.
1: Timeline semaphores don't have this problem.
As for the particulars of your technical question, if operation A is dependent on operation B, and you synchronize with A, you have also synchronized with B. Since your transfer operation is waits on a signal from the graphics queue, waiting on the transfer operation will also wait on graphics commands from before that signal.

Using fences to clean-up command buffers and synchronizing swap chain images at the same time

Say I have a swap chain consisting of n images and I allow k "frames in flight". I ensure correct synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a set of semaphores imageAvailableSemaphore and renderFinishedSemaphore and a fence imageInFlight like it is done in this tutorial:
imageAvailableSemaphores.resize(MAX_FRAMES_IN_FLIGHT);
renderFinishedSemaphores.resize(MAX_FRAMES_IN_FLIGHT);
inFlightFences.resize(MAX_FRAMES_IN_FLIGHT);
The fences are needed to ensure that we don't use the semaphores again before the GPU has completed consuming the corresponding image. So, this fence needs to be specified in vkQueueSubmit.
On the other hand, I'm creating command buffers independently of the "frames in flight". They are "one-time submit" command buffers. Hence, once submitted, I add them to a "to-be-deleted" list. I need to know when the GPU has finished execution of the command buffers in this list.
But I cannot specify another fence in vkQueueSubmit. How can I solve this problem?
I allow k "frames in flight"
Well, that's your answer. Each thread that is going to contribute command buffers for a "frame" should have some multiple of k command buffers. They should use them in a ring-buffer fashion. These command buffers should be created from a transient-allocating pool. When they pick the least-recently used CB from the ring buffer, they should reset it before recording into it.
You ensure that no thread tries to reset a CB that is still in use by not starting any work for the next frame until the kth frame in the past has completed (using a fence).
If for some reason you absolutely cannot tell your threads what k is up front, you're still going to have to tell them something. When you start work on the thread, you need to tell them how many frames are still in fight. This allows them to check the size of their ring buffer against this number of frames. If the number of elements in the ring buffer is less than the number of frames, then the oldest CB in the ring buffer is not in use. Otherwise, it will have to allocate a new CB from the pool and shove that into the ring buffer.
You can use Timeline Semaphores for this. You can read in depth about them here: https://www.khronos.org/blog/vulkan-timeline-semaphores
Instead of being signaled or not signaled timeline semaphores carry a specific value. The function interesting for you is vkGetSemaphoreCounterValue, which allows you to read the value of the semaphore without blocking.
To create a timeline semaphore you simply set the pNext value of your VkSemaphoreCreateInfo to a VkSemaphoreTypeCreateInfo looking like the following.
VkSemaphoreTypeCreateInfo
timelineCreateInfo{VK_STRUCTURE_TYPE_SEMAPHORE_TYPE_CREATE_INFO};
timelineCreateInfo.semaphoreType = VK_SEMAPHORE_TYPE_TIMELINE;
timelineCreateInfo.initialValue = 0;
The pNext value of your VkSubmitInfo needs to be set to a VkTimelineSemaphoreSubmitInfo.
VkTimelineSemaphoreSubmitInfo timelineInfo{VK_STRUCTURE_TYPE_TIMELINE_SEMAPHORE_SUBMIT_INFO};
timelineInfo.signalSemaphoreValueCount = 1;
timelineInfo.pSignalSemaphoreValues = &signalValue;
After the command buffer is done the value of the semaphore will be whatever value you set signalValue to. After that you can query for the value with:
uint64_t value;
vkGetSemaphoreCounterValue(device, semaphore, &value);
So assuming you set signalValue to 1 value here will be either 1 or 0 which is what we gave the semaphore as initial value in VkSemaphoreTypeCreateInfo. After that you can safely delete your one time command buffer.
Note: Timeline Semaphores are actually meant as a semi replacement of fences and binary semaphores and should be the main synchronization primitive you use. I think the only function that requires a binary semaphore is vkAcquireSwapchainImage.

Canonical way to "broadcast" data to multiple processes in linux?

I've got an application that needs to send a stream of data from one process to multiple readers, each of which needs to see its own copy of the stream. This is reasonably high-rate (100MB/s is not uncommon), so I'd like to avoid duplication if possible. In my ideal world, linux would have named pipes that supported multiple readers, with a fast path for the common single-reader case.
I'd like something that provides some measure of namespace isolation (eg: broadcasting on 127.0.0.1 is open to any process I believe...). Unix domain sockets don't support broadcast, and UDP is "unreliable" anyways (server will drop packets instead of blocking in my case). I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
I supposed I could create a shared-memory segment and store the common buffers there, but that feels like reinventing the wheel. Is there a canonical way to do this in linux?
The short answer: No
The long answer: Yes [and you're on the right track]
I've had to do this before [for even higher speeds], so I had to research this. The following is what I came up with.
In the main process, create a pool of shared buffers [use SysV shm or private mmap as you chose]. Assign ID numbers to them (e.g. 1,2,3,...). Now there is a mapping from bufid to buffer memory address. To make this accessible to child processes, do this before you fork them. The children also inherit the shared memory mappings, so not much work
Now fork the children. Give them each a unique process id. You can just incrementally start with a number: 2,3,4,... [main is 1] or just use regular pids.
Open up a SysV msg channel (msgget et. al.). Again, if you do this in the main process before the fork, they are available to the children [IIRC].
Now here's how it works:
main finds an unused buffer and fills it. For each child, main sends an IPC message via msgsnd (on the single common IPC channel) where the message payload [mtext] is the bufid number. Each message has the standard header's mtype field set to the destination child's pid.
After doing this, main remembers the buffer as "in flight" and not yet reusable.
Each child does a msgrcv with the mtype set to its pid. It then extracts the bufid from mtext and processes the buffer. When it's done, it sends an IPC message [again on the same channel] with mtype set to main's pid with an mtext of the bufid it just processed.
main's loop does an non-blocking msgrcv, noting all "release" messages for a given bufid. When all children have released the buffer, it's put back on the buffer "free queue". In main's service loop, it may fill new buffers and send more messages as appropriate [intersperse with the waits].
The child then does an msgrcv and the cycle repeats.
So, we're using [large] shared memory buffers and short [a few bytes] bufid descriptor IPC messages.
Okay, so the question you may be asking: "Why SysV IPC for the comm channel?" [vs. multiple pipes or sockets].
You already know that a shared buffer avoids sending multiple copies of your data.
So, that's the way to go. But, why not send the above bufid messages across sockets or pipes [or shared queues, condition variables, mutexes, etc]?
The answer is speed and the wakeup characteristics of the target process.
For a highly realtime response, when main sends out the bufid messages, you want the target process [if it's been sleeping] to wake up immediately and start processing the buffer.
I examined the linux kernel source and the only mechanism that has that characteristic is SysV IPC. All others have a [scheduling] lag.
When process A does msgsnd on a channel that process B has done msgrcv on, three things will happen:
process B will be marked runnable by the scheduler.
[IIRC] B will be moved to the front of its scheduling queue
Also, more importantly, this then causes an immediate reschedule of all processes.
B will start right away [as opposed to next timer interrupt or when some other process just happens to sleep]. On a single core machine, A will be put to sleep and B will run in its stead.
Caveat: All my research was done a few years back before the CFS scheduler, but, I believe the above should still hold. Also, I was using the RT scheduler, which may be a possible option if CFS doesn't work as intended.
UPDATE:
Looking at the POSIX message queue source, I think that the same immediate-wakeup behavior you discussed with the System V queues is going on, which gives the added benefit of POSIX compatibility.
The timing semantics are possible [and desirable] so I wouldn't be surprised. But, SysV is actually more standard and ubiquitous than POSIX mqueues. And, there are some semantic differences [See below].
For timing, you can build a unit test program [just using msgs] with nsec timestamps. I used TSC stamps, but clock_gettime(CLOCK_REALTIME,...) might also work. Stamp departure time and arrival/wakeup time to see. Compare both SysV and mq
With either SysV or mq you may need to bump up the max # of msgs, max msg size, max # of queues via /proc/*. The default values are relatively small. If you don't, you may find tasks blocked waiting for a msg but master can't send one [is blocked] due to a msg queue maximum parameter being exceeded. I actually had such a bug, so I changed my code to bump up these values [it was running as root] during startup. So, you may need to do this as an RC boot script (or whatever the [atrocious ;-)] systemd equivalent is)
I looked at using mq to replace SysV in my own code. It didn't have the same semantics for a many-to-one return-to-free-pool msg. In my original answer, I had forgotten to mention that two msg queues are needed: master-to-children (e.g. work-to-do) and children-to-master (e.g. returning a now available buffer).
I had several different types of buffers (e.g. compressed video, compressed audio, uncompressed video, uncompressed audio) that had varying types and struct descriptors.
Also, multiple different buffer queues as these buffers got passed from thread to thread [different processing stages].
With SysV you can use a single msg queue for multiple buffer lists/queues, the buffer list ID is the msg mtype. A child msgrcv waits with mtype set to the ID value. The master waits on the return-to-free msg queue with mtype of 0.
mq* requires a separate mqd_t for each ID because it doesn't allow a wait on a msg subtype.
msgrcv allows IPC_NOWAIT on each call, but to get the same effect with mq_receive you have to open the queue with O_NONBLOCK or use the timed version. This gets used during the "shutdown" or "restart" phase (e.g. send a msg to children that no more data will arrive and they should terminate [or reconfigure, etc.]). The IPC_NOWAIT is handy for "draining" a queue during program startup [to get rid of stale messages from a prior invocation] or drain stale messages from a prior configuration during operation.
So, instead of just two SysV msg queues to handle an arbitrary number of buffer lists, you'll need a separate mqd_t for each buffer list/type.

OpenGL: glClientWaitSync on separate thread

I am using glMapBufferRange with the GL_MAP_UNSYNCHRONIZED_BIT to map a buffer object. I then pass the returned pointer to a worker thread to compute the new vertices asynchronously. The Object is doubly buffered so I can render one object while the other is written to. Using GL_MAP_UNSYNCHRONIZED_BIT gives me significantly better performance (mainly because glUnmapBuffer returns sooner), but I am getting some visual artifacts (despite the double buffering) - so I assume either the GPU starts rendering while the DMA upload is still in progress, or the worker thread starts writing to the vertices too early.
If I understand glFenceSync, glWaitSync and glClientWaitSync correctly, then I am supposed to address these issues in the following way:
A: avoid having the GPU render the buffer object before the DMA process completed:
directly after glUnmapBufferRange, call on the main thread
GLsync uploadSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glFlush();
glWaitSync(uploadSync, 0, GL_TIMEOUT_IGNORED);
B: avoid writing to the buffer from the worker thread before the GPU has finished rendering it:
direclty after glDrawElements, call on the main thread
GLsync renderSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
and on the worker thread, right before starting to write data to the pointer that has previously been returned from glMapBufferRange
glClientWaitSync(renderSync,0,100000000);
...start writing to the mapped pointer
1: Is my approach to the explicit syncing correct?
2: How can I handle the second case? I want to wait in the worker thread (I don't want to make my main thread stall), but I cannot issue glCommands from the worker thread. Is there another way to check if the GLsync has been signalled other than the gl call?
What you could do is create an OpenGL context in the worker thread, and then share it with the main thread. Next:
Run on the main thread:
GLsync renderSync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
glFlush();
then
Run on the worker thread:
glClientWaitSync(renderSync,0,100000000);
The glFlush on the main thread is important, since otherwise you could have an infinite wait. See also the OpenGL docs:
4.1.2 Signaling
Footnote 3: The simple flushing behavior defined by SYNC_FLUSH_COMMANDS_BIT will not help when waiting for a fence command issued in another context’s command stream to complete. Applications which block on a fence sync object must take additional steps to assure that the context from which the corresponding fence command was issued has flushed that command to the graphics
pipeline.

Simultaneous Read/Write on a file by two threads (Mutex aren't helping)

I want to use one thread to get fields of packets by using tshark utility (using system () command) whos output is then redirected to a file. This same file needs to be read by another thread simultaneously, so that it can make runtime decisions based on the fields observed in the file.
The problem I am having currently now is even though the first thread is writing to the file, the second thread is unable to read it (It reads NULL from the file). I am not sure why its behaving this way. I thought it might be due to simultaneous access to the same file. I thought of using mutex locks but that would block the reading thread, since the first thread will only end when the program terminates.
Any ideas on how to go about it?
If you are using that file for interprocess communication, you could instead use named pipes or message queues instead. They are much easier to use and don't require synchronization because one thread writes and the other one reads when data is available.
Edit: For inter-thread communication you can simply use shared variables and a conditional variable to signal when some data has been produced (a producer-consumer pattern). Something like:
// thread 1
while(1)
{
// read packet
// write packet to global variable
// signal thread 2
// wait for confirmation of reading
}
// thread 2
while(1)
{
// wait for signal from thread 1
// read from global variable
// signal thread 2 to continue
}
The signal parts can be implemented with conditional variables: pthread_cond_t.

Resources