I am using Vulkan graphics API (via BGFX) to render. And I have been measuring how much (wall-clock) time my calls take.
What I do not understand is that vkAcquireNextImageKHR() is always fast, and never blocks. Even though I disable the time-out and use a semaphore to wait for presentation.
The presentation is locked to a 60Hz display rate, and I see my main-loop indeed run at 16.6 or 33.3 ms.
Shouldn't I see the wait-time for this display rate show up in the length of the vkAcquireNextImageKHR() call?
The profiler measures this call as 0.2ms or so, and never a substantial part of a frame.
VkResult result = vkAcquireNextImageKHR(
m_device
, m_swapchain
, UINT64_MAX
, renderWait
, VK_NULL_HANDLE
, &m_backBufferColorIdx
);
Target hardware is a handheld console.
The whole purpose of Vulkan is to alleviate CPU bottlenecks. Making the CPU stop until the GPU is ready for something would be the opposite of that. Especially if the CPU itself isn't actually going to use the result of this operation.
As such, all vkAcquireNextImageKHR function does is let you know which image will be made available to you next. This is the minimum that needs to happen in order for you to be able to use that image (for example, by building command buffers that reference the image in some way). However, that image is not yet available to you.
This is why this function requires you to provide a semaphore and/or a fence: so that the process which consumes the image can wait for the image to be made available.
If the process which consumes the image is just a bunch of commands in a command buffer (ie: something you submit with vkQueueSubmit), you can simply have that batch of work wait on the semaphore given to the acquire operation. That means all of the waiting happens in the GPU. Where it belongs.
The fence is there if you (for some reason) want the CPU to be able to wait until the acquire is done. But Vulkan, as an explicit, low-level API, forces you to explicitly say that this is what you want (and it almost never is what you want).
Related
I switched to Vulkan from OpenGL to use multi-threading improvements.
In OpenGL, I was able to load dynamically object to the scene (buffer, textures, etc) while rendering by using a waiting system. I was loading all app-side stuffs in a thread, then when it was ready, just before a frame render in the main thread, I was sending everything into the video memory. That was fine.
With Vulkan, I know I can call some functions between threads without provoking the well known segfault from OpenGL. But, this doesn't works with vkQueueSubmit(). I already know, I tried the naive way. To me, it seems logical you can't bother a queue from multiple threads.
I came with some ideas, but I don't know which one is good or bad.
First, I would go the OpenGL way, I will prepare everything I can from the CPU/App side, then just before render a frame, I will submit buffers (with transfer queue) to the video memory. But I feel there is no a real improvement from OpenGL way...
Second, I will try to use the synchronization mechanism to be able to send buffers in a thread and render from an other. But I keep reading there is a lot of way to slow down everything by causing irrelevant locks or by using incorrectly semaphores and fences.
So my question, is basically what path to pick to solve this problem ? How can I load a buffer dynamically from an other thread while the main thread is rendering without making too much pain to performances ? How Vulkan can help ?
If you want to stream resources for immediate use (i.e. the main render cannot proceed without them), then you're pretty much going to either block the main thread waiting, or have it spin doing something visually interesting (e.g. an animated loading screen) waiting for the resources to load.
If you want to stream resources while the app is doing real rendering then the main trick here is to load resources asynchronously in the background and only switch to using those resources in the main thread once they are already loaded. If the main thread ever ends up actually blocked on a semaphore then you've probably already started dropping frames, so your "engine" design needs to ensure that never happens. A lot of game use simple low-detail proxy objects as stand-in versions while the high-detail version is loading in the background.
None of this is particularly related to the graphics API - both GL and Vulkan need the same macro-scale behavior. Vulkan API features don't particularly help because the major bottlenecks which cause problems here are storage/network/CPU which have nothing to do with the graphics part of the problem.
I decided to trust threads !
In the first place it seems to work, I get a lot of :
[MESSAGE:Validation Error: [ UNASSIGNED-Threading-MultipleThreads ] Object 0: handle = 0x56414228bad8, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x141cb623 | THREADING ERROR : vkQueueSubmit(): object of type VkQueue is simultaneously used in thread 0x7f6b977fe640 and thread 0x7f6bc2bcb740]
But it works !
So, the basic idea is to have a thread for loading objects while the engine is drawing. This thread takes care of creating the UBO for the location of the object, then when the geometry is loaded from RAM, it creates the VBO and IBO (I left material with image/UBO on hold for now), then creates the graphics pipeline (with layout, descriptor layout, shaders compiled with GLSLang on the fly) (The next idea is to reuse pipeline for similar needs) and finallly sets a flag to say the object is ready to use. In the other hand, I have my main thread rendering and takes new objects when they shows up ready.
I think it works because I have a gentle video card (GTX 1070) with multiple queues setup, I had one for graphics and an other one for transfer setup.
I'm pretty sure, this will crash or goes poorly with a GPU with a single queue, and this should be why the validation layers tolds me these messages.
I'm currently working on a directx rendering environment and encounter undefined behaviour under Intel cards.
I'm currently using trivial synchronization, so that the CPU always waits for the GPU to finish after every ExecuteCommandLists().
Everything works perfectly fine, drawing, copying my resources etc.
Since I work with different "passes" (draw passes, memory transfer pass, clear pass) I also have a "present" pass. There I have one commandlist with just 2 tasks, using a resource barrier to transform the render target from render target to present mode. I use the same command queue that is used for the swap chain and where all the drawing happens. Previous tasks were fully executed and waited for.
When I try to execute the two tasks it produces a device removal under Intel cards without any further explanations. On other cards from different vendors it works perfectly fine.
It definitely has something to do with the command queue, when I remove the resource barrier calls, it still crashes. If I create a new command queue it runs through, but crashes as well but in this case with a useful debug message that I need to use the same queue as for the swap chain.
Whats going wrong here and how can I get more information about the device removal?
Thanks for your help!
I want to see the intrinsic difference between a thread and a long-running go block in Clojure. In particular, I want to figure out which one I should use in my context.
I understand if one creates a go-block, then it is managed to run in a so-called thread-pool, the default size is 8. But thread will create a new thread.
In my case, there is an input stream that takes values from somewhere and the value is taken as an input. Some calculations are performed and the result is inserted into a result channel. In short, we have input and out put channel, and the calculation is done in the loop. So as to achieve concurrency, I have two choices, either use a go-block or use thread.
I wonder what is the intrinsic difference between these two. (We may assume there is no I/O during the calculations.) The sample code looks like the following:
(go-loop []
(when-let [input (<! input-stream)]
... ; calculations here
(>! result-chan result))
(recur))
(thread
(loop []
(when-let [input (<!! input-stream)]
... ; calculations here
(put! result-chan result))
(recur)))
I realize the number of threads that can be run simultaneously is exactly the number of CPU cores. Then in this case, is go-block and thread showing no differences if I am creating more than 8 thread or go-blocks?
I might want to simulate the differences in performance in my own laptop, but the production environment is quite different from the simulated one. I could draw no conclusions.
By the way, the calculation is not so heavy. If the inputs are not so large, 8,000 loops can be run in 1 second.
Another consideration is whether go-block vs thread will have an impact on GC performance.
There's a few things to note here.
Firstly, the thread pool that threads are created on via clojure.core.async/thread is what is known as a cached thread pool, meaning although it will re-use recently used threads inside that pool, it's essentially unbounded. Which of course means it could potentially hog a lot of system resources if left unchecked.
But given that what you're doing inside each asynchronous process is very lightweight, threads to me seem a little overkill. Of course, it's also important to take into account the quantity of items you expect to hit the input stream, if this number is large you could potentially overwhelm core.async's thread pool for go macros, potentially to the point where we're waiting for a thread to become available.
You also didn't mention preciously where you're getting the input values from, are the inputs some fixed data-set that remains constant at the start of the program, or are inputs continuously feed into the input stream from some source over time?
If it's the former then I would suggest you lean more towards transducers and I would argue that a CSP model isn't a good fit for your problem since you aren't modelling communication between separate components in your program, rather you're just processing data in parallel.
If it's the latter then I presume you have some other process that's listening to the result channel and doing something important with those results, in which case I would say your usage of go-blocks is perfectly acceptable.
I want to share my thoughts about how to keep memory barriers in sync in multi-threading rendering. Please let me know if my thoughts about Vulkan memory barrier is wrong or if my current plan makes any sense. I don't have anyone at work to discuss with, so I'll ask here for help.
For resources in Vulkan, when I set memory barriers for them among drawcalls, I need to set both srcAccessMask and dst AccessMask. This is simple for single threaded rendering. But for multi-threading rendering, it gets complicated. dst AccessMask is not a problem, since we always know what the resource is going to be used for. But for srcAccessMask, when one command buffer tries to read the current access mask of some resource, there might be other command buffers changing it to something else. So my current thoughts of solving it is:
Each resource keeps its own state, I'll only update the state right before submitting command buffers to command queue, I will describe it later. Each command buffer maintains tracking record of how the resource state changed inside it. Doing this way, within the same command buffer the access state of each resource is clear, the only problem is the beginning state of the resource for each command buffer.
When submitting multiple command buffers to execute, as the order of command buffers are fixed now, I check the tracking record of each resource among all command buffers, update resource's state based on the end state of the resource in each command buffer, and use that to correct the beginning state of the same resource in each command buffer's tracking record.
Then I need to either insert a new command buffer to have extra memory barrier to transition resource to correct state for the first command buffer, or insert memory barrier into previous command buffer for the rest command buffers.When all these are done, I can finally submit the command buffers together as a batch.
Do these make sense to you? Are there better solutions to solve it? Or do we even need to solve the "synchronization" issue of access state for each resource?
Thank you for your time
What you're talking about only makes sense in a world where none of these rendering operations have even the slightest idea what's going on elsewhere. Where the consumer of an image has no idea how the data in the image got there. Which probably means that it doesn't really know what that image means conceptually.
Vulkan is a low-level API. The idea is that you can connect the high-level concepts of your rendering system directly to Vulkan. So at a high level, you know that resource X has meaning Y and in this frame will have its data generated from operation Z. Not because of something stored in resource X but because it is resource X; that's what resource X is for. So both the operation generating it and the operation consuming it know what's going on and how it got there.
For example, if you're doing deferred rendering and SSAO, then your SSAO renderpass knows that the texture containing the depth buffer had its values generated by rendering. The depth buffer doesn't need something stored in it to say that; that's simply the nature of your rendering. It's hard-coded to work that way.
Most of your resource dependencies are (or ought to be) that way.
If you're doing some render-to-texture operation via the framebuffer, then the consumer probably doesn't even need to know about the dependency. You can just set an appropriate external dependency for the renderpass and the subpass that generates it. And you probably know why you did the render-to-texture op, and you probably know where it's going. If you're doing RTT for reflection, you know that the destination will be some kind of shader stage texture fetch. And if you don't know how it's going to be used, then you can just be safe and set all of the destination stage bits.
What you're talking about makes some degree of sense if you're dealing with streamed objects, where objects are popping into and outof memory with some regularity. But even then, that's not really a property of each individual resource.
When you load a streamed chunk, you upload its data by generating command buffer(s) and submitting them. And here's where we have an implementation-specific divergence. Your best bet for performance is to execute these CBs on a queue dedicated for transfer operations. But since Vulkan doesn't guarantee all implementations have those, you need to be able to deliver those transfer CBs to the main rendering queue.
So you need a way to communicate to rendering threads when they can expect to start being able to use the resources. But even that doesn't need to be on a per-resource basis; they can be told "stuff from block X is available", and then they can start using it.
Furthermore, that implementation divergence becomes important. See, if it's done on another queue, a barrier isn't the right synchronization primitive. Your rendering CBs now have to have their submitted batches wait on a semaphore. And that semaphore should handle all of the synchronization needs of the memory (ie: the destination bits being everything). So in the implementation where the transfer CBs are executed on the same queue as your rendering CBs, you may as well save yourself some trouble and issue a single barrier at the end of the transfer CB that makes all of the given resources available to all stages.
So as previously stated, this kind of automated system is only useful if you have no real control over the structure of rendering. This would principally be true if you're writing some kind of middleware, where the higher-level code defines the structure of rendering. However, if that's the case, Vulkan probably isn't the right tool for that job.
as a student project, we're building a robot which should run through a defined course and pick up a wooden cube. The core of it is a single-board computer running debian with an ARM9 at 250MHz. So there is more than enough processing power for the controller. Additionally, it does some image processing (not when driving, only when it stops), that's why we don't use a simple microcontroller without an OS.
The whole thing works quite well at the moment, but there is one problem: The main controlling loop executes without any delays and achieves cycle frequencies of over 1kHz. This is more than enough, 100Hz would suffice as well. But every when and then, there is a single cycle which takes 100ms and more, which may greatly disturb the controller.
I suspect that there are some other tasks which cause this delay, since the scheduler may detect that they haven't got any CPU time for an extended period of time.
So what I'd like most is the following: A short sleep of maybe 5ms in the controller's mainloop which does really only take 5ms but gives some processor time to the rest of the system. I can include a delay of e.g. 500us using nanosleep, but this always takes more than 10ms to execute, so it is not really an alternative. I'd prefer something like a voluntary sleep to give waiting tasks the opportunity to do something, but returns as quickly as possible.
The system is unloaded otherwise, so there is nothing which could really need a lot of processing for a long time.
Is there any way to do this in userspace, i.e. without having to stick to things like RTAI?
Thanks,
Philipp
I suggest that you stick to real time interfaces when doing motor control; I've seen a 1000 kg truck slam into a wall during a demo due to that the OS decided to think about something else for once... :-)
If you want to stay away from RTAI (but you shouldn't); a (perhaps) quick fix is to slap an Arduino board for the actual driving, and keep the linux board for high level processing.
To fix the "wall problem" implement a watchdog in the driver board, that stops the thing if no command have arrived for a while...
A real-time problem needs a real-time OS.
Because a real-time OS is predictable. You can set task priorities such that the ones that need to produce results at specified times don't get preempted by tasks that need processing power but are not time constrained.
Ok, I've found a solution which makes it better, altough not perfect. There is another thread which explains the sched_setscheduler() function. My init code now looks like this:
// Lock memory to reduce probability of high latency due to swapping
int rtres = mlockall(MCL_CURRENT | MCL_FUTURE);
if (rtres) {
cerr << "WARNING: mlockall() failed: " << rtres << endl;
}
// Set real-time scheduler policy to get more time deterministic behaviour
struct sched_param rtparams;
rtparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
rtres = sched_setscheduler(0, SCHED_FIFO, &rtparams);
if (rtres) {
cerr << "WARNING: sched_setscheduler() failed: " << rtres << endl;
}
Additionally, I've removed the short sleeps from the mainloop. The process now eats all available processing power and is (obviously) unresponsive to any actions from the outside world (such as console keystrokes and such), but this is OK for the task at hand.
The mainloop stats show that most iterations take less than a millisecond to complete, but there are still a few (every 1000th or so) which need approx. 100ms. This is still not good, but at least there are no more delays which are even longer.
Since it is "only" a student project, we can live with that and note it as a candidate for futher improvement.
Anyway, thanks for the suggestions. For next time, I'd know better how to cope with realtime requirements and take an RT OS from the beginning.
Regards,
Philipp
We are running 100 Hz loops on an ARM7 board at work, using a standard Linux with the RT patch, which allows (almost) all locks in the kernel to be preempted. Combining this with a high-priority thread gives us the necessary performance in both kernel and user space.
As the only thing you need to do is to apply the patch and configure the kernel to use full preemption, it's pretty easy to use also - no need to change anything in the software architecture, although I'm not familiar enough with Debian to say whether the patch will apply cleanly.