Difference between vulkan queue families - graphics

I'm fairly new to vulkan, and am playing around with the api. I have a function that prints my queue families, and currently I have two:
One that supports graphics/transfer/compute, with a max of 16 queues,
and one that supports sparse/transfer with a max of 2 queues.
Say I want to create two queues, one for graphics and one for transfer only. My understanding is that a transfer queue created out of the first family is effectively identical to one created out of the second family, as long as I only use transfer operations. AKA, I can pretty much ignore the second family as long as I do not use sparse memory operations.
Is this understanding correct, or am I missing something. Is there some reason that I would prefer to make my graphics and transfer queues from separate families?

Among the properties of a queue family include minImageTransferGranularity. This is a limitation on the XYZ regions of image data that can be copied using this queue family. So if this value is 8x8x8, then the location and size of pixel rectangle copies must be aligned on 8-pixel boundaries for all image copy operations on that queue.
So no, you cannot assume that a transfer-only queue family can always be used in place of a more capable queue. You always have to check.
At the same time, dedicated transfer queue families tend to represent specialized hardware specifically intended for doing transfer operations. So they may be using more efficient data pathways than transfer operations on other queues.
Generally speaking, if a piece of hardware offers a queue that only does transfer work, and you're doing enough transfer work that you're considering using a dedicated queue for it, you should use that queue family for doing transfer work (so long as the granularity works out for you).

Related

Multiple producer One Consumer Scenario with Signaling in a Real-Time Operating System (RTOS)

I'm developing a real-time system by making use of an mbed-OS (RTOS for ARM architecture). I'm not a software engineer and I want to know whether the following solution is practical or not, and how to improve it.
As it is shown in the figure, the elements of software are as follows:
Three different classes (ClassA, ...) describes low level
peripherals for gathering data from three different modules which their instances are passed-by-reference to three different Threads (Thread a, ...).
By using three queues (queueA, ...), I'm sending data to the
Thread d which is gathering data from the other 3 threads to combine them to form a string in a desired format (synthesis).
The combined data are queued to the Thread e and if some
scenarios (Happening in the first three Threads) satisfied, that data is sent to the Thread g.
Now the questions are:
Three first threads are gathering data in different update rates; How to synchronize them in the Thread d?
What is the best signaling solution to aware the other threads (Event or Signal?!)
Is the mentioned architecture practical?
Thanks.
Three first threads are gathering data in different update rates; How to synchronize them in the Thread d?
Do they need synchronisation? Thread d will presumably synthesise the output C based on the most recent data from all three queues. You might simply update the current data for A/B/C as they arrive, and when either arrives generate D from current data. If the data must be guaranteed to be "fresh" you could timestamp data arrival and only use it if all are recent enough. If you must gather fresh data from all three, you could maintain a flag for all three, set on arrival and generate D when all three flags are set simultaneously clearing the flags for the next set of data. How you do this is really dependent on the needs of the application and your abstract description does not suggest a specific solution.
What is the best signalling solution to aware the other threads (Event or Signal?!)
Message queues are blocking IPC, so if you wait on a single queue, arrival of data is signalling. I am not familiar with Mbed RTOS specifically but most RTOS allow blocking on a single queue only. You might combine all three queues A, B and C into one and include a data source identifier in the message - that might be simpler. There are often good reasons however for separate queues, and to wait of data from several queues, you might use a semaphore or task event flag, that is given whenever A, B or C place data in their output queues, then D will wait of the semaphore/event and then poll all three queues with zero timeout until all three are empty before returning to wait.
You have the same issue with thread E having two input queues.
Is the mentioned architecture practical?
Seems plausible - the abstract nature of your description does not allow to determine whether it is workable or appropriate in your specific application, but it is not insane at least.

What is real world example for priority queue?

I'm writing a Lock-Free C library, and I'm going to implement a priority queue. However, the goal of my library is not completeness of the data structures, I just want to implement some typical ones and then write a mirco-benchmark to show that the lock-free ones perform better under some special cases than the lock-based ones. So I want to know if there're some typical applications that the priority queue plays an important roles. (open-source projects are the best.) Then I can use them as a benchmark.
A few to be listed:
1. Dijkstra’s Shortest Path Algorithm
2. Prim’s algorithm
3. Huffman codes for Data compression.
4. Heap sort
5. Load balancing on servers.
There are various applocation being pointed out in :
https://www.cdn.geeksforgeeks.org/applications-priority-queue/
Also, the wiki itself has an extensive list of application and parameters against which you can benchmark your comparision(refer section Summary of running times):
https://en.wikipedia.org/wiki/Priority_queue
Priority queues are different from queues in the sense that they do not act on the FIFO principle.
...The elements of the priority queue are ordered according to their
natural ordering, or by a Comparator provided at queue construction
time...
One of the real world example would be Priority Scheduling algorithm where each job is assigned a priority and the job with the highest priority gets scheduled first
The most common uses for priority queues that I see in real life are:
1) Prioritized work queues: when a thread is ready for more work, it picks the highest priority available task from a priority queue. This is a great application for lock-free queues.
2) Finding the closest restaurants/hotels/bathrooms/whatever to a given location. The algorithm to retrieve these from pretty much any spacial data structure uses a priority queue.
When you build a product, you breakdown things into smaller chunks (stories). Then assign a priority for each. Then pick it up, work on it and close.
JIRA stories are relatable example for Priority Queues.

Must I use secondary command buffers when use multi-thread rendering? How about single queue vs. mutliple queues?

I'm confused about using Primary command buffers and Secondary command buffers.
From this NVIDIA sample, I understand that 're-use cmd' is Primary command buffer, and 're-use obj-level cmd' is secondary command buffer (relative to object). Is it so?
This performance shows that 're-use cmd' is better (faster) than 're-use obj-level cmd'. So I concluded that using only Primary is better than using Secondary buffer, but, all samples seem to be using secondary buffer in multi-threaded rendering.
Must I use secondary command buffers when using multi-thread rendering?
If I should support multiple queues, do I have to use all the multiple queues?
(multi-thread -> generate command buffer -> submit multiple queues)
or using single queue is fine?
Must I use secondary command buffers when using multi-thread rendering?
No. Stricly speaking, I don't see why you would have to. Of course secondaries may make your life easier in more complex program. You may get into some architectural problems using only big primary in some advanced rendering engine.
The first method is faster, because it's the most simplistic/trivial approach. You build big primary once (driver must love that, having all info at one place, way before needed) per scene and then only submit (like 0 CPU load per frame). But just try e.g. to draw a (general) dynamic scene with this method -- you will have hard time.
If I should support multiple queues, do I have to use all the multiple queues? (multi-thread -> generate command buffer -> submit multiple queues) or using single queue is fine?
Q: Should I try to use as many queues as possible?
TL;DR -- if you have to ask, one queue is probably fine. If you have some work that is highly independent (i.e. without need for excessive synchronization) use multiple queues. (That it writes to different Images is a good sign of independence)
Of course old wisdom applies: Don't make assumptions about performance. You have to measure.
First, you should not take whatever document that comes from as the only options for using Vulkan. There are many ways to use Vulkan, and those are just a few of them that NVIDIA was looking at on whatever presentation that comes from.
So I concluded that using only Primary is better than using Secondary buffer, but, all samples seem to be using secondary buffer in multi-threaded rendering.
Case in point: those samples aren't doing the same thing as what NVIDIA is doing.
The two NVIDIA methods are:
Generate a single completely static and unchanging command buffer. It presumably contains everything that can ever be rendered ever. Presumably, you use memory (indirect rendering, UBO data, etc) to control where they appear, how to draw more than one, or to prevent them from being drawn at all.
Generate a single completely static and unchanging command buffer for each object. Presumably, you use memory to control where they appear. You then render these objects into a primary command buffer.
Neither of these is threaded at all.
When people talk about threading rendering, what they're talking about is threading the creating of command buffers in the render loop. #2 doesn't create secondary command buffers during the render loop; they're static per object.
A typical threaded rendering system creates secondary command buffers that contain multiple objects, which will be submitted to the primary command buffer later in that frame. Each thread processes some count of objects to their command buffer. That's what those threading samples are doing.
So you're comparing apples to oranges here.
If I should support multiple queues, do I have to use all the multiple queues? (multi-thread -> generate command buffer -> submit multiple queues) or using single queue is fine?
Use whatever you feel you need. Parallel queue operations tend to be for things like memory transfers or complex compute operations that are going to do work for the next frame. Rendering tends to be something you need to happen in a specific order. And inter-queue dependencies for separate rendering operations can be very difficult to manage.
Also, remember that Vulkan runs on lots of hardware. In particular, it only requires that hardware provide a single queue. So even if you do some multi-queue programming, you still need a path to support single-queue systems.

Algorithm to optimally distribute multiple writer threads on multiple physical disks

I have a logical store which has multiple physical disks assigned to it
STORE
X:\
Y:\
Z:\
...
I also have a pool of threads that write data (size unknown) to the STORE. Is there an algorithm which I can use (load balancing, scheduling... etc.) to help me determine on which physical disk I should write?
Factors to take under consideration:
Free disk space available.
Disk utilization (proper distribution of threads across physical disks).
Free space % on all disks should be more or less be the same.
Notes:
Each thread has its own data to process, so a single thread can sleep if its data is not available.
Disks are not necessarily the same size.
One or more disks could be taken offline.
One or more disks could be added to the STORE.
UPDATE:
I should've explained the objective of these threads better in my question; these threads read from different data sources/streams and write immediately to disk(s), buffering steams in memory is not much of an option because their size tend to grow huge quickly
Whatever you go with is going to require some tuning. What I describe below is a simple and effective starting point that might very well fit your needs.
First, I doubt that you actually need three threads to handle writing to three disk drives. The amount of processing required to orchestrate this is actually quite small.
As a first cut, you could do a simple round-robin scheduling with one thread and asynchronous writes. That is, you just have a circular queue that you fill with [X, Y, Z]. When a request comes in, you take a disk from the front of the queue and initiate an asynchronous write to that drive.
When the next request comes in, you again take the first item from the queue and issue an asynchronous write.
When an asynchronous write completes, the disk to which the data was written is added to the end of the queue.
If a drive is taken offline, it's removed from the queue. If a drive is added to the store, you make a new entry for it in the queue.
An obvious problem with the above is what to do if you get more concurrent write requests than you have drives. Using the technique I described above, the thread would have to block until there is a drive available. If you have to support bursts of activity, you could easily create a request queue into which requests are written (with their associated data). The thread doing the orchestration, then, would read an item from the queue, get a disk drive from the drive queue, and start the asynchronous write.
Note that with this setup, no drive can be doing more than a single write at a time. That's typically not a problem because the drive hardware typically can't handle multiple concurrent writes.
Keeping free space percentage relatively the same across drives might not be much harder. You could keep track of the free space percentage on each drive easily enough, and rather than using a FIFO queue for the drives, use a priority queue so that you always write to the drive that has the highest free space percentage. That will work well as long as your average write size isn't a huge percentage of a drive's free space.
Edit
Note that I said asynchronous writes. So you can have as many concurrent writes as you have drives. Those writes are running concurrently and will notify on an I/O completion port when done. There's no need for multiple threads.
As for the priority queue, there are plenty of those to choose from, although finding a good concurrent priority queue is a bit more work. In the past I've just used locks to synchronize access to my own priority queue implementation. I guess I should formalize that at some point.
You could play with what I describe above by, for example, adding two or more entries in the queue for each drive. More for faster drives, fewer for slower drives. It's unclear how well that would work, but it's probably worth trying. If those "drives" are high performance network storage devices, they might actually be able to handle multiple concurrent writes better than a typical local disk drive can. But at some point you'll have to buffer writes because your computer can almost certainly create data much faster than your drives can write. The key is making your buffer large enough to handle the normal bursts of data, and also robust enough to block the program briefly if the buffer fills up.

Real World Examples of read-write in concurrent software

I'm looking for real world examples of needing read and write access to the same value in concurrent systems.
In my opinion, many semaphores or locks are present because there's no known alternative (to the implementer,) but do you know of any patterns where mutexes seem to be a requirement?
In a way I'm asking for candidates for the standard set of HARD problems for concurrent software in the real world.
What kind of locks are used depends on how the data is being accessed by multiple threads. If you can fine tune the use case, you can sometimes eliminate the need for exclusive locks completely.
An exclusive lock is needed only if your use case requires that the shared data must be 100% exact all the time. This is the default that most developers start with because that's how we think about data normally.
However, if what you are using the data for can tolerate some "looseness", there are several techniques to share data between threads without the use of exclusive locks on every access.
For example, if you have a linked list of data and if your use of that linked list would not be upset by seeing the same node multiple times in a list traversal and would not be upset if it did not see an insert immediately after the insert (or similar artifacts), you can perform list inserts and deletes using atomic pointer exchange without the need for a full-stop mutex lock around the insert or delete operation.
Another example: if you have an array or list object that is mostly read from by threads and only occasionally updated by a master thread, you could implement lock-free updates by maintaining two copies of the list: one that is "live" that other threads can read from and another that is "offline" that you can write to in the privacy of your own thread. To perform an update, you copy the contents of the "live" list into the "offline" list, perform the update to the offline list, and then swap the offline list pointer into the live list pointer using an atomic pointer exchange. You will then need some mechanism to let the readers "drain" from the now offline list. In a garbage collected system, you can just release the reference to the offline list - when the last consumer is finished with it, it will be GC'd. In a non-GC system, you could use reference counting to keep track of how many readers are still using the list. For this example, having only one thread designated as the list updater would be ideal. If multiple updaters are needed, you will need to put a lock around the update operation, but only to serialize updaters - no lock and no performance impact on readers of the list.
All the lock-free resource sharing techniques I'm aware of require the use of atomic swaps (aka InterlockedExchange). This usually translates into a specific instruction in the CPU and/or a hardware bus lock (lock prefix on a read or write opcode in x86 assembler) for a very brief period of time. On multiproc systems, atomic swaps may force a cache invalidation on the other processors (this was the case on dual proc Pentium II) but I don't think this is as much of a problem on current multicore chips. Even with these performance caveats, lock-free runs much faster than taking a full-stop kernel event object. Just making a call into a kernel API function takes several hundred clock cycles (to switch to kernel mode).
Examples of real-world scenarios:
producer/consumer workflows. Web service receives http requests for data, places the request into an internal queue, worker thread pulls the work item from the queue and performs the work. The queue is read/write and has to be thread safe.
Data shared between threads with change of ownership. Thread 1 allocates an object, tosses it to thread 2 for processing, and never wants to see it again. Thread 2 is responsible for disposing the object. The memory management system (malloc/free) must be thread safe.
File system. This is almost always an OS service and already fully thread safe, but it's worth including in the list.
Reference counting. Releases the resource when the number of references drops to zero. The increment/decrement/test operations must be thread safe. These can usually be implemented using atomic primitives instead of full-stop kernal mutex locks.
Most real world, concurrent software, has some form of requirement for synchronization at some level. Often, better written software will take great pains to reduce the amount of locking required, but it is still required at some point.
For example, I often do simulations where we have some form of aggregation operation occurring. Typically, there are ways to prevent locking during the simulation phase itself (ie: use of thread local state data, etc), but the actual aggregation portion typically requires some form of lock at the end.
Luckily, this becomes a lock per thread, not per unit of work. In my case, this is significant, since I'm typically doing operations on hundreds of thousands or millions of units of work, but most of the time, it's occuring on systems with 4-16 PEs, which means I'm usually restricting to a similar number of units of execution. By using this type of mechanism, you're still locking, but you're locking between tens of elements instead of potentially millions.

Resources