Memory visibility for an in-order queue

Memory visibility for an in-order queue - multithreading

After reading the OpenCL 1.1 standard I still can't grasp whether in-order command queue does guarantee memory visibility for any pair of commands (not only kernels) according to their enqueueing order.
OpenCL standard 1.1 section 5.11 states:
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
command-queue is not set, the commands enqueued to a command-queue
execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a
clEnqueueNDRangeKernel to execute kernel B, the application can
assume that kernel A finishes first and then kernel B is executed. If
the memory objects output by kernel A are inputs to kernel B then
kernel B will see the correct data in memory objects produced by execution of kernel A.
What about clEnqueueWriteBuffer (non-blocking) and clEnqueueNDRangeKernel enqueued after, which uses that buffer contents?
AFAIK, 'finishes execution' does not imply that corresponding writes are visible (due to relaxed consistency). For example, section 5.10 states specifically:
The clEnqueueBarrier command ensures that all queued commands in
command_queue have finished execution before the next batch of
commands can begin execution. The clEnqueueBarrier command is a
synchronization point.
In other words, should I rely on other 'synchronization points'-related rules (events, etc.), or I get memory synchronization out-of-the-box for all the commands in an in-order queue?

What about clEnqueueWriteBuffer (non-blocking) and
clEnqueueNDRangeKernel enqueued after, which uses that buffer
contents?
since it is in-order queue, it will first write then run the kernel after it finishes, even if the write is non-blocking.
clEnqueueBarrier is device-side synchronization command and is intended to work with out-of-order queues. When you use clFinish(), you make the api wait more for the communication between host and device. Enqueueing barrier is much faster synchronization but on the device side only. When you need to synchronize a queue with another queue and still need a similar sync point, you should use clEnqueueWaitForEvents just after(or before) the barrier or simply use only the even waiting(for in-order queue).
For opencl 1.2, clEnqueueWaitForEvents and clEnqueueBarrier was merged into clEnqueueBarrierWithWaitList which lets you both barrier out-of-order queue and synchronize it with other queues or even host-side-raised events.
If there is only single in-order queue, you don't need a barrier and when you need to synchronize with host, you can use clFinish or an event-based synchronization command.
or I get memory synchronization out-of-the-box for all the commands in
an in-order queue?
for only enqueue type commands, yes. Enqueue (1 write + 1 compute + 1 read) operations 128 times in an in-order queue, they all will work one after another and complete a 128-step simulation(after they are issued by a flush/finish command) . Commands don't have to be in a specific order for this implicit synchronization. Anything like 1 write + 2 reads + 2 kernels +5 writes +1 read + 1 kernel +15 reads work one after another(2 kernels = 1 kernel + 1kernel).
For non-enqueue type commands such as clSetKernelArg, you have to use a synchronization point or do it before all enqueuing of commands.
You can also use enqueued commands themselves as an inter-queue sync point with its eventlist parameter and use the next parameter to get its completion event to be used in another queue(signaling) but its still not a barrier for out-of-order queue.
If a buffer is used for two kernels that are in different queues and they are to write data on that buffer, there must be synchronization between queues unless they are writing on different locations. So you can use 20 kernels working on each 1/20th of a buffer and work in parallel using multiple queues and finally synchronize all queues only in the end using a wait list. If a kernel uses or alters another kernels region concurrently, it is undefined behaviour. Similar process can be done for map/unmap too.
in-order vs out-of-order example:
r: read, w: write, c: compute
<------------clFinish----------------------->
in-order queue....: rwrwrwcwccwccwrcwccccccwrcwcwrwcwrwccccwrwcrw
out-of-order queue: <--r--><-r-><-c-><-----c-----><-----r-----><w>
<---w-------><-------r-----><------c----->
<---r---><-----c--------------------><--c->
<---w--->
<---c-----> <----w------>
and another out-of-order queue with a barrier in the middle:
<---r---><--w---> | <-----w---->
<---c------> | <----c---> <----w--->
<---c--------> | <------r------->
<----w------> | <----c------->
where read/write operations before barrier forced to wait until all commands hit same barrier. Then all remaining ones continue concurrently.
The last exemple shows, memory visibility from "host side" can be acquired by barrier or clfinish. But barrier doesn't inform host that it has finished so you need to query events about the queue. ClFinish blocks until all commands are finished so you don't need to query anything. Both will make host see the most updated memory.
Your question is about memory visibility for commands of an in-order queue, so you don't need a synchronization point for them to see each others most-up-to-date-values.
Each kernel execution is also a synchronization point between its work groups so work groups can't know other groups' data until kernel finishes and all data is prepared and becomes visible at the end of kernel execution. So next kernel can use it immediately.
I haven't tried to read data concurrently from device to host without any synchronization points but it may work for some devices that are not caching any data on any cache memory. Even integrated gpus have their dedicated L3 caches so it would need at least a barrier command once in a while, to let the host read some updated(but possibly partially re-updated in-flight) data. Event-based synchronization is faster than clFinish and gives correct memory data for host. Barrier is also faster than clFinish but only usable for device-side sync points.
If I understand correctly,
Sync Point ------------------------- Memory visibility
in-kernel fence in same workitem(and wavefront?)
in-kernel local memory barrier local memory in same workgroup
in-kernel global memory barrier global memory in same workgroup
in-kernel atomics only other atomics in same kernel
enqueued kernel/command next kernel/command in same queue
enqueued barrier following commands in same device
enqueued event wait host
clFinish host
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueMapBuffer.html
If the buffer object is created with CL_MEM_USE_HOST_PTR set in
mem_flags, the host_ptr specified in clCreateBuffer is guaranteed to
contain the latest bits in the region being mapped when the
clEnqueueMapBuffer command has completed; and the pointer value
returned by clEnqueueMapBuffer will be derived from the host_ptr
specified when the buffer object is created.
and
https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/clEnqueueWriteBuffer.html
All commands that use this buffer object or a memory object (buffer or
image) created from this buffer object have finished execution before
the read command begins execution.
so it doesn't say anything like a barrier or sync. Completion is just enough.

From the spec:
In-order Execution: Commands are launched in the order they appear in the commandqueue and complete in order. In other words, a prior
command on the queue completes before the following command begins.
This serializes the execution order of commands in a queue.
In case of in-order queues all commands in a queue executed in order, no extra synchronisation is required.

Related

What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core?

Code within a single thread has certain memory guarantees, such as read after write (i.e. writing some value to a memory location, then reading it back should give the value you wrote).
What happens to such memory guarantees if a thread is rescheduled to execute on a different CPU core? Say a thread writes 10 to memory location X, then gets rescheduled to a different core. That core's L1 cache might have a different value for X (from another thread that was executing on that core previously), so now a read of X wouldn't return 10 as the thread expects. Is there some L1 cache synchronization that occurs when a thread is scheduled on a different core?

All that is required in this case is that the writes performed while on the first processor become globally visible before the process begins executing on the second processor. In the Intel 64 architecture this is accomplished by including one or more instructions with memory fence semantics in the code that the OS uses to transfer the process from one core to another. An example from the Linux kernel:
/*
* Make previous memory operations globally visible before
* sending the IPI through x2apic wrmsr. We need a serializing instruction or
* mfence for this.
*/
static inline void x2apic_wrmsr_fence(void)
{
asm volatile("mfence" : : : "memory");
}
This ensures that the stores from the original core are globally visible before execution of the inter-processor interrupt that will start the thread running on the new core.
Reference: Sections 8.2 and 8.3 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-071, October 2019).

TL;DR: It depends on the architecture and the OS. On x86, this type of read-after-write hazard is mostly not issue that has to be considered at the software level, except for the weakly-order WC stores which require a store fence to be executed in software on the same logical core before the thread is migrated.
Usually the thread migration operation includes at least one memory store. Consider an architecture with the following property:
The memory model is such that memory stores may not become globally observable in program order. This Wikipedia article has a not-accurate-but-good-enough table that shows examples of architectures that have this property (see the row "Stores can be reordered after stores ").
The ordering hazard you mentioned may be possible on such an architecture because even if the thread migration operation completes, it doesn't necessarily mean that all the stores that the thread has performed are globally observable. On architectures with strict sequential store ordering, this hazard cannot occur.
On a completely hypothetical architecture where it's possible to migrate a thread without doing a single memory store (e.g., by directly transferring the thread's context to another core), the hazard can occur even if all stores are sequential on an architecture with the following property:
There is a "window of vulnerability" between the time when a store retires and when it becomes globally observable. This can happen, for example, due to the presence of store buffers and/or MSHRs. Most modern processors have this property.
So even with sequential store ordering, it may be possible that the thread running on the new core may not see the last N stores.
Note that on an machine with in-order retirement, the window of vulnerability is a necessary but insufficient condition for a memory model that supports stores that may not be sequential.
Usually a thread is rescheduled to run on a different core using one of the following two methods:
A hardware interrupt, such as a timer interrupt, occurs that ultimately causes the thread to be rescheduled on a different logical core.
The thread itself performs a system call, such as sched_setaffinity, that ultimately causes it to run on a different core.
The question is at which point does the system guarantee that retired stores become globally observable? On Intel and AMD x86 processors, hardware interrupts are fully serializing events, so all user-mode stores (including cacheable and uncacheable) are guaranteed to be globally observable before the interrupt handler is executed, in which the thread may be rescheduled to run a different logical core.
On Intel and AMD x86 processors, there are multiple ways to perform system calls (i.e., change the privilege level) including INT, SYSCALL, SYSENTER, and far CALL. None of them guarantee that all previous stores become globally observable. Therefore, the OS is supposed to do this explicitly when scheduling a thread on a different core by executing a store fence operation. This is done as part of saving the thread context (architectural user-mode registers) to memory and adding the thread to the queue associated with the other core. These operations involve at least one store that is subject to the sequential ordering guarantee. When the scheduler runs on the target core, it would see the full register and memory architectural state (at the point of the last retired instruction) of the thread would be available on that core.
On x86, if the thread uses stores of type WC, which do not guarantee the sequential ordering, the OS may not guarantee in this case that it will make these stores globally observable. The x86 spec explicitly states that in order to make WC stores globally observable, a store fence has to be used (either in the thread on the same core or, much simpler, in the OS). An OS generally should do this, as mentioned in #JohnDMcCalpin's answer. Otherwise, if the OS doesn't provide the program order guarantee to software threads, then the user-mode programmer may need to take this into account. One way would be the following:
Save a copy of the current CPU mask and pin the thread to the current core (or any single core).
Execute the weakly-ordered stores.
Execute a store fence.
Restore the CPU mask.
This temporarily disables migration to ensure that the store fence is executed on the same core as the weakly-ordered stores. After executing the store fence, the thread can safely migrate without possibly violating program order.
Note that user-mode sleep instructions, such as UMWAIT, cannot cause the thread to be rescheduled on a different core because the OS does not take control in this case.
Thread Migration in the Linux Kernel
The code snippet from #JohnDMcCalpin's answer falls on the path to send an inter-processor interrupt, which is achieved using a WRMSR instruction to an APIC register. An IPI may be sent for many reasons. For example, to perform a TLB shootdown operation. In this case, it's important to ensure that the updated paging structures are globally observable before invaliding the TLB entries on the other cores. That's why x2apic_wrmsr_fence may be needed, which is invoked just before sending an IPI.
That said, I don't think thread migration requires sending an IPI. Essentially, a thread is migrated by removing it from some data structure that is associated with one core and add it to the one associated with the target core. A thread may be migrated for numerous reasons, such as when the affinity changes or when the scheduler decides to rebalance the load. As mentioned in the Linux source code, all paths of thread migration in the source code end up executing the following:
stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg)
where arg holds the task to be migrated and the destination core identifier. migration_cpu_stop is a function that does the actual migration. However, the task to be migrated may be currently running or waiting in some runqueue to run on the source core (i.e, the core on which the task is currently scheduled). It's required to stop the task before the migrating it. This is achieved by adding the call to the function migration_cpu_stop to the queue of the stopper task associated with the source core. stop_one_cpu then sets the stopper task as ready for execution. The stopper task has the highest priority. So on the next timer interrupt on the source core (Which could be the same as the current core), one of the tasks with the highest priority will be selected to run. Eventually, the stopper task will run and it will execute migration_cpu_stop, which in turn performs the migration. Since this process involves a hardware interrupt, all stores of the target task are guaranteed to be globally observable.
There appears to be a bug in x2apic_wrmsr_fence
The purpose of x2apic_wrmsr_fence is to make all previous stores globally observable before sending the IPI. As discussed in this thread, SFENCE is not sufficient here. To see why, consider the following sequence:
store
sfence
wrmsr
The store fence here can order the preceding store operation, but not the MSR write. The WRMSR instruction doesn't have any serializing properties when writing to an APIC register in x2APIC mode. This is mentioned in the Intel SDM volume 3 Section 10.12.3:
To allow for efficient access to the APIC registers in x2APIC mode,
the serializing semantics of WRMSR are relaxed when writing to the
APIC registers.
The problem here is that MFENCE is also not guaranteed to order the later WRMSR with respect to previous stores. On Intel processors, it's documented to only order memory operations. Only on AMD processors it's guaranteed to be fully serializing. So to make it work on Intel processors, there needs to be an LFENCE after the MFENCE (SFENCE is not ordered with LFENCE, so MFENCE must be used even though we don't need to order loads). Actually Section 10.12.3 mentions this.

If a platform is going to support moving a thread from one core to another, whatever code does that moving must respect whatever guarantees a thread is allowed to rely on. If a thread is allowed to rely on the guarantee that a read after a write will see the updated value, then whatever code migrates a thread from one core to another must ensure that guarantee is preserved.
Everything else is platform specific. If a platform has an L1 cache then hardware must make that cache fully coherent or some form of invalidation or flushing will be necessary. On most typical modern processors, hardware makes the cache only partially coherent because reads can also be prefetched and writes can be posted. On x86 CPUs, special hardware magic solves the prefetch problem (the prefetch is invalidated if the L1 cache line is invalidated). I believe the OS and/or scheduler has to specifically flush posted writes, but I'm not entirely sure and it may vary based on the exact CPU.
The CPU goes to great cost to ensure that a write will always see a previous read in the same instruction stream. For an OS to remove this guarantee and require all user-space code to work without it would be a complete non-starter since user-space code has no way to know where in its code it might get migrated.

Adding my two bits here. On first glance, a barrier seems like an overkill (answers above)
Consider this logic: when a thread wants to write to a cacheline, HW cache coherence kicks in and we need to invalidate all other copies of the cacheline that are present with other cores in the system; the write doesn't proceed without the invalidations. When a thread is re-scheduled to a different core then, it will have to fetch the cacheline from the L1-cache that has write permission thereby maintaining read-after-write sequential behavior.
The problem with this logic is that invalidations from cores aren't applied immediately, hence it is possible to read a stale value after being rescheduled (the read to the new L1-cache somehow beats the pending invalidation present in a queue with that core). This is ok for different threads because they are allowed to slip and slide, but with the same thread a barrier becomes essential.

There are many threads reserved while golang application running

The golang application is a tool that receives file by invoking a c library, saves it to disk and report the transfer state to monitor service with http protocol.
After a few transferring, I found there are about 70+ threads existed with a few goroutines.
I check the c and go source code, there are no thread or goroutine leak found.
I use "dlv" to debug the application, here is the stack of one of the such threads:
(dlv) bt
0 0x000000000046df03 in runtime.futex
at /home/vagrant/resource/go/src/runtime/sys_linux_amd64.s:388
1 0x0000000000437e92 in runtime.futexsleep
at /home/vagrant/resource/go/src/runtime/os_linux.go:45
2 0x000000000041e042 in runtime.notesleep
at /home/vagrant/resource/go/src/runtime/lock_futex.go:145
3 0x000000000044036d in runtime.stopm
at /home/vagrant/resource/go/src/runtime/proc.go:1594
4 0x0000000000441178 in runtime.findrunnable
at /home/vagrant/resource/go/src/runtime/proc.go:2021
5 0x0000000000441cec in runtime.schedule
at /home/vagrant/resource/go/src/runtime/proc.go:2120
6 0x0000000000442063 in runtime.park_m
at /home/vagrant/resource/go/src/runtime/proc.go:2183
7 0x0000000000469f1b in runtime.mcall
at /home/vagrant/resource/go/src/runtime/asm_amd64.s:240
I don't know where these threads come from or may be threads pool of golang runtime?
Could any one look at this, thank you very much!

The problem
The golang application is a tool that receives file by invoking a c
library, saves it to disk and report the transfer state to monitor
service with http protocol.
After a few transferring, I found there are about 70+ threads existed
with a few goroutines.
The cause
Each call to C (via cgo, or syscall on Windows etc) is no really
different from performing an OS system call as long as the Go scheduler
is concerned.
What happens is this:
When a goroutine is being executed, it runs on an OS thread
(this is sort of obvious, I fathom).
When it performs a syscall or calls C, that goroutine blocks
(stops executing Go code).
The Go runtime scheduler watches after the goroutines which got blocked
and after at east a single "scheduler tick" (which currently — in
Go 1.8 and 1.9 — is 20 µs) passes, and the goroutine is still blocked,
and there are other runnable goroutines,
the scheduler creates another OS thread to make other goroutines
continue execution.
This behaviour might appear to be counter-intuitive at first
but without it, on, say, a two-CPU machine, you would need to just call
two syscalls (such as reading or writing a file) in parallel from
any two goroutines to block the rest of the active goroutines
from doing their work.
In other words, the scheduler tries to keep up with the Go's promise
of always having up to GOMAXPROCS goroutines running
if there are goroutines which want to run, and GOMAXPROCS
is set to the number of CPUs (cores) of the machine.
So, what happens is that if you have a reasonably high churn of C calls which complete slower than that single scheduler tick, you'll have growing pool
of allocated OS threads.
Note that this is not bad in itself: sure, you'll be allocating resources
(on a typical commodity OS each thread has some 8 MiB of stack allocated
plus some bookkeeping data structures internal to the OS) but they are
not wasted: these threads will get reused as soon as they will be needed.
Say, your next burst of such C calls will reuse the allocated threads.
The solution
Still, if you'd like to prevent that from happening, the common approach
is to reasonably serialize your C calls.
A typical approach to this is to have a single "worker" goroutine
which receives "tasks" — in the form of values of some type, usually
a custom type created by you — over a channel and sends the results of
their execution over another channel.
The input channel may be buffered — effectively turning it into a queue.
If you'd still want to parallelize that work, you can have a pool of
worker goroutines — all reading the single input channel and writing to
the single output channel.
But note that if those C calls spend most of their time doing disk I/O
and the files they read/write are located on the filesystem which
is backed by a single medium, you usually won't gain much with
parallelizing unless that medium is blazingly fast — such as SSD or
in-memory (RAM) disk.
So consider all the options and think through your design.

Vulkan Queue Synchronization in Multithreading

In my application it is imperative that "state" and "graphics" are processed in separate threads. So for example, the "state" thread is only concerned with updating object positions, and the "graphics" thread is only concerned with graphically outputting the current state.
For simplicity, let's say that the entirety of the state data is contained within a single VkBuffer. The "state" thread creates a Compute Pipeline with a Storage Buffer backed by the VkBuffer, and periodically vkCmdDispatchs to update the VkBuffer.
Concurrently, the "graphics" thread creates a Graphics Pipeline with a Uniform Buffer backed by the same VkBuffer, and periodically draws/vkQueuePresentKHRs.
Obviously there must be some sort of synchronization mechanism to prevent the "graphics" thread from reading from the VkBuffer whilst the "state" thread is writing to it.
The only idea I have is to employ the usage of a host mutex fromvkQueueSubmit to vkWaitForFences in both threads.
I want to know, is there perhaps some other method that is more efficient or is this considered to be OK?

Try using semaphores. They are used to synchronize operations solely on the GPU, which is much more optimal than waiting in the app and submitting work after previous work is fully processed.
When You submit work You can provide a semaphore which gets signaled when this work is finished. When You submit another work You can provide the same semaphore on which the second batch should wait. Processing of the second batch will start automatically when the semaphore gets signaled (this semaphore is also automatically unsignaled and can be reused).
(I think there are some constraints on using semaphores, associated with queues. I will update the answer later when I confirm this but they should be sufficient for Your purposes.
[EDIT] There are constraints on using semaphores but it shouldn't affect You - when You use a semaphore as a wait semaphore during submission, no other queue can wait on the same semaphore.)
There are also events in Vulkan which can be used for similar purposes but their use is a little bit more complicated.
If You really need to synchronize GPU and Your application, use fences. They are signaled in a similar way as semaphores. But You can check their state on the app side and You need to manually unsignal them before You can use then again.
[EDIT]
I've added an image that more or less shows what I think You should do. One thread calculates state and with each submission adds a semaphore to the top of the list (or a ring buffer as #NicolasBolas wrote). This semaphore gets signaled when the submission is finished (it is provided in pSignalSemaphores during "compute" batch submission).
Second thread renders Your scene. It manages it's own list of semaphores similarly to the compute thread. But when You want to render things, You need to be sure that compute thread finished calculations. That's why You need to take the latest "compute" semaphore and wait on it (provide it in pWaitSemaphores during "render" batch submission). When You submit rendering commands, compute thread can't start and modify the data because it may influence the results of a rendering. So compute thread also needs to wait until the most recent rendering is done. That's why compute thread also needs to provide a wait semaphore (the most recent "rendering" semaphore).
You just need to synchronize submissions. Rendering thread cannot start when a compute threads submits commands and vice versa. That's why adding semaphores to the lists (and taking semaphores from the list) should be synchronized. But this has nothing to do with Vulkan. Probably some mutex will be helpful (for example a C++-ish std::lock_guard<std::mutex>). But this synchronization is a problem only when You have a single buffer.
Another thing is what to do with old semaphores from both lists. You cannot directly check what is their state and You cannot directly unsignal them. The state of semaphores can be checked by using additional fences provided with each submission. You don't wait on them but from time to time check if a given fence is signaled and, if it is, You can destroy old semaphore (as You cannot unsignal it from the application) or You can make an empty submission, with no command buffers, and use that semaphore as a wait semaphore. This way the semaphore will be unsignaled and You can reuse it. But I don't know which solution is more optimal: destroying old and creating new semaphores, or unsignaling them with empty submissions.
When You have a single buffer, a one-element list/ring is probably enough. But more optimal solution would have some kind of a ping-pong set of buffers - You read data from one buffer, but store results in another buffer. And in the next step You swap them. That's why in the image above, the lists of semaphores (rings) may have more elements depending on Your setup. The more independent buffers and semaphores in the lists (of course to some reasonable count), the best performance You will get as You reduce time wasted on waiting. But this complicates Your code and it may also increase a lag (rendering thread gets data that is a bit older than the data currently processed by the compute thread). So You may need to balance performance, code complexity and a rendering lag.

How you do this depends on two factors:
Whether you want to dispatch the compute operation on the same queue as its corresponding graphics operation.
The ratio of compute operations to their corresponding graphics operations.
#2 is the most important part.
Even though they are generated in separate threads, there must be at least some idea that the graphics operation is being fed by a particular compute operation (otherwise, how would the graphics thread know where the data is to read from?). So, how do you do that?
At the end of the day, that part has nothing to do with Vulkan. You need to use some inter-thread communication mechanism to allow the graphics thread to ask, "which compute task's data should I be using?"
Typically, this would be done by having the compute thread add every compute operation it does to some kind of circular buffer (thread-safe of course. And non-locking). When the graphics thread goes to decide where to read its data from, it asks the circular buffer for the most recently added compute operation.
In addition to the "where to read its data from" information, this would also provide the graphics thread with an appropriate Vulkan synchronization primitive to use to synchronize its command buffer(s) with the compute operation's CB.
If the compute and graphics operations are being dispatched on the same queue, then this is pretty simple. There doesn't have to actually be a synchronization primitive. So long as the graphics CBs are issued after the compute CBs in the batch, all the graphics CBs need is to have a vkCmdPipelineBarrier at the front which waits on all memory operations from the compute stage.
srcStageMask would be STAGE_COMPUTE_SHADER_BIT, with dstStageMask being, well, pretty much everything (you could narrow it down, but it won't matter, since at the very least your vertex shader stage will need to be there).
You would need a single VkMemoryBarrier in the pipeline barrier. It's srcAccessMask would be SHADER_WRITE_BIT, while the dstAccessMask would be however you intend to read it. If the compute operations wrote some vertex data, you need VERTEX_ATTRIBUTE_READ_BIT. If they wrote some uniform buffer data, you need UNIFORM_READ_BIT. And so on.
If you're dispatching these operations on separate queues, that's where you need an actual synchronization object.
There are several problems:
You cannot detect if a Vulkan semaphore has been signaled by user code. Nor can you set a semaphore to the unsignaled state by user code. Nor can you reasonably submit a batch that has a semaphore in it that is currently signaled and nobody's waiting on it. You can do the latter, but it won't do the right thing.
In short, you can never submit a batch that signals a semaphore unless you are certain that some process is going to wait for it.
You cannot issue a batch that waits on a semaphore, unless a batch that signals it is "pending execution". That is, your graphics thread cannot vkQueueSubmit its batch until it is certain that the compute queue has submitted its signaling batch.
So what you have to do is this. When the graphics queue goes to get its compute data, this must send a signal to the compute thread to add a semaphore to its next submit call. When the graphics thread submits its graphics operation, it then waits on that semaphore.
But to ensure proper ordering, the graphics thread cannot submit its operation until the compute thread has submitted the semaphore signaling operation. That requires a CPU-synchronization operation of some form. It could be as simple as the graphics thread polling an atomic variable set by the compute thread.

Linux driver resource protection

I'm writing a Linux device driver and am pretty new at this so I'm learning quickly how NOT to do things. I'm currently using a couple of mutexes to prevent some functions from concurrently reading from the device and running into deadlocks on resume from suspend. My problem is two-fold:
1) Interrupt handler schedules a workqueue to read from the FIFO of the device and process data. FIFO needs to be read uninterrupted by other reads so I have placed a mutex (A) lock/unlock in the read, write functions.
2) Device configuration function is a sequence of read and writes using the same read, write functions as above that must be done uninterrupted by other reads or writes so I have placed a mutex (B) lock/unlock in the config functions. Device configuration functions are called by SYSFS nodes.
The issue appears to be when the system resumes from suspend, an interrupt triggers the FIFO call and at near the same time higher layers write to the SYSFS nodes to set configuration parameters and the system seems to deadlock during configuration sequence. Is my issue just that I'm using mutex which sleeps where I should be using a spinlock? Or am I going about this the wrong way?

Get an interrupt.
Ack/disable interrupt in interrupt handler.
Start work queue.
Do high-priority processing, like get data off device and onto queue.
Enable device interrupt and go process lower-priority data in work queue.
Two difference mutexes here can't work because the locking order may be A->B [where try to acquire mutex B while holding mutex A] while another path is B->A which is textbook deadly embrace.
The solution is to re-structure your processing in to high-priority work (very limited task) to give data to the lower-priority work.
If the test for busy/availableis more than just a yes/no test, use a condition variable to guard the complex testing must be done.

Memory addressing in assembly / multitasking

I understand how programs in machine code can load values from memory in to registers, perform jumps, or store values in registers to memory, but I don't understand how this works for multiple processes. A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I have another question regarding multitasking that is somewhat related. How does the OS, which isn't running, stop a thread and move on to the next. Is this done with timed interrupts? If so, then how can the values in registers be preserved for a thread. Are they saved to memory before control is given to a different thread? Or, rather than timed interrupts, does the thread simply choose a good time to give up control. In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Edit: Or are executables edited before being run to compensate for the correct offsets?

That's not how it works. All modern operating systems virtualize the available memory. Giving every process the illusion that it has 2 gigabytes of memory (or more) and doesn't have to share it with anybody. The key component in a machine that does this is the MMU, nowadays built in the processor itself. Another core feature of this virtualization is that it isolates processes. One misbehaving one cannot bring another one down with it.
Yes, a clock tick interrupt is used to interrupt the currently running code. Processor state is simply saved on the stack. The operating system scheduler then checks if any other thread is ready to run and has a high enough priority to get first in line. Some extra code ensures that everybody gets a fair share. Then it just a matter of setting the MMU to resume execution on the other thread. If no thread is ready to run then the CPU gets physically turned off with the HALT instruction. To be woken again by the next clock interrupt.
This is ten-thousand foot view, it is well covered in any book about operating system design.

A process is allocated memory on the fly, so must it use relative addressing?
No, it can use relative or absolute addressing depending on what it is trying to address.
At least historically, the various different addressing modes were more about local versus remote memory. Relative addressing was for memory addresses close to the current address while absolute was more expensive but could address anything. With modern virtual memory systems, these distinctions may be no longer necessary.
A process is allocated memory on the fly, so must it use relative addressing? Is this done automatically (meaning there are assembly instructions that perform relative jumps, etc.), or does the program have to "manually" add the correct offset to every memory position it addresses.
I'm not sure about this one. This is taken care of by the compiler normally. Again, modern virtual memory systems make make this complexity unnecessary.
Are they saved to memory before control is given to a different thread?
Yes. Typically all of the state (registers, etc.) is stored in a process control block (PCB), a new context is loaded, the registers and other context is loaded from the new PCB, and execution begins in the new context. The PCB can be stored on the stack or in kernel memory or in can utilize processor specific operations to optimize this process.
Or, rather than timed interrupts, does the thread simply choose a good time to give up control.
The thread can yield control -- put itself back at the end of the run queue. It can also wait for some IO or sleep. Thread libraries then put the thread in wait queues and switch to another context. When the IO is ready or the sleep expires, the thread is put back into the run queue. The same happens with mutex locks. It waits for the lock in a wait queue. Once the lock is available, the thread is put back into the run queue.
In the case of timed interrupts, what happens if a thread is given processor time and it doesn't need it. Does it have to waste it, can it call the interrupt manually, or does it alert the OS that it doesn't need much time?
Either the thread can run (perform CPU instructions) or it is waiting -- either on IO or a sleep. It can ask to yield but typically it is doing so by [again] sleeping or waiting on IO.

I probably walked into this question quite late, but then, it may be of use to some other programmers. First - the theory.
The modern day operating system will virtualize the memory, and to do so, it maintains, within its system memory area, a series of page pointers. Each page is of a fixed size (usually 4K), and when any program seeks some memory, its allocated memory addresses that are virtualized using the memory page pointer. Its approximates the behaviour of "segment" registers in the prior generation of the processors.
Now when the scheduler decides to get another process running, it may or may not keep the previous process in memory. If it keeps it in memory, then all that the scheduler does is to save the entire register snapshot (now, including YMM registers - this bit was a complex issue earlier as there are no single instructions that saved the entire context : read up on XSAVE), and this has a fixed format (available in Intel SW manual). This is stored in the memory space of the scheduler itself, along with the information on the memory pages that were being used.
If however, the scheduler needs to "dump" the current process context that is about to go to sleep to the hard disk - this situation usually arises when the process that is waking up needs extraordinary amount of memory, then the scheduler writes the memory page files in the disk blocks (called pagefile - reserved area of memory - also the source of "old grandmother wisdom" that pagefile must be equal to size of real memory) and the scheduler preserves the memory page pointer addresses as offsets in the pagefile. When it wakes up, the scheduler reads from pagefile the offset address, allocates real memory and populates the memory page pointers, and then loads the contents from the disk blocks.
Now, to answer your specific questions :
1. Do u need to use only relative addressing, or you can use absolute?
And. You may use either - whatever u perceive to be as absolute is also relative as the memory page pointer relativizes that address in an invisible format. There is no really absolute memory address anywhere (including the io device memories) except the kernel of the operating system itself. To test this, u may unassemble any .EXE program, to see that the entry point is always CALL 0010 which clearly implies that each thread gets a different "0010" to start the execution.
How do threads get life and what if it surrenders the unused slice.
Ans. The threads usually get a slice - modern systems have 20ms as the usual standard - but this is sometimes changed in special purpose compilation for servers that do not have many hardware interrupts to deal with - in order of their position on the process queue. A thread usually surrenders its slice by calling function sleep(), which is a formal (and very nice way) to surrender your balance part of the time slice. Most libraries implementing asynchronous reads, or interrupt actions, call sleep() internally, but in many instances, top level programs also call sleep() - e.g. to create a time gap. An invocation to sleep will certainly change the process context - the CPU actually is not given the liberty to sleep using NOP.
The other method is to wait for an IO to complete, and this is handled differently. The program on asking for an IO process, will cede its time slice, and the process scheduler flags this thread to be in "WAITING FOR AN IO" state - and this thread will not be given a time slice by the processor till its intended IO is completed, or timed out. This feature helps programmers as they do not have to explicitly write a sleep_until_IO() kind of interface.
Trust this sets you going further in your explorations.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string