How to synchronize `vkMapMemory`? - multithreading

vkMapMemory states:
vkMapMemory does not check whether the device memory is currently in use before returning the host-accessible pointer. The application must guarantee that any previously submitted command that writes to this range has completed before the host reads from or writes to that range, and that any previously submitted command that reads from that range has completed before the host writes to that region
It links to this site which sadly doesn't seem to exist yet. I am wondering how I would synchronize this?
Basically I need to worry about two things
Only 1 thread accesses the same range at the same time
The Gpu isn't currently trying to read the range
The only real way that I see to synchronize this is with a thread safe list. Every time you want to write/read to/from that buffer you have to add the memory range that you are currently trying to read or write into that thread safe list.
That means when when you want to access that buffer you need to lock that list and search for the range that you are trying to access.
Is that how you would synchronize vkMapMemory or are there other ways to do this?

The only time that the gpu will try to access the mapped memory is when a command buffer accessing that memory has been submitted. That memory will be in use until the associated vkFence has been signaled.
A fully general solution would be to track every memory access by the gpu and surround each CPU mapped memory access with a begin/end pair that will wait on the appropriate fences and call flush/invalidate as needed. This is a lot of state tracking and a plethora of potentially blocking calls.
However for persistent mesh/texture data you will only need to write to memory to a staging buffer and then copy to a device-local non-host-visible buffer. You shouldn't need this often so a single fence to track whether a copy from it is in flight is enough. Or for data that needs to survive only for a single frame (per object transforms) you can use a ring buffer. Readback of GPU occlusion test or compute results, you can use a ring buffer.
I hope you can see the pattern emerging. Use just a few mapped ring-buffers and be very conscious about when they are used by the gpu and then you just need to keep a small array of vkFence+offset+size per ring buffer to ensure no data hazard occurs.

Related

Can CUDA unified memory be written to by another CPU thread?

I am writing a program that retrieves images from a camera and processes them with CUDA. In order to gain the best performance, I'm passing a CUDA unified memory buffer to the image acquisition library, which writes to the buffer in another thread.
This causes all sorts of weird results where to program hangs in library code that I do not have access to. If I use a normal memory buffer and then copy to CUDA, the problem is fixed. So I became suspicious that writing from another thread might not allowed, and googled as I did, I could not find a definitive answer.
So is accessing the unified memory buffer from another CPU thread is allowed or not?
There should be no problem writing to a unified memory buffer from multiple threads.
However, keep in mind the restrictions imposed when the concurrentManagedAccess device property is not true. In that case, when you have a managed buffer, and you launch a kernel, no CPU/host thread access of any kind is allowed, to that buffer, or any other managed buffer, until you perform a cudaDeviceSynchronize() after the kernel call.
In a multithreaded environment, this might take some explicit effort to enforce.
I think this is similar to this recital if that is also your posting. Note that TX2 should have this property set to false.
Note that this general rule in the non-concurrent case can be modified through careful use of streams. However the restrictions still apply to buffers attached to streams that have a kernel launched in them (or buffers not explicitly attached to any stream): when the property mentioned above is false, access by any CPU thread is not possible.
The motivation for this behavior is roughly as follows. The CUDA runtime does not know the relationship between managed buffers, regardless of where those buffers were created. A buffer created in one thread could easily have objects in it with embedded pointers, and there is nothing to prevent or restrict those pointers from pointing to data in another managed buffer. Even a buffer that was created later. Even a buffer that was created in another thread. The safe assumption is that any linkages could be possible, and therefore, without any other negotiation, the managed memory subsystem in the CUDA runtime must move all managed buffers to the GPU, when a kernel is launched. This makes all managed buffers, without exception, inaccessible to CPU threads (any thread, anywhere). In the normal program flow, access is restored at the next occurrence of a cudaDeviceSynchronize() call. Once the CPU thread that issues that call completes the call and moves on, then managed buffers are once again visible to (all) CPU threads. Another kernel launch (anywhere) repeats the process, and interrupts the accessibility. To repeat, this is the mechanism that is in effect when the concurrentManagedAccess property on the GPU is not true, and this behavior can be somewhat modified via the aforementioned stream attach mechanism.

Single write - single read big memory buffer sharing without locks

Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)

Is it safe to use O_DIRECT without O_SYNC?

I have a linux application that streams data to files on a directly-attached SAS storage array. It fills large buffers, writes them in O_DIRECT mode, then recycles the buffers (i.e. fills them again etc.). I do not need to use O_SYNC for data integrity, because I can live with data loss on crashes, delayed writing etc. I'm primarily interested in high throughput and I seem to get better performance without O_SYNC. However, I am wondering if it is safe: if O_DIRECT is used but not O_SYNC, when exactly does the write() system call return?
If the write() returns after the DMA to the storage array's cache has been completed, then my application is safe to fill the buffer again. The array itself is in write-back mode: it will write to disk eventually, which is acceptable to me.
If the write returns immediately after the DMA has been initiated (but not yet completed), then my application is not safe, because it would overwrite the buffer while the DMA is still in progress. Obviously I don't want to write corrupted data; but in this case there is also no way that I know to figure out when the DMA for a buffer has been completed and it is safe to refill.
(There are actually several parallel threads, each one with its pool of buffers, although this may not really matter for the question above.)
When the write call returns you can reuse the buffer without any danger. You don't know that the write has made it to disk, but you indicated that was not an issue for you.
One supporting reference is at http://www.makelinux.net/ldd3/chp-15-sect-3, which states:
For example, the use of direct I/O requires that the write system call
operate synchronously; otherwise the application does not know when it
can reuse its I/O buffer.

Linux splice() + kernel AIO when writing to disk

With kernel AIO and O_DIRECT|O_SYNC, there is no copying into kernel buffers and it is possible to get fine grained notification when data is actually flushed to disk. However, it requires data to be held in user space buffers for io_prep_pwrite().
With splice(), it is possible to move data directly to disk from kernel space buffers (pipes) without never having to copy it around. However, splice() returns immediately after data is queued and does not wait for actual writes to the disk.
The goal is to move data from sockets to disk without copying it around while getting confirmation that it has been flushed out. How to combine both previous approaches?
By combining splice() with O_SYNC, I expect splice() to block and one has to use multiple threads to mask latency. Alternatively, one could use asynchronous io_prep_fsync()/io_prep_fdsync(), but this waits for all data to be flushed, not for a specific write. Neither is perfect.
What would be required is a combination of splice() with kernel AIO, allowing zero copy and asynchronous confirmation of writes, such that a single event driven thread can move data from sockets to the disk and get confirmations when required, but this doesn't seem to be supported. Is there a good workaround / alternative approach?
To get a confirmation of the writes, you can't use splice().
There's aio stuff in userspace, but if you were doing it in the kernel it might come to finding out which bio's (block I/O) are generated and waiting for those:
Block I/O structure:
http://www.makelinux.net/books/lkd2/ch13lev1sec3
If you want to use AIO, you will need to use io_getevents():
http://man7.org/linux/man-pages/man2/io_getevents.2.html
Here are some examples on how to perform AIO:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
If you do it from userspace and use msync it's still kind of up in the air if it is actually on spinning rust yet.
msync() docs:
http://man7.org/linux/man-pages/man2/msync.2.html
You might have to soften expectations in order to make it more robust, because it might be very expensive to actually be sure that the writes are fisically written on disk.
The 'highest' typical standard for write assurance in light of something like power removal is a journal recording operation that modifies the storage. The journal itself is append only and you can see if entries are complete when you play it back. That very last journal entry may not be complete, so something may still be potentially lost.

how to mmap() a queue?

I have the following problem:
I have created a queue. The addition of elements (malloc) are done by the main() function, and I have created a thread, which will process the elements/datum and free them.
This is a continuous process, and it will continue till I kill the process.
Now, if I kill the process the data in the queue will be lost, so I was thinking about implementing mmap() on it. So that the queue is also stored in a regular file, and when I restart the process the data will be reload into memory for further processing by the thread...
Since I am malloc'ing and free'ing memory, I suppose the mmapped file size will grow or reduce continuously.
Now is this possible to implement or should I consider other options???
Thanks.
EDIT1: can I use lseek or ftruncate() to resize file?
You can certainly put a queue (or any other data structure) into mmap()ed memory instead of heap memory, but you will run in to several problems which you must overcome:
You will have to do all of the memory management in the memory block corresponding to the mmap()ed file yourself. Unless your queue data structure is one monolithic block of memory, it probably has nodes and pointers which can get created, removed, and relocated. With heap memory you can delegate the task of allocating and freeing small blocks of memory to malloc() and free(), including reusing memory that has been freed for use by new nodes in your data structure. In your mmap()ed file you will have to do all of this yourself.
You won't be able to use pointers inside your mmap()ed block of memory, only offsets from the start of the block. This is because if the block is detached and reattached in another process it probably won't be at the same memory address. Since data structure access is done using pointers, you will have the overhead of constantly transforming offsets to pointers and back by adding or subtracting the mmap() block's base address.
If you want to be able to recover by reattaching the block in another process after the first process has been killed, you will have to be prepared for the case where the first process was killed in the middle of a critical section while one or more invariants of the data structure were temporarily violated. In other words, you might be reattaching a corrupt of inconsistent data structure. In order to support this fully correctly you will have to be very careful with the kinds of mutations you perform on your data structure.
All in all, my recommendation to you is that it's not worth it. You should use a fast, efficient, and easy to use data structure based in heap memory, and once in a while save a serialized snapshot to a regular file. If you have to receiver, recover from the last known good snapshot.

Resources