Vulkan: Concurrent host-writes and device reads to separate parts of same VkMemory - multithreading

To transfer my static data into the GPU, I'm thinking of having a single staging VkMemory object (ballpark 64MB), and using it as a rotating queue. However, I have multiple threads producing content (eg: rendering glyphs, loading files, procedural) and I'd like it if they could upload their data entirely by themselves (i.e. write plus submit Vulkan transfer commands).
I'm intending to keep the entire staging VkMemory permanently mapped (if this is dumb please say so) at least during loading (but perhaps longer if I want to stream data).
To achieve the above, once a thread's data is fully written/flushed to staging I'd like it to be able to immediately submit GPU transfer commands.
However, that means the GPU will be reading from one part of the VkMemory while other threads may be writing/flushing to it.
AFAIK I will also need to use image memory barriers for the transition from VK_IMAGE_LAYOUT_PREINITIALIZED to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL.
I couldn't find anything on the spec explicitly saying this was legal or illegal, only that care should be taken to ensure synchronization. However, I didn't find enough detail for me to be sure one way or the other.
NOTE: The staging queue will need to ensure transfers have been completed before overwriting anything - I intend to keep a complimentary queue of VkFences for this.
Questions:
Is this OK?
Do I need to align each separate object to a page boundary? Or something else.
Am I correct in assuming that the image memory barrier (above) won't require the device to write to staging memory.

yes the spec talks about the region being read from and written to must be synced.
if the memory is not coherent then you must align the blocks being read from or written to to NonCoherentAtomSize
source: Vulkan spec under the note after the declaration of vkMapMemory
vkMapMemory does not check whether the device memory is currently in
use before returning the host-accessible pointer. The application must
guarantee that any previously submitted command that writes to this
range has completed before the host reads from or writes to that
range, and that any previously submitted command that reads from that
range has completed before the host writes to that region (see here
for details on fulfilling such a guarantee). If the device memory was
allocated without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT set, these
guarantees must be made for an extended range: the application must
round down the start of the range to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize, and round the end of the
range up to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize.
a layout transition may write to the memory however barriers will do their own syncing with regards to previous and subsequent memory accesses.

Related

If I private-`mmap` a file and read it, then another process writes to the same file, will another read at the same location return the same value?

(Context: I'm trying to establish which sequences of mmap operations are safe from the "memory safety" point of view, i.e. what assumptions I can make about mmaped memory without risking security bugs as a consequence of undefined behaviour, or miscompiles due to compilers making incorrect assumptions about how memory could behave. I'm currently working on Linux but am hoping to port the program to other operating systems in the future, so although I'm primarily interested in Linux, answers about how other operating systems behave would also be appreciated.)
Suppose I map a portion into file into memory using mmap with MAP_PRIVATE. Now, assuming that the file doesn't change while I have it mapped, if I access part of the returned memory, I'll be given information from the file at that offset; and (because I used MAP_PRIVATE) if I write to the returned memory, my writes will persist in my process's memory but will have no effect on the underlying file.
However, I'm interested in what will happen if the file does change while I have it mapped (because some other process also has the file open and is writing to it). There are several cases that I know the answers to already:
If I map the file with MAP_SHARED, then if any other process writes to the file via a shared mmap, my own process's memory will also be updated. (This is the intended behaviour of MAP_SHARED, as one of its intended purposes is for shared-memory concurrency.) It's less clear what will happen if another process writes to the file via other means, but I'm not interested in that case.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
A portion of the file I haven't accessed yet is written by another process;
I read that portion of the file via my mapping;
then, at least on Linux, the read might return either the old value or the new value:
It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region.
— man 2 mmap on Linux
(This case – which is not the case I'm asking about – is covered in this existing StackOverflow question.)
I also checked the POSIX definition of mmap, but (unless I missed it) it doesn't seem to cover this case at all, leaving it unclear whether all POSIX systems would act the same way.
Linux's behaviour makes sense here: at the time of the access, the kernel might have already mapped the requested part of the file into memory, in which case it doesn't want to change the portion that's already there, but it might need to load it from disk, in which case it will see any new value that may have been written to the file since it was opened. So there are performance reasons to use the new value in some cases and the old value in other cases.
If the following sequence of events occurs:
I map the file with MAP_PRIVATE;
I write to a memory address within the file mapping;
Another process changes that part of the file;
then although I don't know this for certain, I think it's very likely that the rule is that the memory address in question continues to reflect the old value, that was written by our process. The reason is that the kernel needs to maintain two copies of that part of the file anyway: the values as seen by our process (which, because it used MAP_PRIVATE, can write to its view of the file without changing the underlying file), and the values that are actually in the file on disk. Writes by other processes obviously need to change the second copy here, so it would be bizarre to also change the first copy; doing so would make the interface less usable and also come at a performance cost, and would have no advantages.
There is one sequence of events, though, where I don't know what happens (and for which the behaviour is hard to determine experimentally, given the number of possible factors that might be relevant):
I map the file with MAP_PRIVATE;
I read some portion of the file via the mapping, without writing;
Another process changes part of the file that I just read;
I read the same portion of the file via the mapping, again.
In this situation, am I guaranteed to read the same data twice? Or is it possible to read the old data the first time and the new data the second time?

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?
Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.
I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.
Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Writing to a remote file: When does write() really return?

I have a client node writing a file to a hard disk that is on another node (I am writing to a parallel fs actually).
What I want to understand is:
When I write() (or pwrite()), when exactly does the write call return?
I see three possibilities:
write returns immediately after queueing the I/O operation on the client side:
In this case, write can return before data has actually left the client node (If you are writing to a local hard drive, then the write call employs delayed writes, where data is simply queued up for writing. But does this also happen when you are writing to a remote hard disk?). I wrote a testcase in which I write a large matrix (1GByte) to file. Without fsync, it showed very high bandwidth values, whereas with fsync, results looked more realistic. So looks like it could be using delayed writes.
write returns after the data has been transferred to the server buffer:
Now data is on the server, but resides in a buffer in its main memory, but not yet permanently stored away on the hard drive. In this case, I/O time should be dominated by the time to transfer the data over the network.
write returns after data has been actually stored on the hard drive:
Which I am sure does not happen by default (unless you write really large files which causes your RAM to get filled and ultimately get flushed out and so on...).
Additionally, what I would like to be sure about is:
Can a situation occur where the program terminates without any data actually having left the client node, such that network parameters like latency, bandwidth, and the hard drive bandwidth do not feature in the program's execution time at all? Consider we do not do an fsync or something similar.
EDIT: I am using the pvfs2 parallel file system
Option 3. is of course simple, and safe. However, a production quality POSIX compatible parallel file system with performance good enough that anyone actually cares to use it, will typically use option 1 combined with some more or less involved mechanism to avoid conflicts when e.g. several clients cache the same file.
As the saying goes, "There are only two hard things in Computer Science: cache invalidation and naming things and off-by-one errors".
If the filesystem is supposed to be POSIX compatible, you need to go and learn POSIX fs semantics, and look up how the fs supports these while getting good performance (alternatively, which parts of POSIX semantics it skips, a la NFS). What makes this, err, interesting is that the POSIX fs semantics harks back to the 1970's with little to no though of how to support network filesystems.
I don't know about pvfs2 specifically, but typically in order to conform to POSIX and provide decent performance, option 1 can be used together with some kind of cache coherency protocol (which e.g. Lustre does). For fsync(), the data must then actually be transferred to the server and committed to stable storage on the server (disks or battery-backed write cache) before fsync() returns. And of course, the client has some limit on the amount of dirty pages, after which it will block further write()'s to the file until some have been transferred to the server.
You can get any of your three options. It depends on the flags you provide to the open call. It depends on how the filesystem was mounted locally. It also depends on how the remote server is configured.
The following are all taken from Linux. Solaris and others may differ.
Some important open flags are O_SYNC, O_DIRECT, O_DSYNC, O_RSYNC.
Some important mount flags for NFS are ac, noac, cto, nocto, lookupcache, sync, async.
Some important flags for exporting NFS are sync, async, no_wdelay. And of course the mount flags of the filesystem that NFS is exporting are important as well. For example, if you were exporting XFS or EXT4 from Linux and for some reason you used the nobarrier flag, a power loss on the server side would almost certainly result in lost data.

Possible to implement journaling with a single fsync per commit?

Let's say you're building a journaling/write-ahead-logging storage system. Can you simply implement this by (for each transaction) appending the data (with write(2)), appending a commit marker, and then fsync-ing?
The scenario to consider is if you do a large set of writes to this log then fsync it, and there's a failure during the fsync. Are the inode direct/indirect block pointers flushed only after all data blocks are flushed, or are there no guarantees that blocks are being flushed in order? If the latter, then during recovery, if you see a commit marker at the end of the file, you can't trust that the data between it and the previous commit marker is meaningful. Thus you have to rely on another mechanism (involving at least another fsync) to determine what extent of the log file is consistent (e.g., writing/fsyncing the data, then writing/fsyncing the commit marker).
If it makes a difference, mainly wondering about ext3/ext4 as the context.
Note that linux's and mac os's fsync and fdatasync are incorrect by default. Windows is correct by default, but can emulate linux for benchmarking purposes.
Also, fdatasync issues multiple disk writes if you append to the end of a file, since it needs to update the file inode with the new length. If you want to have one write per commit, your best bet is to pre-allocate log space, store a CRC of the log entries in the commit marker, and issue a single fdatasync() at commit. That way, no matter how much the OS / hardware reorder behind your back, you can find a prefix of the log that actually hit disk.
If you want to use the log for durable commits or write ahead, things get harder, since you need to make sure that fsync actually works. Under Linux, you'll want to disable the disk write cache with hdparm, or mount the partition with barrier set to true. [Edit: I stand corrected, barrier doesn't seem to give the correct semantics. SATA and SCSI introduce a number of primitives, such as write barriers and native command queuing, that make it possible for operating systems to export primitives that enable write-ahead logging. From what I can tell from manpages and online, Linux only exposes these to filesystem developers, not to userspace.]
Paradoxically, disabling the disk write cache sometimes leads to better performance, since you get more control over write scheduling in user space; if the disk queues up a bunch of synchronous write requests, you end up exposing strange latency spikes to the application. Disabling write cache prevents this from happening.
Finally, real systems use group commit, and do < 1 sync write per commit with concurrent workloads.
There's no guarantee on the order in which blocks are flushed to disk. These days even the drive itself can re-order blocks on their way to the platters.
If you want to enforce ordering, you need to at least fdatasync() between the writes that you want ordered. All a sync promises is that when it returns, everything written before the sync has hit storage.

Limiting the File System Usage Programmatically in Linux

I was assigned to write a system call for Linux kernel, which oddly determines (and reduces) users´ maximum transfer amount per minute (for file operations). This system call will be called lim_fs_usage and will take a parameter for maximum number of bytes all users can access in a minute. For short, I am going to determine bandwidth of all filesystem operations in Linux. The project also asks for choosing appropriate method for distribution of this restricted resource (file access) among the users but I think this
won´t be a big problem.
I did a long long search and scan but could not find a method for managing file system access programmatically. I thought of mapping (mmap())hard drive to memory and manage memory operations but this turned to be useless. I also tried to find an API for virtual file system in order to monitor and limit it but I could not find one. Any ideas, please... Any help is greatly appreciated. Thank you in advance...
I wonder if you could do this as an IO scheduler implementation.
The main difficulty of doing IO bandwidth limitation under Linux is, by the time it reaches anywhere near the device, the kernel has probably long since forgotten who caused it.
Likewise, you can get on some very tricky ground in determining who is responsible for a given piece of IO:
If a binary is demand-loaded, who owns the IO doing that?
A mapped section of memory (demand-loaded executable or otherwise) might be kicked out of memory because someone else used too much ram, thus causing the kernel to choose to evict those pages, which places an unfair burden on the quota of the other user to then page it back in
IO operations can be combined, and might come from different users
A write operation might cause an IO sooner or later depending on how the kernel schedules it; a later schedule may mean that fewer IOs need to be done in the long run, as another write gets done to the same block in the interim; writing to an already dirty block in cache does not make it any dirtier.
If you understand all these and more caveats, and still want to, I imagine doing it as an IO scheduler is the way to go.
IO schedulers are pluggable under Linux (2.6) and can be changed dynamically - the kernel waits for all IO on the device (IO scheduler is switchable per block device) to end and then switches to the new one.
Since it's urgent I'll give you an idea out of the top of my head without doing any research on the feasibility -- what about inserting a hook to monitor system calls that deal with file system access?
You might end up writing specialised kernel modules to handle the various filesystems (ext3, ext4, etc) but as a proof-of-concept you can start with one. Do not forget that root has reserved blocks in memory, process space and disk for his own operations.
Managing memory operations does not sound related to what you're trying to do (but perhaps I am mistaken here).
After a long period of thinking and searching, I decided to use the ¨hooking¨ method proposed. I am thinking of creating a new system call which initializes and manages a global variable like hdd_ bandwith _limit. This variable will be used in Read() and Write() system calls´ modified implementation (instead of ¨count¨ variable). Then I will decide distribution of this resource which is the real issue. Probably I will find out how many users are using the system for a certain moment and divide this resource equally. Will be a Round-Robin-like distribution. But still, I am open to suggestions on this distribution issue. Will it be a SJF or FCFS or Round-Robin? Synchronization is another issue. How can I know a user´s job is short or long? Or whether he is done with the operation or not?

Resources