I have an existing Linux application that I'd like to accelerate with CUDA. The application streams data in from another process, applies some signal processing operations to that data, and streams the data out to a follow-on process ("process" in this sense refers to the term in the operating systems sense).
The IPC method for streaming the data is dictated by the framework in which my application runs. Namely, named shared memory blocks are used as circular buffers between each process: when my application has data available at its output, it writes it to the shared memory block in a circular fashion.
For maximum throughput of the CUDA version of my application, I would like to overlap all of the following:
The copying the N-th block of input data to the device.
The processing of the N-1-th block of input data by the device.
The copying of the N-2-th block of output data from the device to the host.
From what I understand, to achieve overlap with all of these operations, I must use pinned memory on the host. However, my input/output shared memory buffers are allocated by the framework, which is outside of my control; I can't make it allocate pinned memory. I could do copies to intermediate buffers, but this increases the memory footprint of my application, which is undesirable. Ideally, I'd like to pin the existing shared memory buffers instead.
Is there a way that, given an arbitrary block of virtual memory, I can pin it such that it is suitable for use with asynchronous overlapped CUDA memory copies? Based on its man page, the mlock() function sounds like it might do what I want. Are there any pitfalls to be found here?
Related
I access a file on a disk using memory mapped I/O (mmap call on linux).
Is it possible to pass this virtual memory buffer to OpenCL using CL_MEM_USE_HOST_PTR (for reading only). And could this result in performance gains?
I want to avoid copying an entire file into host memory, and instead let the OpenCL kernel control which parts of the file get loaded/buffered by the operating system.
I think this should work - you shouldn't end up with errors, crashes, or incorrect results; whether or not it brings performance gains probably depends on hardware, driver/CL implementation, and access patterns. I would not be surprised if it didn't make much of a difference in many cases. I could imagine the GPU driver prefaulting and wiring down all the pages in order to map it into the GPU's address space.
I have a multi-threaded program which supposed to run on 6 GPU devices.
I want to open on each device 6 streams to reuse during the lifetime of my program (36 in total).
I'm using cudaStreamCreate() cublasCreate() cublasSetStream() to create each stream and handle.
I also use a GPU memory monitor to see the memory usage for each handle.
However, when I look at the GPU memory usage on each device, it grow only on the first stream creation, and doesn't change in the rest of the streams I create.
As far as I know there isn't any limitation on the amount of streams I want to use.
But I can't figure out why the memory usage of the handles and the streams don't show up on the GPU memory usage.
All the streams you create are residing within a single context on a given device, so there is no context related overhead from creating additional streams after the first one. The streams themselves are lightweight and are (mostly) a host side scheduler abstraction. As you have observed, they don't in themselves consume much (if any) device memory.
I have stumbled not once into a term "non coherent" and "coherent" memory in the
tech papers related to graphics programming.I have been searching for a simple and clear explanation,but found mostly 'hardcore' papers of this type.I would be glad to receive layman's style answer on what coherent memory actually is on GPU architectures and how it is compared to other (probably not-coherent) memory types.
Memory is memory. But different things can access that memory. The GPU can access memory, the CPU can access memory, maybe other hardware bits, whatever.
A particular thing has "coherent" access to memory if changes made by others to that memory are visible to the reader. Now, you might think this is foolishness. After all, if the memory has been changed, how could someone possibly be unable to see it?
Simply put, caches.
It turns out that changing memory is expensive. So we do everything possible to avoid changing memory unless we absolutely have to. When you write a single byte from the CPU to a pointer in memory, the CPU doesn't write that byte yet. Or at least, not to memory. It writes it to a local copy of that memory called a "cache."
The reason for this is that, generally speaking, applications do not write (or read) single bytes. They are more likely to write (and read) lots of bytes, in small chunks. So if you're going to perform an expensive operation like a memory load or store, you should load or store a large chunk of memory. So you store all of the changes you're going to make to a chunk of memory in a cache, then make a single write of that cached chunk to actual memory at some point in the future.
But if you have two separate devices that use the same memory, you need some way to be certain that writes one device makes are visible to other devices. Most GPUs can't read the CPU cache. And most CPU languages don't have language-level support to say "hey, that stuff I wrote to memory? I really mean for you to write it to memory now." So you usually need something to ensure visibility of changes.
In Vulkan, memory which is labeled by VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means that, if you read/write that memory (via a mapped pointer, since that's the only way Vulkan lets you directly write to memory), you don't need to use functions vkInvalidateMappedMemoryRanges/vkFlushMappedMemoryRanges to make sure the CPU/GPU can see those changes. The visibility of any changes is guaranteed in both directions. If that flag isn't available on the memory, then you must use the aforementioned functions to ensure the coherency of the specific regions of data you want to access.
With coherent memory, one of two things is going on in terms of hardware. Either CPU access to the memory is not cached in any of the CPU's caches, or the GPU has direct access to the CPU's caches (perhaps due to being on the same die as the CPU(s)). You can usually tell that the latter is happening, because on-die GPU implementations of Vulkan don't bother to offer non-coherent memory options.
If memory is coherent then all threads accessing that memory must agree on the state of the memory at all times, e.g.: if thread 0 reads memory location A and thread 1 reads the same location at the same time, both threads should always read the same value.
But if memory is not coherent then threads A and B might read back different values. Thread 0 could think that location A contains a 1, while thread thinks that that location contains a 2. The different threads would have an incoherent view of the memory.
Coherence is hard to achieve with a high number of cores. Often every core must be aware of memory accesses from all other cores. So if you have 4 cores in a quad core CPU, coherence is not that hard to achieve as every core must be informed about the memory accesses addresses of 3 other cores, but in a GPU with 16 cores, every core must be made aware of the memory accesses by 15 other cores. The cores exchange data about the content of their cache using so called "cache coherence protocols".
This is why GPUs often only support limited forms of coherency. If some memory locations are read only or are only accessed by a single thread, then no coherence is required. If caches are small and coherence is not always required but only at specific instructions of the program, then it is possible to achieve correct behavior of the program using cache flushes before or after specific memory accesses.
If your hardware offers both coherent and non-coherent memory types, then you can expect that non-coherent memory will be faster, but if you try to run parallel algorithms using this memory they will fail in really weird ways.
I want to share/transfer data between two process locally/network. General IPC mechanism
shared memory and Message Queues can be used to transfer data. But these mechanisms
involve multiple copies.
I came across zero copy mechanism, which reduces the copying overhead
on CPU. Linux supports this using sendfile and splice. These APIs are not in POSIX.
How can I achieve zero copy using only POSIX APIs?
Shared memory between two processes is zero-copy if you keep the shared data in the shared memory. Otherwise there has to be a copy somewhere (e.g. into and out of shared mem). You can reduce this to one copy if one of the processes keeps the shared data in shared memory, and the other process just reads it from there.
The Linux man pages for sendfile(2) and vmsplice(2) don't mention a POSIX alternative, so I doubt there is one. To send data between processes with only one copy, set up a pipe between them and use vmsplice to put pages into the pipe with zero-copy. On the receiving end, I think just use read(2) to get the pages out of the pipe.
Over the network, zero-copy is even harder. Why no zero-copy networking in linux kernel? has some comments and answers. The receive side would be hard to implement on top of the usual socket API, unless it only worked when a thread was blocked on read(2) on the socket. Otherwise, how would it know where in the process's virtual memory to put the packet?
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
TL;DR - They are different concepts, but it is quite likely that zero copy is supported within kernel bypass API/framework.
User Bypass
This mode of communicating should also be considered. It maybe possible for DMA-to-DMA transactions which do not involve the CPU at all. The idea is to use splice() or similar functions to avoid user space at all. Note, that with splice(), the entire data stream does not need to bypass user space. Headers can be read in user space and data streamed directly to disk. The most common downfall of this is splice() doesn't do checksum offloading.
Zero copy
The zero copy concept is only that the network buffers are fixed in place and are not moved around. In many cases, this is not really beneficial. Most modern network hardware supports scatter gather, also know as buffer descriptors, etc. The idea is the network hardware understands physical pointers. The buffer descriptor typically consists of,
Data pointer
Length
Next buffer descriptor
The benefit is that the network headers do not need to exist side-by-side and IP, TCP, and Application headers can reside physically seperate from the application data.
If a controller doesn't support this, then the TCP/IP headers must precede the user data so that they can be filled in before sending to the network controller.
zero copy also implies some kernel-user MMU setup so that pages are shared.
Kernel Bypass
Of course, you can bypass the kernel. This is what pcap and other sniffer software has been doing for some time. However, pcap does not prevent the normal kernel processing; but the concept is similar to what a kernel bypass framework would allow. Ie, directly deliver packets to user space where processing headers would happen.
However, it is difficult to see a case where user space will have a definite win unless it is tied to the particular hardware. Some network controllers may have scatter gather supported in the controller and others may not.
There are various incarnation of kernel interfaces to accomplish kernel by-pass. A difficulty is what happens with the received data and producing the data for transmission. Often this involve other devices and so there are many solutions.
To put this together...
Are they two phrases meaning the same thing, or different?
They are different as above hopefully explains.
Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
It is the opposite. Kernel bypass can use zero copy and most likely will support it as the buffers are completely under control of the application. Also, there is no memory sharing between the kernel and user space (meaning no need for MMU shared pages and whatever cache/TLB effects that may cause). So if you are using kernel bypass, it will often be advantageous to support zero copy; so the things may seem the same at first.
If scatter-gather DMA is available (most modern controllers) either user space or the kernel can use it. zero copy is not as useful in this case.
Reference:
Technical reference on OnLoad, a high band width kernel by-pass system.
PF Ring as of 2.6.32, if configured
Linux kernel network buffer management by David Miller. This gives an idea of how the protocols headers/trailers are managed in the kernel.
Zero-copy networking
You're doing zero-copy networking when you never copy the data between the user-space and the kernel-space (I mean memory space). By example:
C language
recv(fd, buffer, BUFFER_SIZE, 0);
By default the data are copied:
The kernel gets the data from the network stack
The kernel copies this data to the buffer, which is in the user-space.
With zero-copy method, the data are not copied and come to the user-space directly from the network stack.
Kernel Bypass
The kernel bypass is when you manage yourself, in the user-space, the network stack and hardware stuff. It is hard, but you will gain a lot of performance (there is zero copy, since all the data are in the user-space). This link could be interesting if you want more information.
ZERO-COPY:
When transmitting and receiving packets,
all packet data must be copied from user-space buffers to kernel-space buffers for transmitting and vice versa for receiving. A zero-copy driver avoids this by having user space and the driver share packet buffer memory directly.
Instead of having the transmit and receive point to buffers in kernel space which will later require to copy, a region of memory in user space is allocated, and mapped to a given region of physical memory, to be shared memory between the kernel buffers and the user-space buffers, then point each descriptor buffer to its corresponding place in the newly allocated memory.
Other examples of kernel bypass and zero copy are DPDK and RDMA. When an application uses DPDK it is bypassing the kernel TCP/IP stack. The application is creating the Ethernet frames and the NIC grabbing those frames with DMA directly from user space memory so it's zero copy because there is no copy from user space to kernel space. Applications can do similar things with RDMA. The application writes to queue pairs that the NIC directly access and transmits. RDMA iblibverbs is used inside the kernel as well so when iSER is using RDMA it's not Kernel bypass but it is zero copy.
http://dpdk.org/
https://www.openfabrics.org/index.php/openfabrics-software.html