How can I achieve zero copy mechanism in POSIX?

How can I achieve zero copy mechanism in POSIX? - linux

I want to share/transfer data between two process locally/network. General IPC mechanism
shared memory and Message Queues can be used to transfer data. But these mechanisms
involve multiple copies.
I came across zero copy mechanism, which reduces the copying overhead
on CPU. Linux supports this using sendfile and splice. These APIs are not in POSIX.
How can I achieve zero copy using only POSIX APIs?

Shared memory between two processes is zero-copy if you keep the shared data in the shared memory. Otherwise there has to be a copy somewhere (e.g. into and out of shared mem). You can reduce this to one copy if one of the processes keeps the shared data in shared memory, and the other process just reads it from there.
The Linux man pages for sendfile(2) and vmsplice(2) don't mention a POSIX alternative, so I doubt there is one. To send data between processes with only one copy, set up a pipe between them and use vmsplice to put pages into the pipe with zero-copy. On the receiving end, I think just use read(2) to get the pages out of the pipe.
Over the network, zero-copy is even harder. Why no zero-copy networking in linux kernel? has some comments and answers. The receive side would be hard to implement on top of the usual socket API, unless it only worked when a thread was blocked on read(2) on the socket. Otherwise, how would it know where in the process's virtual memory to put the packet?

Related

Can I manually pin memory for use with CUDA?

I have an existing Linux application that I'd like to accelerate with CUDA. The application streams data in from another process, applies some signal processing operations to that data, and streams the data out to a follow-on process ("process" in this sense refers to the term in the operating systems sense).
The IPC method for streaming the data is dictated by the framework in which my application runs. Namely, named shared memory blocks are used as circular buffers between each process: when my application has data available at its output, it writes it to the shared memory block in a circular fashion.
For maximum throughput of the CUDA version of my application, I would like to overlap all of the following:
The copying the N-th block of input data to the device.
The processing of the N-1-th block of input data by the device.
The copying of the N-2-th block of output data from the device to the host.
From what I understand, to achieve overlap with all of these operations, I must use pinned memory on the host. However, my input/output shared memory buffers are allocated by the framework, which is outside of my control; I can't make it allocate pinned memory. I could do copies to intermediate buffers, but this increases the memory footprint of my application, which is undesirable. Ideally, I'd like to pin the existing shared memory buffers instead.
Is there a way that, given an arbitrary block of virtual memory, I can pin it such that it is suitable for use with asynchronous overlapped CUDA memory copies? Based on its man page, the mlock() function sounds like it might do what I want. Are there any pitfalls to be found here?

"zero copy networking" vs "kernel bypass"?

What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?

What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
TL;DR - They are different concepts, but it is quite likely that zero copy is supported within kernel bypass API/framework.
User Bypass
This mode of communicating should also be considered. It maybe possible for DMA-to-DMA transactions which do not involve the CPU at all. The idea is to use splice() or similar functions to avoid user space at all. Note, that with splice(), the entire data stream does not need to bypass user space. Headers can be read in user space and data streamed directly to disk. The most common downfall of this is splice() doesn't do checksum offloading.
Zero copy
The zero copy concept is only that the network buffers are fixed in place and are not moved around. In many cases, this is not really beneficial. Most modern network hardware supports scatter gather, also know as buffer descriptors, etc. The idea is the network hardware understands physical pointers. The buffer descriptor typically consists of,
Data pointer
Length
Next buffer descriptor
The benefit is that the network headers do not need to exist side-by-side and IP, TCP, and Application headers can reside physically seperate from the application data.
If a controller doesn't support this, then the TCP/IP headers must precede the user data so that they can be filled in before sending to the network controller.
zero copy also implies some kernel-user MMU setup so that pages are shared.
Kernel Bypass
Of course, you can bypass the kernel. This is what pcap and other sniffer software has been doing for some time. However, pcap does not prevent the normal kernel processing; but the concept is similar to what a kernel bypass framework would allow. Ie, directly deliver packets to user space where processing headers would happen.
However, it is difficult to see a case where user space will have a definite win unless it is tied to the particular hardware. Some network controllers may have scatter gather supported in the controller and others may not.
There are various incarnation of kernel interfaces to accomplish kernel by-pass. A difficulty is what happens with the received data and producing the data for transmission. Often this involve other devices and so there are many solutions.
To put this together...
Are they two phrases meaning the same thing, or different?
They are different as above hopefully explains.
Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
It is the opposite. Kernel bypass can use zero copy and most likely will support it as the buffers are completely under control of the application. Also, there is no memory sharing between the kernel and user space (meaning no need for MMU shared pages and whatever cache/TLB effects that may cause). So if you are using kernel bypass, it will often be advantageous to support zero copy; so the things may seem the same at first.
If scatter-gather DMA is available (most modern controllers) either user space or the kernel can use it. zero copy is not as useful in this case.
Reference:
Technical reference on OnLoad, a high band width kernel by-pass system.
PF Ring as of 2.6.32, if configured
Linux kernel network buffer management by David Miller. This gives an idea of how the protocols headers/trailers are managed in the kernel.

Zero-copy networking
You're doing zero-copy networking when you never copy the data between the user-space and the kernel-space (I mean memory space). By example:
C language
recv(fd, buffer, BUFFER_SIZE, 0);
By default the data are copied:
The kernel gets the data from the network stack
The kernel copies this data to the buffer, which is in the user-space.
With zero-copy method, the data are not copied and come to the user-space directly from the network stack.
Kernel Bypass
The kernel bypass is when you manage yourself, in the user-space, the network stack and hardware stuff. It is hard, but you will gain a lot of performance (there is zero copy, since all the data are in the user-space). This link could be interesting if you want more information.

ZERO-COPY:
When transmitting and receiving packets,
all packet data must be copied from user-space buffers to kernel-space buffers for transmitting and vice versa for receiving. A zero-copy driver avoids this by having user space and the driver share packet buffer memory directly.
Instead of having the transmit and receive point to buffers in kernel space which will later require to copy, a region of memory in user space is allocated, and mapped to a given region of physical memory, to be shared memory between the kernel buffers and the user-space buffers, then point each descriptor buffer to its corresponding place in the newly allocated memory.

Other examples of kernel bypass and zero copy are DPDK and RDMA. When an application uses DPDK it is bypassing the kernel TCP/IP stack. The application is creating the Ethernet frames and the NIC grabbing those frames with DMA directly from user space memory so it's zero copy because there is no copy from user space to kernel space. Applications can do similar things with RDMA. The application writes to queue pairs that the NIC directly access and transmits. RDMA iblibverbs is used inside the kernel as well so when iSER is using RDMA it's not Kernel bypass but it is zero copy.
http://dpdk.org/
https://www.openfabrics.org/index.php/openfabrics-software.html

Is fork() copy-on-write a stable exposed behavior that can be used to implement read-only shared memory?

The man page on fork() states that it does not copy data pages, it maps them into the child process and puts a copy-on-write flag. Is that behavior:
consistent between flavors of Linux?
considered an implementation detail and therefore likely to change?
I'm wondering if I can use fork() as a means to get a shared read-only memory block on the cheap. If the memory is physically copied, it would be rather expensive - there's a lot of forking going on, and the data area is big enough - but I'm hoping not...

Linux running on machines without a MMU (memory management unit) will copy all process memory on fork().
However, those systems are usually very small and embedded and you probably don't have to worry about them.
Many services such as Apache's fork model, use the initialize and fork() method to share initialized data structures.
You should be aware that if you are using languages like Perl and Python that use reference-counted variables, or C++ shared_ptr's, this model will not work. It will not work because as the reference counts are adjusted up and down, the memory becomes unshared and gets copied.
This causes huge amounts of memory usage in Perl daemons like SpamAssassin that attempt to use an initialize and fork model.

Yes you can certainly rely on it on MMU-Linux kernels; this is almost everything.
However, the page size isn't the same everywhere.
It is possible to explicitly make a shared memory area for forked process, by using mmap() to create an anonymous map - one which is not backed by a physical file. On fork, this area will always remain shared (provided the child doesn't unmap it, or map something else in at the same address). You can mprotect it to be readonly if you want.
Memory allocated with (for example) malloc can easily end up sharing a page with something that isn't readonly, which means it gets copied anyway when another structure is modified. This includes internal structures used by the malloc implementation. So you might want to mmap a specific area for this purpose and allocate from that.

Can you rely on the fact that all Linux flavors do it this way? No. But you can rely on the fact that those who don't use an even faster method.
Therefore you should use the feature and rely on it and revisit your decision if you get a performance problem.

The success of this approach depends on how well you stick to your self-imposed "read-only" limitation. Both parent and child have to obey this stricture, else the memory gets copied.
This may not be the catastrophe you're envisioning, however. The kernel can copy as little as a single page (typically 4 KB) to implement CoW semantics. A typical Linux server will use something more complex, some sort of slab allocator, so the copied region could be much larger.
The main point is that this is decoupled from your program's conception of its memory use. If you malloc() 1 GB of RAM, fork off a child, and the child changes just the first byte of that memory block, the entire 1 GB block isn't copied. Perhaps as little as one page is copied, up to the slab size containing that first byte.

Yes
All the linux distros use the same kernel, albeit with slightly different versions and releases of it.
It's unlikely that another underlying fork(2) implementation will be faster any time soon, so it's a safe bet that copy-on-write will continue to be the mechanism. Perhaps it won't be forever, but for years, definitely.
Certainly some major software systems (for example, Phusion Passenger) use fork(2) in the same way that you want to, so you would not be the only one taking advantage of CoW.

Avoid copying of data between user and kernel space and vice-versa

I am developing a active messaging protocol for parallel computation that replaces TCP/IP. My goal is to decrease the latency of a packet. Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency. I am not writing any device driver and i am just trying to replace the TCP/IP stack with something simpler. Now I wanted to avoid copying of a packet's data from user space to kernel space and vice-versa. I heard of the mmap(). Is it the best way to do this? If yes, it will be nice if you can give links to some examples. I am a linux newbie and i really appreciate your help.. Thank you...
Thanks,
Bala

You should use UDP, that is already pretty fast. At least it was fast enough for W32/SQLSlammer to spread through the whole internet.
About your initial question, see the (vm)splice and tee Linux system calls.
From the manpage:
The three system calls splice(2),
vmsplice(2), and tee(2)), provide
userspace programs with full control
over an arbitrary kernel buffer,
implemented within the kernel using
the same type of buffer that is used
for a pipe. In overview, these system
calls perform the following tasks:
splice(2)
moves data from the buffer to an arbitrary file descriptor, or vice
versa, or from one buffer to another.
tee(2)
"copies" the data from one buffer to another.
vmsplice(2)
"copies" data from user space into the buffer.
Though we talk of copying, actual
copies are generally avoided. The
kernel does this by implementing a
pipe buffer as a set of
reference-counted pointers to pages of
kernel memory. The kernel creates
"copies" of pages in a buffer by
creating new pointers (for the output
buffer) referring to the pages, and
increasing the reference counts for
the pages: only pointers are copied,
not the pages of the buffer.

Since the environment is a LAN, i can replace TCP/IP with simpler protocol to reduce the packet latency
Generally, even in LAN UDP packets tend to be lost, also they will be lost if client
do not have enough time to consume it...
SO no, do not replace TCP with something else (UDP). Because if you do need reliable delivery TCP would be the fastest (because everything connected to acknowledgments and retransmission is done in kernel space).
Generally in normal case there is no latency drawbacks using TCP (of course do not forget TCP_NODELAY option)
About sharing the memory. Actually all memory you allocate is created with mmap. So the kernel will need to copy it somehow in any case when it creates a packet from driver.
If you are talking about reducing copying it is usually done for files/sockets and
sendfile() used that indeed prevents copying data between kernel and user. But I assume
you do not need to send files.

What is the ideal & fastest way to communicate between kernel and user space?

I know that information exchange can happen via following interfaces between kernel and user space programs
system calls
ioctls
/proc & /sys
netlink
I want to find out
If I have missed any other interface?
Which one of them is the fastest way to exchange large amounts of data?
(and if there is any document/mail/explanation supporting such a claim that I can refer to)
Which one is the recommended way to communicate? (I think its netlink, but still would love to hear opinions)

The fastest way to exchange vast amount of data is memory mapping. The mmap call can be used on a device file, and the corresponding kernel driver can then decide to map kernel memory to user address space. A good example of this is the Video For Linux drivers, and I suppose the frame buffer driver works the same way. For an good explanation of how the V4L2 driver works, you have :
The lwn.net article about streaming I/O
The V4L2 spec itself
You can't beat memory mapping for large amount of data, because there is no memcopy like operation involved, the physical underlying memory is effectively shared between kernel and userspace. Of course, like in all shared memory mechanism, you have to provide some synchronisation so that kernel and userspace don't think they have ownership at the same time.

Shared Memory between kernel and usespace is doable.
http://kerneltrap.org/node/14326
For instructions/examples.
You can also use a named pipe which are pretty fast.
All this really depends on what data you are sharing, is it concurrently accessed and what the data is structured like. Calls may be enough for simple data.
Linux kernel /proc FIFO/pipe
Might also help
good luck

You may also consider relay (formerly relayfs):
"Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space."
http://relayfs.sourceforge.net/

You can obviously do shared memory with copy_from_user etc, you can easily set up a character device driver basically all you have to do is make a file_operation structures but this is by far not the fastest way.
I have no benchmarks but system calls on moderns systems should be the fastest. My reasoning is that its what's been most optimized for. It used to be that to get to from user -> kernel one had to create an interrupt, which would then go to the Interrupt table(an array) then locate the interrupt handlex(0x80) and then go to kernel mode. This was really slow, and then came the .sysenter instruction, which basically makes this process really fast. Without going into details, .sysenter reads form a register CS:EIP immediately and the change is quite fast.
Shared memory on the contrary requires writing to and reading from memory, which is infinitely more expensive than reading from a register.

Here is a possible compilation of all the possible interface, although in some ways they overlapped one another (eg, socket and system call are both effectively using system calls):
Procfs
Sysfs
Configfs
Debugfs
Sysctl
devfs (eg, Character Devices)
TCP/UDP Sockets
Netlink Sockets
Ioctl
Kernel System Calls
Signals
Mmap

As for shared memory , I've found that even with NUMA the two thread running on two differrent cores communicate through shared memory still required write/read from L3 cache which if lucky (in one socket)is
about 2X slower than syscall , and if(not on one socket ),is about 5X-UP
slower than syscall,i think syscall's hardware mechanism helped.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string