Understanding sendfile() and splice() - linux

sendfile() can be used to transmit data from a "file" descriptor to a "socket" descriptor in order to get data from machine A to machine B. Is it possible to get the data at the receiving end from the "socket" descriptor to a file with similar zero-copy semantics? I think sendfile() doesn't help here because sendfile() needs the source of data to be "page/buffer" cache. Is my understanding correct? Can splice() help in this situation?

You're correct about the limitation of sendfile for this. And yes, splice can help, but it's not trivial: splice requires that at least one of the source or target file descriptors be a pipe. So you can't directly splice from a socket to a plain file descriptor.
Conceptually, what you can do to make it work is:
setup your inbound socket fd and your output file fd as you would normally
create a pipe with pipe(2)
in a loop:
read from the socket to the write side of the pipe with splice
write from the read side of the pipe to the file with splice also
Repeat the last steps until all the data is read.
Zero-Copy in Linux with sendfile() and splice() has an implementation of this technique.

Related

How can I read the routing table information from Kernel space?

I would like to be able to read the routing table from kernel space...
In user space, this information is readable in /proc/net/route, but I don't know how to read the same information from kernel space..
I don't want to modify, only read..
any ideas?
To fetch the routing table, you would need to send a message of type RTM_GETROUTE to the kernel using a socket of the AF_NETLINK family — this is the rtnetlink(7) interface.
For convenience, rather than sending messages over a socket, you can use the libnetlink(3) library, and call int rtnl_wilddump_request(struct rtnl_handle *rth, int family, RTM_GETROUTE).
For an even simpler cross-platform abstraction, you could use the libdnet library, which has a function int route_get(route_t *r, struct route_entry *entry).
You may find out where in kernel source code tree this file is created in '/proc' pseoudo filesystem by searching the "route" keyword or "create_proc_... smth" Take a look at how such files are created for your kernel in source.
I suspect it's located somewhere in net/ipv4/route.c

read() in linux for event file

I'm writing a program to track the mouse movements in linux. I read in another post that this can be done using read() system call to read the EventX file related to the mouse. I earlier was reading the serial port file and i used the read() to read it. But, then i sent in a character array to it and got back the serial characters. But, it doesnt seem to be in the mouse's case. The lines:
struct input_event ie;
read(fd, &ie, sizeof(struct input_event)
are used to read it. Here the ie is a struct. But i used to send in a char buffer in the serial port case. So, my question is: how do I know what struct/buffer to send. I got to know the answer for the above two code lines after googling, but if I want to read some other file,how would i know what struct/buffer to send. Please help me.
Thank you.
The input subsystem in Linux uses a standarized format to deliver its messages. It is actually quite simple:
You open the relevant input file, usually /dev/input/event<n>, using the open() system call.
You read input events from that file, using the read() function, as you noted in your question.
Every event from that file has a well known structure: that is struct input_event. You don't need to know the exact layout of that structure, that is done by the compiler. Just include the relevant header file: #include <linux/input.h>.
What you do want to know are the fields of this structure that are useful, and what they mean. I recommend you to read the official documentation as well as the input.h source.

Minimizing copies when writing large data to a socket

I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?
It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.
For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.

Linux socket buffered data size

Is there any simple functions to check how much data is buffered but unread? FD_ISSET only indicates the presence of data in the buffer. Is possible not to create a second buffer in the program for greater control of buffer?
You could use recv() with the MSG_PEEK and MSG_DONTWAIT flags, but there's no firm guarantee that there aren't more bytes available than recv() returned in that case.
Using a buffer within your program is the normal and accepted way to solve the problem.

socket descriptor vs file descriptor

read(2) and write(2) works both on socket descriptor as well as on file descriptor. In case of file descriptor, User file descriptor table->file table and finally to inode table where it checks for the file type(regular file/char/block), and reads accordingly. In case of char spl file, it gets the function pointers based on the major number of the file from the char device switch and calls the appropriate read/write routines registered for the device.
Similarly appropriate read/write routine is called for block special file by getting the function pointers from the block device switch.
Could you please let me know what exatly happens when read/write called on socket descriptor. If read/write works on socket descriptor, we cant we use open instead of socket to get the descriptor?
As i know in memory, the file descriptor will contains flag to identify the file-system type of this fd. The kernel will invoke corresponding handler function depends on the file-system type. You can see the source read_write.c in linux kernel.
To be speak in brief, the kernel did:
In read-write.c, there is a file_system_wrapper function, that call corresponding handler function depends on fd's file type (ext2/ ext3/ socket/ ..)
In socket.c, there is a socket_type_wrapper function; that calls corresponding socket handler function depends on socket's type (ipv4, ipv6, atm others)
In socket_ipv4.c, there is a protocol_type wrapper function; that calls corresponding protocol handler function depends on protocol tpye (udp/ tcp)
In tcp_ip4.c; there is tcp_sendmsg and this function would be called when write to FD of tcp ipv4 type.
Hope this clearly,
thanks,
Houcheng
Socket descriptors are associated with file structures too, but a set of file_operations functions for that structures differs from the usual. Initialization and use of those descriptors are therefore different. Read and write part of kernel-level interface just happened to be exactly equivalent.
read and write are valid for some types of sockets in some states; this all depends on the various structs which are passed around inside the kernel.
In principle, open() could create a socket descriptor, but the BSD sockets API was never defined that way.
There are some other (Somewhat linux-specific) types of file descriptor which are opened by system calls other than open(), for example epoll_create or timerfd_create. These work the same.

Resources