Minimizing copies when writing large data to a socket - linux

I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?

It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.

For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.

Related

netperf socket size vs buffer set for send/recv calls?

While I was trying to implement benchmark testware using netperf I happened to read its manual. Where I got this query
In the TCP_STREAM specific test there are an option to mention -s and -S to specify local(netperf client), remote(netperf server) socket buffer sizes respectively. Is that a regular BSD socket size? There is also an option to specify the local send message size -m and remote receive message size -M; Is this the total message size after all TCP/IP encapsulation? Can anybody throw some light on this. It would be great if you can illustrate using a use-case why we need these separate parameters as the BSD socket size appears to be the upper boundary here.
The socket buffer sizes (set via -s and -S) will control how much data may be outstanding on the connection at one time by affecting either the receiver's advertised window (which will be based on the SO_SNDBUF) or how much data the sender can hold waiting for ACKnowledgement (which will be based on SO_SNDBUF).
The send and receive message sizes (-m and -M) control how much data is presented in any one "send" (-m) or requested in any one "recv" (-M) call.
As TCP is a streaming protocol, it is perfectly legal/possible to make a send call with a number of bytes larger than the socket buffer(s). When the socket is blocking (as netperf uses) it simply means the send call will remain there until the last of its bytes have been put into the send socket buffer. On the receive side, one can as for more than a socket buffer's worth of data in a single receive, but the semantics are such that the call will return with however many bytes happen to be there at the time if there are any, and will return with however many bytes arrive if the socket buffer was empty at the time of the call (again because netperf uses blocking sockets/calls).

What does _serial_.bufferUntil(byte) do, and how does it synergies with serialEvent?

I am having trouble understanding this. Upon searching I found
Sets a specific byte to buffer until before calling serialEvent()
This is from this link from the Processing Website
Serial event is the function, that the user states by putting in the Serial port defined as the function, if I’m not mistaken.
But I have seen bufferUntil(‘\n’) when nothing’s being sent to the Serial, so what is this doing, and what does it mean before calling serialEvent() this is put in setup how could it be called each time before a function? And I have also seen arguments like lf, so what is happening here, and how does it synergies with that serialEvent() function?
Thanks for the help, cheers!
With bufferUntil(lf) you set up your serial port to listen (write data to its buffer) until it gets a certain character (lf, which in the example you linked is the line feed character).
As you've noticed bufferUntil(lf) won't actually read any data. To read the data the port received you need to define an interrupt function where you call readString:
void serialEvent(Serial port)
{ inString = port.readString(); }
This function will be called (interrupting the normal flow of your program and hence its name) automatically as soon as the serial port receives the character you defined with bufferUntil(lf); that'll be until the line feed character for the example. After reading data from the port's buffer your program will return to wherever it was interrupted.
EDIT: What is a buffer? The buffer is either a software (a variable hidden in the library that you're using) or hardware (a bank of memory on the serial port chip) place where you store data coming to the port (this one is the reception buffer, but there is also a transmission buffer for the info you send out through the port).
Think of it as a bucket for bits or bytes. In an analogy with water flow coming out of the tap, you can open your tap and place a glass under it if you want to have a glass of water. But it might be that you want to drink your water later, so you can place a bucket (water buffer) to store water for you. In this case, the bufferUntil(lf) statement would be the action of placing the bucket and you can think of the serialEvent as the action of taking water from the bucket (the fact that you are using the bucket allows you to keep doing your errands around the house but at some point the bucket will overflow unless you either close the tap or start emptying it, and to do that you have to interrupt your normal flow of activities).
Why do we need buffers? Well, you could be polling (listening on the port from the main task of your software) continuously but then your code would be very inefficient. With buffers you're allowed to do other things (calculating stuff, reading data from sensors or whatever) and you'll only check the port when you're sure (because your routine was interrupted) when the data you want is there. In this case, the data you want is indicated with the character you used as an argument in the bufferUntil(lf) function.
I hope I did not overstretch the analogies.

NodeJS Request Pipe buffer size

How can I set up the maximum buffer size on a NodeJS Request Pipe? I'm trying to use AWS Lambda to download from a source and pipe upload to a destination like in the code below:
request(source).pipe(request(destination))
This code works fine, but if the file size is bigger than the AWS Lambda Memory Size (image below), it crashes. If I increase the memory, it works, so I know is not the timeout or link, but only the memory allocation. Initially I don't to increase the number, but even if I use the maximum, still 1.5GB, and I'm expecting to transfer files bigger than that.
Is there a global variable for NodeJS on AWS Lambda for this? Or any other suggestion?
Two things to consider:
Do not use request(source).pipe(request(destination)) with or within a promise (async/await). For some reason it memory leaks when done with promises.
"However, STREAMING THE RESPONSE (e.g. .pipe(...)) is DISCOURAGED because Request-Promise would grow the memory footprint for large requests unnecessarily high. Use the original Request library for that. You can use both libraries in the same project." Source: https://www.npmjs.com/package/request-promise
To control how much memory the pipe uses: Set the highWaterMark for BOTH ends of the pipe. I REPEAT: BOTH ENDS OF THE PIPE. This will force the pipe to let only so much data into the pipe and out of the pipe, and thus limits its occupation in memory. (But does not limit how fast data moves through the pipe...see Bonus)
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(request(destinationUrl,{highWaterMark: 1024000));
1025000 is in bytes and is approximately 10MB.
Source for highWaterMark background:
"Because Duplex and Transform streams are both Readable and Writable, each maintains two separate internal buffers used for reading and writing, allowing each side to operate independently of the other while maintaining an appropriate and efficient flow of data. For example, net.Socket instances are Duplex streams whose Readable side allows consumption of data received from the socket and whose Writable side allows writing data to the socket. Because data may be written to the socket at a faster or slower rate than data is received, it is important for each side to operate (and buffer) independently of the other." <- last sentence here is the important part.
https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options
Bonus: If you want to throttle how fast data passes through the pipe, check something like this out: https://www.npmjs.com/package/stream-throttle
const throttle = require('stream-throttle');
let th = new throttle.Throttle({rate: 10240000}); //if you dont want to transfer data faster than 10mb/sec
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(th).pipe(request(destinationUrl,{highWaterMark: 1024000));

linux write(): does it try to write as many bytes as possible?

If I use write in this way: write (fd, buf, 10000000 /* 10MB */) where fd is a socket and uses blocking I/O, will the kernel tries to flush as many bytes as possible so that only one call is enough? Or I have to call write several times according to its return value? If that happens, does it mean something is wrong with fd?
============================== EDITED ================================
Thanks for all the answers. Furthermore, if I put fd into poll and it returns successfully with POLLOUT, so call to write cannot be blocked and writes all the data unless something is wrong with fd?
In blocking mode, write(2) will only return if specified number of bytes are written. If it can not write it'll wait.
In non-blocking (O_NONBLOCK) mode it'll not wait. It'll return right then. If it can write all of them it'll be a success other wise it'll set errno accordingly. Then you have check the errno if its EWOULDBLOCK or EAGAIN you have to invoke same write agian.
From manual of write(2)
The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource
limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).)
So yes, there can be something wrong with fd.
Also note this
A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guar‐
antee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.
/etc/sysctl.conf is used in Linux to set parameters for the TCP protocol, which is what I assume you mean by a socket. There may be a lot of parameters there, but when you dig through it, basically there is a limit to the amount of data the TCP buffers can hold at one time.
So if you tried to write 10 MB of data at one go, write would return a ssize_t value equal to that value. Always check the return value of the write() system call. If the system allowed 10MB then write would return that value.
The value is
net.core.wmem_max = [some number]
If you change some number to a value large enough to allow 10MB you can write that much. DON'T do that! You could cause other problems. Research settings before you do anything. Changing settings can decrease performance. Be careful.
http://linux.die.net/man/7/tcp
has basic C information for TCP settings. Also check out /proc/sys/net on your box.
One other point - TCP is a two way door, so just because you can send a zillion bytes at one time does not mean the other side can read it or even handle it. You socket may just block for a while. And possibly your write() return value may be less than you hoped for.

Linux socket buffered data size

Is there any simple functions to check how much data is buffered but unread? FD_ISSET only indicates the presence of data in the buffer. Is possible not to create a second buffer in the program for greater control of buffer?
You could use recv() with the MSG_PEEK and MSG_DONTWAIT flags, but there's no firm guarantee that there aren't more bytes available than recv() returned in that case.
Using a buffer within your program is the normal and accepted way to solve the problem.

Resources