NodeJS Request Pipe buffer size

NodeJS Request Pipe buffer size - node.js

How can I set up the maximum buffer size on a NodeJS Request Pipe? I'm trying to use AWS Lambda to download from a source and pipe upload to a destination like in the code below:
request(source).pipe(request(destination))
This code works fine, but if the file size is bigger than the AWS Lambda Memory Size (image below), it crashes. If I increase the memory, it works, so I know is not the timeout or link, but only the memory allocation. Initially I don't to increase the number, but even if I use the maximum, still 1.5GB, and I'm expecting to transfer files bigger than that.
Is there a global variable for NodeJS on AWS Lambda for this? Or any other suggestion?

Two things to consider:
Do not use request(source).pipe(request(destination)) with or within a promise (async/await). For some reason it memory leaks when done with promises.
"However, STREAMING THE RESPONSE (e.g. .pipe(...)) is DISCOURAGED because Request-Promise would grow the memory footprint for large requests unnecessarily high. Use the original Request library for that. You can use both libraries in the same project." Source: https://www.npmjs.com/package/request-promise
To control how much memory the pipe uses: Set the highWaterMark for BOTH ends of the pipe. I REPEAT: BOTH ENDS OF THE PIPE. This will force the pipe to let only so much data into the pipe and out of the pipe, and thus limits its occupation in memory. (But does not limit how fast data moves through the pipe...see Bonus)
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(request(destinationUrl,{highWaterMark: 1024000));
1025000 is in bytes and is approximately 10MB.
Source for highWaterMark background:
"Because Duplex and Transform streams are both Readable and Writable, each maintains two separate internal buffers used for reading and writing, allowing each side to operate independently of the other while maintaining an appropriate and efficient flow of data. For example, net.Socket instances are Duplex streams whose Readable side allows consumption of data received from the socket and whose Writable side allows writing data to the socket. Because data may be written to the socket at a faster or slower rate than data is received, it is important for each side to operate (and buffer) independently of the other." <- last sentence here is the important part.
https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options
Bonus: If you want to throttle how fast data passes through the pipe, check something like this out: https://www.npmjs.com/package/stream-throttle
const throttle = require('stream-throttle');
let th = new throttle.Throttle({rate: 10240000}); //if you dont want to transfer data faster than 10mb/sec
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(th).pipe(request(destinationUrl,{highWaterMark: 1024000));

Related

Node uploaded image save - stream vs buffer

I am working on image upload and don't know how to properly deal with storing the received file. It would be nice to analyze the file first if it is really an image or someone just changed the extension. Luckily I use package sharp which has exactly such a feature. I currently work with two approaches.
Buffering approach
I can parse multipart form as a buffer and easily decide whether save a file or not.
const metadata = await sharp(buffer).metadata();
if (metadata) {
saveImage(buffer);
} else {
throw new Error('It is not an image');
}
Streaming approach
I can parse multipart form as a readable stream. First I need to forward the readable stream to writable and store file to disk. Afterward, I need again to create a readable stream from saved file and verify whether it is really image. Otherwise, revert all.
// save uploaded file to file system with stream
readableStream.pipe(createWriteStream('./uploaded-file.jpg'));
// verify whether it is an image
createReadStream('./uploaded-file.jpg').pipe(
sharp().metadata((err, metadata) => {
if (!metadata) {
revertAll();
throw new Error('It is not an image');
}
})
)
It was my intention to avoid using buffer because as I know it needs to store the whole file in RAM. But on the other hand, the approach using streams seems to be really clunky.
Can someone help me to understand how these two approaches differ in terms of performance and used resources? Or is there some better approach to how to deal with such a situation?

In buffer mode, all the data coming from a resource is collected into a buffer, think of it as a data pool, until the operation is completed; it is then passed back to the caller as one single blob of data. Buffers in V8 are limited in size. You cannot allocate more than a few gigabytes of data, so you may hit a wall way before running out of physical memory if you need to read a big file.
On the other hand, streams allow us to process the data as soon as it arrives from the resource. So streams execute their data without storing it all in memory. Streams can be more efficient in terms of both space (memory usage) and time (computation clock time).

How does buffer works in node js?

I'm new in node js and trying to broadcast video streaming, but not getting any idea how to do this. Want to know how buffering works in a node js application?

Buffers are instances of the Buffer class in node, which is designed to handle raw binary data. Each buffer corresponds to some raw memory allocated outside V8. Buffers act somewhat like arrays of integers, but aren't resizable and have a whole bunch of methods specifically for binary data. In addition, the "integers" in a buffer each represent a byte and so are limited to values from 0 to 255 (2^8 - 1), inclusive.
More about buffers here.
Looks something like this:
Data is processed in terms of streams , instead whole of data at a time. These streams are collected in a buffer and once the buffer is full, the streams are passed on from one point to another (to the client requesting the data).
something like streaming movies online. This way we don't have to wait for the whole of data to arrive but receive in chunk and start using it even before the data is arrived. This video is simple and helpful.

piping node.js object streams to multiple destinations is producing bizarre results -- why?

When piping one transform stream to two other transform streams, occasionally I'm getting a few of the objects from one destination stream appearing in place of the proper objects in the other destination stream. In a stream of 90,000 objects, in about 1 out of 3 runs about 10 objects starting at the sequence number about 10,000 are from the wrong stream (the start position of number of anomolous objects varies). What in the world could account for such bizarre results?
The setup:
sourceStream.pipe(processingStream1).pipe(check1);
processingStream1.pipe(check2).pipe(destinationStream1);
processingStream1.pipe(processingStream2).pipe(destinationStream2);
The sourceStream is a transform stream fed by a file read. The two destination streams are transform streams leading to file writes. Both the file read and file write are through the fs streaming API. All the streams rely on node.js automatic backpressure in piping.
Occasionally objects from processingStream2 are leaking into destinationStream1, as described above.
The checking streams (check1 a sink, check2 a passthrough) show the anomalous objects exist in the stream through check2 but not in the stream into check1.
The file reads and writes are of text (csv) files. I'm using Node.js version 8.6 on Windows 7 (though deserved, please don't throw rocks at me for the latter).
Suggestions on how to better isolate the problem also welcomed. The anomoly is structured enought that it doesn't seem like a generic memory leak, but not consistent enough to be a code error. I'm mystified.

Ugh! processingStream2 modifies the object in the stream coming through it (actually modifies a property of a sub-object). Apparently you can't count on the order of the pipes as controlling the order in changes in the streamed objects. Very occassionally, after sending the source objects through processingStream2, the input object to processingStream2 goes into processingStream1 via node internals. Probably as part of some optimization under the hood.
Lesson learned: don't change the input streamed object when piping to multiple destinations, even if you think you're making the change downstream. May you never have to learn this lesson the hard way!

Does NodeJs stream pipe is symmetric?

I'm building a server which transfers files from endpoint A to endpoint B.
I'm wondering if the NodeJs stream pipe is symmetric?
If I do the following: request.get(A).pipe(request.put(B));, does it upload as fast as it downloads?
I'm asking this question, because my server has an asymmetric connexion (it downloads faster than upload), and I try to avoid memory consumption.

According to node's documentation on stream#pipe
pipe will switch the read stream to flowing mode - it will read only when the write stream has finished consuming previous packets.
readable.pipe() method attaches a Writable stream to the readable, causing it to switch automatically into flowing mode and push all of its data to the attached Writable. The flow of data will be automatically managed so that the destination Writable stream is not overwhelmed by a faster Readable stream.
So your transfer may be asymmetrical, due to different send/download speed - the difference may be buffered in Node's memory - Buffering of streams
Buffering#
Both Writable and Readable streams will store data in an internal
buffer that can be retrieved using writable._writableState.getBuffer()
or readable._readableState.buffer, respectively.
The amount of data potentially buffered depends on the highWaterMark
option passed into the streams constructor. For normal streams, the
highWaterMark option specifies a total number of bytes. For streams
operating in object mode, the highWaterMark specifies a total number
of objects.
Data is buffered in Readable streams when the implementation calls
stream.push(chunk). If the consumer of the Stream does not call
stream.read(), the data will sit in the internal queue until it is
consumed.
Once the total size of the internal read buffer reaches the threshold
specified by highWaterMark, the stream will temporarily stop reading
data from the underlying resource until the data currently buffered
can be consumed (that is, the stream will stop calling the internal
readable._read() method that is used to fill the read buffer).
Data is buffered in Writable streams when the writable.write(chunk)
method is called repeatedly. While the total size of the internal
write buffer is below the threshold set by highWaterMark, calls to
writable.write() will return true. Once the the size of the internal
buffer reaches or exceeds the highWaterMark, false will be returned.
A key goal of the stream API, and in particular the stream.pipe()
method, is to limit the buffering of data to acceptable levels such
that sources and destinations of differing speeds will not overwhelm
the available memory.
Because Duplex and Transform streams are both Readable and Writable,
each maintain two separate internal buffers used for reading and
writing, allowing each side to operate independently of the other
while maintaining an appropriate and efficient flow of data. For
example, net.Socket instances are Duplex streams whose Readable side
allows consumption of data received from the socket and whose Writable
side allows writing data to the socket. Because data may be written to
the socket at a faster or slower rate than data is received, it is
important each side operate (and buffer) independently of the other.
I recommend that you look at this question here the topic is elaborated a further.
If you run the following sample
const http = require('http');
http.request({method:'GET', host:'somehost.com', path: '/cat-picture.jpg'}, (response)=>{
console.log(response);
}).end()
you can explore the underlying sockets - on my system they all have the highWaterMark : 16384 property. So if I understand the documentation, and the above-mentioned questions, in your case about 16KB may be buffered in the faster GET socket on Node.js level - what happens below is probably highly dependent on your system/network configuration.

Minimizing copies when writing large data to a socket

I am writing an application server that processes images (large data). I am trying to minimize copies when sending image data back to clients. The processed images I need to send to clients are in buffers obtained from jemalloc. The ways I have thought of sending the data back to the client is:
1) Simple write call.
// Allocate buffer buf.
// Store image data in this buffer.
write(socket, buf, len);
2) I obtain the buffer through mmap instead of jemalloc, though I presume jemalloc already creates the buffer using mmap. I then make a simple call to write.
buf = mmap(file, len); // Imagine proper options.
// Store image data in this buffer.
write(socket, buf, len);
3) I obtain a buffer through mmap like before. I then use sendfile to send the data:
buf = mmap(in_fd, len); // Imagine proper options.
// Store image data in this buffer.
int rc;
rc = sendfile(out_fd, file, &offset, count);
// Deal with rc.
It seems like (1) and (2) will probably do the same thing given jemalloc probably allocates memory through mmap in the first place. I am not sure about (3) though. Will this really lead to any benefits? Figure 4 on this article on Linux zero-copy methods suggests that a further copy can be prevented using sendfile:
no data is copied into the socket buffer. Instead, only descriptors
with information about the whereabouts and length of the data are
appended to the socket buffer. The DMA engine passes data directly
from the kernel buffer to the protocol engine, thus eliminating the
remaining final copy.
This seems like a win if everything works out. I don't know if my mmaped buffer counts as a kernel buffer though. Also I don't know when it is safe to re-use this buffer. Since the fd and length is the only thing appended to the socket buffer, I assume that the kernel actually writes this data to the socket asynchronously. If it does what does the return from sendfile signify? How would I know when to re-use this buffer?
So my questions are:
What is the fastest way to write large buffers (images in my case) to a socket? The images are held in memory.
Is it a good idea to call sendfile on a mmapped file? If yes, what are the gotchas? Does this even lead to any wins?

It seems like my suspicions were correct. I got my information from this article. Quoting from it:
Also these network write system calls, including sendfile, might and
in many cases do return before the data sent over TCP by the method
call has been acknowledged. These methods return as soon as all data
is written into the socket buffers (sk buff) and is pushed to the TCP
write queue, the TCP engine can manage alone from that point on. In
other words at the time sendfile returns the last TCP send window is
not actually sent to the remote host but queued. In cases where
scatter-gather DMA is supported there is no seperate buffer which
holds these bytes, rather the buffers(sk buffs) just hold pointers to
the pages of OS buffer cache, where the contents of file is located.
This might lead to a race condition if we modify the content of the
file corresponding to the data in the last TCP send window as soon as
sendfile is returned. As a result TCP engine may send newly written
data to the remote host instead of what we originally intended to
send.
Provided the buffer from a mmapped file is even considered "DMA-able", seems like there is no way to know when it is safe to re-use it without an explicit acknowledgement (over the network) from the actual client. I might have to stick to simple write calls and incur the extra copy. There is a paper (also from the article) with more details.
Edit: This article on the splice call also shows the problems. Quoting it:
Be aware, when splicing data from a mmap'ed buffer to a network
socket, it is not possible to say when all data has been sent. Even if
splice() returns, the network stack may not have sent all data yet. So
reusing the buffer may overwrite unsent data.

For cases 1 and 2 - does the operation you marked as // Store image data in this buffer require any conversion? Is it just plain copy from the memory to buf?
If it's just plain copy, you can use write directly on the pointer obtained from jemalloc.
Assuming that img is a pointer obtained from jemalloc and size is a size of your image, just run following code:
int result;
int sent=0;
while(sent<size) {
result=write(socket,img+sent,size-sent);
if(result<0) {
/* error handling here */
break;
}
sent+=result;
}
It is working correctly for blocking I/O (the default behavior). If you need to write a data in a non-blocking manner, you should be able to rework the code on your own, but now you have the idea.
For case 3 - sendfile is for sending data from one descriptor to another. That means you can, for example, send data from file directly to tcp socket and you don't need to allocate any additional buffer. So, if the image you want to send to a client is in a file, just go for a sendfile. If you have it in memory (because you processed it somehow, or just generated), use the approach I mentioned earlier.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string