Slow Buffer.concat - node.js

When I read a 16MB file in pieces of 64Kb, and do Buffer.concat on each piece, the latter proves to be incredibly slow, takes a whole 4s to go through the lot.
Is there a better way to concatenate a buffer in Node.js?
Node.js version used: 7.10.0, under Windows 10 (both are 64-bit).
This question is asked while researching the following issue: https://github.com/brianc/node-postgres/issues/1286, which affects a large audience.
The PostgreSQL driver reads large bytea columns in chunks of 64Kb, and then concatenates them. We found out that calling Buffer.concat is the culprit behind a huge loss of performance in such examples.

Rather than concatenating every time (which creates a new buffer each time), just keep an array of all of your buffers and concat at the end.
Buffer.concat() can take a whole list of buffers. Then it's done in one operation. https://nodejs.org/api/buffer.html#buffer_class_method_buffer_concat_list_totallength

If you read from a file and know the size of that file, then you can pre-allocate the final buffer. Then each time you get a chunk of data, you can simply write it to that large 16Mb buffer.
// use the "unsafe" version to avoid clearing 16Mb for nothing
let buf = Buffer.allocUnsafe(file_size)
let pos = 0
file.on('data', (chunk) => {
buf.fill(chunk, pos, pos + chunk.length)
pos += chunk.length
})
if(pos != file_size) throw new Error('Ooops! something went wrong.')
The main difference with #Brad's code sample is that you're going to use 16Mb + size of one chunk (roughly) instead of 32Mb + size of one chunk.
Also, each chunk has a header, various pointers, etc. so you are not unlikely to use 33Mb or even 34Mb... that's a lot more RAM. The amount of RAM copied is otherwise the same. That being said, it could be that Node starts reading the next chunk while you copy so it could make it transparent. When done in one large chunk in the 'end' event, you're going to have to wait for the contact() to complete while doing nothing else in parallel.
In case you are receiving an HTTP POST and are reading it. Remember that you get a Content-Length parameter so you also have the length in that case and can pre-allocate the entire buffer before reading the data.

Related

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

NodeJS Request Pipe buffer size

How can I set up the maximum buffer size on a NodeJS Request Pipe? I'm trying to use AWS Lambda to download from a source and pipe upload to a destination like in the code below:
request(source).pipe(request(destination))
This code works fine, but if the file size is bigger than the AWS Lambda Memory Size (image below), it crashes. If I increase the memory, it works, so I know is not the timeout or link, but only the memory allocation. Initially I don't to increase the number, but even if I use the maximum, still 1.5GB, and I'm expecting to transfer files bigger than that.
Is there a global variable for NodeJS on AWS Lambda for this? Or any other suggestion?
Two things to consider:
Do not use request(source).pipe(request(destination)) with or within a promise (async/await). For some reason it memory leaks when done with promises.
"However, STREAMING THE RESPONSE (e.g. .pipe(...)) is DISCOURAGED because Request-Promise would grow the memory footprint for large requests unnecessarily high. Use the original Request library for that. You can use both libraries in the same project." Source: https://www.npmjs.com/package/request-promise
To control how much memory the pipe uses: Set the highWaterMark for BOTH ends of the pipe. I REPEAT: BOTH ENDS OF THE PIPE. This will force the pipe to let only so much data into the pipe and out of the pipe, and thus limits its occupation in memory. (But does not limit how fast data moves through the pipe...see Bonus)
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(request(destinationUrl,{highWaterMark: 1024000));
1025000 is in bytes and is approximately 10MB.
Source for highWaterMark background:
"Because Duplex and Transform streams are both Readable and Writable, each maintains two separate internal buffers used for reading and writing, allowing each side to operate independently of the other while maintaining an appropriate and efficient flow of data. For example, net.Socket instances are Duplex streams whose Readable side allows consumption of data received from the socket and whose Writable side allows writing data to the socket. Because data may be written to the socket at a faster or slower rate than data is received, it is important for each side to operate (and buffer) independently of the other." <- last sentence here is the important part.
https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options
Bonus: If you want to throttle how fast data passes through the pipe, check something like this out: https://www.npmjs.com/package/stream-throttle
const throttle = require('stream-throttle');
let th = new throttle.Throttle({rate: 10240000}); //if you dont want to transfer data faster than 10mb/sec
request.get(sourceUrl,{highWaterMark: 1024000, encoding:null}).pipe(th).pipe(request(destinationUrl,{highWaterMark: 1024000));

Does node's readFile() use a single Buffer with all the file size allocated?

I'm quite new to Node and filesystem streams concerns. I wanted to now if the readFile function maybe reads the file stats, get the size and create a single Buffer with all the file size allocated. Or in other words: I know it loads the entire file, ok. But does it do it by internally splitting the file in more buffers or does it use only a single big Buffer? Depending on the method used, it has different memory usage/leaks implications.
Found the answer here on chapter 9.3:
http://book.mixu.net/node/ch9.html
As expected, readFile uses 1 full Buffer. From the link above, this is the execution of readFile:
// Fully buffered access
[100 Mb file]
-> 1. [allocate 100 Mb buffer]
-> 2. [read and return 100 Mb buffer]
So if you use readFile() your app wiil need exactly the memory for all the file size at once.
To break that memory into chunks, use read() or createReadStream()

Optimal buffer size with Node.js?

I have a situation where I need to take a stream and chunk it up into Buffers. I plan to write an object transform stream which takes regular input data, and outputs Buffer objects (where the buffers are all the same size). That is, if my chunker transform is configured at 8KB, and 4KB is written to it, it will wait until an additional 4KB is written before outputting an 8KB Buffer instance.
I can choose the size of the buffer, as long as it is in the ballpark of 8KB to 32KB. Is there an optimal size to pick? The reason I'm curious is that the Node.js documentation speaks of using SlowBuffer to back a Buffer, and allocating a minimum of 8KB:
In order to avoid the overhead of allocating many C++ Buffer objects for small blocks of memory in the lifetime of a server, Node allocates memory in 8Kb (8192 byte) chunks. If a buffer is smaller than this size, then it will be backed by a parent SlowBuffer object. If it is larger than this, then Node will allocate a SlowBuffer slab for it directly.
Does this imply that 8KB is an efficient size, and that if I used 12KB, there would be two 8KB SlowBuffers allocated? Or does it just mean that the smallest efficient size is 8KB? What about simply using multiples of 8KB? Or, does it not matter at all?
Basically it's saying that if your Buffer is less than 8KB, it'll try to fit it in to a pre-allocated 8KB chunk of memory. It'll keep putting Buffers in that 8KB chunk until one doesn't fit, then it'll allocate a new 8KB chunk. If the Buffer is larger than 8KB, it'll get its own memory allocation.
You can actually see what's happening by looking at the node source for buffer here:
if (this.length <= (Buffer.poolSize >>> 1) && this.length > 0) {
if (this.length > poolSize - poolOffset)
createPool();
this.parent = sliceOnto(allocPool,
this,
poolOffset,
poolOffset + this.length);
poolOffset += this.length;
} else {
alloc(this, this.length);
}
Looking at that, it actually looks like it'll only put the Buffer in to a pre-allocated chunk if it's less than or equal to 4KB (Buffer.poolSize >>> 1 which is 4096 when Buffer.poolSize = 8 * 1024).
As for an optimum size to pick in your situation, I think it depends on what you end up using it for. But, in general, if you want a chunk less than or equal to 8KB, I'd pick something less than or equal to 4KB that will evenly fit in to that 8KB pre-allocation (4KB, 2KB, 1KB, etc.). Otherwise, chunk sizes greater than 8KB shouldn't make too much of a difference.

Buffer or string or array for adding chunks with json

I'm downloading varying sizes of json data from a provider. The sizes can vary from a couple of hundred bytes to tens of MB.
Got into trouble with a string (i.e. stringVar += chunk). I'm not sure, but I suspect my crashes has to to with quite large strings (15 MB).
In the end I need the json data. My temporary solution is to use a string up to 1MB and then "flushing" it to a buffer. I didn't want to use a buffer from start as it would have to be grown (i.e. copied to a larger buffer) quite often when downloads are small.
Which solution is the best for concatenating downloaded chunks and then parsing to json?
1.
var dataAsAString = '';
..
dataAsAString += chunk;
..
JSON.parse(dataAsAString);
2.
var dataAsAnArray = [];
..
dataAsAnArray.push(chunk);
..
concatenate
JSON.parse..
3.
var buffer = new Buffer(initialSize)
..
buffer.write(chunk)
..
copy buffer to larger buffer when needed
..
JSON.parse(buffer.toString());
Michael
I don't know why you are appending the chunk in a cumulative manner.
If you could store the necessary metadata for the entire duration of all data processing, then you could use a loop and just process the chunk. chunk data should be declared in the loop, then after every iteration, the chunk variable goes out of scope and used memory wouldn't grow continuously.
while((chunk=receiveChunkedData())!=null)
{
JSON.parse(chunk);
}
I have now moved to streams instead of accumulating buffers. Streams are really awesome.
If someone comes here for a solution on accumulating buffer chunks in a speedy manner I thought I'd share my find..
Substack has a module for keeping all the chunks separate without reallocating memory and then treat them as a contiguous buffer when you need.
https://github.com/substack/node-buffers
I think node-stream-buffer can solve your problem.

Resources