I have a situation where I need to take a stream and chunk it up into Buffers. I plan to write an object transform stream which takes regular input data, and outputs Buffer objects (where the buffers are all the same size). That is, if my chunker transform is configured at 8KB, and 4KB is written to it, it will wait until an additional 4KB is written before outputting an 8KB Buffer instance.
I can choose the size of the buffer, as long as it is in the ballpark of 8KB to 32KB. Is there an optimal size to pick? The reason I'm curious is that the Node.js documentation speaks of using SlowBuffer to back a Buffer, and allocating a minimum of 8KB:
In order to avoid the overhead of allocating many C++ Buffer objects for small blocks of memory in the lifetime of a server, Node allocates memory in 8Kb (8192 byte) chunks. If a buffer is smaller than this size, then it will be backed by a parent SlowBuffer object. If it is larger than this, then Node will allocate a SlowBuffer slab for it directly.
Does this imply that 8KB is an efficient size, and that if I used 12KB, there would be two 8KB SlowBuffers allocated? Or does it just mean that the smallest efficient size is 8KB? What about simply using multiples of 8KB? Or, does it not matter at all?
Basically it's saying that if your Buffer is less than 8KB, it'll try to fit it in to a pre-allocated 8KB chunk of memory. It'll keep putting Buffers in that 8KB chunk until one doesn't fit, then it'll allocate a new 8KB chunk. If the Buffer is larger than 8KB, it'll get its own memory allocation.
You can actually see what's happening by looking at the node source for buffer here:
if (this.length <= (Buffer.poolSize >>> 1) && this.length > 0) {
if (this.length > poolSize - poolOffset)
createPool();
this.parent = sliceOnto(allocPool,
this,
poolOffset,
poolOffset + this.length);
poolOffset += this.length;
} else {
alloc(this, this.length);
}
Looking at that, it actually looks like it'll only put the Buffer in to a pre-allocated chunk if it's less than or equal to 4KB (Buffer.poolSize >>> 1 which is 4096 when Buffer.poolSize = 8 * 1024).
As for an optimum size to pick in your situation, I think it depends on what you end up using it for. But, in general, if you want a chunk less than or equal to 8KB, I'd pick something less than or equal to 4KB that will evenly fit in to that 8KB pre-allocation (4KB, 2KB, 1KB, etc.). Otherwise, chunk sizes greater than 8KB shouldn't make too much of a difference.
Related
What is the best tcp send buffer size? For example I want to send a big file (10-100MB) and I set buffer size to 4Kb, but what is the best buffer size for that?
I want to send a big file (10-100MB) and I set buffer size to 4Kb, but what is the best buffer size for that?
Certainly not 4Kb. At least 32-48Kb, or 64Kb or more if you can afford it. In general it should be at least equal to the bandwidth-delay product of the network path, so that you 'fill the pipe' and make maximum use of the available bandwidth.
If you're in control of the other end you should also set its socket receive buffer to a similar size.
When I read a 16MB file in pieces of 64Kb, and do Buffer.concat on each piece, the latter proves to be incredibly slow, takes a whole 4s to go through the lot.
Is there a better way to concatenate a buffer in Node.js?
Node.js version used: 7.10.0, under Windows 10 (both are 64-bit).
This question is asked while researching the following issue: https://github.com/brianc/node-postgres/issues/1286, which affects a large audience.
The PostgreSQL driver reads large bytea columns in chunks of 64Kb, and then concatenates them. We found out that calling Buffer.concat is the culprit behind a huge loss of performance in such examples.
Rather than concatenating every time (which creates a new buffer each time), just keep an array of all of your buffers and concat at the end.
Buffer.concat() can take a whole list of buffers. Then it's done in one operation. https://nodejs.org/api/buffer.html#buffer_class_method_buffer_concat_list_totallength
If you read from a file and know the size of that file, then you can pre-allocate the final buffer. Then each time you get a chunk of data, you can simply write it to that large 16Mb buffer.
// use the "unsafe" version to avoid clearing 16Mb for nothing
let buf = Buffer.allocUnsafe(file_size)
let pos = 0
file.on('data', (chunk) => {
buf.fill(chunk, pos, pos + chunk.length)
pos += chunk.length
})
if(pos != file_size) throw new Error('Ooops! something went wrong.')
The main difference with #Brad's code sample is that you're going to use 16Mb + size of one chunk (roughly) instead of 32Mb + size of one chunk.
Also, each chunk has a header, various pointers, etc. so you are not unlikely to use 33Mb or even 34Mb... that's a lot more RAM. The amount of RAM copied is otherwise the same. That being said, it could be that Node starts reading the next chunk while you copy so it could make it transparent. When done in one large chunk in the 'end' event, you're going to have to wait for the contact() to complete while doing nothing else in parallel.
In case you are receiving an HTTP POST and are reading it. Remember that you get a Content-Length parameter so you also have the length in that case and can pre-allocate the entire buffer before reading the data.
I'm quite new to Node and filesystem streams concerns. I wanted to now if the readFile function maybe reads the file stats, get the size and create a single Buffer with all the file size allocated. Or in other words: I know it loads the entire file, ok. But does it do it by internally splitting the file in more buffers or does it use only a single big Buffer? Depending on the method used, it has different memory usage/leaks implications.
Found the answer here on chapter 9.3:
http://book.mixu.net/node/ch9.html
As expected, readFile uses 1 full Buffer. From the link above, this is the execution of readFile:
// Fully buffered access
[100 Mb file]
-> 1. [allocate 100 Mb buffer]
-> 2. [read and return 100 Mb buffer]
So if you use readFile() your app wiil need exactly the memory for all the file size at once.
To break that memory into chunks, use read() or createReadStream()
I'm downloading varying sizes of json data from a provider. The sizes can vary from a couple of hundred bytes to tens of MB.
Got into trouble with a string (i.e. stringVar += chunk). I'm not sure, but I suspect my crashes has to to with quite large strings (15 MB).
In the end I need the json data. My temporary solution is to use a string up to 1MB and then "flushing" it to a buffer. I didn't want to use a buffer from start as it would have to be grown (i.e. copied to a larger buffer) quite often when downloads are small.
Which solution is the best for concatenating downloaded chunks and then parsing to json?
1.
var dataAsAString = '';
..
dataAsAString += chunk;
..
JSON.parse(dataAsAString);
2.
var dataAsAnArray = [];
..
dataAsAnArray.push(chunk);
..
concatenate
JSON.parse..
3.
var buffer = new Buffer(initialSize)
..
buffer.write(chunk)
..
copy buffer to larger buffer when needed
..
JSON.parse(buffer.toString());
Michael
I don't know why you are appending the chunk in a cumulative manner.
If you could store the necessary metadata for the entire duration of all data processing, then you could use a loop and just process the chunk. chunk data should be declared in the loop, then after every iteration, the chunk variable goes out of scope and used memory wouldn't grow continuously.
while((chunk=receiveChunkedData())!=null)
{
JSON.parse(chunk);
}
I have now moved to streams instead of accumulating buffers. Streams are really awesome.
If someone comes here for a solution on accumulating buffer chunks in a speedy manner I thought I'd share my find..
Substack has a module for keeping all the chunks separate without reallocating memory and then treat them as a contiguous buffer when you need.
https://github.com/substack/node-buffers
I think node-stream-buffer can solve your problem.
The Linux kernel API has a __bread method:
__bread(struct block_device *bdev, sector_t block, unsigned size)
which returns a buffer_head pointer whose data field contains size worth of data.However, I noticed that reading beyond size bytes still gave me valid data up to PAGE_SIZE number of bytes. This got me wondering if I can presume the buffer_head returned by a *__bread* always contains valid data worth PAGE_SIZE bytes even if the size argument passed to it is lesser.
Or maybe it was just a coincidence.
The __bread perform a read IO from given block interface, but depending on the backing store, you get different results.
For harddrives, the block device will fetch data in sector sizes. Usually this is either 512 bytes or 4K. If 512 bytes, and you ask for 256 bytes, you'll be able to access the last parts of the sector. Thus, you may fetch up to the sector size. However, it is not always true. With memory backed devices, you may only access the 256 bytes, as it is not served up by the block layer, but by the VSL.
In short, no. You should not rely on this feature, as it depends on which block device is backing the storage and may also change with block layer implementation.