Nodejs readstream, what value should I use for highWaterMark

Nodejs readstream, what value should I use for highWaterMark - node.js

I have a copy function that uses createReadStream and createWriteStream to give me progress events during a file copy:
createReadStream(source, { highWaterMark })
.pipe(
progress({ length: stats.size }).on('progress', (event) =>
subscriber.next({ ...event, type: 'fileStreamProgress', stats: { source, stats } })
)
)
.pipe(createWriteStream(destination, force ? undefined : { flags: 'wx' }))
.once('error', (err) => subscriber.error(err))
.once('finish', () => subscriber.complete());
I don't really know what highWaterMark is. My rough understanding is that it is chunk size. If I had infinite memory surely I should just set this to infinity? Or does the "chunk" only get sent to the next item in the path once it has all been read? I do see that copy speed gets higher the larger this number is but at a certain point it starts to get slower again.
I will mostly be copying video files that are about 20 -30 GB and wondering what highwatermark value will give me the best performance.
Thanks

Related

How to effectively turn high resolution images into a video with ffmpeg?

I have 24 frames (frame-%d.png)
I want to turn them into a video that will be 1 second long
That means that each frame should play for 1/24 seconds
I'm trying to figure out the correct settings in order to achieve that:
await new Promise((resolve) => {
ffmpeg()
.on('end', () => {
setTimeout(() => {
console.log('done')
resolve()
}, 100)
})
.on('error', (err) => {
throw new Error(err)
})
.input('/my-huge-frames/frame-%d.png')
.inputFPS(1/24)
.output('/my-huge-video.mp4')
.outputFPS(24)
.noAudio()
.run()
Are my inputFPS(1/24) & outputFPS(24) correct ?
Each frame-%d.png is huge: 32400PX x 32400PX (~720Mb). Will ffmpeg be able to generate such a video, and if so, will the video be playable? If not, what is the maximum resolution each frame-%d.png should have instead?
Since the process will be quite heavy, I believe using the command line could be more appropriate. In that case, what is the equivalent of the above Js code in the command line (as in ffmpeg -framerate etc...) ?

your output image size is too large for most common video codecs.
h.264 2048x2048
h.265 8192×4320
av1 7680×4320
You may be able to do raw RGB or raw YUV, but that is going to be huge
~1.5GB per frame for YUV420...
what are you planning to play this on, I know of some dome theaters that theoretically able run something like 15 simultaneous 4k feeds... but they are processed before hand...

Pass large array of objects to RabbitMQ exchange

I receive large array of objects from an external source (about more than 10 000 objects). And then I pass it to exchange in order to notify other microservices about new entries to handle.
this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: myData, // [object1, object2, object3, ...]
pattern: 'myPattern',
})
The problem is that it's bad practice to push such large message to exchange, and I'd like to resolve this issue. I've read articles and stackoverflow posts about that to find code example or information about streaming data but with no success.
The only way I've found out is to divide my large array into chunks and publish each one to exchange using for ... loop. Is it good practice? How to determine what length should each chunk (number of objects) have? Or maybe is there another approach?

It really depends on the Object size.. That's a thing you would have to figure out yourself. Get your 10k objects and calculate an average size out of them (Put them as json into a file and take fileSize/10'000 that's it. Maybe request body size of 50-100kb is a good thing? But that's still up to u ..
Start with number 50 and do tests. Check the time taken, bandwidth and everything what makes sense. Change chunk sizes from between 1-5000 and test test test . At some point, you will get a feeling what number would be good to take! .
Here's some example code of looping through the elements:
// send function for show caseing the idea.
function send(data) {
return this._rmqClient.publishToExchange({
exchange: 'my-exchange',
exchangeOptions: {
type: 'fanout',
durable: true,
},
data: data,
pattern: 'myPattern',
})
}
// this sends chunks one by one..
async function sendLargeDataPacket(data, chunkSize) {
// Pure functions do prevent headache
const mutated = [...data]
// send full packages aslong as possible!.
while (mutated.length >= chunkSize) {
// send a packet of chunkSize length
await send(mutated.splice(0, chunkSize))
}
// send the remaining elements if there are any!
if(mutated.length > 0) {
await send(mutated)
}
}
And you would call it like:
// that's your 10k+ items array!.
var myData = [/**...**/]
// let's start with 50, but try out all numbers!.
const chunkSize = 50
sendLargeDataPacket(chunkSize).then(() => console.log('done')).catch(console.error)
This approach send one packet after the other, and may take some time since it is not done in parallel. I do not know your requirements but I can help you writing a parallel approach if you need..

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()

The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.

For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Node.js readStream for end of large files

I want to occasionally send the last 2kB of my large log file (>100MB) in an email notification. Right now, I am trying the following:
var endLogBytes = fs.statSync(logFilePath).size;
var endOfLogfile = fs.createReadStream(logFilePath, {start: endLogBytes-2000, end: endLogBytes - 1, autoClose: true, encoding: 'utf8'});
endOfLogfile.on('data', function(chunk) {
sendEmailFunction(chunk);
}
Since I just rebooted, my log files are only ~2MB, but as they get larger I am wondering:
1) Does it take a long time to read out the data (Does Node go through the entire file until it gets to the Bytes I want OR does Node jump to the Bytes that I want?)
2) How much memory is consumed?
3) When is the memory space freed up? How do I free the memory space?

You should not use ReadStream in that case; cause it is a stream it have to(I suppose) grind up all the prepending data before it gets to the last two kilobytes.
So I would do just fs.open and then fs.read with the descriptor of opened file. Like that:
fs.open(logFilePath, 'r', function(e, fd) {
if (e)
throw e; //or do whatever you usually doing in such kind of situations
var endOfLogfile = new Buffer(2048);
fs.read(fd, endOfLogFile, endLogBytes-2048, 2048, null, function(e, bytesRead, data) {
if (e)
throw e;
//don't forget to data.toString('ascii|utf8|you_name_it')
sendEmailFunction(data.toString('ascii'));
});
});
UPDATE:
Seems like current implementation of ReadStream smart enough to read only required amount of data. See: https://github.com/joyent/node/blob/v0.10.29/lib/fs.js#L1550. It uses fs.open and fs.read under the hood. So you can use ReadStream without worry.
Anyway I would go with fs open/read, cause it is more explicit, C-way, better style and so on.
About memory and freeing it up. You will need at least 2Mb of memory for data buffer + some overhead. I don't think there is some way to tell how much of overhead it will take exactly. Just test it with your target OS and node version. You can use this module for profiling: https://www.npmjs.org/package/webkit-devtools-agent.
Memory will be freed up when you will not use buffer with data and GC will decide that this is good time to collect some garbage. GC is non deterministic(i.e. unpredictable). You should not try to predict it behaviour or force it in any way to do garbage collection.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string