Node.js readStream for end of large files

Node.js readStream for end of large files - node.js

I want to occasionally send the last 2kB of my large log file (>100MB) in an email notification. Right now, I am trying the following:
var endLogBytes = fs.statSync(logFilePath).size;
var endOfLogfile = fs.createReadStream(logFilePath, {start: endLogBytes-2000, end: endLogBytes - 1, autoClose: true, encoding: 'utf8'});
endOfLogfile.on('data', function(chunk) {
sendEmailFunction(chunk);
}
Since I just rebooted, my log files are only ~2MB, but as they get larger I am wondering:
1) Does it take a long time to read out the data (Does Node go through the entire file until it gets to the Bytes I want OR does Node jump to the Bytes that I want?)
2) How much memory is consumed?
3) When is the memory space freed up? How do I free the memory space?

You should not use ReadStream in that case; cause it is a stream it have to(I suppose) grind up all the prepending data before it gets to the last two kilobytes.
So I would do just fs.open and then fs.read with the descriptor of opened file. Like that:
fs.open(logFilePath, 'r', function(e, fd) {
if (e)
throw e; //or do whatever you usually doing in such kind of situations
var endOfLogfile = new Buffer(2048);
fs.read(fd, endOfLogFile, endLogBytes-2048, 2048, null, function(e, bytesRead, data) {
if (e)
throw e;
//don't forget to data.toString('ascii|utf8|you_name_it')
sendEmailFunction(data.toString('ascii'));
});
});
UPDATE:
Seems like current implementation of ReadStream smart enough to read only required amount of data. See: https://github.com/joyent/node/blob/v0.10.29/lib/fs.js#L1550. It uses fs.open and fs.read under the hood. So you can use ReadStream without worry.
Anyway I would go with fs open/read, cause it is more explicit, C-way, better style and so on.
About memory and freeing it up. You will need at least 2Mb of memory for data buffer + some overhead. I don't think there is some way to tell how much of overhead it will take exactly. Just test it with your target OS and node version. You can use this module for profiling: https://www.npmjs.org/package/webkit-devtools-agent.
Memory will be freed up when you will not use buffer with data and GC will decide that this is good time to collect some garbage. GC is non deterministic(i.e. unpredictable). You should not try to predict it behaviour or force it in any way to do garbage collection.

Related

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()

The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

WriteStream nodejs out memory

I try to create a 20MB file, but it throws the error out of memory, set the max-old-space-size to 2gb, but still can someone explain to me why writing a 20mb stream consumes so much memory?
I have 2.3 g.b of free memory
let size=20*1024*1024; //20MB
for(let i=0;i<size;i++){
writeStream.write('A')
}
writeStream.end();

As mentioned in node documentation, Writable stores data in an internal buffer. The amount of data that can be buffered depends on highWaterMark option passed into the stream's constructor.
As long as size of buffered data is below below highWaterMark, calls to Writable.write(chunk) will return true. Once the buffered data exceeds limit specified by highWaterMark it returns false. This is when you should stop writing more data to Writable and wait for drain event which indicates that it's now appropriate to resume writing data.
Your program crashes because it keeps writing even when the internal buffer has exceeded highWaterMark.
Check the docs about Event:'drain'. It includes an example program.
This looks like a nice use case for Readable.pipe(Writable)
You can create a generator function that returns a character and then create a Readable from that generator by using Readable.from(). Then pipe the output of Readable to a Writable file.
The reason why it's beneficial to use pipe here is that :
A key goal of the stream API, particularly the stream.pipe() method,
is to limit the buffering of data to acceptable levels such that
sources and destinations of differing speeds will not overwhelm the
available memory. link
and
The flow of data will be automatically managed so that the destination
Writable stream is not overwhelmed by a faster Readable stream. link
const { Readable } = require('stream');
const fs = require('fs');
const size = 20 * 1024 * 1024; //20MB
function * generator(numberOfChars) {
while(numberOfChars--) {
yield 'A';
}
}
const writeStream = fs.createWriteStream('./output.txt');
const readable = Readable.from(generator(size));
readable.pipe(writeStream);

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.

For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Nodejs: How can I optimize writing many files?

I'm working in a Node environment on Windows. My code is receiving 30 Buffer objects (~500-900kb each) each second, and I need to save this data to the file system as quickly as possible, without engaging in any work that blocks the receipt of the following Buffer (i.e. the goal is to save the data from every buffer, for ~30-45 minutes). For what it's worth, the data is sequential depth frames from a Kinect sensor.
My question is: What is the most performant way to write files in Node?
Here's pseudocode:
let num = 0
async function writeFile(filename, data) {
fs.writeFileSync(filename, data)
}
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
let filename = 'file-' + num++
// Do anything with data here to optimize write?
writeFile(filename, data)
}
fs.writeFileSync seems much faster than fs.writeFile, which is why I'm using that above. But are there any other ways to operate on the data or write to file that could speed up each save?

First off, you never want to use fs.writefileSync() in handling real-time requests because that blocks the entire node.js event loop until the file write is done.
OK, based on writing each block of data to a different file, then you want to allow multiple disk writes to be in process at the same time, but not unlimited disk writes. So, it's still appropriate to use a queue, but this time the queue doesn't just have one write in process at a time, it has some number of writes in process at the same time:
const EventEmitter = require('events');
class Queue extends EventEmitter {
constructor(basePath, baseIndex, concurrent = 5) {
this.q = [];
this.paused = false;
this.inFlightCntr = 0;
this.fileCntr = baseIndex;
this.maxConcurrent = concurrent;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
while (!paused && this.q.length && this.inFlightCntr < this.maxConcurrent) {
this.inFlightCntr++;
let buf = this.q.shift();
try {
fs.writeFile(basePath + this.fileCntr++, buf, err => {
this.inFlightCntr--;
if (err) {
this.err(err);
} else {
// write more data
this.write();
}
});
} catch(e) {
this.err(e);
}
}
}
err(e) {
this.pause();
this.emit('error', e)
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
let q = new Queue("file-", 0, 5);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(e) {
// go some sort of write error here
console.log(e);
});
Things to consider:
Experiment with the concurrent value you pass to the Queue constructor. Start with a value of 5. Then see if raising that value any higher gives you better or worse performance. The node.js file I/O subsystem uses a thread pool to implement asynchronous disk writes so there is a max number of concurrent writes that will allow so cranking the concurrent number up really high probably does not make things go faster.
You can experiement with increasing the size of the disk I/O thread pool by setting the UV_THREADPOOL_SIZE environment variable before you start your node.js app.
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD on a fast bus would be best.
If you can spread the writes out across multiple actual physical disks, that will likely also increase write throughput (more disk heads at work).
This is a prior answer based on the initial interpretation of the question (before editing that changed it).
Since it appears you need to do your disk writes in order (all to the same file), then I'd suggest that you either use a write stream and let the stream object serialize and cache the data for you or you can create a queue yourself like this:
const EventEmitter = require('events');
class Queue extends EventEmitter {
// takes an already opened file handle
constructor(fileHandle) {
this.f = fileHandle;
this.q = [];
this.nowWriting = false;
this.paused = false;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
if (!nowWriting && !paused && this.q.length) {
this.nowWriting = true;
let buf = this.q.shift();
fs.write(this.f, buf, (err, bytesWritten) => {
this.nowWriting = false;
if (err) {
this.pause();
this.emit('error', err);
} else {
// write next block
this.write();
}
});
}
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
// pass an already opened file handle
let q = new Queue(fileHandle);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(err) {
// got disk write error here
});
You could use a writeStream instead of this custom Queue class, but the problem with that is that the writeStream may fill up and then you'd have to have a separate buffer as a place to put the data anyway. Using your own custom queue like above takes care of both issues at once.
Other Scalability/Performance Comments
Because you appear to be writing the data serially to the same file, your disk writing won't benefit from clustering or running multiple operations in parallel because they basically have to be serialized.
If your node.js server has other things to do besides just doing these writes, there might be a slight advantage (would have to be verified with testing) to creating a second node.js process and doing all the disk writing in that other process. Your main node.js process would receive the data and then pass it to the child process that would maintain the queue and do the writing.
Another thing you could experiment with is coalescing writes. When you have more than one item in the queue, you could combine them together into a single write. If the writes are already sizable, this probably doesn't make much difference, but if the writes were small this could make a big difference (combining lots of small disk writes into one larger write is usually more efficient).
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD would be best.

I have written a service that does this extensively and the best thing you can do is to pipe the input data directly to the file (if you have an input stream as well).
A simple example where you download a file in such a way:
const http = require('http')
const ostream = fs.createWriteStream('./output')
http.get('http://nodejs.org/dist/index.json', (res) => {
res.pipe(ostream)
})
.on('error', (e) => {
console.error(`Got error: ${e.message}`);
})
So in this example there is no intermediate copying involved of the whole file. As the file is read in chunks from the remote http server it is written to the file on disk. This is much more efficient that downloading a whole file from the server, saving that in memory and then writing it to a file on disk.
Streams are a basis of many operations in Node.js so you should study those as well.
One other thing that you should investigate depending on your scenarios is UV_THREADPOOL_SIZE as I/O operations use libuv thread pool that is by default set to 4 and you might fill that up if you do a lot of writing.

Reading file in segments of X number of lines

I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.

I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string