Async write file million times cause out of memory - node.js

Below is code:
var fs = require('fs')
for(let i=0;i<6551200;i++){
fs.appendFile('file',i,function(err){
})
}
When I run this code, after a few seconds, it show:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
and yet nothing in file!
my qusetion is :
why is no byte in file?
where cause out of memory?
how to async write file in for loop no mater how large the write times?
thanks advance.

Bottom line here is that fs.appendFile() is an asynchronous call and you simply are not "awaiting" that call to complete on each loop iteration. This has a number of consequences, including but not limited to:
The callbacks keep getting allocated before they are resolved, which results in the "heap out of memory" eventually being reached.
You are contesting with a file handle, since you the function you are employing is actually opening/writing/closing the file given, and if you don't wait for each turn to do so, then you're simply going to clash.
So the simple solution here is to "wait", and some modern syntax sugar makes that easy:
const fs = require('mz/fs');
const x = 6551200;
(async function() {
try {
const fd = await fs.open('file','w');
for (let i = 0; i < x; i++) {
await fs.write(fd, `${i}\n`);
}
await fs.close(fd);
} catch(e) {
console.error(e)
} finally {
process.exit();
}
})()
That will of course take a while, but it's not going to "blow up" your system whilst it does it's work.
The very first simplified thing is to just get hold of the mz library, which already wraps common nodejs libraries with modernized versions of each function supporting promises. This will help clean up the syntax a lot as opposed to using callbacks.
The next thing to realize is what was mentioned about that fs.appendFile() in how it is "opening/writing/closing" all in one call. That's not great, so what you would typically do is simply open and then write the bytes in a loop, and when that is complete you can actually close the file handle.
That "sugar" comes in modern versions, and though "possible" with plain promise chaining, it's still not really that manageable. So if you don't actually have a nodejs environment that supports that async/await sugar or the tools to "transpile" such code, then you might alternately consider using the asyncjs libary with plain callbacks:
const Async = require('async');
const fs = require('fs');
const x = 6551200;
let i = 0;
fs.open('file','w',(err,fd) => {
if (err) throw err;
Async.whilst(
() => i < x,
callback => fs.write(fd,`${i}\n`,err => {
i++;
callback(err)
}),
err => {
if (err) throw err;
fs.closeSync(fd);
process.exit();
}
);
});
The same base principle applies as we are "waiting" for each callback to complete before continuing. the whilst() helper here allows iteration until the test condition is met, and of course does not do the next iteration until data is passed to the callback of the iterator itself.
There are other ways to approach this, but those are probably the two most sane for a "large loop" of iterations. Common approaches such as "chaining" via .reduce() are really more suited to a "reasonable" sized array of data you already have, and building arrays of such sizes here has inherent problems of it's own.
For instance, the following "works" ( on my machine at least ) but it really consumes a lot of resources to do it:
const fs = require('mz/fs');
const x = 6551200;
fs.open('file','w')
.then( fd =>
[ ...Array(x)].reduce(
(p,e,i) => p.then( () => fs.write(fd,`${i}\n`) )
, Promise.resolve()
)
.then(() => fs.close(fd))
)
.catch(e => console.error(e) )
.then(() => process.exit());
So that's really not that practical to essentially build such a large chain in memory and then allow it to resolve. You could put some "governance" on this, but the main two approaches as shown are a lot more straightforward.
For that case then you either have the async/await sugar available as it is within current LTS versions of Node ( LTS 8.x ), or I would stick with the other tried and true "async helpers" for callbacks where you were restricted to a version without that support
You can of course "promisify" any function with the last few releases of nodejs right "out of the box" as it where, as Promise has been a global thing for some time:
const fs = require('fs');
await new Promise((resolve, reject) => fs.open('file','w',(err,fd) => {
if (err) reject(err);
resolve(fd);
});
So there really is no need to import libraries just to do that, but the mz library given as example here does all of that for you. So it's really up to personal preferences on bringing in additional dependencies.

Javascript is a single threaded language, which means your code can execute one function at the time. So when you execute an async function, it will be "queued" in the stack to be executed next.
so in your code, you are sending 6551200 calls to the stack, which would of course crash your app before starting working "appendFile" on any of them.
you can achieve what you want by splitting your loop into smaller loops, use async and await functions, or iterators.
if what you are trying to achieve is as simple as your code, you can use the following:
const fs = require("fs");
function SomeTask(i=0){
fs.appendFile('file',i,function(err){
//err in the write function
if(err) console.log("Error", err);
//check if you want to continue (loop)
if(i<6551200) return SomeTask(i);
//on finish
console.log("done");
});
}
SomeTask();
In the above code, you write a single line, and when that is done, you call the next one.
This function is just for basic usage, it needs a refactor and use of Javascript Iterators for advanced usage check out Iterators and generators on MDN web docs

1 - The file is empty because none of the fs.append calls have ever finished, the Node.JS process broken before.
2 - The Node.JS heap memory is limited and stores the callback until it returns, not only the "i" variable.
3 - You could try to use promises to do that.
"use strict";
const Bluebird = require('bluebird');
const fs = Bluebird.promisifyAll(require('fs'));
let promisses = [];
for (let i = 0; i < 6551200; i++){
promisses.push(fs.appendFileAsync('file', i + '\n'));
}
Bluebird.all(promisses)
.then(data => {
console.log(data, 'End.');
})
.catch(e => console.error(e));
But no logic can avoid heap memory error for a loop this big. You could increase Node.JS Heep Memory or, the reasonable way, take chunks of data for interval:
'use strict';
const fs = require('fs');
let total = 6551200;
let interval = setInterval(() => {
fs.appendFile('file', total + '\n', () => {});
total--;
if (total < 1) {
clearInterval(interval);
}
}, 1);

Related

Async/Await doesn't await

Building a node.js CLI application. Users should choose some tasks to run and based on that, tasks should work and then spinners (using ora package) should show success and stop spin.
The issue here is spinner succeed while tasks are still going on. Which means it doesn't wait.
Tried using typical Async/Await as to have an async function and await each function under condition. Didn't work.
Tried using promise.all. Didn't work.
Tried using waterfall. Same.
Here's the code of the task runner, I create an array of functions and pass it to waterfall (Async-waterfall package) or promise.all() method.
const runner = async () => {
let tasks = [];
spinner.start('Running tasks');
if (syncOptions.includes('taskOne')) {
tasks.push(taskOne);
}
if (syncOptions.includes('taskTwo')) {
tasks.push(taskTwo);
}
if (syncOptions.includes('taskThree')) {
tasks.push(taskThree);
}
if (syncOptions.includes('taskFour')) {
tasks.push(taskFour);
}
// Option One
waterfall(tasks, () => {
spinner.succeed('Done');
});
// Option Two
Promise.all(tasks).then(() => {
spinner.succeed('Done');
});
};
Here's an example of one of the functions:
const os = require('os');
const fs = require('fs');
const homedir = os.homedir();
const outputDir = `${homedir}/output`;
const file = `${homedir}/.file`;
const targetFile = `${outputDir}/.file`;
module.exports = async () => {
await fs.writeFileSync(targetFile, fs.readFileSync(file));
};
I tried searching concepts. Talked to the best 5 people I know who can write JS properly. No clue.. What am I doing wrong ?
You don't show us all your code, but the first warning sign is that it doesn't appear you are actually running taskOne(), taskTwo(), etc...
You are pushing what look like functions into an array with code like:
tasks.push(taskFour);
And, then attempting to do:
Promise.all(tasks).then(...)
That won't do anything useful because the tasks themselves are never executed. To use Promise.all(), you need to pass it an array of promises, not an array of functions.
So, you would use:
tasks.push(taskFour());
and then:
Promise.all(tasks).then(...);
And, all this assumes that taskOne(), taskTwo(), etc... are function that return a promise that resolves/rejects when their asynchronous operation is complete.
In addition, you also need to either await Promise.all(...) or return Promise.all() so that the caller will be able to know when they are all done. Since this is the last line of your function, I'd generally just use return Promise.all(...) and this will let the caller get the resolved results from all the tasks (if those are relevant).
Also, this doesn't make much sense:
module.exports = async () => {
await fs.writeFileSync(targetFile, fs.readFileSync(file));
};
You're using two synchronous file operations. They are not asynchronous and do not use promises so there's no reason to put them in an async function or to use await with them. You're mixing two models incorrectly. If you want them to be synchronous, then you can just do this:
module.exports = () => {
fs.writeFileSync(targetFile, fs.readFileSync(file));
};
If you want them to be asynchronous and return a promise, then you can do this:
module.exports = async () => {
return fs.promises.writeFile(targetFile, await fs.promises.readFile(file));
};
Your implementation was attempting to be half and half. Pick one architecture or the other (synchronous or asynchronous) and be consistent in the implementation.
FYI, the fs module now has multiple versions of fs.copyFile() so you could also use that and let it do the copying for you. If this file was large, copyFile() would likely use less memory in doing so.
As for your use of waterfall(), it is probably not necessary here and waterfall uses a very different calling model than Promise.all() so you certainly can't use the same model with Promise.all() as you do with waterfall(). Also, waterfall() runs your functions in sequence (one after the other) and you pass it an array of functions that have their own calling convention.
So, assuming that taskOne, taskTwo, etc... are functions that return a promise that resolve/reject when their asynchronous operations are done, then you would do this:
const runner = () => {
let tasks = [];
spinner.start('Running tasks');
if (syncOptions.includes('taskOne')) {
tasks.push(taskOne());
}
if (syncOptions.includes('taskTwo')) {
tasks.push(taskTwo());
}
if (syncOptions.includes('taskThree')) {
tasks.push(taskThree());
}
if (syncOptions.includes('taskFour')) {
tasks.push(taskFour());
}
return Promise.all(tasks).then(() => {
spinner.succeed('Done');
});
};
This would run the tasks in parallel.
If you want to run the tasks in sequence (one after the other), then you would do this:
const runner = async () => {
spinner.start('Running tasks');
if (syncOptions.includes('taskOne')) {
await taskOne();
}
if (syncOptions.includes('taskTwo')) {
await taskTwo();
}
if (syncOptions.includes('taskThree')) {
await taskThree();
}
if (syncOptions.includes('taskFour')) {
await taskFour();
}
spinner.succeed('Done');
};

Possible to make an event handler wait until async / Promise-based code is done?

I am using the excellent Papa Parse library in nodejs mode, to stream a large (500 MB) CSV file of over 1 million rows, into a slow persistence API, that can only take one request at a time. The persistence API is based on Promises, but from Papa Parse, I receive each parsed CSV row in a synchronous event like so: parseStream.on("data", row => { ... }
The challenge I am facing is that Papa Parse dumps its CSV rows from the stream so fast that my slow persistence API can't keep up. Because Papa is synchronous and my API is Promise-based, I can't just call await doDirtyWork(row) in the on event handler, because sync and async code doesn't mix.
Or can they mix and I just don't know how?
My question is, can I make Papa's event handler wait for my API call to finish? Kind of doing the persistence API request directly in the on("data") event, making the on() function linger around somehow until the dirty API work is done?
The solution I have so far is not much better than using Papa's non-streaming mode, in terms of memory footprint. I actually need to queue up the torrent of on("data") events, in form of generator function iterations. I could have also queued up promise factories in an array and work it off in a loop. Any which way, I end up saving almost the entire CSV file as huge collection of future Promises (promise factories) in memory, until my slow API calls have worked all the way through.
async importCSV(filePath) {
let parsedNum = 0, processedNum = 0;
async function* gen() {
let pf = yield;
do {
pf = yield await pf();
} while (typeof pf === "function");
};
var g = gen();
g.next();
await new Promise((resolve, reject) => {
try {
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
parseStream.on("data", row => {
// Received a CSV row from Papa.parse()
try {
console.log("PA#", parsedNum, ": parsed", row.filter((e, i) => i <= 2 ? e : undefined)
);
parsedNum++;
// Simulate some really slow async/await dirty work here, for example
// send requests to a one-at-a-time persistence API
g.next(() => { // don't execute now, call in sequence via the generator above
return new Promise((res, rej) => {
console.log(
"DW#", processedNum, ": dirty work START",
row.filter((e, i) => i <= 2 ? e : undefined)
);
setTimeout(() => {
console.log(
"DW#", processedNum, ": dirty work STOP ",
row.filter((e, i) => i <= 2 ? e : undefined)
);
processedNum++;
res();
}, 1000)
})
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
parseStream.on("finish", () => {
console.log(`Parsed ${parsedNum} rows`);
resolve();
});
} catch (err) {
console.log(err.stack);
reject(err);
}
});
while(!(await g.next()).done);
}
So why the rush Papa? Why not allow me to work down the file a bit slower -- the data in the original CSV file isn't gonna run away, we have hours to finish the streaming, why hammer me with on("data") events that I can't seem to slow down?
So what I really need is for Papa to become more of a grandpa, and minimize or eliminate any queuing or buffering of CSV rows. Ideally I would be able to completely sync Papa's parsing events with the speed (or lack thereof) of my API. So if it weren't for the dogma that async code can't make sync code "sleep", I would ideally send each CSV row to the API inside the Papa event, and only then return control to Papa.
Suggestions? Some kind of "loose coupling" of the event handler with the slowness of my async API is fine too. I don't mind if a few hundred rows get queued up. But when tens of thousands pile up, I will run out of heap fast.
Why hammer me with on("data") events that I can't seem to slow down?
You can, you just were not asking papa to stop. You can do this by calling stream.pause(), then later stream.resume() to make use of Node stream's builtin back-pressure.
However, there's a much nicer API to use than dealing with this on your own in callback-based code: use the stream as an async iterator! When you await in the body of a for await loop, the generator has to pause as well. So you can write
async importCSV(filePath) {
let parsedNum = 0;
const dataStream = fs.createReadStream(filePath);
const parseStream = Papa.parse(Papa.NODE_STREAM_INPUT, {delimiter: ",", header: false});
dataStream.pipe(parseStream);
for await (const row of parseStream) {
// Received a CSV row from Papa.parse()
const data = row.filter((e, i) => i <= 2 ? e : undefined);
console.log("PA#", parsedNum, ": parsed", data);
parsedNum++;
await dirtyWork(data);
}
console.log(`Parsed ${parsedNum} rows`);
}
importCSV('sample.csv').catch(console.error);
let processedNum = 0;
function dirtyWork(data) {
// Simulate some really slow async/await dirty work here,
// for example send requests to a one-at-a-time persistence API
return new Promise((res, rej) => {
console.log("DW#", processedNum, ": dirty work START", data)
setTimeout(() => {
console.log("DW#", processedNum, ": dirty work STOP ", data);
processedNum++;
res();
}, 1000);
});
}
Async code in JavaScript can sometimes be a little hard to grok. It's important to remember how Node operates handles concurrency.
The node process is single-threaded, but it uses a concept called an event loop. The consequence of this is that async code and callbacks are essentially equivalent representations of the same thing.
Of course, you need an async function to use await, but your callback from Papa Parse can be an async function:
parse.on("data", async row => {
await sync(row)
})
Once the await operation completes, the arrow function ends, and all references to row will be eliminated, so the garbage collector can successfully collect row, releasing that memory.
The effect this has is concurrently executing sync every time a row is parsed, so if you can only sync one record at a time, then I would recommend wrapping the sync function in a debouncer.

what is the right way to fork a loop in node.js

So i have created server which collect data and write it into db in never ending loop.
server.listen(3001, () => {
doFullScan();
});
async function doFullScan() {
while (true) {
await collectAllData();
}
}
collectAllData() is a method which check for available projects, loop through each project collect some data and write it into db.
async function collectAllData() {
//doing soemhting
const projectNames = ['array with projects name'];
//this loop takes too much of time
for(let project in projectNames){
await collectProjectData(project);
}
//doing something
}
The problem is that whole loop is taking too much time. So i would like to speed it up by multithreading loop and use all of my computer cores on it.
How should i do it?
There is cluster library with examples on https://nodejs.org/docs/latest/api/cluster.html but i don't want to create new servers. I want to spawn childrens, which will do a task and exit after they have done its job.
So there is const { fork } = require('child_process'); but I'm not exactly sure how to make each fork run only collectProjectData() method.
You can do it natively without any third party libraries.
Right now, your for...loop is running each one after the other.
Option 1
Use Promise.all and .map
await Promise.all(projectNames.map(async(projectName) => {
await collectProjectData(projectName);
});
Note, if you use .map, it will kick off all of them, all at the same time, which might be too much if projectNames continue to grow.
This is the complete opposite of what yours is doing currently.
Option 2
There is a middle way...running batches in sequence, but items inside each batch asynchronously.
const chunk = (a, l) => a.length === 0 ? [] : [a.slice(0, l)].concat(chunk(a.slice(l), l));
const batchSize = 10;
const projectNames = ['array with projects name'];
let projectNamesInChunks = chunk(projectNames, batchSize);
for(let chunk of projectNamesInChunks){
await Promise.all(chunk.map(async(projectName) => {
await collectProjectData(projectName);
});
}
I recommend using Promise.map
http://bluebirdjs.com/docs/api/promise.map.html
that way you can control the level of concurrency as you wish like this:
await Promise.map(projectNames, collectProjectData, {concurrency: 3})

Which functions could work as synchronous in node.js?

For example, I am writing a random generator with crypto.randomBytes(...) along with another async functions. To avoiding fall in callback hell, I though I could use the sync function of crypto.randomBytes. My doubt is if I do that my node program will stop each time I execute the code?. Then I thought if there are a list of async functions which their time to run is very short, these could work as synchronous function, then developing with this list of functions would be easy.
Using the mz module you can make crypto.randomBytes() return a promise. Using await (available in Node 7.x using the --harmony flag) you can use it like this:
let crypto = require('mz/crypto');
async function x() {
let bytes = await crypto.randomBytes(4);
console.log(bytes);
}
x();
The above is nonblocking even though it looks like it's blocking.
For a better demonstration consider this example:
function timeout(time) {
return new Promise(res => setTimeout(res, time));
}
async function x() {
for (let i = 0; i < 10; i++) {
console.log('x', i);
await timeout(2000);
}
}
async function y() {
for (let i = 0; i < 10; i++) {
console.log('y', i);
await timeout(3000);
}
}
x();
y();
And note that those two functions take a lot of time to execute but they don't block each other.
Run it with Node 7.x using:
node --harmony script-name.js
Or with Node 8.x with:
node script-name.js
I show you those examples to demonstrate that it's not a choice of async with callback hell and sync with nice code. You can actually run async code in a very elegant manner using the new async function and await operator available in ES2017 - it's good to read about it because not a lot of people know about those features.
They're asynchronous, learn to deal with it.
Promises now, and in the future ES2017's await and async will make your life a lot easier.
Bluebirds promisifyAll is extremely useful when dealing with any standard Node.js callback API. It adds functions tagged with Async that return a promise instead of requiring a callback.
const Promise = require('bluebird')
const crypto = Promise.promisifyAll(require('crypto'))
function randomString() {
return crypto.randomBytesAsync(4).then(bytes => {
console.log('got bytes', bytes)
return bytes.toString('hex')
})
}
randomString()
.then(string => console.log('string is', string))
.catch(error => console.error(error))

Should loops be avoided in Node.JS or is there a special way to handle them?

Loops are blocking. They seem indifferent to the idea of Node.JS. How to handle the flow where a for loop or a while loop seems to be the best option.
For example, if I want to print a table of a random number upto number * 1000, I would want to use the for loop. Is there a special way to handle this in Node.JS?
Loops are not per se bad, but it depends on the situation. In most cases you will need to do some async stuff inside loops though.
So my personal preference is to not use loops at all but instead go with the functional counterparts (forEach/map/reduce/filter). This way my code base stays consistent (and a sync loop is easily changed to an async one if needed).
const myArr = [1, 2, 3];
// sync loops
myArr.forEach(syncLogFunction);
console.log('after sync loop');
function syncLogFunction(entry) {
console.log('sync loop', entry);
}
// now we want to change that into an async operation:
Promise.all(myArr.map(asyncLogFunction))
.then(() => console.log('after async loop'));
function asyncLogFunction(entry) {
console.log('async loop', entry);
return new Promise(resolve => setTimeout(resolve, 100));
}
Notice how easily you can change between sync and async versions, the structure stays almost the same.
Hope this helps a bit.
If you are doing loops on data in memory (for example, you want to go through an array and add a prop to all objects), loops will work normally, but if you need to do something inside the loop like save values to a DB, you will run into some issues.
I realize this isn't exactly the answer, but it's a suggestion that may help someone. I found one of the easiest ways to deal with this issue is using rate limiter with a forEach (I don't like really promises). This also gives the added benefit of having the option to process things in parallel, but only move on when everything is done:
https://github.com/jhurliman/node-rate-limiter
var RateLimiter = require('limiter').RateLimiter;
var limiter = new RateLimiter(1, 5);
exports.saveFile = function (myArray, next) {
var completed = 0;
var totalFiles = myArray.length;
myArray.forEach(function (item) {
limiter.removeTokens(1, function () {
//call some async function
saveAndLog(item, function (err, result) {
//check for errors
completed++;
if (completed == totalFiles) {
//call next function
exports.process();
}
});
});
});
};

Resources