Why is fs.readFileSync() faster than await fsPromises.readFile()? - node.js

Here is the test code (in an express environment just because that's what I happen to be messing around with):
const fs = require('fs-extra');
const fsPromises = fs.promises;
const express = require('express');
const app = express();
const speedtest = async function (req, res, next) {
const useFsPromises = (req.params.promises == 'true');
const jsonFileName = './json/big-file.json';
const hrstart = process.hrtime();
if (useFsPromises) {
await fsPromises.readFile(jsonFileName);
} else {
fs.readFileSync(jsonFileName);
}
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
};
app.get('/speedtest/:promises', speedtest);
The big-file.json file is around 16 MB. Using node 12.18.4.
Typical results (varies quite a bit around these values, but the following are "typical"):
https://dev.mydomain.com/speedtest/false
time taken to read: 3.948152 ms
https://dev.mydomain.com/speedtest/true
time taken to read: 61.865763 ms
UPDATE to include two more variants... plain fs.readFile() and also a promisified version of this:
const fs = require('fs-extra');
const fsPromises = fs.promises;
const util = require('util');
const readFile = util.promisify(fs.readFile);
const express = require('express');
const app = express();
const speedtest = async function (req, res, next) {
const type = req.params.type;
const jsonFileName = './json/big-file.json';
const hrstart = process.hrtime();
if (type == 'readFileFsPromises') {
await fsPromises.readFile(jsonFileName);
} else if (type == 'readFileSync') {
fs.readFileSync(jsonFileName);
} else if (type == 'readFileAsync') {
return fs.readFile(jsonFileName, function (err, jsondata) {
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
});
} else if (type == 'readFilePromisified') {
await readFile(jsonFileName);
}
res.send(`time taken to read: ${process.hrtime(hrstart)[1]/1000000} ms`);
};
app.get('/speedtest/:type', speedtest);
I am finding that the fsPromises.readFile() is the slowest, while the others are much faster and all roughly the same in terms of reading time. I should add that in a different example (which I can't fully verify so I'm not sure what was going on) the time difference was vastly bigger than reported here. Seems to me at present that fsPromises.readFile() should simply be avoided because there are other async/promise options.

After stepping through each implementation in the debugger (fs.readFileSync and fs.promises.readFile), I can confirm that the synchronous version reads the entire file in one large chunk (the size of the file). Whereas fs.promises.readFile() reads 16,384 bytes at a time in a loop, with an await on each read. This is going to make fs.promises.readFile() go back to the event loop multiple times before it can read the entire file. Besides giving other things a chance to run, it's extra overhead to go back to the event loop every cycle through a for loop. There's also memory management overhead because fs.promises.readFile() allocates a series of Buffer objects and then combines them all at the end, whereas fs.readFileSync() allocates one large Buffer object at the beginning and just reads the entire file into that one Buffer.
So, the synchronous version, which is allowed to hog the entire CPU, is just faster from a pure time to completion point of view (it's significantly less efficient from a CPU cycles used point of view in a multi-user server because it blocks the event loop from doing anything else during the read). The asynchronous version is reading in smaller chunks, probably to avoid blocking the event loop too much so other things can effectively interleave and run while fs.promises.readFile() is doing its thing.
For a project I worked on awhile ago, I wrote my own simple asynchronous version of readFile() that reads the entire file at once and it was significantly faster than the built-in implementation. I was not concerned about event loop blockage in that particular project so I did not investigate if that's an issue.
In addition, fs.readFile() reads the file in 524,288 byte chunks (much larger chunks that fs.promises.readFile()) and does not use await, using just plain callbacks. It is apparently just coded more optimally than the promise implementation. I don't know why they rewrote the function in a slower way for the fs.promises.readFile() implementation. For now, it appears that wrapping fs.readFile() with a promise would be faster.

Related

Is there a memory leak potential if I just use res.once('finish', ... with express?

Let's say I have the following express middleware:
const responseTime = (_, res, next) => {
const startAt = process.hrtime();
res.once('finish', () => {
// calculate time taken for request
const [seconds, nanoseconds] = process.hrtime(startAt);
const milliseconds = seconds * 1e3 + nanoseconds * 1e-6;
res.locals['responseTime'] = milliseconds;
});
next();
};
Is this safe? What I'm thinking is that this callback may never fire if the client closes the connection early, so .once never calls removeListener under the hood - so that's the concern.
But....then again, if the entire response object is thrown away, shouldn't all this be GC'd away? I did try this out with a debugger attached and couldn't see any leaks, but maybe i'm being dumb and not seeing it?
The only reason i'm second guessing myself is that doing explicit res.removeListener cleanup is fairly common place if you search around https://grep.app/search?q=res.removeListener%28&filter[lang][0]=TypeScript&filter[lang][1]=JavaScript
specific example: https://github.com/instructure/canvas-rce-api/blob/master/app/middleware/stats.js (but I've seen this pattern copied and pasted in a bunch of codebases)
thanks!

Stream large JSON from REST API using NodeJS/ExpressJS

I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.
Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

Await promisified fs.writeFile vs fs.writeFileSync

Are there some advantages of one of this options?
1.
const fs = require('fs')
const testFunc1 = async () => {
fs.writeFileSync('text.txt', 'hello world')
}
2.
const fs = require('fs')
const util = require('util')
const writeFilePromisified = util.promisify(fs.writeFile)
const testFunc2 = async () => {
await writeFilePromisified('text.txt', 'hello world')
}
I am aware the difference betweeen writeFile and writeFileSync. The question is are there some diffs between promisses that return testFunc1 and testFunc2. So ist it the same to calling
testFunc1.then(...) // or await testFunc1
or
testFunc2.then(...) // or await testFunc2
These both promisses will be fullfilled when file writing is done.
fs already contains promisified API that doesn't need promisify.
const fsPromises = require("fs/promises");
await fsPromises.writeFile(file, data[, options])
Asynchronous promise-based version requires to use it as a part of promise-based control flow, while synchronous version doesn't impose this requirement.
Asynchronous readFile/writeFile is non-blocking, while synchronous readFileSync/writeFileSync is blocking but allows to complete a job faster. This may be noticeable during intensive IO operations.
To illustrate the difference between two promises that returns functions:
const fs = require('fs')
const util = require('util')
const testFunc1 = async () => {
fs.writeFileSync('text.txt', 'hello world')
console.log('file write done with writeFileSync')
}
const writeFilePromisified = util.promisify(fs.writeFile)
const testFunc2 = async () => {
await writeFilePromisified('text.txt', 'hello world')
console.log('file write done with promisified writeFile')
}
console.log('start test1')
testFunc1().then(() => {
console.log('promise 1 is fullfiled')
})
console.log('start test2')
testFunc2().then(() => {
console.log('promise 2 is fullfiled')
})
console.log('stop')
Output would be:
start test1
file write done with writeFileSync
start test2
stop
promise 1 is fullfiled
file write done with promisified writeFile
promise 2 is fullfiled
So like estus said testFunc1 blocks the execution of the main thread. testFunc2 do not block.
fs.readFile takes a callback function, which means it will not block the execution of your script.
fs.readFileSync however does not take a callback, which means that the execution of your script will be paused untill the process is finished.
Using promisfy is one way to solve this issue, for small files it wont make a difference, but for large files you might want to transform fs.readFileSync into a promise so you wont block the execution.
Hope that helps.
The difference between writeFile() and writeFileSync() is as explained by others that writeFileSync() "blocks". So, what is the difference between "blocking" and "not blocking"?
I wrote a small test where I compared writeFile() and writeFileSync() in terms of speed. I measured the time it took for writeFileSync() to return a result vs. the time it took from calling writeFile() until its callback-argument got called. To my (initial) surprise there was no noticeable difference between the two. writeFile() did not seem any faster than writeFileSync(). So why should I use it when it seems to complicate my program-flow?
But then I measured the time writeFile() took if I didn't wait for its callback-argument to get called. There was a big difference in speed, maybe 10x.
Having a big or no difference thus depends on the rest of your program. If you make a call to writeFileSync() that takes a long time, your program can do nothing else while it is waiting for writeFileSync() to return. If you do writeFile() and wait for the callback to complete before doing anything else, the outcome is no different. But writeFile() is much faster if you don't (need to) wait for its callback to get called.
This would have a big difference if you write a server of some kind that calls writeFileSync(). The server could not service any new requests while it is waiting for a writeFileSync() to complete. And if most requests caused a call to writeFileSync() that could have a big impact on the performance of your server.
fs.writeFileSync( file, data, options )
fs.writeFile( file, data, options, callback)
The Asynchronous function has a callback function as the last
parameter which indicates the completion of the asynchronous function.
The ‘fs’ module implements the File I/O operation. Methods in the fs module can be synchronous as well as asynchronous.
Developers prefer asynchronous methods over synchronous methods as they never block a program during its execution, whereas the later does. Blocking the main thread is "malpractice" in Node.js, thus synchronous functions should only be used for "debugging" or when no other options are available.
The fs.writeFileSync() is a synchronous method & creates a new file if
the specified file does not exist while fs.writeFile() is an
asynchronous method.

Async write file million times cause out of memory

Below is code:
var fs = require('fs')
for(let i=0;i<6551200;i++){
fs.appendFile('file',i,function(err){
})
}
When I run this code, after a few seconds, it show:
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
and yet nothing in file!
my qusetion is :
why is no byte in file?
where cause out of memory?
how to async write file in for loop no mater how large the write times?
thanks advance.
Bottom line here is that fs.appendFile() is an asynchronous call and you simply are not "awaiting" that call to complete on each loop iteration. This has a number of consequences, including but not limited to:
The callbacks keep getting allocated before they are resolved, which results in the "heap out of memory" eventually being reached.
You are contesting with a file handle, since you the function you are employing is actually opening/writing/closing the file given, and if you don't wait for each turn to do so, then you're simply going to clash.
So the simple solution here is to "wait", and some modern syntax sugar makes that easy:
const fs = require('mz/fs');
const x = 6551200;
(async function() {
try {
const fd = await fs.open('file','w');
for (let i = 0; i < x; i++) {
await fs.write(fd, `${i}\n`);
}
await fs.close(fd);
} catch(e) {
console.error(e)
} finally {
process.exit();
}
})()
That will of course take a while, but it's not going to "blow up" your system whilst it does it's work.
The very first simplified thing is to just get hold of the mz library, which already wraps common nodejs libraries with modernized versions of each function supporting promises. This will help clean up the syntax a lot as opposed to using callbacks.
The next thing to realize is what was mentioned about that fs.appendFile() in how it is "opening/writing/closing" all in one call. That's not great, so what you would typically do is simply open and then write the bytes in a loop, and when that is complete you can actually close the file handle.
That "sugar" comes in modern versions, and though "possible" with plain promise chaining, it's still not really that manageable. So if you don't actually have a nodejs environment that supports that async/await sugar or the tools to "transpile" such code, then you might alternately consider using the asyncjs libary with plain callbacks:
const Async = require('async');
const fs = require('fs');
const x = 6551200;
let i = 0;
fs.open('file','w',(err,fd) => {
if (err) throw err;
Async.whilst(
() => i < x,
callback => fs.write(fd,`${i}\n`,err => {
i++;
callback(err)
}),
err => {
if (err) throw err;
fs.closeSync(fd);
process.exit();
}
);
});
The same base principle applies as we are "waiting" for each callback to complete before continuing. the whilst() helper here allows iteration until the test condition is met, and of course does not do the next iteration until data is passed to the callback of the iterator itself.
There are other ways to approach this, but those are probably the two most sane for a "large loop" of iterations. Common approaches such as "chaining" via .reduce() are really more suited to a "reasonable" sized array of data you already have, and building arrays of such sizes here has inherent problems of it's own.
For instance, the following "works" ( on my machine at least ) but it really consumes a lot of resources to do it:
const fs = require('mz/fs');
const x = 6551200;
fs.open('file','w')
.then( fd =>
[ ...Array(x)].reduce(
(p,e,i) => p.then( () => fs.write(fd,`${i}\n`) )
, Promise.resolve()
)
.then(() => fs.close(fd))
)
.catch(e => console.error(e) )
.then(() => process.exit());
So that's really not that practical to essentially build such a large chain in memory and then allow it to resolve. You could put some "governance" on this, but the main two approaches as shown are a lot more straightforward.
For that case then you either have the async/await sugar available as it is within current LTS versions of Node ( LTS 8.x ), or I would stick with the other tried and true "async helpers" for callbacks where you were restricted to a version without that support
You can of course "promisify" any function with the last few releases of nodejs right "out of the box" as it where, as Promise has been a global thing for some time:
const fs = require('fs');
await new Promise((resolve, reject) => fs.open('file','w',(err,fd) => {
if (err) reject(err);
resolve(fd);
});
So there really is no need to import libraries just to do that, but the mz library given as example here does all of that for you. So it's really up to personal preferences on bringing in additional dependencies.
Javascript is a single threaded language, which means your code can execute one function at the time. So when you execute an async function, it will be "queued" in the stack to be executed next.
so in your code, you are sending 6551200 calls to the stack, which would of course crash your app before starting working "appendFile" on any of them.
you can achieve what you want by splitting your loop into smaller loops, use async and await functions, or iterators.
if what you are trying to achieve is as simple as your code, you can use the following:
const fs = require("fs");
function SomeTask(i=0){
fs.appendFile('file',i,function(err){
//err in the write function
if(err) console.log("Error", err);
//check if you want to continue (loop)
if(i<6551200) return SomeTask(i);
//on finish
console.log("done");
});
}
SomeTask();
In the above code, you write a single line, and when that is done, you call the next one.
This function is just for basic usage, it needs a refactor and use of Javascript Iterators for advanced usage check out Iterators and generators on MDN web docs
1 - The file is empty because none of the fs.append calls have ever finished, the Node.JS process broken before.
2 - The Node.JS heap memory is limited and stores the callback until it returns, not only the "i" variable.
3 - You could try to use promises to do that.
"use strict";
const Bluebird = require('bluebird');
const fs = Bluebird.promisifyAll(require('fs'));
let promisses = [];
for (let i = 0; i < 6551200; i++){
promisses.push(fs.appendFileAsync('file', i + '\n'));
}
Bluebird.all(promisses)
.then(data => {
console.log(data, 'End.');
})
.catch(e => console.error(e));
But no logic can avoid heap memory error for a loop this big. You could increase Node.JS Heep Memory or, the reasonable way, take chunks of data for interval:
'use strict';
const fs = require('fs');
let total = 6551200;
let interval = setInterval(() => {
fs.appendFile('file', total + '\n', () => {});
total--;
if (total < 1) {
clearInterval(interval);
}
}, 1);

Does the new way to use streams (streams2) in node create blocking?

I realize that node is non-blocking, however, I also realize that because node has only one thread, putting a three second while loop in the middle of your event loop will cause blocking. I.e.:
var start = new Date();
console.log('Test 1');
function sleep(time, words) {
while(new Date().getTime() < start.getTime() + time);
console.log(words);
}
sleep(3000, 'Test 2'); //This will block
console.log('Test 3') //Logs Test 1, Test 2, Test 3
Many of the examples I have seen dealing with the new "Streams2" interface look like they would cause this same blocking. For instance this one, borrowed from here:
var crypto = require('crypto');
var fs = require('fs');
var readStream = fs.createReadStream('myfile.txt');
var hash = crypto.createHash('sha1');
readStream
.on('readable', function () {
var chunk;
while (null !== (chunk = readStream.read())) {
hash.update(chunk); //DOESN'T This Cause Blocking?
}
})
.on('end', function () {
console.log(hash.digest('hex'));
});
If I am following right, the readStream will emit the readable event when there is data in the buffer. So it seems that once the readable event is emitted, the entire event loop would be stopped until the readStream.read() emits null. This seems less desirable than the old way (because it would not block). Can somebody please tell me why I am wrong. Thanks.
You don't have to read until the internal stream buffer is empty. You could just read once if you wanted and then read another chunk some time later.
readStream.read() itself is not blocking, but hash.update(chunk) is (for a brief amount of time) because the hashing is done on the main thread (there is a github issue about adding an async interface that would execute crypto functions in the thread pool though).
Also, you can simplify the code you have to use the crypto stream interface:
var crypto = require('crypto'),
fs = require('fs');
var readStream = fs.createReadStream('myfile.txt'),
hasher = crypto.createHash('sha1');
readStream.pipe(hasher).on('readable', function() {
// the hash stream automatically pushes the digest
// to the readable side once the writable side is ended
console.log(this.read());
}).setEncoding('hex');
All JS code is single-threaded, so a loop will block, but you are misunderstanding how long that loop will run for. Calling .read() takes a readable item from the stream, just like a 'data' handler would be called with the item. It will stop executing and unblock as soon as there are no items. 'readable' is triggered whenever there is data, and then it empties the buffer and waits for another 'readable'. So where your first while loop relies on the time to be updated, which could be some unbounded amount of time, the other loop is basically doing:
while (items.length > 0) items.pop()
which is pretty much the minimum amount of work you need to do to process items from a stream.

Resources