I'm trying to save a million data to a JSON file in my NodeJS application. The main idea is to save a grid of 1000x1000 pixels as an array with the x,y position and a color id so each pixel has a coordinate and a color. My actual code to generate an example :
So I have a function to generate data and using fs.writeFile() I can save it.
//resetPos('test.json');
function resetPos(path) {
let data = [];
for (let y = 1; y <= 1000; y++){
data.push([]);
}
data.forEach(function(e, i){
for (let x = 1; x <= 1000; x++) {
e.push([
"x": x,
"y": i,
"color": "#cccccc"
]);
}
});
fs.writeFile(path, JSON.stringify(data), function(err){
if(err) throw err;
});
console.log(data);
}
let global_data = fs.readFileSync("test.json");
console.log(global_data[0]);
When I read the file, it shows "91". I've tried using .toJSON() and .toString() but didn't go as I want. I looking do get an x,y coordinate as data[y][x];
Basically, there are two ways to read a .json file in Node.js.
Your first option is to use the require function, which you normally use to import .js files, but you can use it for .json files as well:
const packageJson = require('./package.json');
The advantage of this approach is that you get back the contents of the .json file as a JavaScript object or array, already parsed. The disadvantage of this approach is that you can not reload the file within the current process if something has changed, since require caches the contents of any loaded file, and you are always given back the cached value.
Your second option is exactly the opposite: It allows you to reload things, but it requires you to parse the file on your own. For that, you use the fs module's readFile function:
const fs = require('fs');
fs.readFile('./package.json', { encoding: 'utf8' }, (err, data) => {
const packageJson = JSON.parse(data);
// ...
});
If you want to use async and await, and if you use the util.promisify function of Node.js, you can also write this in a synchronous way, but keep the asynchronous code:
const fs = require('fs'),
{ promisify } = require('util');
const readFile = promisify(fs.readFile);
(async () => {
const data = await readFile('./package.json', { encoding: 'utf8' });
const packageJson = JSON.parse(data);
// ...
})();
Apart from that, there is also the fs.readFileSync function, which works in a synchronous way, but you should stay away from that one for performance reasons of your software.
For writing a JSON file, your only option is to use the writeFile function of the fs module, which works basically the same as its readFile counterpart:
const fs = require('fs');
const packageJson = { /* ... */ };
const data = JSON.parse(packageJson);
fs.writeFile('./package.json', data, { encoding: 'utf8' }, err => {
// ...
});
Again, you can use the util.promisify function as mentioned above to make things work with async and await, or you could use the synchronous fs.writeFileSync function (which, again, I would not recommend).
You can generally read in a JSON file with the require method of Node.
Example: const myJSON = requrie('./file.json') myJSON will hold the JSON data from file.json as a normal Javascript Object.
You can write a JSON file yourself, using:
const fs = require('fs');
fs.writeFile('file.json', json, 'utf8', callback);
Related
I am trying to read files with Node and then create a Object with the information that I extract from those files. I used the fs and path libs. I defined a empty Object outside the code that read the files, inside that code (where I use callbacks) the defined object get it's values modified, but outside it's value remain empty. Can someone help me?
Here is my code:
const path = require("path");
const fs = require("fs");
const dirPath = path.join(__dirname, "query");
let Query = {};
fs.readdir(dirPath, (error, files) => {
if (error) {
return console.log(error);
}
files.forEach((file) => {
const loaded = require(path.join(dirPath, file));
Object.entries(loaded).forEach(([key, value]) => {
Query[key] = value;
});
});
});
module.exports = {
Query,
};
When I ran console.log(Query) above the module.exports I got {} as answer, however running the same before the callback from fs.readdir ends return me the correct object.
I am not sure if the problem is non blocking IO or if I defined the object the wrong way.
fs.readdir is an asynchronous function, which means, the rest of the script will keep running without waiting the callback in readdir to finish.
Hence, the object Query will still print {} - it hasn't changed yet.
You want to continue your program only after reading the file.
One way to do this is to use readdirSync instead of readdir which is an synchronous function. Then, only when it is done, the program will continue.
Your code using readdirSync:
const path = require("path");
const fs = require("fs");
const dirPath = path.join(__dirname, "query");
let Query = {};
fs.readdirSync(dirPath).forEach((file) => {
const loaded = require(path.join(dirPath, file));
Object.entries(loaded).forEach(([key, value]) => {
Query[key] = value;
});
console.log(Query); // Should return the desired object
module.exports = {
Query,
};
I am trying to create a node module that has some helper functions for searching through reference data that I have in a CSV file. I've used the csv-parser module for loading it into objects, and this API seems to be for use with an asynchronous stream reader / pipe situation. I don't want the helper functions in this module to be available to any other modules before this reference data has had a chance to load.
I've tried using a Promise, but in order to get it to work, I've had to expose that promise and the initialization function to the calling module(s), which is not ideal.
// refdata.js
const fs = require('fs');
const csv = require('csv-parser');
var theData = new Array();
function initRefdata() {
return(new Promise(function(resolve, reject) {
fs.createReadStream('refdata.csv')
.pipe(csv())
.on('data', function(data) {theData.push(data);})
.on('end', resolve);}));
}
function helperFunction1(arg) { ... }
module.exports.initRefdata = initRefdata;
module.exports.helperFunction1 = helperFunction1;
// main.js
var refdata = require('./refdata.js');
function doWork() {
console.log(refdata.helperFunction1('Italy'));
}
refdata.initRefdata().then(doWork);
This works for this one use of the reference data module, but it is frustrating that I cannot use an initialization function completely internally to refdata.js. When I do, the asynchronous call to the stream pipe is not complete before I start using the helper functions, which need all the data before they can be useful. Also, I do not want to re-load all the CSV data each time it is needed.
With the comment from #Deepal I was able to come up with:
// refdata.js
const fs = require('fs');
const csv = require('csv-parser');
var theData = new Array();
function initRefdata() {
return(new Promise(function(resolve, reject) {
fs.createReadStream('refdata.csv')
.pipe(csv())
.on('data', function(data) {theData.push(data);})
.on('end', resolve);}));
}
function helperFunction1(arg) {
if (theData.length == 0) {
initRefdata().then(nestedHelper(arg));
}
else {
nestedHelper(arg);
}
function nestedHelper(arg) { ... }
}
module.exports.helperFunction1 = helperFunction1;
// main.js
var refdata = require('./refdata.js');
function doWork() {
console.log(refdata.helperFunction1('Italy'));
}
doWork();
We long-term Python and PHP coders have a tidy bit of synchronous code (sample below). Most of the functions have asynchronous counterparts. We really want to 'get' the power of Javascript and Node, and believe this is an ideal case of where asynchronous node.js would speed things up and blow our socks off.
What is the textbook way to refactor the following to utilize asynchronous Node? Async / await and promise.all? How? (Using Node 8.4.0. Backwards compatibility is not a concern.)
var fs = require('fs');
// This could list over 10,000 files of various size
const fileList = ['file1', 'file2', 'file3'];
const getCdate = file => fs.statSync(file).ctime; // Has async method
const getFSize = file => fs.statSync(file).size; // Has async method
// Can be async through file streams (see resources below)
const getMd5 = (file) => {
let fileData = new Buffer(0);
fileData = fs.readFileSync(file);
const hash = crypto.createHash('md5');
hash.update(fileData);
return hash.digest('hex');
};
let filesObj = fileList.map(file => [file, {
datetime: getCdate(file),
filesize: getFSize(file),
md5hash: getMd5(file),
}]);
console.log(filesObj);
Notes:
We need to keep the functions modular and re-usable.
There are more functions getting things for filesObj than listed here
Most functions can be re-written to be async, some can not.
Ideally we need to keep the original order of fileList.
Ideally we want to use latest Node and JS features -- not rely on external modules.
Assorted file stream methods for getting md5 asynchronously:
Obtaining the hash of a file using the stream capabilities of crypto module (ie: without hash.update and hash.digest)
How to calculate md5 hash of a file using javascript
There are a variety of different ways you could handle this code asynchronously. You could use the node async library to handle all of the callbacks more elegantly. If you don't want to dive into promises then that's the "easy" option. I put easy in quotes because promises are actually easier if you understand them well enough. The async lib is helpful but it still leaves much to be desired in the way of error propagation, and there is a lot of boilerplate code you'll have to wrap all your calls in.
The better way is to use promises. Async/Await is still pretty new. Not even supported in node 7 (not sure about node 8) without a preprocessor like Bable or Typescript. Also, async/await uses promises under the hood anyway.
Here is how I would do it using promises, even included a file stats cache for maximum performance:
const fs = require('fs');
const crypto = require('crypto');
const Promise = require('bluebird');
const fileList = ['file1', 'file2', 'file3'];
// Use Bluebird's Promise.promisifyAll utility to turn all of fs'
// async functions into promise returning versions of them.
// The new promise-enabled methods will have the same name but with
// a suffix of "Async". Ex: fs.stat will be fs.statAsync.
Promise.promisifyAll(fs);
// Create a cache to store the file if we're planning to get multiple
// stats from it.
let cache = {
fileName: null,
fileStats: null
};
const getFileStats = (fileName, prop) => {
if (cache.fileName === fileName) {
return cache.fileStats[prop];
}
// Return a promise that eventually resolves to the data we're after
// but also stores fileStats in our cache for future calls.
return fs.statAsync(fileName).then(fileStats => {
cache.fileName = fileName;
cache.fileStats = fileStats;
return fileStats[prop];
})
};
const getMd5Hash = file => {
// Return a promise that eventually resolves to the hash we're after.
return fs.readFileAsync(file).then(fileData => {
const hash = crypto.createHash('md5');
hash.update(fileData);
return hash.digest('hex');
});
};
// Create a promise that immediately resolves with our fileList array.
// Use Bluebird's Promise.map utility. Works very similar to Array.map
// except it expects all array entries to be promises that will
// eventually be resolved to the data we want.
let results = Promise.resolve(fileList).map(fileName => {
return Promise.all([
// This first gets a promise that starts resolving file stats
// asynchronously. When the promise resolves it will store file
// stats in a cache and then return the stats value we're after.
// Note that the final return is not a promise, but returning raw
// values from promise handlers implicitly does
// Promise.resolve(rawValue)
getFileStats(fileName, 'ctime'),
// This one will not return a promise. It will see cached file
// stats for our file and return the stats value from the cache
// instead. Since it's being returned into a Promise.all, it will
// be implicitly wrapped in Promise.resolve(rawValue) to fit the
// promise paradigm.
getFileStats(fileName, 'size'),
// First returns a promise that begins resolving the file data for
// our file. A promise handler in the function will then perform
// the operations we need to do on the file data in order to get
// the hash. The raw hash value is returned in the end and
// implicitly wrapped in Promise.resolve as well.
getMd5(file)
])
// .spread is a bluebird shortcut that replaces .then. If the value
// being resolved is an array (which it is because Promise.all will
// resolve an array containing the results in the same order as we
// listed the calls in the input array) then .spread will spread the
// values in that array out and pass them in as individual function
// parameters.
.spread((dateTime, fileSize, md5Hash) => [file, { dateTime, fileSize, md5Hash }]);
}).catch(error => {
// Any errors returned by any of the Async functions in this promise
// chain will be propagated here.
console.log(error);
});
Here's the code again but without comments to make it easier to look at:
const fs = require('fs');
const crypto = require('crypto');
const Promise = require('bluebird');
const fileList = ['file1', 'file2', 'file3'];
Promise.promisifyAll(fs);
let cache = {
fileName: null,
fileStats: null
};
const getFileStats = (fileName, prop) => {
if (cache.fileName === fileName) {
return cache.fileStats[prop];
}
return fs.statAsync(fileName).then(fileStats => {
cache.fileName = fileName;
cache.fileStats = fileStats;
return fileStats[prop];
})
};
const getMd5Hash = file => {
return fs.readFileAsync(file).then(fileData => {
const hash = crypto.createHash('md5');
hash.update(fileData);
return hash.digest('hex');
});
};
let results = Promise.resolve(fileList).map(fileName => {
return Promise.all([
getFileStats(fileName, 'ctime'),
getFileStats(fileName, 'size'),
getMd5(file)
]).spread((dateTime, fileSize, md5Hash) => [file, { dateTime, fileSize, md5Hash }]);
}).catch(console.log);
In the end results will be an array like which should hopefully match the results of your original code but should perform much better in a benchmark:
[
['file1', { dateTime: 'data here', fileSize: 'data here', md5Hash: 'data here' }],
['file2', { dateTime: 'data here', fileSize: 'data here', md5Hash: 'data here' }],
['file3', { dateTime: 'data here', fileSize: 'data here', md5Hash: 'data here' }]
]
Apologies in advance for any typos. Didn't have the time or ability to actually run any of this. I looked over it quite extensively though.
After discovering that async/await is in node as of 7.6 I decided to play with it a bit last night. It seems most useful for recursive async tasks that don't need to be done in parallel, or for nested async tasks that you might wish you could write synchronously. For what you needed here there isn't any mind-blowing way to use async/await that I can see but there are a few places where the code would read more cleanly. Here's the code again but with a few little async/await conveniences.
const fs = require('fs');
const crypto = require('crypto');
const Promise = require('bluebird');
const fileList = ['file1', 'file2', 'file3'];
Promise.promisifyAll(fs);
let cache = {
fileName: null,
fileStats: null
};
async function getFileStats (fileName, prop) {
if (cache.fileName === fileName) {
return cache.fileStats[prop];
}
let fileStats = await fs.stat(fileName);
cache.fileName = fileName;
cache.fileStats = fileStats;
return fileStats[prop];
};
async function getMd5Hash (file) {
let fileData = await fs.readFileAsync(file);
const hash = crypto.createHash('md5');
hash.update(fileData);
return hash.digest('hex');
};
let results = Promise.resolve(fileList).map(fileName => {
return Promise.all([
getFileStats(fileName, 'ctime'),
getFileStats(fileName, 'size'),
getMd5(file)
]).spread((dateTime, fileSize, md5Hash) => [file, { dateTime, fileSize, md5Hash }]);
}).catch(console.log);
I would make getCDate, getFSize, and getMd5 all asynchronous and promisified then wrap them in another asynchronous promise-returning function, here called statFile.
function statFile(file) {
return Promise.all([
getCDate(file),
getFSize(file),
getMd5(file)
]).then((datetime, filesize, md5hash) => ({datetime, filesize, md5hash}))
.catch(/*handle error*/);
}
Then you could change your mapping function to
const promises = fileList.map(statFile);
Then it's simple to use Promise.all:
Promise.all(promises)
.then(filesObj => /*do something*/)
.catch(err => /*handle error*/)
This leaves things modular, doesn't require async/await, allows you to plug extra functions into statFile, and preserves your file order.
I need to build a function for processing large CSV files for use in a bluebird.map() call. Given the potential sizes of the file, I'd like to use streaming.
This function should accept a stream (a CSV file) and a function (that processes the chunks from the stream) and return a promise when the file is read to end (resolved) or errors (rejected).
So, I start with:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
// use readable or data event?
parser.on('readable', function() {
// call processor, which may be async
// how do I throttle the amount of promises generated
});
var db = pgp(api.config.mailroom.fileMakerDbConfig);
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Now, I have two inter-related issues:
I need to throttle the actual amount of data being processed, so as to not create memory pressures.
The function passed as the processor param is going to often be async, such as saving the contents of the file to the db via a library that is promise-based (right now: pg-promise). As such, it will create a promise in memory and move on, repeatedly.
The pg-promise library has functions to manage this, like page(), but I'm not able to wrap my ahead around how to mix stream event handlers with these promise methods. Right now, I return a promise in the handler for readable section after each read(), which means I create a huge amount of promised database operations and eventually fault out because I hit a process memory limit.
Does anyone have a working example of this that I can use as a jumping point?
UPDATE: Probably more than one way to skin the cat, but this works:
'use strict';
var _ = require('lodash');
var promise = require('bluebird');
var csv = require('csv');
var stream = require('stream');
var pgp = require('pg-promise')({promiseLib: promise});
api.parsers.processCsvStream = function(passedStream, processor) {
// some checks trimmed out for example
var db = pgp(api.config.mailroom.fileMakerDbConfig);
var parser = csv.parse(passedStream, {trim: true});
passedStream.pipe(parser);
var readDataFromStream = function(index, data, delay) {
var records = [];
var record;
do {
record = parser.read();
if(record != null)
records.push(record);
} while(record != null && (records.length < api.config.mailroom.fileParserConcurrency))
parser.pause();
if(records.length)
return records;
};
var processData = function(index, data, delay) {
console.log('processData(' + index + ') > data: ', data);
parser.resume();
};
parser.on('readable', function() {
db.task(function(tsk) {
this.page(readDataFromStream, processData);
});
});
return new Promise(function(resolve, reject) {
parser.on('end', resolve);
parser.on('error', reject);
});
}
Anyone sees a potential problem with this approach?
You might want to look at promise-streams
var ps = require('promise-streams');
passedStream
.pipe(csv.parse({trim: true}))
.pipe(ps.map({concurrent: 4}, row => processRowDataWhichMightBeAsyncAndReturnPromise(row)))
.wait().then(_ => {
console.log("All done!");
});
Works with backpressure and everything.
Find below a complete application that correctly executes the same kind of task as you want: It reads a file as a stream, parses it as a CSV and inserts each row into the database.
const fs = require('fs');
const promise = require('bluebird');
const csv = require('csv-parse');
const pgp = require('pg-promise')({promiseLib: promise});
const cn = "postgres://postgres:password#localhost:5432/test_db";
const rs = fs.createReadStream('primes.csv');
const db = pgp(cn);
function receiver(_, data) {
function source(index) {
if (index < data.length) {
// here we insert just the first column value that contains a prime number;
return this.none('insert into primes values($1)', data[index][0]);
}
}
return this.sequence(source);
}
db.task(t => {
return pgp.spex.stream.read.call(t, rs.pipe(csv()), receiver);
})
.then(data => {
console.log('DATA:', data);
}
.catch(error => {
console.log('ERROR:', error);
});
Note that the only thing I changed: using library csv-parse instead of csv, as a better alternative.
Added use of method stream.read from the spex library, which properly serves a Readable stream for use with promises.
I found a slightly better way of doing the same thing; with more control. This is a minimal skeleton with precise parallelism control. With parallel value as one all records are processed in sequence without having the entire file in memory, we can increase parallel value for faster processing.
const csv = require('csv');
const csvParser = require('csv-parser')
const fs = require('fs');
const readStream = fs.createReadStream('IN');
const writeStream = fs.createWriteStream('OUT');
const transform = csv.transform({ parallel: 1 }, (record, done) => {
asyncTask(...) // return Promise
.then(result => {
// ... do something when success
return done(null, record);
}, (err) => {
// ... do something when error
return done(null, record);
})
}
);
readStream
.pipe(csvParser())
.pipe(transform)
.pipe(csv.stringify())
.pipe(writeStream);
This allows doing an async task for each record.
To return a promise instead we can return with an empty promise, and complete it when stream finishes.
.on('end',function() {
//do something wiht csvData
console.log(csvData);
});
So to say you don't want streaming but some kind of data chunks? ;-)
Do you know https://github.com/substack/stream-handbook?
I think the simplest approach without changing your architecture would be some kind of promise pool. e.g. https://github.com/timdp/es6-promise-pool
There are some untar libraries, but I cannot get them working.
My idea would be something like
untar(bufferStreamOrFilePath).extractToDirectory("/path", function(err){})
Is something like this available?
Just an update on this answer, instead of node-tar, consider using tar-fs which yields a significant performance boost, as well as a neater interface.
var tarFile = 'my-other-tarball.tar';
var target = './my-other-directory';
// extracting a directory
fs.createReadStream(tarFile).pipe(tar.extract(target));
The tar-stream module is a pretty good one:
var tar = require('tar-stream')
var fs = require('fs')
var extract = tar.extract();
extract.on('entry', function(header, stream, callback) {
// make directories or files depending on the header here...
// call callback() when you're done with this entry
});
fs.createReadStream("something.tar").pipe(extract)
extract.on('finish', function() {
console.log('done!')
});
A function to extract a base64 encoded tar fully in memory, with the assumption that all the files in the tar are utf-8 encoded text files.
const tar=require('tar');
let Duplex = require('stream').Duplex;
function bufferToStream(buffer) {
let stream = new Duplex();
stream.push(buffer);
stream.push(null);
return stream;
}
module.exports=function(base64EncodedTar){
return new Promise(function(resolve, reject){
const buffer = new Buffer.from(base64EncodedTar, "base64");
let files={};
try{
bufferToStream(buffer).pipe(new tar.Parse())
.on('entry', entry => {
let file={
path:entry.path,
content:""
};
files[entry.path]=file;
entry.on('data', function (tarFileData) {
file.content += tarFileData.toString('utf-8');
});
// resolve(entry);
}).on('close', function(){
resolve(files);
});
} catch(e){
reject(e);
}
})
};
Expanding on the tar-stream answer, that seemed to be the simplest to use in the browser with jest-tested code. Comparing the node libraries based on my trying to implement a project:
tar: Fairly easy to use on files, or buffers as in or Gudlaugur Egilsson's answer. It had some annoying webpack polyfill issues that I had trouble with when putting it into a react app.
js-untar: This was pretty annoying to set up for jest testing because it uses web workers and blob URLs, which jest does not directly support. I didn't proceed to getting it working in the browser, though it may work fine there. In order to get jest tests working, I had to use jsdom-worker-fix, and it was very slow in that environment. (It may be faster in-browser.)
tar-stream combined with gunzip-maybe seems to be fairly performant in browser, and doesn't have any issues with being used in jest tests. Worked fine on multi-hundred megabyte tarballs I tried.
This code extracts a tar or tar.gz stream:
var tar = require("tar-stream");
const gunzip = require("gunzip-maybe");
exports.unTar = (tarballStream) => {
const results = {};
return new Promise((resolve, reject) => {
var extract = tar.extract();
extract.on("entry", async function (header, stream, next) {
const chunks = [];
for await (let chunk of stream) {
chunks.push(chunk);
}
results[header.name] = Buffer.concat(chunks);
next();
});
extract.on("finish", function () {
resolve(results);
});
tarballStream.pipe(gunzip()).pipe(extract);
});
};
Then in browser, you can use readable-web-to-node-stream to process a tarball fetched in the browser.
const { ReadableWebToNodeStream } = require("readable-web-to-node-stream");
const response = await fetch(url, headers);
const extracted = await unTar(new ReadableWebToNodeStream(response.body));