Concatening files using streams - node.js

My app generates several files in parallel that I finally have to concat before serving it to the user.
Some of these can be big (>10M).
In order to achieve this, I have create 2 routines, one for handling the set of files to process and one for appending the content of a file to a destination one. Everything works fine on small/medium size files, but on large ones, the content is truncated...
function copyFileContent(ws,rs){
let errorDetected=false;
return new Promise((resolve,reject)=>{
if(rs && ws){
ws.on('error',(err)=>{
errorDetected=true;
return reject();
});
ws.on('close',()=>{
if(!errorDetected){
resolve(); // -> need to wait for large files...
} // Else reject has already occured...
});
rs.pipe(ws);
} else reject();
})
}
async function mergeFiles(dest, files){
for(var i=0;i<files.length;i++){
let w=fs.createWriteStream(dest,{flags: 'a', encoding: 'utf-8'});
let r=fs.createReadStream(files[i]);
await copyFileContent(w,r);
}
}
After some investigation, I have doublechecked that the close event on the writeStream (ws) was not a consequence of an error (which could justify that the content is truncated).
I finally added some delay by replacing the 'resolve()' statement by setTimeout(()=>{resolve()},3000);
And obviously allowing more time for the system (OS=Windows 10) to indeed write to file fixes the issue. But I don't understand why ! What is happening under the scene ? How to avoid such behaviour. I need to make sure that when the 'close' event occurs, then the file is indeed fully populated.
Can anyone help me finding my bug or misunderstanding on streams ?
Thx

Related

fs.writeFile corrupting json files

I have a program that writes a lot of files very quickly, and I noticed that sometimes there will be extra brackets or text in json files sometimes.
Here is how the program works:
There is an array of emojis with some more information, and if that emoji doesn't already have a file for itself, it creates a new one. If there already is an existing file of that name, it will edit it.
Code to write file:
function writeToFile(fileName, file){
return new Promise(function(resolve, reject) {
fs.writeFile(fileName, JSON.stringify(file, null, 2), 'utf8', function(err) {
if (err) reject(err);
else resolve();
});
});
}
I have tried using fs and graceful-fs and both have had this issue every couple hundred files, with no visible patterns.
examples off messed up json:
...
],
"trade_times": []
}
]
}ade_times": []
}
]
}
That second "ade_times" shouldnt be there, and I have no idea why it is appearing.
other times it just looks like this:
{
...
}}
with extra closing brackets for no reason.
Not sure if this is an issue with my code, with fs, or something with my pc. If you need any more information I can provide that (more code, node.js version, etc).
Thank you for your time :)
I dont know why your code is buggy.
But, I have created an alternate writeToFile function for you:
// Async functions are functions that return a promise in a more concise way.
async function writeToFile(filename, file) {
let fj = JSON.stringify(file, undefined, 2)
// fs.readFileSync will automatically reject the promise with the error when it encounters an error
fs.writeFileSync(filename, fj, {
encoding: "utf8"
})
return // This line is the equivalent of resolve(), this line is optional too.
}
Hope this answer works for you 😊

Is this terminal-log the consequence of the Node JS asynchronous nature?

I haven't found anything specific about this, it isn't really a problem but I would like to understand better what is going on here.
Basically, I'am testing some simple NodeJS code , like this :
//Summary : Open a file , write to the file, delete the file.
let fs = require('fs');
fs.open('mynewfile.txt' , 'w' , function(err,file){
if(err) throw err;
console.log('Created file!')
})
fs.appendFile('mynewfile.txt' , 'Depois de ter criado este ficheiro com o fs.open, acrescentei-lhe data com o fs.appendFile' , function(err){
if (err) throw err;
console.log('Added text to the file.')
})
fs.unlink('mynewfile.txt', function(err){
if (err) throw err;
console.log('File deleted!')
})
console.log(__dirname);
I thought this code would be executed in the order it was written from the top to the bottom, but when I look at the terminal I'am not sure that was the case because this is what I get :
$ node FileSystem.js
C:\Users\Simon\OneDrive\Desktop\Portfolio\Learning Projects\NodeJS_Tutorial
Created file!
File deleted!
Added text to the file.
//Expected order would be: Create file, add text to file , delete file , log dirname.
Instead of what ther terminal might make you think, in the end when I look at my folder the code order still seems to have been followed somehow because the file was deleted and I have nothing left on the directory.
So , I was wondering , why is it that the terminal doesn't log in the same order that the code is written from the top to the bottom.
Would this be the result of NodeJS asynchronous nature or is it something else ?
The code is (in princliple) executed from top to bottom, as you say. But fs.open, fs.appendFile, and fs.unlink are asynchronous. Ie, they are placed on the execution stack in the partiticular order, but there is no guarantee whatsoever, in which order they are finished, and thus you can't guarantee, in which order the callbacks are executed. If you run the code multiple times, there is a good chance, that you may encounter different execution orders ...
If you need a specific order, you have two different options
You call the later operation only in the callback of the prior, ie something like below
fs.open('mynewfile.txt' , 'w' , function(err,file){
if(err) throw err;
console.log('Created file!')
fs.appendFile('mynewfile.txt' , '...' , function(err){
if (err) throw err;
console.log('Added text to the file.')
fs.unlink('mynewfile.txt', function(err){
if (err) throw err;
console.log('File deleted!')
})
})
})
You see, that code gets quite ugly and hard to read with all that increasing nesting ...
You switch to the promised based approach
let fs = require('fs').promises;
fs.open("myfile.txt", "w")
.then(file=> {
return fs.appendFile("myfile.txt", "...");
})
.then(res => {
return fs.unlink("myfile");
})
.catch(e => {
console.log(e);
})
With the promise-version of the operations, you can also use async/await
async function doit() {
let file = await fs.open('myfile.txt', 'w');
await fs.appendFile('myfile.txt', '...');
await fs.unlink('myfile.txt', '...');
}
For all three possibilites, you probably need to close the file, before you can unlink it.
For more details please read about Promises, async/await and the Execution Stack in Javascript
It's a combination of 2 things:
The asynchronous nature of Node.js, as you correctly assume
Being able to unlink an open file
What likely happened is this:
The file was opened and created at the same time (open with flag w)
The file was opened a second time for appending (fs.appendFile)
The file was unlinked
Data was appended to the file (while it was already unlinked) and the file was closed
When data was being appended, the file still existed on disk as an inode, but had zero hard links (references) to it. It still takes up space then, but the OS checks the reference count when closing and frees up the space if the count has fallen to zero.
People sometimes run into a similar situation with daemons such as HTTP servers that employ log rotation: if something goes wrong when switching over logs, the old log file may be unlinked but not closed, so it's never cleaned up and it takes space forever (until you reboot or restart the process).
Note that the ordering of operations that you're observing is random, and it is possible that they would be re-ordered. Don't rely on it.
You could write this as (untested):
let fs = require('fs');
const main = async () => {
await fs.open('mynewfile.txt' , 'w');
await fs.appendFile('mynewfile.txt' , 'content');
await fs.unlink('mynewfile.txt');
});
main()
.then(() => console.log('success'()
.catch(console.error);
or within another async function:
const someOtherFn = async () => {
try{
await main();
} catch(e) {
// handle any rejection to your liking
}
}
(The catch block is not mandatory. You can opt to just let them throw to the top. It's just to showcase how async / await allows you to make synchronous code appear as if it was synchronous code without runing into callback hell.)

Difficulty processing CSV file, browser timeout

I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.

how to use Node.JS foreach function with Event listerner

I am not sure where I am going wrong but I think that the event listener is getting invoked multiple times and parsing the files multiple times.
I have five files in the directory and they are getting parsed. However the pdf file with array 0 gets parsed once and the next one twice and third one three times.
I want the each file in the directory to be parsed once and create a text file by extracting the data from pdf.
The Idea is to parse the pdf get the content as text and convert the text in to json in a specific format.
To make it simple, the plan is to complete one task first then use the output from the below code to perform the next task.
Hope anyone can help and point out where i am going wrong and explain a bit about my mistake so i understand it. (new to the JS and Node)
Regards,
Jai
Using the module from here:
https://github.com/modesty/pdf2json
var fs = require('fs')
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser')
var pdfParser = new PDFParser(this, 1)
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
//console.log(pdffiles)
pdffiles.forEach(function(pdffile){
console.log(pdffile)
pdfParser.once("pdfParser_dataReady",function(){
fs.writeFile('C:/Users/Administrator/Desktop/Project/Jsonoutput/'+pdffile, pdfParser.getRawTextContent())
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/'+pdffile)
})
})
})
As mentioned in the comment, just contributing 'work-around' ideas for OP to temporary resolve this issue.
Assuming performance is not an issue then you should be able to asynchronously parse the pdf files in a sequential matter. That is, only parse the next file when the first one is done.
Unfortunately I have never used the npm module PDFParser before so it is really difficult for me to try the code below. Pardon me as it may require some minor tweaks to make it to work, syntactically they should be fine as they were written using an IDE.
Example:
var fs = require('fs');
PDFParser = require('C:/Users/Administrator/node_modules/pdf2json/PDFParser');
var parseFile = function(files, done) {
var pdfFile = files.pop();
if (pdfFile) {
var pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => { return done(errData); });
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("'C:/Users/Administrator/Desktop/Project/Jsonoutput/" + pdfFile, JSON.stringify(pdfData));
parseFile(files, done);
});
pdfParser.loadPDF('C:/Users/Administrator/Desktop/Project/Input/' + pdfFile);
}
else {
return done(null, "All pdf files parsed.")
}
};
fs.readdir('C:/Users/Administrator/Desktop/Project/Input/',function(err,pdffiles){
parseFile(pdffiles, (err, message) => {
if (err) { console.error(err.parseError); }
else { console.log(message); }
})
});
In the code above, I have isolated out the parsing logic into a separated function called parseFile. In this function it first checks to see if there are still files to process or not, if none then it invokes the callback function done otherwise it will do an array.pop operation to get the next file in queue and starts parsing it.
When parsing is done then it recursively call the parseFile function until the last file is parsed.

Complex sequencing of promises - nested

After a lot of googling I have not been able to confirm the correct approach to this problem. The following code runs as expected but I have a grave feeling that I am not approaching this in the correct way, and I am setting myself up for problems.
The following code is initiated by the main app.js file and is passed a location to start loading XML files from and processing into a mongoDB
exports.processProfiles = function(path) {
var deferrer = q.defer();
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function(deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function(filenames) {
// now we have all the file names lets load and save
filenames.forEach(function(filename) {
// Here is where i think the problem is!
// kick off another promise chain for the dynamically sized array of files to process
q(loadFileContent(path, filename)) // first we load the data in the file
.then(function(inboundFile) {
// then parse XML structure to my new shiny JSON structure
// and ask Mongo to store it for me
return dataService.createProfile(processProfileXML(filename, inboundFile));
})
.done(function(result) {
console.log(result);
})
});
})
.catch(function(err) {
deferrer.reject('Unable to Process Profile records : ' + err);
})
.done(function() {
deferrer.resolve('Profile Processing Completed');
});
return deferrer.promise;
}
Whilst this code works these are my main concerns but cannot solve them on my own after a few hours of Google and reading.
1) Is this blocking? The read out to the console is difficult to understand if this is running asynchronously as i want it to - i think it is but advice on if I am doing something fundamentally wrong would be great
2) Is having a nested promise a bad idea, should I be linking it to the outter promise - I have tried but could not get anything to compile or run.
I haven't used Q in a really long time, but I think that you'd need to do is let it know you're about to hand back an array of promises that need to all be satisfied before moving on.
Additionally as you're waiting for multiple promises on one section of code, rather than nesting further, throw the 'set' of promises back up once they're all satisfied.
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function (deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function (filenames) {
return q.all(
filenames.map(function (filename) {
return q(loadFileContent(path, filename)) { /* Do stuff with your filenames */ });
})
);
.then(function (resultsOfLoadFileContentsPromises) {
console.log('I did stuff with all the things');
)
.catch(function(err) {});
What you have is not 'blocking'. But really what you're doing with promises is moving things into a new 'block'ing section. The more blocks you have, the more async-ish your code will appear. If nothing else is running apart from this promise, it will still appear procedural.
But inner promises must still resolve before the parent promises resolve thereafter.
Inner promises like what you have aren't an inherently bad, personally I will break them out into seperate files to makes easier to reason about, but I wouldn't define that as 'bad' unless there's no need for that inner promise to exist, however where possible (and in your example here) I've adjusted so I throw back up the next set of promises for a new section to deal with the data after it's gotten it.
(I'm not great with Q though, this code will probably require a little further tweaking).

Resources