I'm trying to read large files. Currently, I'm following the NodeJS documentation on how to read the large files but when I read a somewhat large file (~1.1 MB, ~20k lines), my Electron app freezes up for about 6 minutes and then the app finishes loading all the lines.
Here's my current code
var fileContents = document.getElementById("fileContents")
//first clear out the existing text
fileContents.innerHTML = ""
if(fs.existsSync(pathToFile)){
const fileLine = readline.createInterface({
input: fs.createReadStream(pathToFile)
})
fileLine.on('line', (line) => {
fileContents.innerHTML += line + "\n"
})
} else {
fileContents.innerHTML += fileNotFound + "\n"
console.log('Could not find file!!')
}
And the tag I'm targeting is a <xmp> tag.
What are some ways that people have displayed large files?
Streams can often be useful for high performance as they allow you to process one line at a time without loading the whole file into memory.
In this case however, you are loading each line and then concatenating onto your existing string (fileContents.innerHTML) with +=. All that concatenating is likely to be slower than just loading the whole contents of the file as one string. Worse still, you are outputting HTML every time you read in a line. So with 20k lines you are asking the rendering engine to render HTML 20,000 times!
Instead, try reading in the file as one string, and outputting the HTML just once.
fs.readFile(pathToFile, (err, data) => {
if (err) throw err;
fileContents.innerHTML = data;
});
The problem with fs.readFile() is that you just won't be able to open large files, for instance 600Mb, you need to use stream anyway for very big files.
I'm writing a genomics app called AminoSee using Node and Electron. When I started trying to ingest bigger than 2 GB files I had to switch to streaming architecture as my program was trying to load the entire file into memory. Since I scan the file this is clearly ludicrous. Here is the core of my processor, from CLI app at:
sourced: https://github.com/tomachinz/AminoSee/blob/master/aminosee-cli.js
try {
var readStream = fs.createReadStream(filename).pipe(es.split()).pipe(es.mapSync(function(line){
readStream.pause(); // curious to test performance of removing
streamLineNr++;
processLine(line); // process line here and call readStream.resume() when ready
readStream.resume();
})
.on('error', function(err){
error('While reading file: ' + filename, err.reason);
error(err)
})
.on('end', function() {
log("Stream ending");
})
.on('close', function() {
log("Stream closed");
setImmediate( () => { // after a 2 GB file give the CPU 1 cycle breather!
calcUpdate() ;
saveDocuments();
});
}));
} catch(e) {
error("ERROR:" + e)
}
I used setImmediate a lot as my program would get quite far ahead of itself before I learnt about callbacks and promises! Was a great time to learn about race conditions that for sure. Still has a million bugs would make a good learning project.
Related
I am attempting to download a zip file, extract the contents, and push them into a database. Unfortuantely, my stream never seems to complete, so I never get the opportunity to do clean up and end the process.
I have stripped the code down to the minimum to reproduce the error.
let debugmode = false;
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
objectMode: true,
transform: async function(entry,e,done) {
console.log('Item: ' + debugmode++ + ' of 819080');
let buff = await entry.buffer();
await entry.autodrain().promise()
done();
}
}))
.on('finish',()=>{
console.log('DONE');
})
;
The log shows the last couople of items, but never issues the word DONE.
Item: 819075
Item: 819076
Item: 819077
Item: 819078
Item: 819079
Item: 819080
Is there something I have done incorrectly? Is there something I can do to monitor for the end of file and kill the stream?
Extra Info
In the actual code, there is also a transform that reports progress based on bytes processed. There are a few bytes processed after this item.
I am using unzipper to do the extract
The zip file is a publicly accessible SEC submissions.zip. I have no problem with companies.zip. (I'm trying to find their linkable page)
I download the zip in full before processing.
Out of frustration, I have implemented a Dead Man's Switch.
let deadman = null;
await new Promise((resolve)=>{
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
clearTimeout(deadman);
deadman = setTimeout(resolve,60000);
/// still do all the other stuff
}
}))
.on('finish',()=>{
clearTimeout(deadman);
console.log('DONE');
resolve();
})
});
Now, every time it processes an entry, it has 60 seconds to complete processing. If it fails to complete processing in 60 seconds, it is assumed to have died and the promise is resolved. The timer is restarted every time an item is processed (the stream demonstrates it is still alive).
While I do not consider this a solution, just a work around, it is intended to be used as a single process, so it can be terminated after the run (to clean up the memory)
I wrote a simple utility to convert a somewhat weird json file (multiple objects not in an array) to csv for some system testing purposes. The read and transformation themselves are fine, and the resulting string is logged to the console correctly, but sometimes the resulting csv file is missing the first data line (it shows header, 1 blank line, then rest of data). I'm using read and write streams, without any provisions for backpressure. I don't think the problem is backpressure, since only the 1st line gets skipped, but I could be wrong. Any ideas?
const fs = require('fs');
const readline = require('readline');
const JSONbig = require('json-bigint');
// Create read interface to stream each line
const readInterface = readline.createInterface({
input: fs.createReadStream('./confirm.json'),
// output: process.stdout,
console: false
});
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
readInterface.on('line', function(line) {
let task = JSONbig.parse(line);
task.businessData.MESSAGE.RECORD[0].DETAIL.REG_DETAIL.forEach(element => {
let csv = "I,PTB,0,VCO,PR9999999011,,cpicker1,2020121000000," + element.QUANTITYTOPICK.toString() + ",0,COMPLETED," +
task.businessData.MESSAGE.RECORD[0].ASSIGNMENTNUMBER.toString() + "," + element.LOCATIONNUMBER.toString() + "," +
element.ITEMNUMBER.toString() + ",,,," +
task.businessData.MESSAGE.RECORD[0].WAVE.toString() + ",N," + element.CARTONNUMBER.toString() + "\n";
console.log(csv);
try {
writeDetail.write(csv);
} catch (err) {
console.error(err);
}
});
});
Edit: Based on the feedback below, I consolidated the write streams into one (the missing line was still happening, but it's better coding anyway). I also added a try block around the JSON parse. Ran the code several times over different files, and no missing line. Maybe the write was happening before the parse was done? In any case, it seems my problem is resolved for the moment. I'll have to research how to properly handle backpressure later. Thanks for the help.
The code you show here is opening two separate writestreams on the same file and then writing to both of them without any timing coordination between them. That will clearly conflict.
You open one here:
const writeHeader = fs.createWriteStream('./confirm.csv');
const header = "ACTION_TYPE,PROCESS_PICK,TYPE_FLAG,APP_ID,FACILITY_ID,CONTAINER_ID,USER_ID,CONFIRM_DATE_TS,PICK_QTY,REMAINING_QTY,PICK_STATUS,ASSIGNMENT_ID,LOCATION_ID,ITEM_ID,CLUSTER_ID,TOTAL_QTY,TOTAL_ITEMS,WAVE_NBR,QA_FLAG,WORK_DIRECTIVE_ID\n";
writeHeader.write(header);
And, you open one here:
// Create write interface to save each csv line
const writeDetail = fs.createWriteStream('./confirm.csv', {
flags: 'a'
});
And, then you write to the second one in your loop. Those clearly conflict. The write from the first is probably not complete when you open the second and it also may not be flushed to disk yet either. The second one opens for append, but doesn't accurately read the file position for appending because the first one hasn't yet succeeded.
This code doesn't show any reason for using separate write streams at all so the cleanest way to address this would be to just use one writestream that will accurately serialize the writes. Otherwise, you have to wait for the first writestream to finish and close before opening the second one.
And, your .forEach() loop needs to have backpressure support since you're repeatedly calling .write() and, at some data size, you can get backpressure. I agree that backpressure is not likely the cause of the issue you are asking about, but is something else you need to fix when rapdily writing in a loop.
I am playing with Node.js and I have created a simple script that uploads files from a directory to a server:
var request = require('request');
var file = require('file');
var fs = require('fs');
var path = require('path');
VERSION = '0.1'
CONFIG_FILE = path.join(__dirname, 'etc', 'sender.conf.json');
var config = JSON.parse(
fs.readFileSync(CONFIG_FILE).toString()
);
var DATA_DIR = __dirname
config['data_dir'].forEach(function(dir) {
DATA_DIR = path.join(DATA_DIR, dir)
});
console.log('sending data from root directory: ' + DATA_DIR);
file.walk(
DATA_DIR,
function(err, dir_path, dirs, files) {
if(err) {
return console.error(err);
}
sendFiles(dir_path, files);
}
);
function sendFiles(dir_path, files)
{
files
.filter(function(file) {
return file.substr(-5) === '.meta';
})
.forEach(function(file) {
var name = path.basename(file.slice(0, -5));
sendFile(dir_path, name);
})
;
}
function sendFile(dir_path, name)
{
console.log("reading file start: " + dir_path + "/" + name);
fs.readFile(
path.join(dir_path, name + '.meta'),
function(err, raw_meta) {
if(err) {
return console.error(err);
}
console.log("reading file done: " + dir_path + "/" + name);
sendData(
name,
JSON.parse(raw_meta),
fs.createReadStream(path.join(dir_path, name + '.data'))
);
}
);
console.log("reading file async: " + dir_path + "/" + name);
}
function sendData(name, meta, data_stream)
{
meta['source'] = config['data_source'];
var req = request.post(
config['sink_url'],
function(err, res, body) {
if(err) {
console.log(err);
}
else {
console.log(name);
console.log(meta);
console.log(body);
}
}
);
var form = req.form();
form.append(
'meta',
JSON.stringify(meta),
{
contentType: 'application/x-www-form-urlencoded'
}
);
form.append(
'data',
data_stream
);
}
It works fine, when run with only a few files. But when I run it on directory with lots of files, it chokes. This is because it keeps creating huge amounts of tasks for reading from a file, but never gets to actually doing the reading (because there is too many files). This can be observed on output:
sending data from root directory: .../data
reading file start: .../data/ac/ad/acigisu-adruire-sabeveab-ozaniaru-fugeef-wemathu-lubesoraf-lojoepe
reading file async: .../data/ac/ad/acigisu-adruire-sabeveab-ozaniaru-fugeef-wemathu-lubesoraf-lojoepe
reading file start: .../data/ac/ab/acodug-abueba-alizacod-ugvut-nucom
reading file async: .../data/ac/ab/acodug-abueba-alizacod-ugvut-nucom
reading file start: .../data/ac/as/acigisu-asetufvub-liwi-ru-mitdawej-vekof
reading file async: .../data/ac/as/acigisu-asetufvub-liwi-ru-mitdawej-vekof
reading file start: .../data/ac/av/ace-avhad-bop-rujan-pehwopa
reading file async: .../data/ac/av/ace-avhad-bop-rujan-pehwopa
...
For each file, there is console output "reading file start" produced immediately before call to fs.readFile, and "reading file async" that is produced immediately after the async reading has been scheduled. But there is no "reading file done" message even when I let it run for a long time, which means that reading of any file has probably never been even scheduled (those files are on order of 100s of bytes, so once scheduled, those reads would probably finish in single go).
This leads me to the following thought process. Async calls in Node.js are done because the event loop itself is single-threaded and we do not want to block it. However, once this requirement is satisfied, does it make any sense to nest further async calls into async calls that are themselves nested in async calls, etc.? Would it serve any particular purpose? Moreover, would not it be actual pessimisation of the code due to scheduling overhead that is not really needed and can be completely avoided if complete handling of single file have consisted of synchronous calls only?
Given the thought process above, my course of action would be to use solution from this question:
asynchronously push names of all files to async.queue
limit number of parallel tasks by setting queue.concurrency
provide file-upload handler that is completely synchronous, i.e. it synchronously reads contents of the file and after that is finished, it synchronously sends POST request to the server
This is my very first try to use Node.js and/or JavaScript, therefore it is quite possible I am completely wrong (note that e.g. sync-request package makes it very clear that synchronous calls are not desirable, which is in contradiction with my thought process above - the question is why). Any comments on validity of the above thought process as well as viability of the proposed solution and eventual alternatives to it would be very much appreciated.
== Update ==
There is very good article explaining all this in great detail directly in documentation of Node.js.
As for the particular problem at hand, it is indeed in the choice of file-system-walker-module. The solution is to use e.g. walk instead of file:
## -4,7 +4,7 ##
var request = require('request');
-var file = require('file');
+var walk = require('walk');
var fs = require('fs');
var path = require('path');
## -24,13 +24,19 ## config['data_dir'].forEach(function(dir) {
console.log('sending data from root directory: ' + DATA_DIR);
-file.walk(
- DATA_DIR,
- function(err, dir_path, dirs, files) {
- if(err) {
- return console.error(err);
- }
- sendFiles(dir_path, files);
+var walker = walk.walk(DATA_DIR)
+walker.on(
+ 'files',
+ function(dir_path, files, next) {
+ sendFiles(dir_path, files.map(function(stats) { return stats.name; }));
+ next();
+ }
+);
+walker.on(
+ 'errors',
+ function(dir_path, node_stats, next) {
+ console.error('file walker:', node_stats);
+ next();
}
);
== Original Post ==
After a bit more study, I will attempt to answer my own question. This answer is still only a partial solution (more complete answer from someone who has actual experience with Node.js would be very much appreciated).
The short answer to the main question above is that it indeed is not only desirable, but also almost always necessary to schedule more asynchronous functions from already asynchronous functions. The long explanation follows.
It is because of how Node.js scheduling works: "Everything runs on a different thread except our code.". There are two very important comments in the discussion below the linked blog post:
"Javascript always finishes the currently executing function first. An event will never interrupt a function." [Twitchard]
"Also note it won't just finish the current function, it will run to completion of all synchronous functions and I believe anything queued with process.nextTick... before the request callback is handled." [Tim Oxley]
There is also a note mentioning this in the documentatoin of the process.nextTick: "The next tick queue is completely drained on each pass of the event loop before additional I/O is processed. As a result, recursively setting nextTick callbacks will block any I/O from happening, just like a while(true); loop."
So, to summarize, all code of the script itself is running on single thread and single thread only. The asynchronous callbacks scheduled to be run are executed on that very same single thread and they are executed only after whole current next tick queue has been drained. Use of asynchronous callbacks provide the only point, when some other function can be scheduled to be run. If the file-upload handler would not schedule any additional asynchronous tasks as described in the question, its execution would block everything else until that whole file-upload handler will have been finished. That is not desirable.
This also explains why the actual reading of the input file never occurs ("recursively setting nextTick callbacks will block any I/O from happening" - see above). It eventually would occur after all the tasks for whole directory hierarchy traversed will have been scheduled. However, without further study, I am not able to answer the question how to limit the number of file-upload tasks scheduled (effectively size of the task queue) and block the scheduling loop until some of those tasks will have been processed (some room on the task queue has been freed). Hence this answer is still incomplete.
I'm trying to use the ssh2-sftp library to read/write a file in Node. When I do an sftp.get on a larger CSV file (but not too large -- like 2 MB only) on an sftp site, and then read data on the returned stream, the call is hanging on me after the 14th stream.on("data") call. I've tested this with a few different sample files and the code works fine on smaller files. But if a CSV file is big enough to get past that 14th call, it just hangs and it's like it can't read anymore even though there's more there to read. And the stream.on("close") never gets called into either.
Obviously this is pretty weird behavior. Was hoping maybe somebody has run into something similar using this library and had some guidance.
If it helps at all, here is some code
sftp.get(currentFileName).then((readStream) => {
var counter = 0;
readStream.on("data", function(d) {
counter++;
console.log("counter = " + counter);
});
readStream.on("close", function(err) {
if (err) {
console.error("Problem with read stream for file " + currentFileName + ", error = ", err);
}
//done reading the individual file, all went well
else {
console.log("done reading the stream");
}
});
readStream.on('error', function(e){
console.error("Error retrieving file: " + e);
})
readStream.resume();
And after the 14th call into readStream.on("data"), it just freezes up. With maybe half the file read.
The issue turned out to be that ssh2-sftp seems to be running an outdated version of the underlying ssh2 library. Switching from ssh2-sftp to the most recent (0.5.2) version of ssh2 and using that library directly fixed the issue (which might have been this one: https://github.com/mscdex/ssh2/issues/450)
I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.
I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}