Why would node.js not close a file after writing with writeFileSync? - node.js

(OS=Windows) I was having issues with power failures while processing very large files. If the power went out, then my log file would be all zeros (NUL NUL NUL...). So, I wrote a couple lines to save the progress every 50K lines, so that in the event of a power failure I can then restart from the last successful line.
toggle = toggle ^ 1
fs.writeFileSync(`progress${toggle}.txt`, progress)
Toggle alternates between 1 and 0. This way, if the power happens to go out while progress1.txt is being saved, I will still have progress0.txt. And vice-versa. I tested this and it works correctly. It toggles between the two files and saves progress. Then, after the power went out, I checked and BOTH progress files were all NULs. Not just one script, but all the scripts had their progress files overwritten with NULs.
According to my research this happens when a computer is suddenly shut off while writing to a file. How is it possible that both files could be open at the same time? In order for progress1 to be open, writeFileSync would have to save and close progress0 and then process another 50K lines since everything is being done sync style. There should be no way for both to open at the same time.
Thank you.
EDIT I was asked if I can show how writeFileSync is being called. Here is the basic program structure:
var module = {
finish_cb: null,
toggle: 0,
progress: 0,
start_batch: function () {
this.toggle = this.toggle ^ 1
fs.writeFileSync(`progress${toggle}.txt`, this.progress)
},
process_batch: function () {
while (STILL PROCESSING) {
// DO BATCH THINGS
// THIS WILL TAKE 10-20 seconds to complete
}
this.progress++
this.finish_cb()
}
}
module.finish_cb = finish
module.start_batch()
module.process_batch()
function finish () {
if (ALL DONE) {
console.log('All done!')
} else {
module.start_batch()
module.process_batch()
}
}

Related

NodeJS unzipper stream never finishes

I am attempting to download a zip file, extract the contents, and push them into a database. Unfortuantely, my stream never seems to complete, so I never get the opportunity to do clean up and end the process.
I have stripped the code down to the minimum to reproduce the error.
let debugmode = false;
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
objectMode: true,
transform: async function(entry,e,done) {
console.log('Item: ' + debugmode++ + ' of 819080');
let buff = await entry.buffer();
await entry.autodrain().promise()
done();
}
}))
.on('finish',()=>{
console.log('DONE');
})
;
The log shows the last couople of items, but never issues the word DONE.
Item: 819075
Item: 819076
Item: 819077
Item: 819078
Item: 819079
Item: 819080
Is there something I have done incorrectly? Is there something I can do to monitor for the end of file and kill the stream?
Extra Info
In the actual code, there is also a transform that reports progress based on bytes processed. There are a few bytes processed after this item.
I am using unzipper to do the extract
The zip file is a publicly accessible SEC submissions.zip. I have no problem with companies.zip. (I'm trying to find their linkable page)
I download the zip in full before processing.
Out of frustration, I have implemented a Dead Man's Switch.
let deadman = null;
await new Promise((resolve)=>{
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
clearTimeout(deadman);
deadman = setTimeout(resolve,60000);
/// still do all the other stuff
}
}))
.on('finish',()=>{
clearTimeout(deadman);
console.log('DONE');
resolve();
})
});
Now, every time it processes an entry, it has 60 seconds to complete processing. If it fails to complete processing in 60 seconds, it is assumed to have died and the promise is resolved. The timer is restarted every time an item is processed (the stream demonstrates it is still alive).
While I do not consider this a solution, just a work around, it is intended to be used as a single process, so it can be terminated after the run (to clean up the memory)

Node.js halts when console window is scrolled

If you run the following script in Node.js under Windows (at least 8)
const init = +new Date;
setInterval(() => {
console.log(+new Date - init);
}, 1000);
and drag the thumb of a scroll bar of console window, the output of the script looks similar to
1001
2003 // long drag here
12368 // its result
13370
14372
Looks like Node.js' event loop halts during the scroll. The same thing should happen to asynchronous actions inside of http package. Thus leaving a visible terminal window is dangerous to the running server.
How do I change the code to avoid such behavior?
NodeJS is not halted while scrolling or selecting text. The only functions that send data to stdout are halted.
In your server, you are able to send log data to a file, and this way your server will not halt.
For example, see this code:
const init = +new Date;
var str=''
setInterval(() => {
x=(+new Date - init).toString();;
str+=x + '\n'
}, 1000);
setTimeout(function(){
console.log(str)
},5000)
I have selected text during the first 5 seconds, and this was the result:
C:\me>node a
1002
2002
3002
4003
You can see that there is no 'pause'.
As you see, the first event loop setInterval wasn't halted, because there is no console.log inside.
Now, when you use an output file for logging, you can view live log using tail -f. This will show you each new line in the output file.
Your console is actually pausing when you scroll or click in the console, as it's entering into select mode. Have a look in the title bar as it's paused it will likely say select.
To prevent this behavior, edit the properties of the command prompt, and unselect "quick edit mode".
There are two pieces of information in the node documentation that may give some clues for the reason of that behaviour:
an excerpt from Console:
Warning: The global console object's methods are neither consistently synchronous like the browser APIs they resemble, nor are they consistently asynchronous like all other Node.js streams. See the note on process I/O for more information.
an excerpt from A note on process I/O:
Warning: Synchronous writes block the event loop until the write has completed. This can be near instantaneous in the case of output to a file, but under high system load, pipes that are not being read at the receiving end, or with slow terminals or file systems, its possible for the event loop to be blocked often enough and long enough to have severe negative performance impacts.
And it seems that partial solution can be built using method you already proposed:
const fs = require('fs');
const init = +new Date;
setInterval(() => {
fs.write(1,String(+new Date - init)+'\n',null,'utf8',()=>{});
}, 1000);
It still blocks UI if you start selection, but doesn't stop processing:
2296
3300 // long pause here when selection was started
4313 // all those lines printed at the same time after selection was aborted
5315
6316
7326
8331
9336
10346
11356
12366
13372
If you'd like to make your console.log and console.error always asynchronous on all platforms, you can do this by using fs.write to fd 1 (stdout) or fd 2 (stderr).
const fs = require('fs')
const util = require('util')
// used by console.log
process.stdout.write = function write (str) {
fs.write(1, str, ()=>{})
}
// used by console.error
process.stderr.write = function write (str) {
fs.write(2, str, ()=>{})
}

NodeJS + Electron - Optimizing Displaying Large Files

I'm trying to read large files. Currently, I'm following the NodeJS documentation on how to read the large files but when I read a somewhat large file (~1.1 MB, ~20k lines), my Electron app freezes up for about 6 minutes and then the app finishes loading all the lines.
Here's my current code
var fileContents = document.getElementById("fileContents")
//first clear out the existing text
fileContents.innerHTML = ""
if(fs.existsSync(pathToFile)){
const fileLine = readline.createInterface({
input: fs.createReadStream(pathToFile)
})
fileLine.on('line', (line) => {
fileContents.innerHTML += line + "\n"
})
} else {
fileContents.innerHTML += fileNotFound + "\n"
console.log('Could not find file!!')
}
And the tag I'm targeting is a <xmp> tag.
What are some ways that people have displayed large files?
Streams can often be useful for high performance as they allow you to process one line at a time without loading the whole file into memory.
In this case however, you are loading each line and then concatenating onto your existing string (fileContents.innerHTML) with +=. All that concatenating is likely to be slower than just loading the whole contents of the file as one string. Worse still, you are outputting HTML every time you read in a line. So with 20k lines you are asking the rendering engine to render HTML 20,000 times!
Instead, try reading in the file as one string, and outputting the HTML just once.
fs.readFile(pathToFile, (err, data) => {
if (err) throw err;
fileContents.innerHTML = data;
});
The problem with fs.readFile() is that you just won't be able to open large files, for instance 600Mb, you need to use stream anyway for very big files.
I'm writing a genomics app called AminoSee using Node and Electron. When I started trying to ingest bigger than 2 GB files I had to switch to streaming architecture as my program was trying to load the entire file into memory. Since I scan the file this is clearly ludicrous. Here is the core of my processor, from CLI app at:
sourced: https://github.com/tomachinz/AminoSee/blob/master/aminosee-cli.js
try {
var readStream = fs.createReadStream(filename).pipe(es.split()).pipe(es.mapSync(function(line){
readStream.pause(); // curious to test performance of removing
streamLineNr++;
processLine(line); // process line here and call readStream.resume() when ready
readStream.resume();
})
.on('error', function(err){
error('While reading file: ' + filename, err.reason);
error(err)
})
.on('end', function() {
log("Stream ending");
})
.on('close', function() {
log("Stream closed");
setImmediate( () => { // after a 2 GB file give the CPU 1 cycle breather!
calcUpdate() ;
saveDocuments();
});
}));
} catch(e) {
error("ERROR:" + e)
}
I used setImmediate a lot as my program would get quite far ahead of itself before I learnt about callbacks and promises! Was a great time to learn about race conditions that for sure. Still has a million bugs would make a good learning project.

How to forcibly keep a Node.js process from terminating?

TL;DR
What is the best way to forcibly keep a Node.js process running, i.e., keep its event loop from running empty and hence keeping the process from terminating? The best solution I could come up with was this:
const SOME_HUGE_INTERVAL = 1 << 30;
setInterval(() => {}, SOME_HUGE_INTERVAL);
Which will keep an interval running without causing too much disturbance if you keep the interval period long enough.
Is there a better way to do it?
Long version of the question
I have a Node.js script using Edge.js to register a callback function so that it can be called from inside a DLL in .NET. This function will be called 1 time per second, sending a simple sequence number that should be printed to the console.
The Edge.js part is fine, everything is working. My only problem is that my Node.js process executes its script and after that it runs out of events to process. With its event loop empty, it just terminates, ignoring the fact that it should've kept running to be able to receive callbacks from the DLL.
My Node.js script:
var
edge = require('edge');
var foo = edge.func({
assemblyFile: 'cs.dll',
typeName: 'cs.MyClass',
methodName: 'Foo'
});
// The callback function that will be called from C# code:
function callback(sequence) {
console.info('Sequence:', sequence);
}
// Register for a callback:
foo({ callback: callback }, true);
// My hack to keep the process alive:
setInterval(function() {}, 60000);
My C# code (the DLL):
public class MyClass
{
Func<object, Task<object>> Callback;
void Bar()
{
int sequence = 1;
while (true)
{
Callback(sequence++);
Thread.Sleep(1000);
}
}
public async Task<object> Foo(dynamic input)
{
// Receives the callback function that will be used:
Callback = (Func<object, Task<object>>)input.callback;
// Starts a new thread that will call back periodically:
(new Thread(Bar)).Start();
return new object { };
}
}
The only solution I could come up with was to register a timer with a long interval to call an empty function just to keep the scheduler busy and avoid getting the event loop empty so that the process keeps running forever.
Is there any way to do this better than I did? I.e., keep the process running without having to use this kind of "hack"?
The simplest, least intrusive solution
I honestly think my approach is the least intrusive one:
setInterval(() => {}, 1 << 30);
This will set a harmless interval that will fire approximately once every 12 days, effectively doing nothing, but keeping the process running.
Originally, my solution used Number.POSITIVE_INFINITY as the period, so the timer would actually never fire, but this behavior was recently changed by the API and now it doesn't accept anything greater than 2147483647 (i.e., 2 ** 31 - 1). See docs here and here.
Comments on other solutions
For reference, here are the other two answers given so far:
Joe's (deleted since then, but perfectly valid):
require('net').createServer().listen();
Will create a "bogus listener", as he called it. A minor downside is that we'd allocate a port just for that.
Jacob's:
process.stdin.resume();
Or the equivalent:
process.stdin.on("data", () => {});
Puts stdin into "old" mode, a deprecated feature that is still present in Node.js for compatibility with scripts written prior to Node.js v0.10 (reference).
I'd advise against it. Not only it's deprecated, it also unnecessarily messes with stdin.
Use "old" Streams mode to listen for a standard input that will never come:
// Start reading from stdin so we don't exit.
process.stdin.resume();
Here is IFFE based on the accepted answer:
(function keepProcessRunning() {
setTimeout(keepProcessRunning, 1 << 30);
})();
and here is conditional exit:
let flag = true;
(function keepProcessRunning() {
setTimeout(() => flag && keepProcessRunning(), 1000);
})();
You could use a setTimeout(function() {""},1000000000000000000); command to keep your script alive without overload.
spin up a nice repl, node would do the same if it didn't receive an exit code anyway:
import("repl").then(repl=>
repl.start({prompt:"\x1b[31m"+process.versions.node+": \x1b[0m"}));
I'll throw another hack into the mix. Here's how to do it with Promise:
new Promise(_ => null);
Throw that at the bottom of your .js file and it should run forever.

Reading file in segments of X number of lines

I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.
I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}

Resources