Exclusive access to variable in mapLimit return - node.js

I'm reasonably new to programming in nodejs but not to programming (C/C++/Python/Shaders) and I have a question about exclusive access to a global variable when e.g. async.mapLimit return its callbacks
example
var myGlobalCounter = 0;
function executeDownload(item, callback){
exec('./ascriptthatdownload.sh ' + item, function (error, stdout, stderr) {
// Here do I have exclusive access to myGlobalCounter
// so that I could do this or update lets say a UI component?
myGlobalCounter++
console.log('Downloads ready:', myGlobalCounter);
});
}
function downloadSomeFiles() {
var listOfFiles = [];
// create download links
async.mapLimit(listOfFiles, 4, executeDownload, function(err, results){
});
}
I can get this to work but I don't know if this is safe enough? Other suggestions also appreciated. In C/C++ I would have used a mutex to guard against simultaneous access to myGlobalCounter.
Edit:I want to be able to safely count myGlobalCounter by 1 each time a download is ready and then pass it on either in console.log or to another component

The kudos goes to #CertainPerformance explanation in his comment (that I can't accept as an answer)
Yes, Javascript is single-threaded - a synchronous block of code will
run to the end before any other callback can run, look up the event
loop. You almost never have to worry about shared mutable state in JS

Related

Can this code cause a race condition in socket io?

I am very new to node js and socket io. Can this code lead to a race condition on counter variable. Should I use a locking library for safely updating the counter variable.
"use strict";
module.exports = function (opts) {
var module = {};
var io = opts.io;
var counter = 0;
io.on('connection', function (socket) {
socket.on("inc", function (msg) {
counter += 1;
});
socket.on("dec" , function (msg) {
counter -= 1;
});
});
return module;
};
No, there is no race condition here. Javascript in node.js is single threaded and event driven so only one socket.io event handler is ever executing at a time. This is one of the nice programming simplifications that come from the single threaded model. It runs a given thread of execution to completion and then and only then does it grab the next event from the event queue and run it.
Hopefully you do realize that the same counter variable is accessed by all socket.io connections. While this isn't a race condition, it means that there's only one counter that all socket.io connections are capable of modifying.
If you wanted a per-connection counter (separeate counter for each connection), then you could define the counter variable inside the io.on('connection', ....) handler.
The race conditions you do have to watch out for in node.js are when you make an async call and then continue the rest of your coding logic in the async callback. While the async operation is underway, other node.js code can run and can change publicly accessible variables you may be using. That is not the case in your counter example, but it does occur with lots of other types of node.js programming.
For example, this could be an issue:
var flag = false;
function doSomething() {
// set flag indicating we are in a fs.readFile() operation
flag = true;
fs.readFile("somefile.txt", function(err, data) {
// do something with data
// clear flag
flag = false;
});
}
In this case, immediately after we call fs.readFile(), we are returning control back to the node.js. It is free at that time to run other operations. If another operation could also run this code, then it will tromp on the value of flag and we'd have a concurrency issue.
So, you have to be aware that anytime you make an async operation and then the rest of your logic continues in the callback for the async operation that other code can run and any shared variables can be accessed at that time. You either need to make a local copy of shared data or you need to provide appropriate protections for shared data.
In this particular case, the flag could be incremented and decremented rather than simply set to true or false and it would probably serve the desired purpose of keeping track of whether this file is current being read or not.
Shorter answer:
"Race condition" is when you execute a series of ordered asynchronous functions and because of their async nature they won't finish processing in their original order.
In your code, you are executing a series of ordered synchronous process (increasing or decreasing the counter), So they finish instantly after they start, resulting in ordered output. So no racing here!

When calling Edge.js from C#, how do you hook stdout and stderr?

Background
I am working on a C# program which currently runs Node via Process.Start(). I am capturing the stdout and stderr from this child process and redirecting it for my own reasons. I am looking into replacing the invocation of Node.exe with a call to Edge.js instead. In order to be able to do this I must be able to reliably capture stdout and stderr from the Javascript running within Edge, and get the messages back into my C# application.
Approach 1
I'll describe this approach for completeness in case anybody recommends it :)
If the Edge process terminates, it is fairly easy to deal with this by simply declaring a msgs array and overwriting process.stdout.write and process.stderr.write with new functions that accumulate messages on that array, then at the end, simply return the msgs array. Example:
var msgs = [];
process.stdout.write = function (string) {
msgs.push({ stream: 'o', message : string });
};
process.stderr.write = function (string) {
msgs.push({ stream: 'e', message: string });
};
// Return to caller.
var result = { messages: msgs; ...other stuff... };
callback(null, result);
Obviously this only works if the Edge code terminates, and msgs may grow large in the worst case. However, it is likely to perform well because only one marshalling call is necessary to get all the messages back.
Approach 2
This is a little harder to explain. Instead of accumulating messages, we "hook" stdout and stderr using a delegate we send in from C#. In the C#, we create an object that we will pass into Edge, and that object has a property called stdoutHook:
dynamic payload = new ExpandoObject();
payload.stdoutHook = GetStdoutHook();
public Func<object, Task<object>> GetStdoutHook()
{
Func<object, Task<object>> hook = (message) =>
{
TheLogger.LogMessage((message as string).Trim());
return Task.FromResult<object>(null);
};
return hook;
}
I could really get away with an Action, but Edge appears to require the Func<object, Task<object>>, it won't proxy the function otherwise. Then, in the Javascript, we can detect that function and use it like this:
var func = Edge.Func(#"
return function(payload, callback) {
if (typeof (payload.stdoutHook) === 'function') {
process.stdout.write = payload.stdoutHook;
}
// do lots of stuff while stdout and stderr are hooked...
var what = require('whatever');
what.futz();
// terminate.
callback(null, result);
}");
dynamic result = func(payload).Result;
Questions
Q1. Both of these techniques seem to work, but is there a better way of doing this, something built-in to Edge perhaps that I have missed? Both solutions are invasive - they require some shim code to wrap the actual work that is to be done in Edge. This is not the end of the world, but it would be better if there was a non-invasive method.
Q2. In approach 2, where I have to return a task here
return Task.FromResult<object>(null);
it feels wrong to be returning an already completed "null task". But is there another way of writing this?
Q3. Do I need to be more rigorous in the Javascript code when hooking stdout and stderr? I note in double-edge.js there is this code, frankly I am not sure what is happening here, but it is quite a bit more complex than my crude overwriting of process.stdout.write :-)
// Fix #176 for GUI applications on Windows
try {
var stdout = process.stdout;
}
catch (e) {
// This is a Windows GUI application without stdout and stderr defined.
// Define process.stdout and process.stderr so that all output is discarded.
(function () {
var stream = require('stream');
var NullStream = function (o) {
stream.Writable.call(this);
this._write = function (c, e, cb) { cb && cb(); };
}
require('util').inherits(NullStream, stream.Writable);
var nullStream = new NullStream();
process.__defineGetter__('stdout', function () { return nullStream; });
process.__defineGetter__('stderr', function () { return nullStream; });
})();
}
Q1: There isn't anything built into Edge that would make capturing stdout or stderr of Node.js code automatic when calling Node from CLR. At some point I thought of writing an extension of Edge that would make marshaling Streams across CLR/V8 boundary easy. Under the hood it would be very similar to your Approach 2. It could be done as a standalone module on top of Edge.
Q2: Returning a completed task is very appropriate in this case. Your function has captured the Node.js output, processed it, and has in fact "completed" in that sense. Returning a task completed with Null is really a moral equivalent of returning from an Action.
Q3: The code you are pointing to is only relevant in Windows GUI applications, not Console applications. If you are writing a Console application, simply overriding write should suffice at the level of the Node.js code you pass to Edge.js. Note that the signature of write in Node allows an optional encoding parameter to be passed in. You seem to ignore it both in Approach 1 and 2. In particular in Approach 2 I would suggest wrapping the JavaScript proxy to C# callback into a JavaScript function that normalizes the parameters before assigning it to process.stdout.write. Otherwise Edge.js code may assume that the encoding parameter passed to a write call is a callback function which would follow the Edge.js calling convention.

NodeJS fs API: Detect Asynchronous Completion

I have a NodeJS application which uses the fs API to read files from a directory tree. I'm using the fs-walk module to walk the tree. For every sub directory encountered, the same function executes again to handle it. (I don't think this is recursion; rather, the same function is bound to an event which is fired each time a directory is handled.) Files are handled by a different function, which does stuff to them.
I'd like to execute arbitrary code once all files have been read without using synchronous or blocking code. I couldn't find any way to keep track of the number of files in a directory (to count down, for instance), nor could I find any attribute in fs.stat to indicate that the entire operation has completed.
Had anyone found a way to do this yet? I could find nothing in the node docs or on stack overflow.
After reviewing the fs-walk library a little closer, it looks like the third argument to the walk() method is actually a final callback. Internally they are using the async library, specifically async.whilst() and async.waterfall() methods which will execute the final callback when everything is complete.
I think the intention of the library creator is for that final callback to be executed when all async actions are completed. If that isn't working, you may want to file an issue in Github for it:
According to the code, you should be able to do:
var walk = require('fs-walk';
walk('/some/dir', someFileOrDirHandler, function(err) {
// This should be a final callback, if the first argument is present,
// then there was an error
if (err) {
/* handle it */
return;
}
// Getting here indicates success
});
As a compromise in performance, I ended up doing a total file count using a recursive function that accessed the file system synchronously. Using the total, I then accessed all the files asynchronously, decrementing the total each time. Once the total reached zero, I executed a function to handle all of the completed data.
var countAllFiles = new Promise(function (resolve, reject) {
var total = 0,
count = function (path) {
var contents = fs.readdirSync(path), file, name;
for (file in contents) {
if (!contents.hasOwnProperty(file)) continue;
name = path + '/' + contents[file];
if (fs.statSync(name).isDirectory())
count(name);
else
++total;
}
};
count('/path/to/tree/');
resolve(total);
}).then(function (total) {
walk.dirs('/path/to/tree/', handlerFunction, errorHandler);
// for every file, decrement total. Then, if it's zero, execute the code that
// depends on all the read/write operations being complete
});

How to populate mongoose with a large data set

I'm attempting to load a store catalog into MongoDb (2.2.2) using Node.js (0.8.18) and Mongoose (3.5.4) -- all on Windows 7 64bit. The data set contains roughly 12,500 records. Each data record is a JSON string.
My latest attempt looks like this:
var fs = require('fs');
var odir = process.cwd() + '/file_data/output_data/';
var mongoose = require('mongoose');
var Catalog = require('./models').Catalog;
var conn = mongoose.connect('mongodb://127.0.0.1:27017/sc_store');
exports.main = function(callback){
var catalogArray = fs.readFileSync(odir + 'pc-out.json','utf8').split('\n');
var i = 0;
Catalog.remove({}, function(err){
while(i < catalogArray.length){
new Catalog(JSON.parse(catalogArray[i])).save(function(err, doc){
if(err){
console.log(err);
} else {
i++;
}
});
if(i === catalogArray.length -1) return callback('database populated');
}
});
};
I have had a lot of problems trying to populate the database. Under previous scenarios (and this one), node pegs the processor and eventually runs out of memory. Note that in this scenario, I'm trying to allow Mongoose to save a record, and then iterate to the next record once the record saves.
But the iterator inside of the Mongoose save function never gets incremented. In addition, it never throws any errors. But if I put the iterator (i) outside of the asynchronous call to Mongoose, it will work, provided the number of records that I try to load are not too big (I have successfully loaded 2,000 this way).
So my questions are: Why isn't the iterator inside of the Mongoose save call ever incremented? And, more importantly, what is the best way to load a large data set into MongoDb using Mongoose?
Rob
i is your index to where you're pulling input data from in catalogArray, but you're also trying to use it to keep track of how many have been saved which isn't possible. Try tracking them separately like this:
var i = 0;
var saved = 0;
Catalog.remove({}, function(err){
while(i < catalogArray.length){
new Catalog(JSON.parse(catalogArray[i])).save(function(err, doc){
saved++;
if(err){
console.log(err);
} else {
if(saved === catalogArray.length) {
return callback('database populated');
}
}
});
i++;
}
});
UPDATE
If you want to add tighter flow control to the process, you can use the async module's forEachLimit function to limit the number of outstanding save operations to whatever you specify. For example, to limit it to one outstanding save at a time:
Catalog.remove({}, function(err){
async.forEachLimit(catalogArray, 1, function (catalog, cb) {
new Catalog(JSON.parse(catalog)).save(function (err, doc) {
if (err) {
console.log(err);
}
cb(err);
});
}, function (err) {
callback('database populated');
});
}
Rob,
The short answer:
You created an infinite loop. You're thinking synchronously and with blocking, Javascript functions asynchronously and without blocking. What you are trying to do is like trying to directly turn the feeling of hunger into a sandwich. You can't. The closest thing is you use the feeling of hunger to motivate you to go to the kitchen and make it. Don't try to make Javascript block. It won't work. Now, learn async.forEachLimit. It will work for what you want to do here.
You should probably review asynchronous design patterns and understand what it means on a deeper level. Callbacks are not simply an alternative to return values. They are fundamentally different in how and when they are executed. Here is a good primer: http://cs.brown.edu/courses/csci1680/f12/handouts/async.pdf
The long answer:
There is an underlying problem here, and that is your lack of understanding of what non-blocking IO and asynchronous means. Im not sure if you are breaking into node development, or this is just a one-off project, but if you do plan to continue using node (or any asynchronous language) then it is worth the time to understand the difference between synchronous and asynchronous design patterns, and what motivations there are for them. So, that is why you have a logic error putting the loop invariant increment inside an asynchronous callback which is creating an infinite loop.
In non-computer science, that means that your increment to i will never occur. The reason is because Javascript executes a single block of code to completion before any asynchronous callbacks are called. So in your code, your loop will run over and over, without i ever incrementing. And, in the background, you are storing the same document in mongo over and over. Each iteration of the loop starts sending document with index 0 to mongo, the callback can't fire until your loop ends, and all other code outside the loop runs to completion. So, the callback queues up. But, your loop runs again since i++ is never executed (remember, the callback is queued until your code finishes), inserting record 0 again, queueing another callback to execute AFTER your loop is complete. This goes on and on until your memory is filled with callbacks waiting to inform your infinite loop that document 0 has been inserted millions of times.
In general, there is no way to make Javascript block without doing something really really bad. For example, something paramount to setting your kitchen on fire to fry some eggs for that sandwich I talked about in the "short answer".
My advice is to take advantage of libs like async. https://github.com/caolan/async JohnnyHK mentioned it here, and he was correct for doing so.

Nodejs asynchronous confusion

I can't seem to grasp how to maintain async control flow with NodeJs. All of the nesting makes the code very hard to read in my opinion. I'm a novice, so I'm probably missing the big picture.
What is wrong with simply coding something like this...
function first() {
var object = {
aProperty: 'stuff',
anArray: ['html', 'html'];
};
second(object);
}
function second(object) {
for (var i = 0; i < object.anArray.length; i++) {
third(object.anArray[i]);
};
}
function third(html) {
// Parse html
}
first();
The "big picture" is that any I/O is non-blocking and is performed asynchronously in your JavaScript; so if you do any database lookups, read data from a socket (e.g. in an HTTP server), read or write files to the disk, etc., you have to use asynchronous code. This is necessary as the event loop is a single thread, and if I/O wasn't non-blocking, your program would pause while performing it.
You can structure your code such that there is less nesting; for example:
var fs = require('fs');
var mysql = require('some_mysql_library');
fs.readFile('/my/file.txt', 'utf8', processFile);
function processFile(err, data) {
mysql.query("INSERT INTO tbl SET txt = '" + data + "'", doneWithSql);
}
function doneWithSql(err, results) {
if(err) {
console.log("There was a problem with your query");
} else {
console.log("The query was successful.");
}
}
There are also flow control libraries like async (my personal choice) to help avoid lots of nested callbacks.
You may be interested in this screencast I created on the subject.
As #BrandonTilley said, I/O is asynchronous, so you need callbacks in Node.js to handle them. This is why Node.js can do so much with just a single thread (it's not actually doing more in a single thread, but rather than having the thread wait around for the data, it just starts processing the next task and when the I/O comes back, then it'll jump back to that task with the callback function you gave it).
But, nested callbacks can be taken care of with a good library like the venerable async or my new little library: queue-flow. They handle the callback issues and let you keep your code un-nested and looking very similar to blocking, synchronous code. :)

Resources