Nodejs + mikeal/Request module, how to close request or increase MaxSockets - node.js

I have a Nodejs app that's designed to perform simple end-to-end testing of a large web application. This app uses the mikeal/Request and Cheerio modules to navigate, request, traverse and inspect web pages across the application.
We are refactoring some tests, and are hitting a problem when multiple request functions are called in series. I believe this may be due to the Node.js process hitting the MaxSockets limit, but am not entirely sure.
Some code...
var request = require('request');
var cheerio = require('cheerio);
var async = require('async');
var getPages_FromMenuLinks = function() {
var pageUrl = 'http://www.example.com/app';
async.waterfall([
function topPageRequest(cb1) {
var menuLinks = [];
request(pageUrl, function(err, resp, page) {
var $ = cheerio.load(page);
$('div[class*="sub-menu"]').each(function (i, elem) {
menuLinks.push($(this).find('a').attr('href');
});
cb1(null, menuLinks);
});
}, function subMenuRequests(menuLinks, cb2) {
async.eachSeries(menuLinks, functionv(link, callback) {
request(link, function(err, resp, page) {
var $ = cheerio.load(page);
// do some quick validation testing of elements on the expected page
callback();
});
}, function() { cb2(null) } );
}
], function () { });
};
module.export = getPages_FromMenuLinks;
Now, if I run this Node script, it runs through the first topPageRequest and starts the subMenuRequests, but then freezes after completing the request for the third sub-menu item.
It seems that I might be hitting a Max-Sockets limit, either on Node or my machine (?) -- I'm testing this on standard Windows 8 machine, running Node v0.10.26.
I've tried using request({pool:{maxSockets:25}, url:link}, function(err, resp..., but it does not seem to make any difference.
It also seems there's a way to abort the request object, if I first instantiate it (as found here). But I have no idea how I would "parse" the page, similar to what's happening in the above code. In other words, from the solution found in the link...
var theRequest = request({ ... });
theRequest.pipe(parser);
theRequest.abort();
..., how would I re-write my code to pipe and "parse" the request?

You can make easily thousands requests at the same time (e.g. from a single for loop) and they will be queued and terminate automatically one by one, once a particular request is served.
I think by default there are 5 sockets per domain and this limit in your case should be more than enough.
It is highly probably that your server does not handle your requests properly (e.g. on error they are not terminated and hung up indefinitely).
There are three steps you can make to find out what is going on:
check if you are sending proper request -- as #mattyice observed there are some bugs in your code.
investigate server code and the way your requests are handled there -- for me it seems that the server does not serve/terminate them in first place.
try to use setTimeout when sending the request. 5000ms should be a reasonable amount of time to wait. On timeout the request will be aborted with appropriate error code.
As an advice: I would recommend to use some more suitable, easier in use and more accurate tools to do your testing: e.g. phantomjs.

Related

Using redis as cache as REST Api user (in order to save Api requests)

I am a API user and I have only a limited number of requests availble for a high traffic website (~1k concurrent visitors). In order to save API requests I would like to cache the responses for specific requests which are unlikely to change.
However I want to refresh this redis key (the API response) at least every 15 seconds. I wonder what the best approach for this would be?
My ideas:
I thought the TTL field would be handy for this scenario. Just set a TTL of 15s for this key. When I query this key and it's not present I would just request it again using the API. The problem: Since this is a high traffic website I would expect around 20-30 requests until I've got a response from the API and this would lead to 20-30 requests to the API within a few ms. So I would need to "pause" all incoming requests until there is a API response
My second idea was to refresh the key every 15s. I could set a background task which runs every 15s or upon page request I could check in my controller if the key needs a refresh. I would prefer the last idea but therefore I would need to maintain the redis key age and this seems to be very expensive and it is not a built in feature?
What would you suggest for this use case?
My controller code:
function players(req, res, next) {
redisClient.getAsync('leaderboard:players').then((playersLeaderboard) => {
if(!playersLeaderboard) {
// We need to get a fresh copy of the playersLeaderboard
}
res.set('Cache-Control', 's-maxage=10, max-age=10')
res.render('leaderboards/players', {playersLeaderboard: playersLeaderboard})
}).catch((err) => {
logger.error(err)
})
}
Simply fetch and cache the data when the node.js server starts and then set an interval for 15 seconds to fetch fresh data and update cache. Avoid using the TTL for this usecase.
function fetchResultsFromApi(cb) {
apiFunc((err, result) => {
// do some error handling
// cache result in redis without ttl
cb();
});
}
fetchResultsFromApi(() => {
app.listen(port);
setInterval(() => {
fetchResultsFromApi(() => {});
}, 15000);
}
Pros:
Very simple to implement
No queuing of client request required
Super fast response times
Cons:
The cache update might not execute/complete exactly after every 15th second. It might be a few milliseconds here and there. I assume that it won't make a lot of difference for what you are doing and you can always reduce the interval time to update cache before 15 seconds.
I guess this is more of an architecture question than those typical "help my code don't work" kind.
Let me paraphrase your requirements.
Q: I would like to cache the responses of some HTTP requests which are unlikely to change and I would like these cached responses to be refreshed every 15 seconds. Is it possible?
A: Yes it is and you're so going to thank the fact that Javascript is single threaded so it is going to be quite straight forward.
Some fundamental knowledge here. NodeJS is an event driven framework which means that at 1 point in time it is going to execute only one piece of code, all the way until it is done.
If any aysnc call is encountered along the way, it will call them and add an event to the event-loop to say "callback when a response is received". When the code routine is finished then it will pops the next event from the queue to run them.
Based on this knowledge, we know we can achieve this by building a function to only fire-off 1 async call to update the cached-responses everytime it expires. If an async call is already in action, then just put their callback functions into a queue. This is so that you don't do multiple async calls to fetch the new result.
I'm not familiar with the async module so I have provided an pseudo code example using promises instead.
Pseudo code:
var fetch_queue = [];
var cached_result = {
"cached_result_1": {
"result" : "test",
"expiry" : 1501477638 // epoch time 15s in future
}
}
var get_cached_result = function(lookup_key) {
if (cached_result.hasOwnProperty(lookup_key)) {
if (result_expired(cached_result[lookup_key].expiry)) {
// Look up cached
return new Promise(function (resolve) {
resolve(cached_result[lookup_key].result);
});
}
else {
// Not expired, safe to use cached result
return update_result();
}
}
}
var update_result = function() {
if (fetch_queue.length === 0) {
// No other request is retrieving an updated result.
return new Promise(function (resolve, reject) {
// call your API to get the result.
// When done call.
resolve("Your result");
// Inform other requests that an updated response is ready.
fetch_queue.forEach(function(promise) {
promise.resolve("Your result");
})
// Compute the new expiry epoch time and update the cached_result
})
}
else {
// Create a promise and park it into the queue
return new Promise(function(resolve, reject) {
fetch_queue.push({
resolve: resolve,
reject: reject
})
});
}
}
get_cached_result("cached_result_1").then(function(result) {
// reply the result
})
Note: As the name suggested the code is not actual working solution but the concept is there.
Something worth noting is, setInterval is 1 way to go but it doesn't guarantee that the function will get called exactly at the 15 second mark. The API only make sure that something will happen after the expected time.
Whereas the proposed solution will ensure that as long as the cached result has expired, the very next person looking it up will do a request and the following requests will wait for the initial request to return.

How do I make HTTP requests inside a loop in NodeJS

I'm writing a command line script in Node (because I know JS and suck at Bash + I need jQuery for navigating through DOM)… right now I'm reading an input file and I iterate over each line.
How do I go about making one HTTP request (GET) per line so that I can load the resulting string with jQuery and extract the information I need from each page?
I've tried using the NPM httpsync package… so I could make one blocking GET call per line of my input file but it doesn't support HTTPS and of course the service I'm hitting only supports HTTPS.
Thanks!
A good way to handle a large number of jobs in a conrolled manner is the async queue.
I also recommend you look at request for making HTTP requests and cheerio for dealing with the HTML you get.
Putting these together, you get something like:
var q = async.queue(function (task, done) {
request(task.url, function(err, res, body) {
if (err) return done(err);
if (res.statusCode != 200) return done(res.statusCode);
var $ = cheerio.load(body);
// ...
done();
});
}, 5);
Then add all your URLs to the queue:
q.push({ url: 'https://www.example.com/some/url' });
// ...
I would most likely use the async library's function eachLimit function. That will allow you to throttle the number of active connections as well as getting a callback for when all the operations are done.
async.eachLimit(urls, function(url, done) {
request(url, function(err, res, body) {
// do something
done();
});
}, 5, function(err) {
// do something
console.log('all done!');
})
I was worried about making a million simultaneous requests without putting in some kind of throttling/limiting the number of concurrent connections, but it seems like Node is throttling me "out of the box" to something around 5-6 concurrent connections.
This is perfect, as it lets me keep my code a lot simpler while also fully leveraging the inherent asynchrony of Node.

Best way for Node JS to wait on start up for initialisation from database etc

I know Node is non blocking etc but i dont know how to solve this issue without blocking.
You start the server
node app.js
but you need some config etc from a database or mongodb before you deal with incoming requests, so you need to wait for the db response to return before you launch into dealing with taking requests.
I could use nimble but then you have to wrap the routes etc all in a second execution block which is nasty.
Whats the best way?
Node is indeed non-blocking, but that doesn't mean you need to start accepting requests right away! Take a look at the classic HTTP server example:
var http = require('http');
var server = http.createServer(function (req, res) {
// ... logic to handle requests ...
});
server.listen(8000);
You can do anything you like before calling server.listen, including whatever configuration tasks you need. Assuming those tasks are asynchronous, you can start the server in the callback:
var http = require('http');
var server = http.createServer(function (req, res) {
// ... logic to handle requests ...
});
// Set up your mongo DB and grab a collection, and then...
myMongoCollection.find().toArray(function(err, results) {
// Do something with results here...
// Then start the server
server.listen(8000);
});
It is OK to block for things that are necessary. Don't go asynchronous just for the sake of it!
In this case, since the DB is crucial to your app even running, blocking until it is ready is appropriate (and probably saving you a lot of hassle of handling calls that don't have a DB connected yet).
You could also postpone starting your app server (in a callback, promise, etc) until the call to start the DB completes. Though since nothing else is happening until the app is initialized (from what I can tell in the question), it wouldn't matter either way because you're not stealing that single-thread from anything else!
Based on the significance of the server.listen role in the sequence i used nimble and did the following....
In the first block i get the stuff from the db (elastic search in this case) and do some manipulation of it that is required to build the routes in the second block, then in the last block i start the server.
You could use nimble to do some other pre init tasks and just do a parallel block inside the first serial block too.
var chans = [];
flow.series([
function (cb) {
esClient.search({
...
}).then(function (resp) {
var channels = resp.hits.hits;
channels.forEach(function(chan){chans.push(chan.urlSlug)});
chans = chans.join('|');
cb();
});
},
function(cb) {
app.get('/:type('+types+')/[a-z\-]+/[a-z0-9\-]+-:id/', itemRt.feature);//http://localhost:3000/how-to/apple/apple-tv-set-up-tips-3502783/
app.get('/:urlSlug('+types+')/', listingRt.category);
app.get('/:urlSlug('+chans+')/', listingRt.channel);
cb();
},
function(cb) {
server.listen(app.get('port'), function(){
console.log('Express server listening on port ' + app.get('port'));
cb();
});
}
]);

wait for async to complete before return

mongoosejs async code .
userSchema.static('alreadyExists',function(name){
var isPresent;
this.count({alias : name },function(err,count){
isPresent = !!count
});
console.log('Value of flag '+isPresent);
return isPresent;
});
I know isPresent is returned before the this.count async function calls the callback , so its value is undefined . But how do i wait for callback to change value of isPresent and then safely return ?
what effect does
(function(){ asynccalls() asynccall() })(); has in the async flow .
What happens if var foo = asynccall() or (function(){})()
Will the above two make return wait ?
can process.nextTick() help?
I know there are lot of questions like these , but nothing explained about problem of returning before async completion
There is no way to do that. You need to change the signature of your function to take a callback rather than returning a value.
Making IO async is one of the main motivation of Node.js, and waiting for an async call to be completed defeats the purpose.
If you give me more context on what you are trying to achieve, I can give you pointers on how to implement it with callbacks.
Edit: You need something like the following:
userSchema.static('alreadyExists',function (name, callback) {
this.count({alias : name}, function (err, count) {
callback(err, err ? null : !!count);
console.log('Value of flag ' + !!count);
});
});
Then, you can use it like:
User.alreadyExists('username', function (err, exists) {
if (err) {
// Handle error
return;
}
if (exists) {
// Pick another username.
} else {
// Continue with this username.
}
}
Had the same problem. I wanted my mocha tests to run very fast (as they originally did), but at the same time to have a anti-DOS layer present and operational in my app. Running those tests just as they originally worked was crazy fast and ddos module I'm using started to response with Too Many Requests error, making the tests fail. I didn't want to disable it just for test purposes (I actually wanted to have automated tests to verify Too Many Requests cases to be there as well).
I had one place used by all the tests that prepared client for HTTPS requests (filled with proper headers, authenticated, with cookies, etc.). It looked like this more or less:
var agent = thiz.getAgent();
thiz.log('preReq for user ' + thiz.username);
thiz.log('preReq for ' + req.url + ' for agent ' + agent.mochaname);
if(thiz.headers) {
Object.keys(thiz.headers).map(function(header) {
thiz.log('preReq header ' + header);
req.set(header, thiz.headers[header]);
});
}
agent.attachCookies(req);
So I wanted to inject there a sleep that would kick in every 5 times this client was requested by a test to perform a request - so the entire suite would run quickly and every 5-th request would wait to make ddos module consider my request unpunishable by Too Many Requests error.
I searched most of the entries here about Async and other libs or practices. All of them required going for callback - which meant I would have to re-write a couple of hundreds of test cases.
Finally I gave up with any elegant solution and fell to the one that worked for me. Which was adding a for loop trying to check status of non-existing file. It caused a operation to be performed long enough I could calibrate it to last for around 6500 ms.
for(var i = 0; i < 200000; ++i) {
try {
fs.statSync('/path' + i);
} catch(err) {
}
};

Meteor client synchronous server database calls

I am building an application in Meteor that relies on real time updates from the database. The way Meteor has laid out the examples is to have the database call under the Template call. I've found that when dealing with medium sized datasets this becomes impractical. I am trying to move the request to the server, and have the results passed back to the client.
I have looked at similar questions on SA but have found no immediate answers.
Here is my server side function:
Meteor.methods({
"getTest" : function() {
var res = Data.find({}, { sort : { time : -1 }, limit : 10 });
var r = res.fetch();
return (r);
}
});
And client side:
Template.matches._matches = function() {
var res= {};
Meteor.call("getTest", function (error, result) {
res = result;
});
return res;
}
I have tried variations of the above code - returning in the callback function as one example. As far as I can tell, having a callback makes the function asynchronous, so it cannot be called onload (synchronously) and has to be invoked from the client.
I would like to pass all database queries server side to lighten the front end load. Is this possible in Meteor?
Thanks
The way to do this is to use subscriptions instead of remote method calls. See the counts-by-room example in the docs. So, for every database call you have a collection that exists client-side only. The server then decides the records in the collection using set and unset.

Resources