How to get a count of the current open sockets in Node? - node.js

I am using the request module to crawl a list of URLs and would like to
limit the number of open sockets to 2:
var req = request.defaults({
forever: true,
pool: {maxSockets: 1}
});
req(options, function(error, response, body) {
... code ...
done();
however, when looping over an array of URLs and issuing a new request to each - that does not seem to work.
is there a way to get the current number of open sockets to test it?

I believe that maxSockets maps to http.Agent.maxSockets, which limits the number of concurrent requests to the same origin (host:port).
This comment, from the developer of request, suggests the same:
actually, pooling controls the agent passed to core. each agent holds all hosts and throttles the maxSockets per host
In other words, you can't use it to limit the number of concurrent requests in general. For that, you need to use an external solution, for instance using limiter or async.queue.

Related

Throttling event-driven Nodejs HTTP requests

I have a Node net.Server that listens to a legacy system on a TCP socket. When a message is received, it sends an http request to another http server. Simplified, it looks like this:
var request = require('request-promise');
...
socket.on('readable', function () {
var msg = parse(socket.read());
var postOptions = {
uri: 'http://example.com/go',
method: 'POST',
json: msg,
headers: {
'Content-Type': 'application/json'
}
};
request(postOptions);
})
The problem is that the socket is readable about 1000 times per second. The requests then overload the http server. Almost immediately, we get multiple-second response times.
In running Apache benchmark, it's clear that the http server can handle well over 1000 requests per second in under 100ms response time - if we limit the number of concurrent requests to about 100.
So my question is, what is the best way to limit the concurrent requests outstanding using the request-promise (by extension, request, and core.http.request) library when each request is fired separately within an event callback?
Request's documentation says:
Note that if you are sending multiple requests in a loop and creating multiple new pool objects, maxSockets will not work as intended. To work around this, either use request.defaults with your pool options or create the pool object with the maxSockets property outside of the loop.
I'm pretty sure that this paragraph is telling me the answer to my problem, but I can't make sense of it. I've using defaults to limit the number of sockets open:
var rp = require('request-promise');
var request = rp.defaults({pool: {maxSockets: 50}});
Which doesn't help. My only thought at the moment is to manually manage a queue, but I expect that would be unnecessary if I only knew the conventional way to do it.
Well you need to throttle your request right? I have workaround this in two ways, but let me show you one patter I always use. I often use throttle-exec and Promise to make wrapper for request. You could install it with npm install throttle-exec and use Promise natively or third-party. Here is my gist for this wrapper https://gist.github.com/ans-4175/d7faec67dc6374803bbc
How do you use it? It's simple, just like ordinary request.
var Request = require("./Request")
Request({
url:url_endpoint,
json:param,
method:'POST'
})
.then(function(result){
console.log(result)
})
.catch(reject)
Tell me after you implement it. Either way I have another wrapper :)

nodejs, control over concurrent connections

I am using nodejs async module for concurrent connectios, Now my backend server only can handle 1000 connections at a time, I am using async.mapLimit to limit the connections, each and every job of async.mapLimit does multiple connections, when I am sending the same request which does async.mapLimit from multiple browser at the same time, then I am getting EMFILE error from Server side
([Error: connect EMFILE] code: 'EMFILE', errno: 'EMFILE', syscall: 'connect'),
My code somewhat looks like this:
async.mapLimit(jobList, 200, jobCallback, function(error, data) {
});
function jobCallback(job, callback) {
/* Make multiple connections to to backend server, this number is
dynamic, here also I use async.mapLimit */
}
Now I want to implement some wrapper function top of this mapLimit or anything, irrespective of number of parallel requests, I want to limit the concurrent connections, even irrespective of number of client calls also, it may be slower, but I do not bother.
How I can achieve this?
I am using restler library. I have tried to set
proto.globalAgent.maxSockets = 1000
to do concurrent 1000 connections at a time, but it seems it is not working.
Please advise.
-M-
You will have to control for throttling yourself, as that async instruction won't know if you have calls from other users adding to the 1000 limit.
In REST services, a service would typically send an http 429 response when such a limit is triggered, allowing your app to identify a bottleneck scenario and trigger a throttling mechanism.
A common way to do that is via exponential backoff
https://developers.google.com/api-client-library/java/google-http-java-client/backoff
I use the following line of code to manage the limit globally:
require('events').EventEmitter.prototype._maxListeners = 1000;
Thanks

Fetching external resources in parallel in node - good practice?

I have a setup where a node server acts as a proxy server to serve images.
For example an image "test1.jpg", the exact same image can be fetched from 3 external sources - lets say -
a. www.abc.com/test1.jpg
b. www.def.com/test1.jpg
c. www.ghi.com/test1.jpg
When the nodejs server gets a request for "test1.jpg" it first gets a list of external URLs from a DB. Now amongst these external resources, at least one is always behind a CDN and is "expected" to respond faster and hence is a preferred source for the image.
My question is what is the correct method to achieve this out of the two below (or if there is any other method)
Fire http requests (using mikeal's request client module) for all the URLs at the same time. Get their promise objects and whichever source responds first, send that image back to the user (it can be any of the three sources, not necessarily the preferred source behind the cDN - but doesnt matter since the image is exactly the same). The disadvantage that I see is that for every image we hit 3 sources. Also the promises for http requests can still get fulfilled after the response from the first successful source has been sent out.
Fire http requests one at a time starting with the most preferred image, wait for it to fail (i.e. a 404 on the image) and then proceed to the next preferred image. We have lesser number of HTTP requests but more wait time for the user.
Some pseudo code
Method 1
while(imagePreferences.length > 0) {
var url = imagePreferences.splice(0,1);
getImage(url).then(function() {
sendImage();
}, function(err) {
console.log(err);
});
}
Method 2
if(imageUrls.length > 0) {
var url = imageUrls.splice(0,1);
getImage(url).then(function(imageResp) {
sendImageResp();
}, function(err) {
getNextImage(); //recurse over this
});
}
This is just pseudo code. I am new to nodejs. Any help/views would be appreciated.
I prefer the 1st option, CDNs are designed to receive massive requests. Your code is perfectly fine to send HTTP requests to multiple sources in parallel.
In case you want to stop the other requests after successfully receiving the first image, you can use async.detect: https://github.com/caolan/async#detect

How to use Request js (Node js Module) pools

Can someone explain how to use the request.js pool hash?
The github notes say this about pools:
pool - A hash object containing the agents for these requests. If omitted this
request will use the global pool which is set to node's default maxSockets.
pool.maxSockets - Integer containing the maximum amount of sockets in the pool.
I have this code for writing to a CouchDB instance (note the question marks). Basically, any user who connects to my Node server will write to the DB independent of each other:
var request = require('request');
request({
//pool:, // ??????????????????
'pool.maxSockets' : 100, // ??????????????????
'method' : 'PUT',
'timeout' : 4000,
'strictSSL' : true,
'auth' : {
'username' : myUsername,
'password' : myPassword
},
'headers' : {
'Content-Type': 'application/json;charset=utf-8',
'Content-Length': myData.length
},
'json' : myData,
'url': myURL
}, function (error, response, body){
if (error == null) {
log('Success: ' + body);
}
else {
log('Error: ' + error);
}
});
What's best for high throughput/performance?
What are the drawbacks of a high 'maxSockets' number?
How do I create a separate pool to use instead of the global pool? Why do I only want to create a separate pool?
The pool option in request uses agent which is same as http.Agent from standard http library. See the documentation for http.Agent and see the agent options in http.request.
Usage
pool = new http.Agent(); //Your pool/agent
http.request({hostname:'localhost', port:80, path:'/', agent:pool});
request({url:"http://www.google.com", pool:pool });
If you are curious to know what is that you can see it from console.
{ domain: null,
_events: { free: [Function] },
_maxListeners: 10,
options: {},
requests: {},
sockets: {},
maxSockets: 5,
createConnection: [Function] }
The maxSockets determines how many concurrent sockets the agent can have open per host, is present in an agent by default with value 5. Typically you would set it before. Passing pool.maxSockets explicitly would override the maxSockets property in pool. This option only makes sense if passing pool option.
So different ways to use it :
Don't give agent option, will be undefined will use http.globalAgent. The default case.
Give it as false, will disable pooling.
Provide your own agent, like above example.
Answering your questions in reverse.
Pool is meant to keep certain number of sockets to be used by the program. Firstly the sockets are reused for different requests. So it reduces overhead of creating new sockets. Secondly it uses fewer sockets for requests, but consistently. It will not take up all sockets available. Thirdly it maintains queue of requests. So there is waiting time implied.
Pool acts like both a cache and a throttle. The throttle effect will be more visible if you have more requests and lesser sockets. When using global pool it may limit functioning of two different clients, there are no guarantees on waiting time. Having separate pool for them will be fairer to both (think if one requests more than other).
The maxSockets property gives maximum concurrency possible. It increases the overall throughput/performance. Drawback is throttle effect is reduced. You cannot control peak overhead. Setting it to large number, will be like no pooling at all. You would start getting errors like socket not available. It cannot be more than the allowed maximum limit set by the OS.
So what is best for high throughput/performance? There is a physical limit in throughput. If you reach the limit, response time will increase with number of connections. You can keep increasing maxSockets till then, but after that increasing it will not help.
You should take a look at the forever-agent module, which is a wrapper to http.Agent.
Generally the pool is a hash object that contains a number of http agent. it tries to reuse created sockets from "keep-alive" connection. per host:port. For example, you performed several requests to host www.domain1.com:80 and www.domain2.com:80, if any of response contains no header Connection: close, it will put the socket in pool and give it to pending requests.
If no pending requests need this pooled socket, it will be destroyed.
The maxSockets means the max concurrent sockets for a single host:port, the default value is 5. I would suggest thinking of this value with your scenario together:
According to those hot sites requests visit, you'd better create separate pool. so that new requests can pick up idle sockets very fast. the point is, you need to reduce the number of pending requests to certain sites by increasing maxSockets value of a pool. Notice that it doesn't matter if you set a very high number to maxSockets when the connection is well managed by the origin server via response header Connection: close.
According to those sites that your requests hardly visit, use pool: false to disable pool.
You can use this way to specify separate pool for your request:
// create a separate socket pool with 10 concurrent sockets as max value.
var separateReqPool = {maxSockets: 10};
var request = require('request');
request({url: 'http://localhost:8080/', pool: separateReqPool}, function(e, resp){
});

“Proxying” a lot of HTTP requests with Node.js + Express 2

I'm writing proxy in Node.js + Express 2. Proxy should:
decrypt POST payload and issue HTTP request to server based on result;
encrypt reply from server and send it back to client.
Encryption-related part works fine. The problem I'm facing is timeouts. Proxy should process requests in less than 15 secs. And most of them are under 500ms, actually.
Problem appears when I increase number of parallel requests. Most requests are completed ok, but some are failed after 15 secs + couple of millis. ab -n5000 -c300 works fine, but with concurrency of 500 it fails for some requests with timeout.
I could only speculate, but it seems thant problem is an order of callbacks exectuion. Is it possible that requests that comes first are hanging until ETIMEDOUT because of node's focus in latest ones which are still being processed in time under 500ms.
P.S.: There is no problem with remote server. I'm using request for interactions with it.
upd
The way things works with some code:
function queryRemote(req, res) {
var options = {}; // built based on req object (URI, body, authorization, etc.)
request(options, function(err, httpResponse, body) {
return err ? send500(req, res)
: res.end(encrypt(body));
});
}
app.use(myBodyParser); // reads hex string in payload
// and calls next() on 'end' event
app.post('/', [checkHeaders, // check Content-Type and Authorization headers
authUser, // query DB and call next()
parseRequest], // decrypt payload, parse JSON, call next()
function(req, res) {
req.socket.setTimeout(TIMEOUT);
queryRemote(req, res);
});
My problem is following: when ab issuing, let's say, 20 POSTs to /, express route handler gets called like thousands of times. That's not always happening, sometimes 20 and only 20 requests are processed in timely fashion.
Of course, ab is not a problem. I'm 100% sure that only 20 requests sent by ab. But route handler gets called multiple times.
I can't find reasons for such behaviour, any advice?
Timeouts were caused by using http.globalAgent which by default can process up to 5 concurrent requests to one host:port (which isn't enough in my case).
Thouthands of requests (instead of tens) were sent by ab (Wireshark approved fact under OS X; I can not reproduce this under Ubuntu inside Parallels).
You can have a look at node-http-proxy module and how it handles the connections. Make sure you don't buffer any data and everything works by streaming. And you should try to see where is the time spent for those long requests. Try instrumenting parts of your code with conosle.time and console.timeEnd and see where is taking the most time. If the time is mostly spent in javascript you should try to profile it. Basically you can use v8 profiler, by adding --prof option to your node command. Which makes a v8.log and can be processed via a v8 tool found in node-source-dir/deps/v8/tools. It only works if you have installed d8 shell via scons(scons d8). You can have a look at this article to help you further to make this working.
You can also use node-webkit-agent which uses webkit developer tools to show the profiler result. You can also have a look at my fork with a bit of sugar.
If that didn't work, you can try profiling with dtrace(only works in illumos-based systems like SmartOS).

Resources