Node.js: Handling outgoing HTTP request from node - node.js

I've been wrestling with this problem for a while but could not find a good solution for this, so came here for help.
I have a node.js server; on receiving a request from a client, the server will contact 3rd party backend to grab some data, and return it back to the client.
The server to 3rd party backend communication involved multiple calls back and forth, and it typically takes ~3 seconds to finish this process.
If I fire off, say, 50 concurrent requests from test tools like JMeter, the performance degradation becomes severe very quickly, even causing timeouts for some of the later served calls.
Initially I started looking into asyncblock, but since it was running on fiber I wasn't seeing a big improvements in performance, so I started looking into threads.
The only mature module I could find was thread-a-gogo, but I also recently found out that you cannot use the required external modules(like crypto, for example) within threads spawned by TAGG.
Given that there are proxy products built with node.js I believe there is an efficient way to do this but I can't really think of other approaches if threads cannot use external modules.
Any advice would be appreciated.
I can't reveal the full detail due to NDA but here's the basic concept of what I'm doing. I'd like to send below logic to a separate thread.
asyncblock(flow){
var result1 = flow.sync(externalRequest1(flow.callback());
if result1 contains success message
var result2 = flow.sync(externalRequest2(flow.callback());
if result2 contains success message
process result2 and return to client
}

First thing to check and be certain: are you using the node.js core HTTP Agent? If so, you are subject to maxSockets limit of 5 connections to the same server. Read the hyperquest README rant for details.
Secondly, be aware the remote side may impose abuse limitations as well, so check to see if there are issues there or if, for example, overall performance would be better if you used a single pool of 10 connections instead of opening unlimited number of simultaneous connections to the upstream server.

Related

NodeJS Polling per User Structure best practice

My project is a full stack application where a web client subscribes to an unready object. When the subscription is triggered, the backend will run an observation loop to that unready object until it becomes ready. When that happens it sends a message to the frontend through socketIO (suggestions are welcome, I'm not quite sure if it's the best method). My question is how do I construct the observation loop.
My frontend basically subscribes to the backend, and gets a return 200 and will connect to the server per Websocket (socketIO) if it got subscribed correctly, or an error 4XX code if there was something that went wrong. On the backend, when the user subscribes, it should start for that user, a "thread" (I know Nodejs doesn't support threads, it's just for the mental image) that polls an information from an api every 10 or so seconds.
I do that, because the API that I poll from does not support WebHooks, so I need to observe the API response until it's at the state that I want it (this part I already got cleared).
What I'm asking, is there a third party library that actually is meant for those kinds of tasks? Should I use worker threads or simple setTimeouts abstracted by Classes? The response will be sent over SocketIO, that part I already got working as well, it's just the method I'm using im not quite sure how to build.
I'm also open to use another fitting programming language that makes solving this case easier. I'm not in a hurry.
A polling network request (which it sounds like this is) is non-blocking and asynchronous so it doesn't really take much of your nodejs CPU unless you're doing some heavy-weight computation of the result.
So, a single nodejs thread can make a lot of network requests (for your polling and for sending data over socket.io connection) without adding WorkerThreads or clustering. This is something that nodejs is very, very good at.
I'm not aware of any third party library specifically for this as you have to custom code looking at the results of the network request anyway and that's most of the coding. There are a bunch of libraries for making http requests of other servers from nodejs listed here. My favorite in that list is got(), but you can look at the choices and decide what you like.
As for making the repeated requests, I would probably just use either repeated setTimeout() calls or a setInterval() call.
You don't say whether you have to make separate requests for every single client that is subscribed to something or whether you can somehow combine all clients watching the same resource so that you use the same polling interval for all of them. If you can do the latter, that would certainly be more efficient.
If, as you scale, you run into scaling issues, you can then move the polling code to one or more child processes or WorkerThreads and then just communicate back to the main thread via messaging when you have found a new state that needs to be sent to the client. But, I would not anticipate you would need to code that extra step until you reach larger scale. As with most scaling things, you would need to code up the more basic option (which should scale well by itself) and then measure and benchmark and see where any bottlenecks are and modify the architecture based on data, not speculation. Far too often, the architecture is over-designed and over-implemented based on where people think the bottlenecks might be rather than where they actually turn out to be. Not only does this make the development take longer and end up with more complicated implementation than required, but it can target development at the wrong part of the problem. Profile, measure, then decide.

Node.js design approach. Server polling periodically from clients

I'm trying to learn Node.js and adequate design approaches.
I've implemented a little API server (using express) that fetches a set of data from several remote sites, according to client requests that use the API.
This process can take some time (several fecth / await), so I want the user to know how is his request doing. I've read about socket.io / websockets but maybe that's somewhat an overkill solution for this case.
So what I did is:
For each client request, a requestID is generated and returned to the client.
With that ID, the client can query the API (via another endpoint) to know his request status at any time.
Using setTimeout() on the client page and some DOM manipulation, I can update and display the current request status every X, like a polling approach.
Although the solution works fine, even with several clients connecting concurrently, maybe there's a better solution?. Are there any caveats I'm not considering?
TL;DR The approach you're using is just fine, although it may not scale very well. Websockets are a different approach to solve the same problem, but again, may not scale very well.
You've identified what are basically the only two options for real-time (or close to it) updates on a web site:
polling the server - the client requests information periodically
using Websockets - the server can push updates to the client when something happens
There are a couple of things to consider.
How important are "real time" updates? If the user can wait several seconds (or longer), then go with polling.
What sort of load can the server handle? If load is a concern, then Websockets might be the way to go.
That last question is really the crux of the issue. If you're expecting a few or a few dozen clients to use this functionality, then either solution will work just fine.
If you're expecting thousands or more to be connecting, then polling starts to become a concern, because now we're talking about many repeated requests to the server. Of course, if the interval is longer, the load will be lower.
It is my understanding that the overhead for Websockets is lower, but still can be a concern when you're talking about large numbers of clients. Again, a lot of clients means the server is managing a lot of open connections.
The way large services handle this is to design their applications in such a way that they can be distributed over many identical servers and which server you connect to is managed by a load balancer. This is true for either polling or Websockets.

Faster HTTP scraping per POST request?

I'm writing an API that returns an array of redirects for any given page:
router.post('/trace', function(req,res){
if(!req.body.link)
return res.status(405).send(""); //error: no link provided!
console.log("\tapi/trace()", req.body.link);
var redirects = [];
function exit(goodbye){
if(goodbye)
console.log(goodbye);
res.status(200).send(JSON.stringify(redirects)); //end
}
function getRedirect(link){
request({ url: link, followRedirect: false }, function (err, response, body) {
if(err)
exit(err);
else if(response.headers.location){
redirects.push(response.headers.location);
getRedirect(response.headers.location);
}
else
exit(); //all done!
});
}
getRedirect(req.body.link);
});
and here is the corresponding browser request:
$.post('/api/trace', { link: l }, cb);
a page will make about 1000 post request very quickly and then waits a very long time to get each request back.
The problem is the response to the nth request is very slow. individual request takes about half a second, but as best I cant tell the express server is processing each link sequentially. I want the server to make all the requests and respond as it receives a response.
Am I correct in assuming express POST router is running processes sequentially? How do I get it to blast all requests and pass the responses as it gets them?
My question is why is it so slow / is POST an async process on a "out of the box" express server?
You may be surprised to find out that this is probably first a browser issue, not a node.js issue.
A browser will have a max number of simultaneous requests it will allow your Javascript ajax to make to same host which will vary slightly from one browser to the next, but is around 6. So, if you're making 1000 requests, then only around 6 are being sent at at time. The rest go in a queue in the browser waiting for prior requests to finish. So, your node server likely isn't getting 1000 simultaneous requests. You should be able to confirm this by logging incoming requests in your node.js app. You will probably see a long delay before it receives the 1000th request (because it's queued by the browser).
Here's a run-down of how many simultanous requests to a given host each of the browser supported (as of a couple years ago): Max parallel http connections in a browser?.
My first recommendation would be to package up an array of requests to make from the client to the server (perhaps 50 at a time) and then send that in one request. That will give your node.js server plenty to chew on and won't run afoul of the browser's connection limit to the same host.
As for the node.js server, it depends a lot on what you're doing. If most of what you're doing in the node.js server is just networking and not a lot of processing that requires CPU cycles, then node.js is very efficient at handling lots and lots of simultaneous requests. If you start engaging a bunch of CPU (processing or preparing results), then you make benefit from either adding worker processes or using node.js clustering. In your case, you may want to use worker processes. You can examine your CPU load when your node.js server is processing a bunch of work and see if the one CPU that node.js is using is anywhere near 100% or not. If it isn't, then you don't need more node.js processes. If it is, then you do need to spread the work over more node.js processes to go faster.
In your specific case, it looks like you're really only doing networking to collect 302 redirect responses. Your single node.js process should be able to handle a lot of those requests very efficiently so probably the issue is just that your client is being throttled by the browser.
If you want to send a lot of requests to the server (so it can get to work on as many as feasible), but want to get results back immediately as they become available, that's a little more work.
One scheme that could work is to open a webSocket or socket.io connection. You can then send a giant array of URLs that you want the server to check for you in one message over the socket.io connection. Then, as the server gets a result, it can send back each individual result (tagged with the URL that it corresponds to). That way, you can somewhat get the best of both worlds with the server crunching on a long list of URLs, but able to send back individual responses as soon as it gets them.
Note, you will probably find that there is an upper limit to how many outbound http requests you may want to run at the same time from your node.js server too. While modern versions of node.js don't throttle you like the browser does, you probably also don't want your node.js server attempting to run 10,000 simultaneous requests because you may exhaust some sort of network resource pool. So, once you get past the client bottleneck, you will want to test your server at different levels of simultaneous requests open to see where it performs best. This is both to optimize its performance, but also to protect your server against attempting to overextend its use of networking or memory resources and get into error conditions.

Building Websites only on NodeJs and Express blocking requests over http

I have a question regarding the examples out there when using Nodejs, Express and Jade for templates.
All the examples show how to build some sort of a user administrative interface where you can add user profiles, delete them and manage them.
Those are considered beginner's guides to NodeJs. My question is around the fact that if I have have 10 users concurrently accessing the same interface and doing the same operations, surely NodeJs will block the requests for the other users as they are running on the same port.
So let's say I am pulling out a list of users which may be something like 10000. Yes I can do paging, but that is not the point. While I am getting the list from the server another 4 users want to access the application. They have to wait for my process to end. That is my question - how can one avoid that using NodeJS & Express?
I am on this issue for a couple of months! I currently have something in place that does the following:
Run the main processing of stuff on a port
Run a Socket.io process on a different port
Use a sticky session
The idea is that I do a request (like getting a list of items), and immediately respond with some request reference but without the requested items, thus releasing the port.
In the background "asynchronously" I then do the process of getting the items. Upon which when completed, I do an http request from one node to the socket node port node SENDING the items through.
When that is done I then perform a socket.io emit WITH the data and the initial request reference so that the correct user gets the message.
On the client side I have an event listening for the socket which then completes the ajax request by populating the list.
I have SOME success in doing this! It actually works to a degree! I have an issue online which complicates matters due to ip addresses, and socket.io playing funny.
I also have multiple workers using clustering. I use it in the following manner:
I create a master worker
I spawn workers
I take any connection request and pass it to the relevant worker.
I do that for the main node request as well as for the socket requests. Like I said I use 2 ports!
As you can see I have had a lot of work done on this and I am not getting a proper solution!
My question is this - have I gone all around the world 10 times only to have missed something simple? This sounds way to complicated to achieve a non-blocking nodejs only website.
I asked myself - surely all these tutorials would have not missed on something as important as this! But they did!
I have researched, read, and tested a lot of code - this is my very first time I ask anything on stackoverflow!
Thank you for any assistance.
P.S. One example of the same approach is this: I request a report using jasper, I pass parameters, and with the "delayed ajax response" approach as described above I simply release the port, and in the background a very intensive report is being generated (and this can be very intensive process as a lot of calculations are being performed)..! I really don't see a better approach - any help will be super appreciated!
Thank you for taking the time to read!
I'm sorry to say it, but yes, you have been going around the world 10 times only to have been missing something simple.
It's obvious that your previous knowledge/experience with webservers are from a blocking point of view, and if this was the case, your concerns had been valid.
Node.js is a framework focused around using a single thread to execute code, which means if it does any blocking operations, no one else would be able to get anything done.
There are some operations that can do this in node, like reading/writing to disk. However, most node operations will be asynchronous.
I believe you are familiar with the term, so I won't go into details. What asynchronous operations allows node to do, is to keep this single thread idle as much as possible. By idle I mean open for other work. If your code is fully asynchronous, then handling 4 concurrent users (or even 400) shouldn't be a problem, even for a single thread.
Now, in regards to your initial problem of ports: Once a request is received on a given port, node.js execute whatever code you have written for it, until it encounters an asynchronous operation as soon as that happens, it is available to to pick up more requests on the same port.
The second problem you inquire about, is the database operation. In this case, node-js would send the query to the database (which takes no time at all) and the database does that actual execution of the query. In the meantime, node is free to do whatever it wants, until the database is finished, and lets node know there is a result to fetch.
You can recognize async operations by their structure: my_function(..., ..., callback). Function that uses a callback function, is in most cases asynch.
So bottom line: Don't worry about the problems around blocking IO, as you will hardly encounter any in node. Use a single port if you want (By creating multiple child processes, you can even have multiple node instances on the same port).
Hope this explains it good enough. If you have any further questions, let me know :)

Node.js App with API endpoints which take 20sec+ :: Connections Left Open :: How to Optimize?

I have a Node.js RESTful API returning JSON data. One of the API calls can (and frequently does) take 10 - 20 seconds to finish. This long RTT is due to connecting to external APIs, like DiffBot, MailChimp, Facebook, Twitter, etc. I wish I could make the API call shorter, but I cannot.
Of course, I've implemented the node code in a nice async way, but the problem is that the client's inbound connection (to the node app) is alive while it waits for the server to finish, and thus might be killing my performance. In fact, I'm currently guessing that this may explain my long-running timeout issue in node.
I've already increased maxSockets to a huge number...
require('http').globalAgent.maxSockets = 9999;
For the sake of interest, I'm printing out the active sockets each time a new connection is made (here's the code).
Which gives me output like this:
SOCKETS: {} { 'graph.facebook.com:443': 5, 'api.instagram.com:443': 1 }
Nothing too enlightening there. The max connections I ever see is around 20 or so, total, across all hosts. But this doesn't really tell me anything about incoming connections, or how to optimize them so that my server does not choke when there are many of them alive at once (which I suspect it is).
You should optimize your architecture, not just the code.
First, I would change the way the client/server interact with each other. The server should end the request upon recept and notify the client once all the tasks for that request are truly complete.
There are different ways to achieve that. For example, the client can query the stats of the request using AJAX (poll) every X seconds. Another example would be to use WebSocket.
If you're going with this approach, look into Socket.IO. It supports many transports with the same API, if WebSocket is available, it would use that, otherwise, it would fall back to other transports such as Flash Socket, long-polling, etc.
Second, you shouldn't use one process to do all this work. You should use a queue (preferably a messaging system that supports queues), then, run workers (separate processes) to do the "heavy lifting".
Personally, I use AMQP due to its features and portability (it's an open-standard) but feel free to use any other queue system with a persistant backend.
That way, if one or more process(es) crash(es) and you use the right queue, you wouldn't lose any data (such as the API tasks you mentioned).
Hope it helps.

Resources