I have a question that nobody seems to help with. How will this be handled in a production mode with thousands of requests at the same time?
I did a simple test case:
module.exports = {
index: function (req, res){
if (req.param('foo') == 'bar'){
async.series([
function(callback){
for (k=0; k <= 50000; k++){
console.log('did something stupid a few times');
}
callback();
}
], function(){
return res.json(null);
});
}else{
return res.view('homepage', {
});
}
}
};
Now if I go to http://localhost:1337/?foo=bar it will obviously wait a while before it responds. So if I now open a different session (other browser or incognito, and go to http://localhost:1337/ I am expecting a result immediately. Instead it is waiting for the other request to finish and only then it will let this request go through.
Therefore it is not asynchronous and it is a huge problem if I have as much as 2 ppl at the same time operating this app. I mean this app will have drop downs coming from databases, html files being served etc...
My question is this: how does one handle such an issue??? I hear the word "promises vs callbacks" - is this some sort of a solution to this?
I know about clustering, but that only separates the requests over the amount of cpu's, ultimately you will fix it by at most allowing 8 people at the same time without being blocked. It won;t handle 100 requests at the same time...
P.S. That test was to simplify the example, but think of someone uploading a file, a web service that goes to a different server, a point of sales payment terminal waiting for a user to input the pin, someone downloading a file from the app, etc...
nodejs is event driven and runs your Javascript as single threaded. So, as long as your code from the first request is sitting in that for loop, nodejs can't do anything else and won't get to the next event in the event queue so thus your second request has to wait for the first one to finish.
Now, if you used a true async operation such as setTimeout() instead of your big for loop, then nodejs could service other events while the first request was waiting for the setTimeout().
The general rule in nodejs is to avoid doing anything that takes a ton of CPU in your main nodejs app. If you are stuck with something CPU-intensive, then you're best to either run clusters (as many as CPUs you have) or move the CPU-intensive work to some sort of worker queue that is served by different processes and let the OS time slice those other processes while the main nodejs process stays free and ready to service new incoming requests.
My question is this: how does one handle such an issue??? I hear the word "promises vs callbacks" - is this some sort of a solution to this?
I know about clustering, but that only separates the requests over the amount of cpu's, ultimately you will fix it by at most allowing 8 people at the same time without being blocked. It won;t handle 100 requests at the same time...
Most of the time, a server process spends most of the time of a request doing things that are asynchronous in nodejs (reading files, talking to other servers, doing database operations, etc...) where the actual work is done outside the nodejs process. When that is the case, nodejs does not block and is free to work on other requests while the async operations from other requests are underway. The little bit of CPU time coordinating these operations can be helped further by clustering though it's probably worth testing a single process first to see if clustering is really needed.
P.S. That test was to simplify the example, but think of someone uploading a file, a web service that goes to a different server, a point of sales payment terminal waiting for a user to input the pin, someone downloading a file from the app, etc...
All the operations you mentioned here can be done truly asynchronously so they won't block your nodejs app like your for loop does so basically the for loop isn't a good simulation of any of this. You need to use a real async operation to simulate it. Real async operations do their work outside of the main nodejs thread and then just post an event to the event queue when they are done, allowing nodejs to do other things while the async operations are doing their work. That's the key.
Related
Backstory
I recently completed my Node.js skill assessment on LinkedIn, which is basically a small test. But i noticed a question with a weird answer, which i can not understand.
Question
How does it affect the performance of a web application when an execution path contains a CPU-heavy operation, such as calculating a long Fibonacci sequence?
Right answer
The current thread will block until the executon is completed and the operating system will spawn new threads to handle incoming requests. This can exhaust the number of allowed threads (255) and degrade performance over time.
Problem
I always thought that node is not creating a new threads unless you're using something like child-process or workers. I tried to create a simple express server with one endpoint, that is blocking execution forever:
app.get('/', (req, res) => {
console.log('recieved request');
while (true) {}
});
And when i tried to send multiple requests, only one log message has appeared, which means that new thread is not created to handle new request.
So, is this answer just incorrect (even though it seems as a correct one), or am i misunderstanding something?
I have created a single endpoint in Node.js.
Following is the end-point:
app.post('/processMyRequests',function(req,res){
switch(req.body.functionality) {
case "functionalityName1":
jsFileName1.functionA(req,res);
break;
case "functionalityName2":
jsFileName2.functionB(req,res);
break;
default:
res.send("Sorry for that");
break;
}
});
In each of these functions, calls to APIs are done, then the data is processed, and finally response is sent back.
My questions:
Since Node.js as a default handles requests asynchronously, can we have a single route for all the responses?
Will concurrency be an issue i.e. when parallel hits are happening into the single route will Node.js stall or slow down?
If the answer to question (2) is YES, how will it change when I have separate routes i.e if the same amount of requests come into a specific route then it is going to be the same issue right?
Would be happy if someone could share real-time use cases. Thanks
You technically can have a single route for all the responses, but it's considered "better-practice" to create endpoints which are compact, clear in what the intended function/purpose is, and not too complex; in your example, there could be many possible branches of code that the route could take. This requires unique logic for each branch, which adds to the complexity of your endpoints, and takes away from the clarity of the code. Imagine that when an error occurs, you now have to debug potentially multiple different files and different branches of your endpoint, when you could have created a separate endpoint for each unique "branch".
As your application grows in both size, and complexity, you are going to want an easy way to manage your API. Putting lots of stuff into one endpoint is going to be a nightmare for you to maintain, and debug.
It may be useful for you to look at some tutorials/docs about how to design and implement an API, here is a good article from Scotch.io
Example for question one:
// GET multiple records
app.get('/functionality1',function(req,res){
//Unique logic for functionality
});
// GET a record by an 'id' field
app.get('/functionality1/:id',function(req,res){
//Unique logic for functionality
});
// POST a new record
app.post('/functionality1',function(req,res){
//Unique logic for functionality
});
// PUT (update) a record
app.put('/functionality1',function(req,res){
//Unique logic for functionality
});
// DELETE a record
app.delete('/functionality1',function(req,res){
//Unique logic for functionality
});
app.get('/functionality2',function(req,res){
//Unique logic for functionality
});
...
This gives you a much clearer idea of what is happening for each endpoint, versus having to digest a lot of technically unrelated logic in a single API endpoint. Summing it up, it's better to have endpoints which are clear and concise in their purpose, and scope.
It really depends on how the logic is implemented; obviously Node.js is single-threaded. This means it can only process 1 "stream" of code at a time (no true concurrency or parallelism). However, Node gets around this through its event-loop. The problem that you could see depends on if you wrote asynchronous (non-blocking) code, or synchronous (blocking) code. In Node it's almost always better and recommended to write non-blocking code. This helps to prevent blocking the event loop, meaning your node app can do other things while, for example waiting for a file to finish being read, an API call to finish, or a promise to resolve. Writing blocking code will result in your application bottle-necking/"hanging", which is perceived by your end-users as higher-latency
Having multiple routes, or a single route isn't going to resolve this problem. It's more about how you are utilizing (or not utilizing) the event loop. It's extremely important to use asynchronous code as much as possible.
One thing that you can do if you absolutely must use synchronous code (this is actually a good approach to leverage regardless of code synchronicity)is to implement a microservice architecture, where a service can process your blocking (or resource-intensive) code off of your API Node service. This frees up your API service to handle requests as rapidly as possible, and leave the heavy lifting to other services.
Another possibility is to leverage clustering. This gives you the ability to run node as if it were multi-threaded, by spawning "worker" processes, which are identical to your master process, with the difference in that they are able to process work individually. This type of approach is extremely useful if you expect that you will have a very busy API service.
Some extremely helpful resources:
Node.js Express Best Practices
A GREAT video explaining the event-loop
Parallelism vs. Concurrency in Node.js
Node.js Clustering
API Design
I want this kind of structure;
express backend gets a request and runs a function, this function will get data from different apis and saves it to db. Because this could takes minutes i want it to run parallel while my web server continues to processing requests.
i want this because of this scenario:
user has dashboard, after logs in app starts to collect data from apis and preparing the dashboard for user, at that time user can navigate through site even can close the browser but the function has to run until it finishes fetching data.Once it finishes, all data will be saved db and the dashboard is ready for user.
how can i do this using child_process or any kind of structure in nodejs?
Since what you're describing is all async I/O (networking or disk) and is not CPU intensive, you don't need multiple child processes in order to effectively serve multiple requests. This is the beauty of node.js. With async I/O, node.js can be working on many different requests over the same period of time.
Let's supposed part of your process is downloading an image. Your node.js code sends a request to fetch an image. That request is sent off via TCP. Immediately, there is nothing else to do on that request. It's winging it's way to the destination server and the destination server is preparing the response. While all that is going on, your node.js server is completely free to pull other events from it's event queue and start working on other requests. Those other requests do something similar (they start async operations and then wait for events to happen sometime later).
Your server might get 10 different async operations started and "in flight" before the first one actually starts getting a response. When a response starts coming in, the system puts an event into the node.js event queue. When node.js has a moment between other requests, it pulls the next event out of the event queue and processes it. If the processing has further async operations (like saving it to disk), the whole async and event-driven process starts over again as node.js requests a write to disk and node.js is again free to serve other events. In this manner, events are pulled from the event queue one at a time as they become available and lots of different operations can all get worked on in the idle time between async operations (of which there is a lot).
The only thing that upsets the apple cart and ruins the ability of node.js to juggle lots of different things all at once is an operation that takes a lot of CPU cycles (like say some unusually heavy duty crypto). If you had something like that, it would "hog" too much of the CPU and the CPU couldn't be effectively shared among lots of other operations. If that were the case, then you would want to move the CPU-intensive operations to a group of child processes. But, just doing async I/O (disk, networking, other hardware ports, etc...) does not hog the CPU - in fact it barely uses much node.js CPU.
So, the next question is often "how do I know if I have too much stuff that uses the CPU". The only way to really know is to just code your server properly using async I/O and then measure its performance under load and see how things go. If you're doing async things appropriately and the CPU still spikes to 100%, then you have too much CPU load and you'll want to either use generic clustering or move specific CPU-heavy operations to a group of child processes.
I am currently developing a node js app with a REST API that exposes data from a mongo db.
The application needs to update some data every 5 minutes by calling an external service (could take more than one minute to get the new data).
I decided to isolate this task into a child_process but I am not sure about what should I need put in this child process :
Only the function to be executed. The schedule is managed by the main process.
Having a independent process that auto-refresh data every 5 minute and send a message to main process every time the refresh is done.
I don't really know if there is a big cost to start a new child process every 5 minutes or if I should use only one long time running child process or if I am overthinking the problem ^^
EDIT - Inforamtion the update task
the update task can take up than one minute but it consists in many smaller tasks (gathering information from many external providers) than run asynchronously do many I don't even need a child process ?
Thanks !
Node.js has an event-driven architecture capable of handling asynchronous calls hence it is unlike your typical C++ program where you will go with a multi-threaded/process architecture.
For your use-case I'm thinking maybe you can make use of the setInterval to repeatedly perform an operation which you can define more tiny async calls through using some sort of promises framework like bluebirdJS?
For more information see:
setInterval: https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers/setInterval
setInterval()
Repeatedly calls a function or executes a code snippet, with a fixed
time delay between each call. Returns an intervalID.
Sample code:
setInterval(function() {
console.log("I was executed");
}, MILLISECONDS_IN_FIVE_MINUTE);
Promises:
http://bluebirdjs.com/docs/features.html
Sample code:
new Promise(function(resolve, reject) {
updateExternalService(data)
.then(function(response) {
return this.parseExtResp(response);
})
.then(function(parsedResp) {
return this.refreshData(parsedResp);
})
.then(function(returnCode) {
console.log("yay updated external data source and refreshed");
return resolve();
})
.catch(function(error) {
// Handle error
console.log("oops something went wrong ->" + error.message);
return reject();
});
}
It does not matter the total clock time that it takes to get data from an external service as long as you are using asynchronous requests. What matters is how much CPU you are using in doing so. If the majority of the time is waiting for the external service to respond or to send the data, then your node.js server is just sitting idle most of the time and you probably do not need a child process.
Because node.js is asynchronous, it can happily have many open requests that are "in flight" that it is waiting for responses to and that takes very little system resources.
Because node.js is single threaded, it is CPU usage that typically drives the need for a child process. If it takes 5 minutes to get a response from an external service, but only 50ms of actual CPU time to process that request and do something with it, then you probably don't need a child process.
If it were me, I would separate out the code for communicating with the external service into a module of its own, but I would not add the complexity of a child process until you actually have some data that such a change is needed.
I don't really know if there is a big cost to start a new child
process every 5 minutes or if I should use only one long time running
child process or if I am overthinking the problem
There is definitely some cost to starting up a new child process. It's not huge, but if you're going to be doing it every 5 minutes and it doesn't take a huge amount of memory, then it's probably better to just start up the child process once, have it manage the scheduling of communicating with the external service entirely upon it's own and then it can communicate back results to your other node.js process as needed. This makes the 2nd node process much more self-contained and the only point of interaction between the two processes is to communicate an update. This separation of function and responsibility is generally considered a good thing. In a multi-developer project, you could more easily have different developers working on each app.
It depends on how cohesion between your app and the auto refresh task.
If the auto refresh task can running standalone, without interaction with your app, then it better to start your task as a new process. Use child_process directly is not a good idea, spawn/monitor/respawn child process is tricky, you can use crontab or pm2 to manage it.
If auto refresh task depends on your app, you can use child_process directly, send message to it for schedule. But first try to break this dependency, this will simplify your app, easy to deployment and maintain separately. Child process is long running or one shot is not a question until you have hundreds of such task running on one machine.
I have been just learning redis and node.js There are two questions I have for which I couldn't find any satisfying answer.
My first question is about reusing redis clients within the node.js. I have found this question and answer: How to reuse redis connection in socket.io? , but it didn't satisfy me enough.
Now, if I create the redis client within the connection event, it will be spawned for each connection. So, if I have 20k concurrent users, there will be 20k redis clients.
If I put it outside of the connection event, it will be spawned only once.
The answer is saying that he creates three clients for each function, outside of the connection event.
However, from what I know MySQL that when writing an application which spawns child processes and runs in parallel, you need to create your MySQL client within the function in which you are creating child instances. If you create it outside of it, MySQL will give an error of "MySQL server has gone away" as child processes will try to use the same connection. It should be created for each child processes separately.
So, even if you create three different redis clients for each function, if you have 30k concurrent users who send 2k messages concurrently, you should run into the same problem, right? So, every "user" should have their own redis client within the connection event. Am I right? If not, how node.js or redis handles concurrent requests, differently than MySQL? If it has its own mechanism and creates something like child processes within the redis client, why we need to create three different redis clients then? One should be enough.
I hope the question was clear.
-- UPDATE --
I have found an answer for the following question. http://howtonode.org/control-flow
No need to answer but my first question is still valid.
-- UPDATE --
My second question is this. I am also not that good at JS and Node.js. So, from what I know, if you need to wait for an event, you need to encapsulate the second function within the first function. (I don't know the terminology yet). Let me give an example;
socket.on('startGame', function() {
getUser();
socket.get('game', function (gameErr, gameId) {
socket.get('channel', function (channelErr, channel) {
console.log(user);
client.get('games:' + channel + '::' + gameId + ':owner', function (err, owner) { //games:channel.32:game.14
if(owner === user.uid) {
//do something
}
});
}
});
});
So, if I am learning it correctly, I need to run every function within the function if I need to wait I/O answer. Otherwise, node.js's non-blocking mechanism will allow the first function to run, in this case it will get the result in parallel, but the second function might not have the result if it takes time to get. So, if you are getting a result from redis for example, and you will use the result within the second function, you have to encapsulate it within the redis get function. Otherwise second function will run without getting the result.
So, in this case, if I need to run 7 different functions and the 8. function will need the result of all of them, do I need to write them like this, recursively? Or am I missing something.
I hope this was clear too.
Thanks a lot,
So, every "user" should have their own redis client within the connection event.
Am I right?
Actually, you are not :)
The thing is that node.js is very unlike, for example, PHP. node.js does not spawn child processes on new connections, which is one of the main reasons it can easily handle large amounts of concurrent connections, including long-lived connections (Comet, Websockets, etc.). node.js processes events sequentially using an event queue within one single process. If you want to use several processes to take advantage of multi-core servers or multiple servers, you will have to do it manually (how to do so is beyond the scope of this question, though).
Therefore, it is a perfectly valid strategy to use one single Redis (or MySQL) connection to serve a large quantity of clients. This avoids the overhead of instantiating and terminating a database connection for each client request.
So, every "user" should have their own redis client within the
connection event. Am I right?
You shouldn't make a new Redis client for each connected user, that's not the proper way to do it. Instead just create 2-3 clients max and use them.
For more information checkout this question:
How to reuse redis connection in socket.io?
As for the first question:
The "right answer" might make you think you are good with one Connection.
In reality, whenever you are doing something that is waiting on an IO, a timer, etc, you are actually making node run the waiting method on the queue. Hence, if you use only 1 single connection, you will actually limit the performance of the thread you working on ( a single CPU) to the speed of redis - which is probably a few hundreds of callbacks per second (non-redis waiting callbacks will still go on) - while this is not poor performance, there's no reason to create this kind of limitation. It is recommended to create a few (5-10) connections to avoid this issue in it's entire. This number goes up for slower databases, e.g. MySQL, but is dependant on the type of queries and the code specifics.
Do note, that you should run a few workers on your server, per the number of CPUs you have, for best performance.
In regards to the 2nd Question:
It is a much better practice, to name the functions, one after the other, and use the names in the code rather than defining it as you go. In some situations, it will reduce memory consumption.