Redis and Node.js and Socket.io Questions - node.js

I have been just learning redis and node.js There are two questions I have for which I couldn't find any satisfying answer.
My first question is about reusing redis clients within the node.js. I have found this question and answer: How to reuse redis connection in socket.io? , but it didn't satisfy me enough.
Now, if I create the redis client within the connection event, it will be spawned for each connection. So, if I have 20k concurrent users, there will be 20k redis clients.
If I put it outside of the connection event, it will be spawned only once.
The answer is saying that he creates three clients for each function, outside of the connection event.
However, from what I know MySQL that when writing an application which spawns child processes and runs in parallel, you need to create your MySQL client within the function in which you are creating child instances. If you create it outside of it, MySQL will give an error of "MySQL server has gone away" as child processes will try to use the same connection. It should be created for each child processes separately.
So, even if you create three different redis clients for each function, if you have 30k concurrent users who send 2k messages concurrently, you should run into the same problem, right? So, every "user" should have their own redis client within the connection event. Am I right? If not, how node.js or redis handles concurrent requests, differently than MySQL? If it has its own mechanism and creates something like child processes within the redis client, why we need to create three different redis clients then? One should be enough.
I hope the question was clear.
-- UPDATE --
I have found an answer for the following question. http://howtonode.org/control-flow
No need to answer but my first question is still valid.
-- UPDATE --
My second question is this. I am also not that good at JS and Node.js. So, from what I know, if you need to wait for an event, you need to encapsulate the second function within the first function. (I don't know the terminology yet). Let me give an example;
socket.on('startGame', function() {
getUser();
socket.get('game', function (gameErr, gameId) {
socket.get('channel', function (channelErr, channel) {
console.log(user);
client.get('games:' + channel + '::' + gameId + ':owner', function (err, owner) { //games:channel.32:game.14
if(owner === user.uid) {
//do something
}
});
}
});
});
So, if I am learning it correctly, I need to run every function within the function if I need to wait I/O answer. Otherwise, node.js's non-blocking mechanism will allow the first function to run, in this case it will get the result in parallel, but the second function might not have the result if it takes time to get. So, if you are getting a result from redis for example, and you will use the result within the second function, you have to encapsulate it within the redis get function. Otherwise second function will run without getting the result.
So, in this case, if I need to run 7 different functions and the 8. function will need the result of all of them, do I need to write them like this, recursively? Or am I missing something.
I hope this was clear too.
Thanks a lot,

So, every "user" should have their own redis client within the connection event.
Am I right?
Actually, you are not :)
The thing is that node.js is very unlike, for example, PHP. node.js does not spawn child processes on new connections, which is one of the main reasons it can easily handle large amounts of concurrent connections, including long-lived connections (Comet, Websockets, etc.). node.js processes events sequentially using an event queue within one single process. If you want to use several processes to take advantage of multi-core servers or multiple servers, you will have to do it manually (how to do so is beyond the scope of this question, though).
Therefore, it is a perfectly valid strategy to use one single Redis (or MySQL) connection to serve a large quantity of clients. This avoids the overhead of instantiating and terminating a database connection for each client request.

So, every "user" should have their own redis client within the
connection event. Am I right?
You shouldn't make a new Redis client for each connected user, that's not the proper way to do it. Instead just create 2-3 clients max and use them.
For more information checkout this question:
How to reuse redis connection in socket.io?

As for the first question:
The "right answer" might make you think you are good with one Connection.
In reality, whenever you are doing something that is waiting on an IO, a timer, etc, you are actually making node run the waiting method on the queue. Hence, if you use only 1 single connection, you will actually limit the performance of the thread you working on ( a single CPU) to the speed of redis - which is probably a few hundreds of callbacks per second (non-redis waiting callbacks will still go on) - while this is not poor performance, there's no reason to create this kind of limitation. It is recommended to create a few (5-10) connections to avoid this issue in it's entire. This number goes up for slower databases, e.g. MySQL, but is dependant on the type of queries and the code specifics.
Do note, that you should run a few workers on your server, per the number of CPUs you have, for best performance.
In regards to the 2nd Question:
It is a much better practice, to name the functions, one after the other, and use the names in the code rather than defining it as you go. In some situations, it will reduce memory consumption.

Related

If Redis is single Threaded, how can it be so fast?

I'm currently trying to understand some basic implementation things of Redis. I know that redis is single-threaded and I have already stumbled upon the following Question: Redis is single-threaded, then how does it do concurrent I/O?
But I still think I didn't understood it right. Afaik Redis uses the reactor pattern using one single thread. So If I understood this right, there is a watcher (which handles FDs/Incoming/outgoing connections) who delegates the work to be done to it's registered event handlers. They do the actual work and set eg. their responses as event to the watcher, who transfers the response back to the clients. But what happens if a request (R1) of a client takes lets say about 1 minute. Another Client creates another (fast) request (R2). Then - since redis is single threaded - R2 cannot be delegated to the right handler until R1 is finished, right? In a multithreade environment you could just start each handler in a single thread, so the "main" Thread is just accepting and responding to io connections and all other work is carried out in own threads.
If it really just queues the io handling and handler logic, it could never be as fast it is. What am I missing here?
You're not missing anything, besides perhaps the fact that most operations in Redis complete in less than a ~millisecond~ couple of microseconds. Long running operations indeed block the server during their execution.
Let’s say if there were 10,000 users doing live data pulling with 10 seconds each on hmget, and on the other side, server were broadcasting using hmset, redis can only issue the set at the last available queue.
Redis is only good for queuing and handle limited processing like inserting lazy last login info, but not for live info broadcasting, in this case, memcached will be the right choice. Redis is single threaded, like FIFO.

port blocking for multiple user requests

I have a question that nobody seems to help with. How will this be handled in a production mode with thousands of requests at the same time?
I did a simple test case:
module.exports = {
index: function (req, res){
if (req.param('foo') == 'bar'){
async.series([
function(callback){
for (k=0; k <= 50000; k++){
console.log('did something stupid a few times');
}
callback();
}
], function(){
return res.json(null);
});
}else{
return res.view('homepage', {
});
}
}
};
Now if I go to http://localhost:1337/?foo=bar it will obviously wait a while before it responds. So if I now open a different session (other browser or incognito, and go to http://localhost:1337/ I am expecting a result immediately. Instead it is waiting for the other request to finish and only then it will let this request go through.
Therefore it is not asynchronous and it is a huge problem if I have as much as 2 ppl at the same time operating this app. I mean this app will have drop downs coming from databases, html files being served etc...
My question is this: how does one handle such an issue??? I hear the word "promises vs callbacks" - is this some sort of a solution to this?
I know about clustering, but that only separates the requests over the amount of cpu's, ultimately you will fix it by at most allowing 8 people at the same time without being blocked. It won;t handle 100 requests at the same time...
P.S. That test was to simplify the example, but think of someone uploading a file, a web service that goes to a different server, a point of sales payment terminal waiting for a user to input the pin, someone downloading a file from the app, etc...
nodejs is event driven and runs your Javascript as single threaded. So, as long as your code from the first request is sitting in that for loop, nodejs can't do anything else and won't get to the next event in the event queue so thus your second request has to wait for the first one to finish.
Now, if you used a true async operation such as setTimeout() instead of your big for loop, then nodejs could service other events while the first request was waiting for the setTimeout().
The general rule in nodejs is to avoid doing anything that takes a ton of CPU in your main nodejs app. If you are stuck with something CPU-intensive, then you're best to either run clusters (as many as CPUs you have) or move the CPU-intensive work to some sort of worker queue that is served by different processes and let the OS time slice those other processes while the main nodejs process stays free and ready to service new incoming requests.
My question is this: how does one handle such an issue??? I hear the word "promises vs callbacks" - is this some sort of a solution to this?
I know about clustering, but that only separates the requests over the amount of cpu's, ultimately you will fix it by at most allowing 8 people at the same time without being blocked. It won;t handle 100 requests at the same time...
Most of the time, a server process spends most of the time of a request doing things that are asynchronous in nodejs (reading files, talking to other servers, doing database operations, etc...) where the actual work is done outside the nodejs process. When that is the case, nodejs does not block and is free to work on other requests while the async operations from other requests are underway. The little bit of CPU time coordinating these operations can be helped further by clustering though it's probably worth testing a single process first to see if clustering is really needed.
P.S. That test was to simplify the example, but think of someone uploading a file, a web service that goes to a different server, a point of sales payment terminal waiting for a user to input the pin, someone downloading a file from the app, etc...
All the operations you mentioned here can be done truly asynchronously so they won't block your nodejs app like your for loop does so basically the for loop isn't a good simulation of any of this. You need to use a real async operation to simulate it. Real async operations do their work outside of the main nodejs thread and then just post an event to the event queue when they are done, allowing nodejs to do other things while the async operations are doing their work. That's the key.

When is blocking code acceptable in node.js?

I know that blocking code is discouraged in node.js because it is single-threaded. My question is asking whether or not blocking code is acceptable in certain circumstances.
For example, if I was running an Express webserver that requires a MongoDB connection, would it be acceptable to block the event loop until the database connection was established? This is assuming that all pages served by Express require a database query (which would fail if MongoDB was not initialized).
Another example would be an application that requires the contents of a configuration file before being initializing. Is there any benefit in using fs.readFile over fs.readFileSync in this case?
Is there a way to work around this? Is wrapping all the code in a callback or promise the best way to go? How would that be different from using blocking code in the above examples?
It is really up to you to decide what is acceptable. And you would do that by determining what the consequences of blocking would be ... on a case-by-case basis. That analysis would take into account:
how often it occurs,
how long the event loop is likely to be blocked, and
the impact that blocking in that context will have on usability1.
Obviously, there are ways to avoid blocking, but these tend to add complexity to your application. Really, you need to decide ... on a case-by-case basis ... whether that added complexity is warranted.
Bottom line: >>you<< need to decide what is acceptable based on your understanding of your application and your users.
1 - For example, in a game it would be more acceptable to block the UI while switching "levels" than during active play. Or for a general web service, "once off" blocking while a config file is loaded or a DB connection is established during webserver startup is more acceptable that if this happened on every request.
From my experience most tasks should be handled in a callback or by returning a promise. You DO NOT want to block code in a Node application. That's what makes it so nice! Mostly with MongoDB it will crash before it has a chance to connect if there is no connection. It won't' really have an effect on an API call because your server will be dead!
Source: I'm a developer at a bootcamp that teaches MEAN stack.
Your two examples are completely different. The distinction actually answers the question in and of itself.
Grabbing data from a database is dependent on being connected to that database. Any code that is dependent upon that data is then dependent upon that connection. These things have to happen serially for the app to function and be meaningful.
On the other hand, readFileSync will block ALL code, not just code that is reliant on it. You could start reading a csv file while simultaneously establishing a database connection. Once both are done, you could add that csv data to the database.

How can I "break up" a long running server side function in a Meteor app?

I have, as part of a meteor application, a server side that gets POST messages of information to feed to the web client via inserts/updates to a Collection. So far so good. However, sometimes these updates can be rather large (50K records a go, every 5 seconds). I was having a hard time keeping up to this until I started using batch-insert package and then low-level batch.find.update() and batch.execute() from Mongo.
However, there is still a good amount of processing going on even with 50K records (it does some calculations, analytics, etc). I would LOVE to be able to "thread" that logic so the main event loop can continue along. However, I am not sure there is a real easy way to create "real" threads for this within Meteor. So baring that, I would like to know the best / proper way of at least "batching" the work so that every N (say 1K or so) records I can release the event loop back to process other events (like some client side DDP messages and the like). Then do another 1K records, etc. until however many records as I need are done.
I am THINKING the solution lies within using Fibers/Futures -- which appear to be the Meteor way -- but I am not positive that is correct or the low level ideas like "setTimeout()" and/or "setImmediate()" are more appropriate.
TIA!
Meteor is not a one size fits all tool. I think you should decouple your meteor application from your batch processing. Set up a separate meteor instance, or better yet set up a pure node.js server to handle these requests and batch processes. It would look like this:
Create a node.js instance that connects to the same mongo database using the mongodb plugin (https://www.npmjs.com/package/mongodb).
Use express if you're using node.js to handle the post requests (https://www.npmjs.com/package/express).
Do the batch processing/inserts/updates in this instance.
The updates in mongo will be reflected in meteor very quickly. I had a similar situation and used a node server to do some batch data collection and then pass it into a cassandra database. I then used pig latin to run some batch operations on that data, and then inserted it into mongo. My meteor application would reactively display the new data pretty much instantaneously.
You can call this.unblock() inside a server method to allow the code to run in the background, and immediately return from the method. See example below.
Meteor.methods({
longMethod: function() {
this.unblock();
Meteor._sleepForMs(1000 * 60 * 60);
}
});

Implementing general purpose long polling

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.
Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.
Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

Resources