Implementing general purpose long polling - node.js

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.

Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.

Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

Related

NodeJS Performance - Multiple routes vs Single routes

I have created a single endpoint in Node.js.
Following is the end-point:
app.post('/processMyRequests',function(req,res){
switch(req.body.functionality) {
case "functionalityName1":
jsFileName1.functionA(req,res);
break;
case "functionalityName2":
jsFileName2.functionB(req,res);
break;
default:
res.send("Sorry for that");
break;
}
});
In each of these functions, calls to APIs are done, then the data is processed, and finally response is sent back.
My questions:
Since Node.js as a default handles requests asynchronously, can we have a single route for all the responses?
Will concurrency be an issue i.e. when parallel hits are happening into the single route will Node.js stall or slow down?
If the answer to question (2) is YES, how will it change when I have separate routes i.e if the same amount of requests come into a specific route then it is going to be the same issue right?
Would be happy if someone could share real-time use cases. Thanks
You technically can have a single route for all the responses, but it's considered "better-practice" to create endpoints which are compact, clear in what the intended function/purpose is, and not too complex; in your example, there could be many possible branches of code that the route could take. This requires unique logic for each branch, which adds to the complexity of your endpoints, and takes away from the clarity of the code. Imagine that when an error occurs, you now have to debug potentially multiple different files and different branches of your endpoint, when you could have created a separate endpoint for each unique "branch".
As your application grows in both size, and complexity, you are going to want an easy way to manage your API. Putting lots of stuff into one endpoint is going to be a nightmare for you to maintain, and debug.
It may be useful for you to look at some tutorials/docs about how to design and implement an API, here is a good article from Scotch.io
Example for question one:
// GET multiple records
app.get('/functionality1',function(req,res){
//Unique logic for functionality
});
// GET a record by an 'id' field
app.get('/functionality1/:id',function(req,res){
//Unique logic for functionality
});
// POST a new record
app.post('/functionality1',function(req,res){
//Unique logic for functionality
});
// PUT (update) a record
app.put('/functionality1',function(req,res){
//Unique logic for functionality
});
// DELETE a record
app.delete('/functionality1',function(req,res){
//Unique logic for functionality
});
app.get('/functionality2',function(req,res){
//Unique logic for functionality
});
...
This gives you a much clearer idea of what is happening for each endpoint, versus having to digest a lot of technically unrelated logic in a single API endpoint. Summing it up, it's better to have endpoints which are clear and concise in their purpose, and scope.
It really depends on how the logic is implemented; obviously Node.js is single-threaded. This means it can only process 1 "stream" of code at a time (no true concurrency or parallelism). However, Node gets around this through its event-loop. The problem that you could see depends on if you wrote asynchronous (non-blocking) code, or synchronous (blocking) code. In Node it's almost always better and recommended to write non-blocking code. This helps to prevent blocking the event loop, meaning your node app can do other things while, for example waiting for a file to finish being read, an API call to finish, or a promise to resolve. Writing blocking code will result in your application bottle-necking/"hanging", which is perceived by your end-users as higher-latency
Having multiple routes, or a single route isn't going to resolve this problem. It's more about how you are utilizing (or not utilizing) the event loop. It's extremely important to use asynchronous code as much as possible.
One thing that you can do if you absolutely must use synchronous code (this is actually a good approach to leverage regardless of code synchronicity)is to implement a microservice architecture, where a service can process your blocking (or resource-intensive) code off of your API Node service. This frees up your API service to handle requests as rapidly as possible, and leave the heavy lifting to other services.
Another possibility is to leverage clustering. This gives you the ability to run node as if it were multi-threaded, by spawning "worker" processes, which are identical to your master process, with the difference in that they are able to process work individually. This type of approach is extremely useful if you expect that you will have a very busy API service.
Some extremely helpful resources:
Node.js Express Best Practices
A GREAT video explaining the event-loop
Parallelism vs. Concurrency in Node.js
Node.js Clustering
API Design

Web Api - Mutex Per User

I have an asp.net core Web Api application.
In my application I have Web Api method which I want to prevent multi request from the same user to enter simultaneously. I don't mind request from different users to perform simultaneously.
I am not sure how to create the lock and where to put it. I thought about creating some kind of a dictionary which will contains the user id and perform the lock on the item but I don't think i'm getting it right. Also, what will happen if there is more than one server and there is a load balancer?
Example:
Let assume each registered user can do 10 long task each month. I need to check for each user if he exceeded his monthly limit. If the user will send many simultaneously requests to the server, he might be allowed to perform more than 10 operations. I understand that I need to put a lock on the method but I do want to allow other users to perform this action simultaneously.
What you're asking for is fundamentally not how the Internet works. The HTTP and underlying IP protocols are stateless, meaning each request is supposed to run independent of any knowledge of what has occurred previously (or concurrently, as the case may be). If you're worried about excessive load, your best bet is to implement rate limiting/throttling tied to authentication. That way, once a user burns through their allotted requests, they're cut off. This will then have a natural side-effect of making the developers programming against your API more cautious about sending excessive requests.
Just to be a bit more thorough, here, the chief problem with the approach you're suggesting is that I know of no way it can be practically implemented. You can use something like SemaphoreSlim to create a lock, but that needs to be static so that the same instance is used for each request. Being static is going to limit your ability to use a dictionary of them, which is what you'll need for this. It can technically be done, I suppose, but you'd have to use a ConcurrentDictionary and even then, there's no guarantee of single-thread additions. So, concurrent requests for the same user could load concurrent semphaphores into it, which defeats the entire point. I suppose you could front-load the dictionary with a semphaphore for each user from the start, but that could become a huge waste of resources, depending on your user-base. Long and short, it's one of those things where when you're finding a solution this darn difficult, it's a good sign you're likely trying to do something you shouldn't be doing.
EDIT
After reading your example, I think this really just boils down to an issue of trying to handle the work within the request pipeline. When there's some long-running task to be completed or just some heavy work to be done, the first step should always be to pass it off to a background service. This allows you to return a response quickly. Web servers have a limited amount of threads to handle requests with, and you want to service the request and return a response as quickly as possible to keep from exhausting your threadpool.
You can use a library like Hangfire to handle your background work or you can implement an IHostedService as described here to queue work on. Once you have your background service ready, you would then just immediately hand off to that any time your get a request to this endpoint, and return a 202 Accepted response with a URL the client can hit to check the status. That solves your immediate issue of not wanting to allow a ton of requests to this long-running job to bring your API down. It's now essentially doing nothing more that just telling something else to do it and then returning immediately.
For the actual background work you'd be queuing, there, you can check the user's allowance and if they have exceeded 10 requests (your rate limit), you fail the job immediately, without doing anything. If not, then you can actually start the work.
If you like, you can also enable webhook support to notify the client when the job completes. You simply allow the client to set a callback URL that you should notify on completion, and then when you've finish the work in the background task, you hit that callback. It's on the client to handle things on their end to decide what happens when the callback is it. They might for instance decide to use SignalR to send out a message to their own users/clients.
EDIT #2
I actually got a little intrigued by this. While I still think it's better for your to offload the work to a background process, I was able to create a solution using SemaphoreSlim. Essentially you just gate every request through the semaphore, where you'll check the current user's remaining requests. This does mean that other users must wait for this check to complete, but then your can release the semaphore and actually do the work. That way, at least, you're not blocking other users during the actual long-running job.
First, add a field to whatever class you're doing this in:
private static readonly SemaphoreSlim _semaphore = new SemaphoreSlim(1, 1);
Then, in the method that's actually being called:
await _semaphore.WaitAsync();
// get remaining requests for user
if (remaining > 0)
{
// decrement remaining requests for user (this must be done before this next line)
_semaphore.Release();
// now do the work
}
else
{
_semaphore.Release();
// handle user out of requests (return error, etc.)
}
This is essentially a bottle-neck. To do the appropriate check and decrementing, only one thread can go through the semaphore at a time. That means if your API gets slammed, requests will queue up and may take a while to complete. However, since this is probably just going to be something like a SELECT query followed by an UPDATE query, it shouldn't take that long for the semaphore to release. You should definitely do some load testing and watch it, though, if you're going to go this route.

Node.js GPS device tracking performance considerations

Using node.js as a tcp server, I am going to manage relatively large number of GPS devices( ~3000 device ) and as first step just going to store incoming data in database, but even in this phase i envision some performance issues which bothers me and I'd like to caught them before they bite me.
1 - Looking at written similar servers using languages like java or ruby I see some code like the following:
java
Thread serverThread = new Thread(() -> {
System.out.println("Listening to server port 9000");
while (true) {
try {
Socket socket = serverSocket.accept();
...
ruby
require 'socket'
server = TCPServer.new ("127.0.0.1",8080)
loop do
Thread.start(server.accept) do |client|
...
Which seems they gives separate thread to every device(socket) which get connected to tcp server? As node.js is single-threaded and acts asynchronously, should i be concerned about incoming connections or something like the following simple approach will satisfy large number of simultaneous connections?
net.createServer(function(device) {
device.on('data', function(data) {
// parse data
// store in database
});
});
2 - Should I confine database connections using connection pool? As database also query from the other side for GIS and monitoring, how much the pool size should be?
3 - How could I benefit caching( for example using redis ) in such system?
It should be great if someone sheds some light on this thoughts. I also willingly would like to hear any other performance thoughts you might be also experiencing or aware of in implementing such systems. Thanks.
Choosing among the options you have listed I would say NodeJS is actually a better option for your use case because it does not use one thread per connection like the other two options. Threads are normally a finite resource on a given machine. Java and Ruby do have 'evented' servers though and these are worth looking at if you want an apples to apples comparison.
I think you need to say more about the database you intend to use if you want advice on connection pooling. However reusing connections if they are costly to setup would be a good thing to do. It is probably a good idea to have the facility to configure the minimum and maximum size of the pool. Ultimately the correct size to use is a matter of testing.
I think the benefit of caching in this system would be minimal as you are mostly writing data. If the data is valuable you will want to write it to disk rather than memory. On the other hand, if you have clients that are reading the data collected perhaps caching their reads in something like Redis might be a good idea.
I'm sure you're aware, but this sounds like you're trying to prematurely optimize your application here.
1- Node being event-driven and non-blocking makes it a perfect candidate for holding a large number of open socket connections, no need for forking per connection. As always though, make sure your application is properly clustered. I was able to hold ~100k open TCP sockets on a dirt cheap laptop. If the number of device you need to support ever grows beyond that, just scale accordingly.
2- I saw you were planning on using postgres. Pools are always a good thing.
3- Caching is useful for 'hot' data. Stuff that gets queried a lot, and therefore having it in memory or inside redis (in-memory storage) makes these data lookups faster and removes strain on the system. In your case, if you just need to get certain chunks of data, for analytics or for more causal use, I would recommend spark or solr as opposed to a plain caching layer. It's also going to be much cheaper and easier to maintain.

When is blocking code acceptable in node.js?

I know that blocking code is discouraged in node.js because it is single-threaded. My question is asking whether or not blocking code is acceptable in certain circumstances.
For example, if I was running an Express webserver that requires a MongoDB connection, would it be acceptable to block the event loop until the database connection was established? This is assuming that all pages served by Express require a database query (which would fail if MongoDB was not initialized).
Another example would be an application that requires the contents of a configuration file before being initializing. Is there any benefit in using fs.readFile over fs.readFileSync in this case?
Is there a way to work around this? Is wrapping all the code in a callback or promise the best way to go? How would that be different from using blocking code in the above examples?
It is really up to you to decide what is acceptable. And you would do that by determining what the consequences of blocking would be ... on a case-by-case basis. That analysis would take into account:
how often it occurs,
how long the event loop is likely to be blocked, and
the impact that blocking in that context will have on usability1.
Obviously, there are ways to avoid blocking, but these tend to add complexity to your application. Really, you need to decide ... on a case-by-case basis ... whether that added complexity is warranted.
Bottom line: >>you<< need to decide what is acceptable based on your understanding of your application and your users.
1 - For example, in a game it would be more acceptable to block the UI while switching "levels" than during active play. Or for a general web service, "once off" blocking while a config file is loaded or a DB connection is established during webserver startup is more acceptable that if this happened on every request.
From my experience most tasks should be handled in a callback or by returning a promise. You DO NOT want to block code in a Node application. That's what makes it so nice! Mostly with MongoDB it will crash before it has a chance to connect if there is no connection. It won't' really have an effect on an API call because your server will be dead!
Source: I'm a developer at a bootcamp that teaches MEAN stack.
Your two examples are completely different. The distinction actually answers the question in and of itself.
Grabbing data from a database is dependent on being connected to that database. Any code that is dependent upon that data is then dependent upon that connection. These things have to happen serially for the app to function and be meaningful.
On the other hand, readFileSync will block ALL code, not just code that is reliant on it. You could start reading a csv file while simultaneously establishing a database connection. Once both are done, you could add that csv data to the database.

Redis and Node.js and Socket.io Questions

I have been just learning redis and node.js There are two questions I have for which I couldn't find any satisfying answer.
My first question is about reusing redis clients within the node.js. I have found this question and answer: How to reuse redis connection in socket.io? , but it didn't satisfy me enough.
Now, if I create the redis client within the connection event, it will be spawned for each connection. So, if I have 20k concurrent users, there will be 20k redis clients.
If I put it outside of the connection event, it will be spawned only once.
The answer is saying that he creates three clients for each function, outside of the connection event.
However, from what I know MySQL that when writing an application which spawns child processes and runs in parallel, you need to create your MySQL client within the function in which you are creating child instances. If you create it outside of it, MySQL will give an error of "MySQL server has gone away" as child processes will try to use the same connection. It should be created for each child processes separately.
So, even if you create three different redis clients for each function, if you have 30k concurrent users who send 2k messages concurrently, you should run into the same problem, right? So, every "user" should have their own redis client within the connection event. Am I right? If not, how node.js or redis handles concurrent requests, differently than MySQL? If it has its own mechanism and creates something like child processes within the redis client, why we need to create three different redis clients then? One should be enough.
I hope the question was clear.
-- UPDATE --
I have found an answer for the following question. http://howtonode.org/control-flow
No need to answer but my first question is still valid.
-- UPDATE --
My second question is this. I am also not that good at JS and Node.js. So, from what I know, if you need to wait for an event, you need to encapsulate the second function within the first function. (I don't know the terminology yet). Let me give an example;
socket.on('startGame', function() {
getUser();
socket.get('game', function (gameErr, gameId) {
socket.get('channel', function (channelErr, channel) {
console.log(user);
client.get('games:' + channel + '::' + gameId + ':owner', function (err, owner) { //games:channel.32:game.14
if(owner === user.uid) {
//do something
}
});
}
});
});
So, if I am learning it correctly, I need to run every function within the function if I need to wait I/O answer. Otherwise, node.js's non-blocking mechanism will allow the first function to run, in this case it will get the result in parallel, but the second function might not have the result if it takes time to get. So, if you are getting a result from redis for example, and you will use the result within the second function, you have to encapsulate it within the redis get function. Otherwise second function will run without getting the result.
So, in this case, if I need to run 7 different functions and the 8. function will need the result of all of them, do I need to write them like this, recursively? Or am I missing something.
I hope this was clear too.
Thanks a lot,
So, every "user" should have their own redis client within the connection event.
Am I right?
Actually, you are not :)
The thing is that node.js is very unlike, for example, PHP. node.js does not spawn child processes on new connections, which is one of the main reasons it can easily handle large amounts of concurrent connections, including long-lived connections (Comet, Websockets, etc.). node.js processes events sequentially using an event queue within one single process. If you want to use several processes to take advantage of multi-core servers or multiple servers, you will have to do it manually (how to do so is beyond the scope of this question, though).
Therefore, it is a perfectly valid strategy to use one single Redis (or MySQL) connection to serve a large quantity of clients. This avoids the overhead of instantiating and terminating a database connection for each client request.
So, every "user" should have their own redis client within the
connection event. Am I right?
You shouldn't make a new Redis client for each connected user, that's not the proper way to do it. Instead just create 2-3 clients max and use them.
For more information checkout this question:
How to reuse redis connection in socket.io?
As for the first question:
The "right answer" might make you think you are good with one Connection.
In reality, whenever you are doing something that is waiting on an IO, a timer, etc, you are actually making node run the waiting method on the queue. Hence, if you use only 1 single connection, you will actually limit the performance of the thread you working on ( a single CPU) to the speed of redis - which is probably a few hundreds of callbacks per second (non-redis waiting callbacks will still go on) - while this is not poor performance, there's no reason to create this kind of limitation. It is recommended to create a few (5-10) connections to avoid this issue in it's entire. This number goes up for slower databases, e.g. MySQL, but is dependant on the type of queries and the code specifics.
Do note, that you should run a few workers on your server, per the number of CPUs you have, for best performance.
In regards to the 2nd Question:
It is a much better practice, to name the functions, one after the other, and use the names in the code rather than defining it as you go. In some situations, it will reduce memory consumption.

Resources