Node.js GPS device tracking performance considerations - node.js

Using node.js as a tcp server, I am going to manage relatively large number of GPS devices( ~3000 device ) and as first step just going to store incoming data in database, but even in this phase i envision some performance issues which bothers me and I'd like to caught them before they bite me.
1 - Looking at written similar servers using languages like java or ruby I see some code like the following:
java
Thread serverThread = new Thread(() -> {
System.out.println("Listening to server port 9000");
while (true) {
try {
Socket socket = serverSocket.accept();
...
ruby
require 'socket'
server = TCPServer.new ("127.0.0.1",8080)
loop do
Thread.start(server.accept) do |client|
...
Which seems they gives separate thread to every device(socket) which get connected to tcp server? As node.js is single-threaded and acts asynchronously, should i be concerned about incoming connections or something like the following simple approach will satisfy large number of simultaneous connections?
net.createServer(function(device) {
device.on('data', function(data) {
// parse data
// store in database
});
});
2 - Should I confine database connections using connection pool? As database also query from the other side for GIS and monitoring, how much the pool size should be?
3 - How could I benefit caching( for example using redis ) in such system?
It should be great if someone sheds some light on this thoughts. I also willingly would like to hear any other performance thoughts you might be also experiencing or aware of in implementing such systems. Thanks.

Choosing among the options you have listed I would say NodeJS is actually a better option for your use case because it does not use one thread per connection like the other two options. Threads are normally a finite resource on a given machine. Java and Ruby do have 'evented' servers though and these are worth looking at if you want an apples to apples comparison.
I think you need to say more about the database you intend to use if you want advice on connection pooling. However reusing connections if they are costly to setup would be a good thing to do. It is probably a good idea to have the facility to configure the minimum and maximum size of the pool. Ultimately the correct size to use is a matter of testing.
I think the benefit of caching in this system would be minimal as you are mostly writing data. If the data is valuable you will want to write it to disk rather than memory. On the other hand, if you have clients that are reading the data collected perhaps caching their reads in something like Redis might be a good idea.

I'm sure you're aware, but this sounds like you're trying to prematurely optimize your application here.
1- Node being event-driven and non-blocking makes it a perfect candidate for holding a large number of open socket connections, no need for forking per connection. As always though, make sure your application is properly clustered. I was able to hold ~100k open TCP sockets on a dirt cheap laptop. If the number of device you need to support ever grows beyond that, just scale accordingly.
2- I saw you were planning on using postgres. Pools are always a good thing.
3- Caching is useful for 'hot' data. Stuff that gets queried a lot, and therefore having it in memory or inside redis (in-memory storage) makes these data lookups faster and removes strain on the system. In your case, if you just need to get certain chunks of data, for analytics or for more causal use, I would recommend spark or solr as opposed to a plain caching layer. It's also going to be much cheaper and easier to maintain.

Related

Websockets: listen multiple connections simultaneously?

I am working on a project which goal is to receive and store real time data from financial exchanges, using websockets. I have some very general questions about the technology.
Suppose that I have two websocket connections open, receiving real time data from two different servers. How do I make sure not to miss any messages? I have learned a bit of asynchronous programming (python asyncio) but it does not seem to solve the problem: when I listen to one connection, I cannot listen to the other one at the same time, right?
I can think of two solutions: the first one would require that the servers use a buffer system to send their data, but I do not think this is the case (Binance, Bitfinex...). The second solution I see is to listen each websocket using a different core. If my laptop has 8 cores I can listen to 8 connections and be sure not to miss any messages. I guess I can then scale up by using a cloud service.
Is that correct or am I missing something? Many thanks.
when I listen to one connection, I cannot listen to the other one at the same time, right?
Wrong.
When using an evented programming design, you will be using an IO "reactor" that adds IO related events to the event loop.
This allows your code to react to events from a number of connections.
It's true that the code reacts to the events in sequence, but as long as your code doesn't "block", these events could be handled swiftly and efficiently.
Blocking code should be avoided and big / complicated tasks should be fragmented into a number of "events". There should be no point at which your code is "blocking" (waiting) on an IO read or write.
This will allow your code to handle all the connections without significant delays.
...the first one would require that the servers use a buffer system to send their data...
Many evented frameworks use an internal buffer that streams to the IO when "ready" events are raised. For example, look up the drained event in node.js (or the on_ready in facil.io).
This is a convenience feature rather than a requirement.
The event loop might as well add an "on ready" event and assume your code will handle buffering after partial write calls return EAGAIN / EWOULDBLOCK.
The second solution I see is to listen each websocket using a different core.
No need. A single thread on a single core with an evented design should support thousands (and tens of thousands) of concurrent clients with reasonable loads (per-client load is a significant performance factor).
Attaching TCP/IP connections to a specific core can (sometimes) improve performance, but this is a many-to-one relationship. If we had to dedicate a CPU core per connection than server prices would shoot through the roof.

When is blocking code acceptable in node.js?

I know that blocking code is discouraged in node.js because it is single-threaded. My question is asking whether or not blocking code is acceptable in certain circumstances.
For example, if I was running an Express webserver that requires a MongoDB connection, would it be acceptable to block the event loop until the database connection was established? This is assuming that all pages served by Express require a database query (which would fail if MongoDB was not initialized).
Another example would be an application that requires the contents of a configuration file before being initializing. Is there any benefit in using fs.readFile over fs.readFileSync in this case?
Is there a way to work around this? Is wrapping all the code in a callback or promise the best way to go? How would that be different from using blocking code in the above examples?
It is really up to you to decide what is acceptable. And you would do that by determining what the consequences of blocking would be ... on a case-by-case basis. That analysis would take into account:
how often it occurs,
how long the event loop is likely to be blocked, and
the impact that blocking in that context will have on usability1.
Obviously, there are ways to avoid blocking, but these tend to add complexity to your application. Really, you need to decide ... on a case-by-case basis ... whether that added complexity is warranted.
Bottom line: >>you<< need to decide what is acceptable based on your understanding of your application and your users.
1 - For example, in a game it would be more acceptable to block the UI while switching "levels" than during active play. Or for a general web service, "once off" blocking while a config file is loaded or a DB connection is established during webserver startup is more acceptable that if this happened on every request.
From my experience most tasks should be handled in a callback or by returning a promise. You DO NOT want to block code in a Node application. That's what makes it so nice! Mostly with MongoDB it will crash before it has a chance to connect if there is no connection. It won't' really have an effect on an API call because your server will be dead!
Source: I'm a developer at a bootcamp that teaches MEAN stack.
Your two examples are completely different. The distinction actually answers the question in and of itself.
Grabbing data from a database is dependent on being connected to that database. Any code that is dependent upon that data is then dependent upon that connection. These things have to happen serially for the app to function and be meaningful.
On the other hand, readFileSync will block ALL code, not just code that is reliant on it. You could start reading a csv file while simultaneously establishing a database connection. Once both are done, you could add that csv data to the database.

What should I limit my POST per second rate to?

I'm building out an API using Hapi.js. Some of my code is pushing small amounts of data to the API. The issue seems to be that the pusher code is swamping the API and I'm getting ECONNRESET errors -- which means messages are getting lost. I'm planning on installing a rate-limiter in the pusher code, probably node-rate-limiter (link).
The question is, what should I set that limit to? I want to max out performance for this app, so I could easily be attempting to send in thousands of messages per hour. The data just gets dumped into redis, so I doubt the code in the API will be an issue but I still need to get an idea of what kind of message rate Hapi is comfortable with. Do I need to just start with something reasonable and see how it goes? Maybe 1 message per 10 milliseconds?
Hapi = require('hapi');
server = new (Hapi.Server);
server.connection(port: config.port, routes: {
cors: {
origin: ['*']
}
});
server.route({method: 'POST', path: '/update/{id}', ...})
There is no generic answer to how many requests per second you can process. It depends upon many things in your configuration and code such as:
Type and performance of server hardware
The amount of CPU time an average request uses
Whether your requests are CPU or disk bound. If disk bounded, then it depends a lot on your database and disk performance.
Whether you implement clustering to use multiple cores (if CPU bound)
Whether you're on shared infrastructure or not
The max number of incoming connections your server is configured for
So, there is no absolute answer here that works for everyone. If you don't have some sort of design problem that is artificially limiting your concurrency, then the best way to discover what your server can actually handle is to build a test engine and test it. Find where and how it fails and either fix those issues to extend the scalability further or implement protections to avoid hitting that limit.
Note: When a public API makes rate limiting choices, it is typically done on a per-client basis and the limit is set to a value that seems to be a little above what a reasonable client would be doing. This is more to allow fair use of the server by many clients to that one single client does not consume too much of the overall resource. If issuing thousands of small requests from a single client is not considered "good practice" in using your API, then you can just pick a number that is much smaller than that for a per-client limit.
Note: You may also want to make it easier for clients by having your API let them upload multiple messages in one API request rather than lots of API requests.

Node.js performance optimization involving HTTP calls

I have a Node.js application which opens a file, scans each line and makes a REST call that involves Couchbase for each line. The average number of lines in a file is about 12 to 13 million. Currently without any special settings my app can completely process ~1 million records in ~24 minutes. I went through a lot of questions, articles, and Node docs but couldn't find out any information about following:
Where's the setting that says node can open X number of http connections / sockets concurrently ? and can I change it?
I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this?
What all possible optimizations can I perform so that it becomes faster than this. I know the gc related config that prevents from frequent halts in the app.
Is using "cluster" module recommended? Does it work seamlessly?
Background: We have an existing java application that does exactly same by spawning 100 threads and it is able to achieve slightly better throughput than the current node counterpart. But I want to try node since the two operations in question (reading a file and making a REST call for each line) seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these...
Any help would be greatly appreciated...
Generally you should break your questions on Stack Overflow into pieces. Since your questions are all getting at the same thing, I will answer them. First, let me start with the bottom:
We have an existing java application that does exactly same by spawning 100 threads ... But I want to try node since the two operations in question ... seem like perfect situation for node app since they both can be async in node where as Java app makes blocking calls for these.
Asynchronous calls and blocking calls are just tools to help you control flow and workload. Your Java app is using 100 threads, and therefore has the potential of 100 things at a time. Your Node.js app may have the potential of doing 1,000 things at a time but some operations will be done in JavaScript on a single thread and other IO work will pull from a thread pool. In any case, none of this matters if the backend system you're calling can only handle 20 things at a time. If your system is 100% utilized, changing the way you do your work certainly won't speed it up.
In short, making something asynchronous is not a tool for speed, it is a tool for managing the workload.
Where's the setting that says node can open X number of http connections / sockets concurrently ? and can I change it?
Node.js' HTTP client automatically has an agent, allowing you to utilize keep-alive connections. It also means that you won't flood a single host unless you write code to do so. http.globalAgent.maxSocket=1000 is what you want, as mentioned in the documentation: http://nodejs.org/api/http.html#http_agent_maxsockets
I had to regulate the file processing because the file reading is much faster than the REST call so after a while there are too many open REST requests and it clogs the system and it goes out of memory... so now I read 1000 lines wait for the REST calls to finish for those and then resume it ( i am doing it using pause and resume methods on stream) Is there a better alternative to this?
Don't use .on('data') for your stream, use .on('readable'). Only read from the stream when you're ready. I also suggest using a transform stream to read by lines.
What all possible optimizations can I perform so that it becomes faster than this. I know the gc related config that prevents from frequent halts in the app.
This is impossible to answer without detailed analysis of your code. Read more about Node.js and how its internals work. If you spend some time on this, the optimizations that are right for you will become clear.
Is using "cluster" module recommended? Does it work seamlessly?
This is only needed if you are unable to fully utilize your hardware. It isn't clear what you mean by "seamlessly", but each process is its own process as far as the OS is concerned, so it isn't something I would call "seamless".
By default, node uses a socket pool for all http requests and the default global limit is 5 concurrent connections per host (these are re-used for keepalive connections however). There are a few ways around this limit:
Create your own http.Agent and specify it in your http requests:
var agent = new http.Agent({maxSockets: 1000});
http.request({
// ...
agent: agent
}, function(res) { });
Change the global/default http.Agent limit:
http.globalAgent.maxSockets = 1000;
Disable pooling/connection re-use entirely for a request:
http.request({
// ...
agent: false
}, function(res) { });

Implementing general purpose long polling

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.
Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.
Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

Resources