Websockets severe delay/latency - node.js

We're using websockets (specifically uws.js on Node.js) to run a multiplayer quiz. The server's running on an AWS t2.micro in the eu-west-2a region, but recently, we've been seeing some incredibly high latency from some players - yet only on an intermittent basis.
By latency, what I'm actually measuring is the time taken from sending a broadcast message (using uws's built in pub-sub), to the player's device telling the server they've safely received it. The message we're tracking tells the player's device to move on to the next phase of the quiz, so it's pretty critical to the workings of the application. Most of the time, for players in the UK, this time is somewhere around 30 - 60 ms, but every now and then we're seeing delays of up to 17s.
Recently, we had a group on the other side of the world to our server do a quiz, and even though there were only 10 or so players, so the server's definitely not being overloaded, we saw roughly half that group intermittently getting these very, very high latency spikes, where it'd take 12, 17, 22, or even 39(!!) seconds for their device to acknowledge having received the message. Even though this is a slow paced quiz, that's still an incredibly amount of latency, and something that has a real detrimental effect.
My question is, how do I tell what's causing this, and how do I fix it? My guess would be it's something to do with TCP and its in-order delivery, combined with some perhaps dodgy internet connections, as one of the players yesterday seemed to receive nothing for 39 seconds, then three messages all in a row, backed up. To me that suggests packet loss, but I don't know where to even begin when trying to resolve it. I also haven't yet figured out how to reproduce it (I've never seen it happen when I've been playing myself), which makes things even harder.

TCP routing issues are unlikely to cause extreme delays of 17+seconds. Are you sure there is no "store and forward" queuing system that is buffering your messages on a server or perhaps a cloud pub/sub queue?
Another important check: t2.micro is just about the cheapest, least-assured networking QoS VM you can boot on AWS. No throughput and no jitter assurances on the network performance.
You may wish to review:
EC2 network benchmarking guide from AWS including MTU parameters
Available instance bandwidth for EC2
t2.micro does not have any minimum baseline assured bandwidth for example.

Related

Extremely slow Apollo gateway / server responses

We're facing some challenging performance issues with some queries taking an extremely long time to respond, often in excess of 20s and frequently timing out at 60s. These extreme timings cannot be reproduced in local dev environments, even when using load testing tools like graphql-bench and autocannon. These same queries can also complete in ~500ms.
We have a federated graph setup with an instance of Apollo Gateway (running ~0.24) and 2 instances of Apollo Server (running 3.x) providing subgraphs. All 3 use NestJS 7 on top of Apollo Server + express. Our resolvers use Typeorm to interface with a reasonably large and complex MSSQL database. The database is hugely OP and resource monitors never get over 10% RAM & CPU.
The API servers are hosted on AWS ECS and are generously resources. Like the DB server, the CPU and RAM usage average out at 10-20%, although the CPU does spike to >100% sometimes.
Tracing in Apollo Studio appears to show some high level resolvers "hanging" or "waiting" for long periods of time. ie. 27s out of a 30s response.
We do have some N+1 issues and we've made some limited use of Dataloader in places to resolve those issues, although we can definitely implement it more. I'm not convined that the N+1 issues are responsible for the extreme timings though due to the wild range of those timing and the inconsistencies of them.
Does this scenario sound famililar to anyone? Does anyone have any ideas or pointers how we might go about tracking down the root issue and debugging it? We're a bit stumped tbh, especially as the issues seem to be impossible to repro locally.

NodeJS Monitoring Website (Worker Threads?/Multi Process?)

I am doing small project of application that will monitor some servers.
It will base on telnet port check, ping, and also it will use libraries to connect directly to databases (MSSQL, Oracle, MySQL) to check their status.
I wonder what will be the best effective solution for this idea, currently with around 30 servers it works quite smooth, around 2.5sec to check status for all of them (running async). However I am worried that in the future with more servers it might get worse. Hence thinking about using some alternative like Worker Threads maybe? or some multi processing? Any ideas? Everything is happening in internal network so I do not expect huge latency.
Thank you in advance.
Have you ever tried the PM2 cluster mode:
https://pm2.keymetrics.io/docs/usage/cluster-mode/
The telnet stuff is TCP, which Node.js does very well using OS-level networking events. The connections to databases can vary. In the case of Oracle, you'll likely be using the node-oracledb. Those are SQL*Net connections that rely on the OCI libs and Node.js' thread pool. The thread pool defaults to four threads, but you can grow it up to 128 per Node.js process. See this doc for info:
https://oracle.github.io/node-oracledb/doc/api.html#-143-connections-threads-and-parallelism
Having said all that, other than increasing the size of the thread pool, I wouldn't recommend you make any changes. Why fight fires before they're burning? No need to over-engineer things. You're getting acceptable performance given the current number of servers you have.
How many servers do you plan to add in, say, 5 years? What's the difference in timing if you run the status checks for half of the servers vs all of them? Perhaps you could use that kind of data to make an educated guess as to where things would go.
As you add new ones, keep track of the total time to check the status. Is it slipping? If so, look into where the time is being spent and write the solution that will help.

How to catch/record a "Burst" in HAProxy and/or NodeJS traffic

We have a real-time service, which gets binary messages from different sources (internal and external), then using a couple of NodeJS instances and one HAProxy instance, configured to route TCP traffic, we deliver them to our end-users and different services who consume the messages. HAProxy version is 1.8.14, NodeJS is 6.14.3, both hosted on a CentOS 7 machine.
Now we've got a complex problem with some "burst"s in the outbound interface of HAProxy instance. We are not sure whether the burst is real (e.g. some messages got stuck in Node and then network gets flooded with messages) or the problem is some kind of misconfig or an indirect effect of some other service (Both latter reasons are more likely, as sometimes we get these bursts during midnight, which we have minimal to zero load).
The issue is annoying right now, but it might get critical as it floods our outbound traffic so our real-time services experience a lag or a small downtime during working hours.
My question is, how can we track and record the nature or the content of these messages with minimum overhead? I've been reading through HAProxy docs to find a way to monitor this, which can be achieved by using a Unix socket, but we are worried about a couple of things:
How much is the overhead of using this socket?
Can we track what is going on in the servers using this socket? Or it only gives us stats?
Is there a way to "catch/echo" the contents of these messages, or find out some information about them? with minimum overhead?
Please let me know if you have any questions regarding this problem.

I'm not sure how to correctly configure my server setup

This is kind of a multi-tiered question in which my end goal is to establish the best way to setup my server which will be hosting a website as well as a service (using Socket.io) for an iOS (and eventually an Android) app. Both the app service and the website are going to be written in node.js as I need high concurrency and scaling for the app server and I figured whilst I'm at it may as well do the website in node because it wouldn't be that much different in terms of performance than something different like Apache (from my understanding).
Also the website has a lower priority than the app service, the app service should receive significantly higher traffic than the website (but in the long run this may change). Money isn't my greatest priority here, but it is a limiting factor, I feel that having a service that has 99.9% uptime (as 100% uptime appears to be virtually impossible in the long run) is more important than saving money at the compromise of having more down time.
Firstly I understand that having one node process per cpu core is the best way to fully utilise a multi-core cpu. I now understand after researching that running more than one per core is inefficient due to the fact that the cpu has to do context switching between the multiple processes. How come then whenever I see code posted on how to use the in-built cluster module in node.js, the master worker creates a number of workers equal to the number of cores because that would mean you would have 9 processes on an 8 core machine (1 master process and 8 worker processes)? Is this because the master process usually is there just to restart worker processes if they crash or end and therefore does so little it doesnt matter that it shares a cpu core with another node process?
If this is the case then, I am planning to have the workers handle providing the app service and have the master worker handle the workers but also host a webpage which would provide statistical information on the server's state and all other relevant information (like number of clients connected, worker restart count, error logs etc). Is this a bad idea? Would it be better to have this webpage running on a separate worker and just leave the master worker to handle the workers?
So overall I wanted to have the following elements; a service to handle the request from the app (my main point of traffic), a website (fairly simple, a couple of pages and a registration form), an SQL database to store user information, a webpage (probably locally hosted on the server machine) which only I can access that hosts information about the server (users connected, worker restarts, server logs, other useful information etc) and apparently nginx would be a good idea where I'm handling multiple node processes accepting connection from the app. After doing research I've also found that it would probably be best to host on a VPS initially. I was thinking at first when the amount of traffic the app service would be receiving will most likely be fairly low, I could run all of those elements on one VPS. Or would it be best to have them running on seperate VPS's except for the website and the server status webpage which I could run on the same one? I guess this way if there is a hardware failure and something goes down, not everything does and I could run 2 instances of the app service on 2 different VPS's so if one goes down the other one is still functioning. Would this just be overkill? I doubt for a while I would need multiple app service instances to support the traffic load but it would help reduce the apparent down time for users.
Maybe this all depends on what I value more and have the time to do? A more complex server setup that costs more and maybe a little unnecessary but guarantees a consistent and reliable service, or a cheaper and simpler setup that may succumb to downtime due to coding errors and server hardware issues.
Also it's worth noting I've never had any real experience with production level servers so in some ways I've jumped in the deep end a little with this. I feel like I've come a long way in the past half a year and feel like I'm getting a fairly good grasp on what I need to do, I could just do with some advice from someone with experience that has an idea with what roadblocks I may come across along the way and whether I'm causing myself unnecessary problems with this kind of setup.
Any advice is greatly appreciated, thanks for taking the time to read my question.

Which resources should one monitor on a Linux server running a web-server or database

When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?
As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.
I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)
I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap
"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.
Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.
Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)
In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.
network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text
I typically watch top and tail -f /var/log/auth.log.

Resources