Find the source of waves of latency from Redis using Node.js - node.js

I am investigating some latency issues on my server, and I've narrowed it down but not enough to solve it. I'm hoping someone with more experience with Redis or Node.js can help.
Within a function that is called a few thousand times per minute, scaling up and down with web traffic, I send a GET request to my redis client to check if a process is complete. I've noticed increased latency for my web requests, and it appears as though the redis GET command is taking up the bulk of my server time. Which surprised me, as I always thought redis was wicked fast all the time. And if I look at Redis's "time spent" info, it says everything is under 700 microseconds.
That didn't jive with what I was seeing from my transaction monitoring setup, so I added some logging to my code:
const start = Date.now();
client.get(`poll:${submittedId}`, (err, res) => {
console.log(`${Date.now() - start}`);
//other stuff
})
Now my logs print the number of milliseconds spend on each redis GET. I watch that for a while, and see a surprising pattern.
Most of the time, there are lots of 1s and an occasional number in the 10s or sometimes 100s. Then, periodically, all the gets across the server slow down, reaching up to several seconds for each get to complete. Then after a while the numbers curve back down and things are running smoothly again.
What could be causing this sort of behaviour?
Things I've tried to investigate:
Like I mentioned, I've combed through redis's performance data, as presented on Heroku's redis dashboard, and it doesn't have any complaints or latency spikes.
I confirmed that all these requests are coming from a small number of connections, because I read that opening and closing too many can cause issues.
I looked into connection pooling, thinking maybe the transactions are being queued and causing a backlog, but the internet seems to say this isn't necessary for Redis and Node.
Really appreciate anyone taking the time to advise on this!

Seems like Redis blocked in bgsave. Check if
your Redis used memory is large.
use lastsave command to assure the bgsave span.
close aof always.
use slowlog to assure if other command blocked.

Related

Except from memory and CPU leaks, what will be reasons for Node.js server might go went down?

I have a Node.js (Express.js) server for my React.js website as BFF. I use Node.js for SSR, proxying some request and cache some pages in Redis. In last time I found that my server time to time went down. I suggest an uptime is about 2 days. After restart, all ok, then response time growth from hour to hour. I have resource monitoring at this server, and I see that server don't have problems with RAM or CPU. It used about 30% of RAM and 20% of CPU.
I regret to say it's a big production site and I can't make minimal reproducible example, cause i don't know where is reason of these error :(
Except are memory and CPU leaks, what will be reasons for Node.js server might go went down?
I need at least direction to search.
UPDATE1:
"went down" - its when kubernetes kills container due 3 failed life checks (GET request to a root / of website)
My site don't use any BD connection but call lots of 3rd party API's. About 6 API requests due one GET/ request from browser
UPDATE2:
Thx. To your answers, guys.
To understand what happend inside my GET/ request, i'm add open-telemetry into my server. In longtime and timeout GET/ requests i saw long API requests with very big tcp.connect and tls.connect.
I think it happens due lack of connections or something about that. I think Mostafa Nazari is right.
I create patch and apply them within the next couple of days, and then will say if problem gone
I solve problem.
It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/
PS: i can't mark two replies as answer. Joe have most detailed and Mostafa Nazari most relevant to my problem. They both may be "best answers".
Tnx for help, guys.
Gradual growth of response time suggest some kind of leak.
If CPU and memory consumption is excluded, another potentially limiting resources include:
File descriptors - when your server forgets to close files. Monitor for number of files in /proc//fd/* to confirm this. See what those files are, find which code misbehaves.
Directory listing - even temporary directory holding a lot of files will take some time to scan, and if your application is not removing some temporary files and lists them - you will be in trouble quickly.
Zombie processes - just monitor total number of processes on the server.
Firewall rules (some docker network magic may in theory cause this on host system) - monitor length of output of "iptables -L" or "iptables-save" or equivalent on modern kernels. Rare condition.
Memory fragmentation - this may happen in languages with garbage collection, but often leaves traces with something like "Can not allocate memory" in logs. Rare condition, hard to fix. Export some health metrics and make your k8s restart your pod preemptively.
Application bugs/implementation problems. This really depends on internal logic - what is going on inside the app. There may be some data structure that gets filled in with data as time goes by in some tricky way, becoming O(N) instead of O(1). Really hard to trace down, unless you have managed to reproduce the condition in lab/test environment.
API calls from frontend shift to shorter, but more CPU-hungry ones. Monitor distribution of API call types over time.
Here are some of the many possibilities of why your server may go down:
Memory leaks The server may eventually fail if a Node.js application is leaking memory, as you stated in your post above. This may occur if the application keeps adding new objects to the memory without appropriately cleaning up.
Unhandled exceptions The server may crash if an exception is thrown in the application code and is not caught. To avoid this from happening, ensure that all exceptions are handled properly.
Third-party libraries If the application uses any third-party libraries, the server may experience problems as a result. Before using them, consider examining their resource usage, versions, or updates.
Network Connection The server's network connection may have issues if the server is sending a lot of queries to third-party APIs or if the connection is unstable. Verify that the server is handling connections, timeouts, and retries appropriately.
Connection to the Database Even though your server doesn't use any BD connections, it's a good idea to look for any stale connections to databases that could be problematic.
High Volumes of Traffic The server may experience performance issues if it is receiving a lot of traffic. Make sure the server is set up appropriately to handle a lot of traffic, making use of load balancing, caching, and other speed enhancement methods. Cloudflare is always a good option ;)
Concurrent Requests Performance problems may arise if the server is managing a lot of concurrent requests. Check to see if the server is set up correctly to handle several requests at once, using tools like a connection pool, a thread pool, or other concurrency management strategies.
(Credit goes to my System Analysis and Design course slides)
With any incoming/outgoing web requests, 2 File Descriptors will be acquired. as there is a limit on number of FDs, OS does not let new Socket to be opened, this situation cause "Timeout Error" on clients. you can easily check number of open FDs by sudo ls -la /proc/_PID_/fd/ | tail -n +4 | wc -l where _PID_ is nodejs PID, if this value is rising, you have connection leak issue.
I guess you need to do the following to prevent Connection Leak:
make sure you are closing outgoing API call Http Connection (it depends on how you are opening them, some libraries manage this and you just need to config them)
cache your outgoing API call (if it is possible) to reduce API call
for your outgoing API call, use Connection pool, this would manage number of open HttpConnection, reuse already-opened connection and ...
review your code, so that you can serve a request faster than now (for example make your API call more parallel instead of await or nested call). anything you do to make your response faster, is good for preventing this situation
I solve problem. It really was lack of connections. I add reusing node-fetch connection due keepAlive and a lot of cache for saving connections. And its works.
Thanks for all your answers. They all right, but most helpful thing was added open-telemetry to my server to understand what exactly happens inside request.
For other people with these problems, I'm strongly recommended as first step, add telemetry to your project.
https://opentelemetry.io/

MongoDB NodeJS driver pooling connection (Question)

I've just set up a full NodeJS bot, using MongoDB. This Discord server has roughly 24k people spamming the bot left and right with commands, and there for I've used
(Info blurred out, due to having username, password, ips there)
"url": "mongodb://XXXX:XXXX#XXX.XX.XXX.XX.XXX:25000/?authSource=admin?maxPoolSize=500&poolSize=300&autoReconnect=true",
This is my URI, and as you see I've allowed a farely large poolsize.
Normally my application (before i enabled pooling) would have hit 300-600 on average connections, due to having it have multiple instances of "MongoDB.Connect(uri) etc" around in the cose, as well as a massive amount of db.close() at the end of collections.
I've cleaned up the entire thing, and i only call 1 instance of MongoClient.Connect() & then refer this connection around once in the code (as a bypasser).
There after I've made sure to wipe everything that would close the db (db.close();)
I've started up, and everything still seems responsive - so theres no database/mongo errors.
However, looking through MongoDB Compass, my connection count is around 29 stable. Which is good obviously, but when i enabled 300 Pools, shouldn't this be higher?
This is how my mongod.cfg looks like
Is there something i have missed? or is it all behaving as it should?
Each client connects to each server once or twice for monitoring. If you create a client that performs a single operation, while that operation is running against a 4.4 replica set you have 7 open connections.
By reusing clients you can have a dramatic reduction in the number of total connections.
Additionally a further reduction is expected since each of your operations can complete faster (it doesn't have to wait for server discovery).

Node.JS WebSocket High Memory Usage

We currently have a production node.js application that has been underperforming for a while. Now the application is a live bidding platform, and also runs timed auctions. The actual system running live sales is perfect and works as required. We have noticed that while running our timed sales (where items in a sale have timers and they incrementally finish, and if someone bids within the last set time, it will increment the time up X amount of seconds).
Now the issue I have found is that during the period of a timed sale finishing (which can go on for hours) if items have 60 seconds between each lots and have extensions if users bid in the last 10 seconds. So we were able to connect via the devtools and I have done heap memory exports to see what is going on, but all I can see is that all indications point to stream writeable and the buffers. So my question is what am I doing wrong. See below a screenshot of a heap memory export:
As you can see from the above, there is a lot of memory being used specifically for this it was using 1473MB of physical RAM. We saw this rise very quickly (within 30 mins) and each increment seemed to be more than the last. So when it hit 3.5GB it was incrementing at around 120MB each second, and then as it got higher around 5GB it was incrementing at 500MB per second and got to around 6GB and then the worker crashed (has a max heap size of 8GB), and then we were a process down.
So let me tell you about the platform. It is of course a bidding platform as I said earlier, the platform uses Node (v11.3.0) and is clustered using the built in cluster library. It spawns 4 workers, and has the main process (so 5 altogether). The system accepts bids, checks other bids, calculates who is winning and essentially pushes updates to the connected clients via Redis PUB/SUB then that is broadcasted to that workers connected users.
All data is stored within redis and mysql is used to refresh data into redis as redis has performed 10x faster than mysql was able to.
Now the way this works is on connection a small session is created against the connection, this is then used to authenticate the user (which is a message sent from the client) all message events are sent to a handler which pushes it to the correct command these commands are then all set as async functions and run async.
Now this has no issue on small scale, but we had over 250 connections and was seeing the above behaviour and are unsure where to find a fix. We noticed when opening the top obejct, it was connected to buffer.js and stream_writable.js as well. I can also see all references are connected to system / JSArrayBufferData and all refer back to these, there are lots of objects, and we are unable to fix this issue.
We think one of the following:
We log to file using append mode, which logs lots of information to the console and to a file using fs.writeFile and append mode. We did some research and saw that writing to console can be a cause of this kind of behaviour.
It is the get lots function which outputs all the lots for that page (currently set to 50) every time an item finishes, so if the timer ends it will ask for a full page load for all the items on that page, instead of adding new lots in.
There is something else happening here that we are unaware of, maybe the external library we are using that may not be removing a reference.
I have listed the libraries of interest that we require here:
"bluebird": "^3.5.1", (For promisifying the redis library)
"colors": "^1.2.5", (Used on every console.log (we call logs for everything that happens this can be around 50 every few seconds.)
"nodejs-websocket": "^1.7.1", (Our websocket library)
"redis": "^2.8.0", (Our redis client)
Anyway, if there is anything painstakingly obvious I would love to hear, as everything I have followed online and other stack overflow questions does not relate close enough to the issue we are facing.

Using Fleck Websocket for 10k simultaneous connections

I'm implementing a websocket-secure (wss://) service for an online game where all users will be connected to the service as long they are playing the game, this will use a high number of simultaneous connections, although the traffic won't be a big problem, as the service is used for chat, storage and notifications... not for real-time data synchronization.
I wanted to use Alchemy-Websockets, but it doesn't support TLS (wss://), so I have to look for another service like Fleck (or other).
Alchemy has been tested with high number of simultaneous connections, but I didn't find similar tests for Fleck, so I need to get some real info from users of fleck.
I know that Fleck is non-blocking and uses Async calls, but I need some real info, cuz it might be abusing threads, garbage collector, or any other aspect that won't be visible to lower number of connections.
I will use c# for the client as well, so I don't need neither hybiXX compatibility, nor fallback, I just need scalability and TLS support.
I finally added Mono support to WebSocketListener.
Check here how to run WebSocketListener in Mono.
10K connections is not little thing. WebSocketListener is asynchronous and it scales well. I have done tests with 10K connections and it should be fine.
My tests shows that WebSocketListener is almost as fast and scalable as the Microsoft one, and performs better than Fleck, Alchemy and others.
I made a test on a Windows machine with Core2Duo e8400 processor and 4 GB of ram.
The results were not encouraging as it started delaying handshakes after it reached ~1000 connections, i.e. it would take about one minute to accept a new connection.
These results were improved when i used XSockets as it reached 8000 simultaneous connections before the same thing happened.
I tried to test on a Linux VPS with Mono, but i don't have enough experience with Linux administration, and a few system settings related to TCP, etc. needed to change in order to allow high number of concurrent connections, so i could only reach ~1000 on the default settings, after that he app crashed (both Fleck test and XSocket test).
On the other hand, I tested node.js, and it seemed simpler to manage very high number of connections, as node didn't crash when reached the limits of tcp.
All the tests where echo test, the servers send the same message back to the client who sent the message and one random other connected client, and each connected client sends a random ~30 chars text message to the server on a random interval between 0 and 30 seconds.
I know my tests are not generic enough and i encourage anyone to have their own tests instead, but i just wanted to share my experience.
When we decided to try Fleck, we have implemented a wrapper for Fleck server and implemented a JavaScript client API so that we can send back acknowledgment messages back to the server. We wanted to test the performance of the server - message delivery time, percentage of lost messages etc. The results were pretty impressive for us and currently we are using Fleck in our production environment.
We have 4000 - 5000 concurrent connections during peak hours. On average 40 messages are sent per second. Acknowledged message ratio (acknowledged messages / total sent messages) never drops below 0.994. Average round-trip for messages is around 150 miliseconds (duration between server sending the message and receiving its ack). Finally, we did not have any memory related problems due to Fleck server after its heavy usage.

Node js avoid pyramid of doom and memory increases at the same time

I am writing a socket.io based server and I'm trying to avoid the pyramid of doom and to keep the memory low.
I wrote this client - http://jsfiddle.net/QUDXU/1/ which i run with node client-cluster 1000. So 1000 connections that are making continuous requests.
For the server side a tried 3 different solutions which i tested. The results in terms of RAM used by the server, after i let everything run for an hour are:
Simple callbacks - http://jsfiddle.net/DcWmJ/ - 112MB
Q module - http://jsfiddle.net/hhsja/1/ - 850MB and increasing
Async module - http://jsfiddle.net/SgemT/ - 1.2GB and increasing
The server and clients are on different machines. (Softlayer cloud instances). Node 0.10.12 and Socket.io 0.9.16
Why is this happening? How can I keep the memory low and use some kind of library which allows to keep the code readable?
Option 1. You can use the cluster module and gracefully kill your workers from time to time (make sure you disconnect() first). You can check process.memoryUsage().rss > 130000000 in the master and kill the workers when they exceed 130MB, for example :)
Option 2. NodeJS has the habit of using memory and rarely doing rigorous cleanups. As V8 reaches the maximum memory limit, GC calls are more aggressive. So you could lower the maximum memory a node process can take up by running node --max-stack-size <amount>. I do this when running node on embedded devices (often with less than 64 MB of ram available).
Option 3. If you really want to keep the memory low, use weak references where it is possible (anywhere except in long-running calls) https://github.com/TooTallNate/node-weak . This way, the objects will get garbage collected sooner. Extensive tests to make sure everything works are needed, though. GL if u use this one :) https://github.com/TooTallNate/node-weak
It seems like the problem was on the client script, not on the server one. I ran 1000 processes, each of them emitting messages to the server at every second. I think the server was getting very busy resolving all of those requests and thus using all of that memory. I rewrote the client side like this, spawning a number of processes proportional to the number of processors, each of them connecting multiple times like this:
client = io.connect(selectedEnvironment, { 'force new connection': true, 'reconnect': false });
Notice the 'force new connection' flag that allows to connect multiple clients using the same instance of socket.io-client.
The part that solved my problem was actually how the requests were made: any client would make another request after a second from receiving the acknowledge of the previous request, not at every second.
Connecting 1000 clients is making my server using ~100MB RSS. I also used async on the server script which seems very elegant and easier to understand than Q.
The bad part is that I've been running the server for about 2-3 days and the memory rised at 250MB RSS. This, I don't know why.

Resources