MongoNetworkError Connection Timed Out - node.js

I have a MongoDB Server on an EC2 instance. My meteor app is hosted on Heroku and is connected to said Server. We've had about 2 months uptime and just yesterday things crapped out causing the app to crash.
Logs show `
Exception while polling query {"collectionName":"Foo","selector":{"barId":"9hcnn7vreGbM9dKSH"},"options":{"transform":null}} { MongoNetworkError: connection 5 to IP:27017 timed out`
at TLSSocket.<anonymous> (/app/.meteor/heroku_build/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:259:7)
at Promise.asyncApply (packages/mongo/mongo_driver.js:1042:14)
This then repeats for what seems like an infinite amount of lines. I can see it being done for several other queries. Seems like clients had several connections trying to query data and the logs are showing failure for all of them?
Restarting the Heroku dynos seems to have resolved things. I also checked the mongod.log file. I'm seeing msg":"Slow query" on some line, but other than that, nothing stands out (or rather, I'm not sure what to look for).
Never had this issue before. Sounds like it could just be an anomaly with the connection, or maybe the DB being bogged down? Any insights? Thanks!

Related

MongoDB NodeJS driver pooling connection (Question)

I've just set up a full NodeJS bot, using MongoDB. This Discord server has roughly 24k people spamming the bot left and right with commands, and there for I've used
(Info blurred out, due to having username, password, ips there)
"url": "mongodb://XXXX:XXXX#XXX.XX.XXX.XX.XXX:25000/?authSource=admin?maxPoolSize=500&poolSize=300&autoReconnect=true",
This is my URI, and as you see I've allowed a farely large poolsize.
Normally my application (before i enabled pooling) would have hit 300-600 on average connections, due to having it have multiple instances of "MongoDB.Connect(uri) etc" around in the cose, as well as a massive amount of db.close() at the end of collections.
I've cleaned up the entire thing, and i only call 1 instance of MongoClient.Connect() & then refer this connection around once in the code (as a bypasser).
There after I've made sure to wipe everything that would close the db (db.close();)
I've started up, and everything still seems responsive - so theres no database/mongo errors.
However, looking through MongoDB Compass, my connection count is around 29 stable. Which is good obviously, but when i enabled 300 Pools, shouldn't this be higher?
This is how my mongod.cfg looks like
Is there something i have missed? or is it all behaving as it should?
Each client connects to each server once or twice for monitoring. If you create a client that performs a single operation, while that operation is running against a 4.4 replica set you have 7 open connections.
By reusing clients you can have a dramatic reduction in the number of total connections.
Additionally a further reduction is expected since each of your operations can complete faster (it doesn't have to wait for server discovery).

Excessive open connections to Mongos instances

We're moving from a single replica set to shards and are experiencing some issues. We have 3 mongos instances, 3 config servers, and 15 data nodes (5 shards with 3 replicas). We're seeing really poor query performance and looking at the mongos instances I'm seeing something like 25k open connections per instance!
For example, I'm seeing log lines like
[listener] connection accepted from 10.10.36.122:35098 #521622 (23858 connections now open)
and
[conn498875] end connection 10.10.36.122:41520 (23695 connections now open)
For reference, we have another nearly identical environment that we have not yet moved to sharding which is showing ~250 total open connections.
The application code is using the nodejs driver and is using a connection url that looks something like
mongodb://mongos0.some.internal.domain:27017,mongos1.some.internal.domain:27017,mongos2.some.internal.domain:27017
I'm at a bit of a loss for how to track this issue down. Is this not the correct way to connect to mongos?
EDIT (7/7/18)
After some experimenting, I found that we were using a connectTimeoutMS of 180000 (3 minutes). Removing this value resolved the issue. However, it's still not clear why this configuration works with a standalone replica set, but causes issues when sharding. Can anyone explain what's going on here?

Connection to Redis cache fails after restart - Azure

We are using following code to connect to our caches (in-memory and Redis):
settings
.WithSystemRuntimeCacheHandle()
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime)
.And
.WithRedisConfiguration(CacheManagerRedisConfigurationKey, connectionString)
.WithMaxRetries(3)
.WithRetryTimeout(100)
.WithJsonSerializer()
.WithRedisBackplane(CacheManagerRedisConfigurationKey)
.WithRedisCacheHandle(CacheManagerRedisConfigurationKey, true)
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime);
It works fine, but sometimes machine is restarted (automatically by Azure where we host it) and after the restart connection to Redis fails with following exception:
Connection to '{connection string}' failed.
at CacheManager.Core.BaseCacheManager`1..ctor(String name, ICacheManagerConfiguration configuration)
at CacheManager.Core.BaseCacheManager`1..ctor(ICacheManagerConfiguration configuration)
at CacheManager.Core.CacheFactory.Build[TCacheValue](String cacheName, Action`1 settings)
at CacheManager.Core.CacheFactory.Build(Action`1 settings)
According to Redis FAQ (https://learn.microsoft.com/en-us/azure/redis-cache/cache-faq) part: "Why was my client disconnected from the cache?" it might happen after redeploy.
The question is
is there any mechanism to restore the connection after redeploy
is anything wrong in way we initialize the connection
We are sure the connection string is OK
Most clients (including StackExchange.Redis) usually connect / re-connect automatically after a connection break. However, your connect timeout setting needs to be large enough for the re-connect to happen successfully. Remember, you only connect once, so it's alright to give the system enough time to be able to reconnect. Higher connect timeout is especially useful when you have a burst of connections or re-connections after a blip causing CPU to spike and some connections might not happen in time.
In this case, I see RetryTimeout as 100. If this is the Connection timeout, check if this is in milliseconds. 100 milliseconds is too low. You might want to make this more like 10 seconds (remember it's a one time thing, so you want to give it time to be able to connect).

OpenShift HAProxy scaling is just not working

I've been trying to get OpenShift's HAProxy scaling working with ny NodeJS Express 4 app (it's essentially a REST API), but I haven't had much luck.
I'm using loader.io's stress testing tools, with a mere 100 users/minute (ramps up from 0), as I'm sure at least NodeJS/Express should be able to handle that. Now granted, this does generate roughly 10-20k requests in 60 seconds, but still.
What happens after the requests start pounding the server, is that I can see CPU go up, memory stays pretty solid and HAProxy's log file is letting me know that it's about to scale up.
It never does. HAProxy crashes before it can scale, then I lose the SSH connection to the OpenShift host. It comes back after a while, though.
At one point I did see that it was hitting the default 128 connection limit, then trying to spin up another gear, but since the requests kept coming in, I'm guessing it just couldn't handle it?
At first I thought that it was due to using a small gear, as I was running 'top' and saw that the CPU load spiked through the roof and I eventually disconnected.
I deleted the app and switched to small.highcpu gears (which cost money per hour).
Still crashes when it's supposed to scale up (with less than 100 concurrent users).
The small.highcpu gear does do something different though, because after it restarts, it adds a new gear, but it does NOT scale down (even though all traffic has stopped), so I have to manually scale down
If I leave the second gear up and try to stress test again with 100 users within 1 minute, HAProxy still goes down (memory usage and CPU seem to be OK) and I lose the SSH connection shortly afterwards. Also, this time it does NOT come up by itself. I also receive the following error in my NodeJS app:
{ [Error: socket hang up] code: 'ECONNRESET' }
{ [Error: socket hang up] code: 'ECONNRESET', sslError: undefined }
If I manually restart HAProxy after this (I kinda have to since it's not coming up), I can see that the local-gear is down, while the second gear is up, meaning that my NodeJS app crashed on the first gear, but stayed online on the second gear.
Is this really intended behaviour? Should I be doing something differently when dealing with NodeJS and HAProxy?
I really can't justify paying for a service such as this, if I can't even handle 100 users/minute, since I'm certain that I will eventually peak far beyond a 100.
UPDATE: Here's a loader.io graph/report, which kinda shows when HAProxy is giving up:
http://ldr.io/1tV2iwj
UPDATE 2: I tried using Blitz instead of loader.io, just to be certain on when HAProxy goes crazy. Blitz ended up with 12k hits, 26k errors and 4k timeouts.
Additionally, HAProxy went down and seemed like it would never come back up. This time I decided to wait, and after a few minutes, the local-gear DID come back up. It didn't bring up any additional gears, though.
Here's also what HAProxy was telling me when the Blitz test happened (before it crashed and I disconnected):
==> app-root/logs/haproxy_ctld.log <==
I, [2014-10-13T07:14:48.857616 #74934] INFO -- : add-gear - capacity: 143.75% gear_count: 1 sessions: 23 up_thresh: 90.0%
==> app-root/logs/haproxy.log <==
[WARNING] 285/071506 (74918) : Server express/local-gear is DOWN, reason: Layer7 timeout, check duration: 10002ms. 0 active and 0 backup servers left. 128 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/071506 (74918) : proxy 'express' has no server available!
[WARNING] 285/071511 (74918) : Server express/local-gear is DOWN for maintenance.
UPDATE 3: Tried again with Blitz, this time HAProxy/NodeJS didn't come back up, but instead got stuck on the following line (I can still SSH in):
DEBUG: Sending SIGTERM to child...
There's not much of a pattern here, except that HAProxy isn't doing what it's supposed to be doing: scaling.
I'm fairly confident that it's not my NodeJS app at fault here, as it's not reporting any errors (to the log file or to New Relic).
Your gear is running out of memory, and thus all of your processes are being killed. (that's why you are also getting kicked out of your ssh session.) When that happens, it could potentially put the haproxy configuration in a bad state, and if it does not automatically repair itself on a restart I would consider that to be a bug.

Catching auth error on redis / heroko / node.js

I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?
In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.

Resources