Connection to Redis cache fails after restart - Azure

Connection to Redis cache fails after restart - Azure - azure

We are using following code to connect to our caches (in-memory and Redis):
settings
.WithSystemRuntimeCacheHandle()
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime)
.And
.WithRedisConfiguration(CacheManagerRedisConfigurationKey, connectionString)
.WithMaxRetries(3)
.WithRetryTimeout(100)
.WithJsonSerializer()
.WithRedisBackplane(CacheManagerRedisConfigurationKey)
.WithRedisCacheHandle(CacheManagerRedisConfigurationKey, true)
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime);
It works fine, but sometimes machine is restarted (automatically by Azure where we host it) and after the restart connection to Redis fails with following exception:
Connection to '{connection string}' failed.
at CacheManager.Core.BaseCacheManager`1..ctor(String name, ICacheManagerConfiguration configuration)
at CacheManager.Core.BaseCacheManager`1..ctor(ICacheManagerConfiguration configuration)
at CacheManager.Core.CacheFactory.Build[TCacheValue](String cacheName, Action`1 settings)
at CacheManager.Core.CacheFactory.Build(Action`1 settings)
According to Redis FAQ (https://learn.microsoft.com/en-us/azure/redis-cache/cache-faq) part: "Why was my client disconnected from the cache?" it might happen after redeploy.
The question is
is there any mechanism to restore the connection after redeploy
is anything wrong in way we initialize the connection
We are sure the connection string is OK

Most clients (including StackExchange.Redis) usually connect / re-connect automatically after a connection break. However, your connect timeout setting needs to be large enough for the re-connect to happen successfully. Remember, you only connect once, so it's alright to give the system enough time to be able to reconnect. Higher connect timeout is especially useful when you have a burst of connections or re-connections after a blip causing CPU to spike and some connections might not happen in time.
In this case, I see RetryTimeout as 100. If this is the Connection timeout, check if this is in milliseconds. 100 milliseconds is too low. You might want to make this more like 10 seconds (remember it's a one time thing, so you want to give it time to be able to connect).

Related

Consumption Plan Azure Functions Transient Failures

During heavy load my consumption azure functions are timing out with the following errors-
1.System.InvalidOperationException : An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call. ---> System.Data.SqlClient.SqlException : The client was unable to establish a connection because of an error during connection initialization process before login. Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) ---> System.ComponentModel.Win32Exception : An existing connection was forcibly closed by the remote host.
2.System.InvalidOperationException : Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
We are using Azure SQL database P4 with 500 DTUs. My initial thought was that due to less available worker threads it might be failing. But, they are well within limit with max at 12%.
We know that some of out LINQ queries are slow and are not performing well, but that would require business logic change.
Is there any solution on Azure Infrastructure side or any logs I can look into to?

We had first problem couple of month ago, we just added EnableRetryOnFailure() on database configuration, it resolved the first issue. Sample code given below
var optionsBuilder = new DbContextOptionsBuilder<KasDbContext>();
optionsBuilder.UseSqlServer(getConnectionString(), options =>
{
options.EnableRetryOnFailure(maxRetryCount: Constants.MaxRetryCountOnDbTransientFailure, maxRetryDelay: TimeSpan.FromSeconds(Constants.MaxDelaySecondsOnDbTransientFailure), errorNumbersToAdd: null);
});
return new DbContext(optionsBuilder.Options);

Azure Http connection gets interrupted after 5 minutes

We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?

We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.

So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?

Check queued reads/writes for MongoDB

I feel like this question would have been asked before, but I can't find one. Pardon me if this is a repeat.
I'm building a service on Node.js hosted in Heroku and using MongoDB hosted by Compose. Under heavy load, the latency is most likely to come from the database, as there is nothing very CPU-heavy in the service layer. Thus, when MongoDB is overloaded, I want to return an HTTP 503 promptly instead of waiting for a timeout.
I'm also using REDIS, and REDIS has a feature where you can check the number of queued commands (redisClient.command_queue.length). With this feature, I can know right away if REDIS is backed up. Is there something similar for MongoDB?
The best option I have found so far is polling the server for status via this command, but (1) I'm hoping for something client side, as there could be spikes within the polling interval that cause problems, and (2) I'm not actually sure what part of the status response I want to act on. That second part brings me to a follow up question...
I don't fully understand how the MondoDB client works with the server. Is one connection shared per client instance (and in my case, per process)? Are queries and writes queued locally or on the server? Or, is one connection opened for each query/write, until the database's connection pool is exhausted? If the latter is the case, it seems like I might want to keep an eye on the open connections. Does the MongoDB server return such information at other times, besides when polled for status?
Thanks!

MongoDB connection pool workflow-
Every MongoClient instance has a built-in connection pool. The client opens sockets on demand to support the number of concurrent MongoDB operations your application requires. There is no thread-affinity for sockets.
The client instance, opens one additional socket per server in your MongoDB topology for monitoring the server’s state.
The size of each connection pool is capped at maxPoolSize, which defaults to 100.
When a thread in your application begins an operation on MongoDB, if all other sockets are in use and the pool has reached its maximum, the thread pauses, waiting for a socket to be returned to the pool by another thread.
You can increase maxPoolSize:
client = MongoClient(host, port, maxPoolSize=200)
By default, any number of threads are allowed to wait for sockets to become available, and they can wait any length of time. Override waitQueueMultiple to cap the number of waiting threads. E.g., to keep the number of waiters less than or equal to 500:
client = MongoClient(host, port, maxPoolSize=50, waitQueueMultiple=10)
Once the pool reaches its max size, additional threads are allowed to wait indefinitely for sockets to become available, unless you set waitQueueTimeoutMS:
client = MongoClient(host, port, waitQueueTimeoutMS=100)
Reference for connection pooling-
http://blog.mongolab.com/2013/11/deep-dive-into-connection-pooling/

Meteor - Connection Timeout. No heartbeat received

I get the following error:
Connection timeout. No heartbeat received.
When accessing my meteor app (http://127.0.0.1:3000). The application has been moved over to a new pc with the same code base - and the server runs fine with no errors, and I can access the mongodb. What would cause the above error?
The problem seems to occur when the collection is larger. however I have it running on another computer which loads the collections instantaneously. The connection to to sock takes over a minute and grows in size, before finally failing:

Meteor's DDP implements Sockjs's Heartbeats used for long-polling. This is probably due to DDP Heartbeat's timeout default of 15s. If you access a large amount of data and it takes a lot of time, in your case, 1 minute, DDP will time out after being blocked long enough by the operation to prevent connections being closed by proxies (which can be worse), and then try to reconnect again. This can go on forever and you may never get the process completed.
You can try hypothetically disconnecting and reconnecting in short amount of time before DDP closes the connection, and divide the database access into shorter continuous processes which you can pick up on each iteration and see if the problem persists:
// while cursorCount <= data {
Meteor.onConnection(dbOp);
Meteor.setTimeout(this.disconnect, 1500); // Adjust timeout here
Meteor.reconnect();
cursorCount++;
}
func dbOp(cursorCount) {
// database operation here
// pick up the operation at cursorCount where last .disconnect() left off
}
However, when disconnected all live-updating will stop as well, but explicitly reconnecting might make up for smaller blocking.
See a discussion on this issue on Google groupand Meteor Hackpad

Catching auth error on redis / heroko / node.js

I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?

In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string