Consumption Plan Azure Functions Transient Failures

Consumption Plan Azure Functions Transient Failures - azure

During heavy load my consumption azure functions are timing out with the following errors-
1.System.InvalidOperationException : An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure()' to the 'UseSqlServer' call. ---> System.Data.SqlClient.SqlException : The client was unable to establish a connection because of an error during connection initialization process before login. Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) ---> System.ComponentModel.Win32Exception : An existing connection was forcibly closed by the remote host.
2.System.InvalidOperationException : Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
We are using Azure SQL database P4 with 500 DTUs. My initial thought was that due to less available worker threads it might be failing. But, they are well within limit with max at 12%.
We know that some of out LINQ queries are slow and are not performing well, but that would require business logic change.
Is there any solution on Azure Infrastructure side or any logs I can look into to?

We had first problem couple of month ago, we just added EnableRetryOnFailure() on database configuration, it resolved the first issue. Sample code given below
var optionsBuilder = new DbContextOptionsBuilder<KasDbContext>();
optionsBuilder.UseSqlServer(getConnectionString(), options =>
{
options.EnableRetryOnFailure(maxRetryCount: Constants.MaxRetryCountOnDbTransientFailure, maxRetryDelay: TimeSpan.FromSeconds(Constants.MaxDelaySecondsOnDbTransientFailure), errorNumbersToAdd: null);
});
return new DbContext(optionsBuilder.Options);

Related

Number of active connections on the server reached to max

I am working with mongodb and nodejs. I have mongodb hosted on Atlas.
My backend had been working perfectly but now it is sometimes getting stuck and when I see the analytics on mongodb atlas it shows maximum number of active connections reached to 100.
Can someone please explain why this is happening? Can I reboot the connections and make it 0?
#Stennie I have used mongoose to connect to database
Here is my configuration file
const mongooseOptions = {
useNewUrlParser: true,
autoReconnect: true,
poolSize: 25,
connectTimeoutMS: 30000,
socketTimeoutMS: 30000
}
exports.register = (server, options, next) => {
defaults = Hoek.applyToDefaults(defaults, options)
if (Mongoose.connection.readyState) {
return next()
}
if (!Mongoose.connection.readyState) {
server.log(`${process.env.NOED_ENV} server connecting to ${defaults.url} ${defaults.url}`)
return Mongoose.connect(defaults.url, mongooseOptions).then(() => {
return next() // call the next item in hapi bootstrap
})
}
}

Assuming your backend is deployed on lambda since serverless tag.
Each invocation will leave a container idle to prevent cold start, or use an existing one if available. You are leaving the connection open to reuse it between invocation, like advertised in best practices.
With a poolSize of 25 (?) and 100 max connections, you should limit your function concurrency to 4.
Reserve concurrency to prevent your function from using all the available concurrency in the region, or from overloading downstream resources.
More reading: https://www.mongodb.com/blog/post/optimizing-aws-lambda-performance-with-mongodb-atlas-and-nodejs

You could try couple of things:
In a serverless environment, as already suggested by #Gabriel Bleu, why have such a high connectionLimit. Serverless environment keeps spawning new containers and stopping as per requests. If multiple instances spawn concurrently, it would exhaust the MongoDB server limit very quickly.
The concept of connectionPool is, x number of connections are established every time from every node (instance). But that does not mean all the connections are automatically released after querying. After completing ALL the DB operation, you should release each connection individual after use: mongoose.connection.close();
Note: Mongoose connection close will close all the connections of connection pool. So ideally, this should be run just before returning the response.
Why are you setting explicity autoReconnect to true. MongoDB driver internally reconnects whenever the connection is lost and certainly is not recommended for short lifespan instances such as serverless containers.
If you are running in cluster mode, to optimize for performance, change the serverUri to replica set URL format: MONGODB_URI=mongodb://<username>:<password>#<hostOne>,<hostTwo>,<hostThree>...&ssl=true&authSource=admin.

There are so many factors affecting the max connection limit. You have mongoDB hosted on Atlas and as you mentioned the backend is lamda means you have a serverless environment.
Serverless environments spawn new container on the new connection and destroy a connection when it's no longer being used. The peak connection shows that there are so many new instances being initialized or so many concurrent requests from the user connection. The best practice is to terminate database connection once it's no longer needed. You can terminate the connection
mongoose.connection.close(); as you have used mongoose. It will release the connection from the connection pool. Rather exhausting the concurrent connection limit, you should release connection once it's idle.
Your configuration forces the database driver to reconnect after the connection is dropped by the database. You are explicitly setting the autoReconnect as true so the driver will quickly instantiate connection request once the connection is dropped. That may affect the concurrent connection limit. You should avoid setting it explicitly.
cluster mode can optimize the requests according to the load, you can change the server uri to the replica of database. it may help to migrate the load.
There is a small initial startup cost of approximately 5 to 10 seconds when the Lambda function is invoked for the first time and the MongoDB client in your AWS Lambda function connects to MongoDB. Connections to a mongos for a sharded cluster are faster than connecting to a replica set. Subsequent connections will be significantly faster for the duration of the lifecycle of the Lambda function. so Each invocation will leave a container idle to prevent cold start or cold boot, or use an existing one if available.
Atlas sets the limit for concurrent incoming connections to a cluster based on the cluster tier. If you try to connect when you are at this limit, MongoDB displays an error stating “connection refused because too many open connections”. You can close any open connections to your cluster not currently in use. scaling down to a higher tier to support more concurrent connections. as mentioned in best practice you may restart the application. To prevent this issue in the future, consider utilizing the maxPoolSize connection string option to limit the number of connections in the connection pool.
Final Solution to this issue is Upgrading to a larger Atlas cluster tier which allows a greater number of connections. if your user base is too large for your current cluster tier.

Connection to Redis cache fails after restart - Azure

We are using following code to connect to our caches (in-memory and Redis):
settings
.WithSystemRuntimeCacheHandle()
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime)
.And
.WithRedisConfiguration(CacheManagerRedisConfigurationKey, connectionString)
.WithMaxRetries(3)
.WithRetryTimeout(100)
.WithJsonSerializer()
.WithRedisBackplane(CacheManagerRedisConfigurationKey)
.WithRedisCacheHandle(CacheManagerRedisConfigurationKey, true)
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime);
It works fine, but sometimes machine is restarted (automatically by Azure where we host it) and after the restart connection to Redis fails with following exception:
Connection to '{connection string}' failed.
at CacheManager.Core.BaseCacheManager`1..ctor(String name, ICacheManagerConfiguration configuration)
at CacheManager.Core.BaseCacheManager`1..ctor(ICacheManagerConfiguration configuration)
at CacheManager.Core.CacheFactory.Build[TCacheValue](String cacheName, Action`1 settings)
at CacheManager.Core.CacheFactory.Build(Action`1 settings)
According to Redis FAQ (https://learn.microsoft.com/en-us/azure/redis-cache/cache-faq) part: "Why was my client disconnected from the cache?" it might happen after redeploy.
The question is
is there any mechanism to restore the connection after redeploy
is anything wrong in way we initialize the connection
We are sure the connection string is OK

Most clients (including StackExchange.Redis) usually connect / re-connect automatically after a connection break. However, your connect timeout setting needs to be large enough for the re-connect to happen successfully. Remember, you only connect once, so it's alright to give the system enough time to be able to reconnect. Higher connect timeout is especially useful when you have a burst of connections or re-connections after a blip causing CPU to spike and some connections might not happen in time.
In this case, I see RetryTimeout as 100. If this is the Connection timeout, check if this is in milliseconds. 100 milliseconds is too low. You might want to make this more like 10 seconds (remember it's a one time thing, so you want to give it time to be able to connect).

Azure WebJob DB Connection Error Only on some instances

I have two Azure WebJobs. The first takes an incoming message that tells it to grab a PDF and break it into individual page images and then queue another message for the second WebJob to process the individual pages. It worked fine on our QC instance, but when we tried to move to production I started getting strange errors on the second job, but not consistently. The first job runs and breaks the file into page images. That is working fine. I have confirmed that every page image gets created and every page message gets queued. However, for the second job, only some of the messages are getting processed correctly. The remaining show this error in the WebJob diagnostics:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Functions.ProcessBatchPage ---> System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 52 - Unable to locate a Local Database Runtime installation. Verify that SQL Server Express is properly installed and that the Local Database Runtime feature is enabled.) ---> System.ComponentModel.Win32Exception: The system cannot find the file specified
But what's weird is that this error mentions the Local Database Runtime and SQL Server Express and I am not references either anywhere in my code. The system points at an Azure SQL DB. The job is ADO.Net and I have hardcoded the connection string to try to eliminate any issues with configuration based connection strings. But what's weird is that it only happens to a certain portion of the messages. The others process perfectly.
Lastly, I ran the job in debug locally (still pointing at the real queue and DB on Azure) and got the same problem. But the job outputs a console line with the job ID as the first line of the code. For those jobs that process successfully, I see this writeline. For those that fail, I never see anything. It's almost like the job is not really starting up correctly. (the failed jobs also have a really short run time 50-100ms)

I had the same issue with some jobs and I've came accross these articles to find a solution:
Transient Fault Handling (Building Real-World Cloud Apps with Azure)
Connection Resiliency / Retry Logic (EF6 onwards)
From theses articles :
Causes of transient failures :
In the cloud environment you’ll find that failed and dropped database connections happen periodically. That’s partly because you’re going through more load balancers compared to the on-premises environment where your web server and database server have a direct physical connection. Also, sometimes when you’re dependent on a multi-tenant service you’ll see calls to the service get slower or time out because someone else who uses the service is hitting it heavily. In other cases you might be the user who is hitting the service too frequently, and the service deliberately throttles you – denies connections – in order to prevent you from adversely affecting other tenants of the service.
Use smart retry/back-off logic to mitigate the effect of transient failures:
The Microsoft Patterns & Practices group has a Transient Fault Handling Application Block that does everything for you if you’re using ADO.NET for SQL Database access (not through Entity Framework). You just set a policy for retries – how many times to retry a query or command and how long to wait between tries – and wrap your SQL code in a using block :
public void HandleTransients()
{
var connStr = "some database";
var _policy = RetryPolicy.Create < SqlAzureTransientErrorDetectionStrategy(
retryCount: 3,
retryInterval: TimeSpan.FromSeconds(5));
using (var conn = new ReliableSqlConnection(connStr, _policy))
{
// Do SQL stuff here.
}
}
When you use the Entity Framework you typically aren’t working directly with SQL connections, so you can’t use this Patterns and Practices package, but Entity Framework 6 builds this kind of retry logic right into the framework. In a similar way you specify the retry strategy, and then EF uses that strategy whenever it accesses the database.
To use this feature in the Fix It app, all we have to do is add a class that derives from DbConfiguration and turn on the retry logic.
// EF follows a Code based Configuration model and will look for a class that
// derives from DbConfiguration for executing any Connection Resiliency strategies
public class EFConfiguration : DbConfiguration
{
public EFConfiguration()
{
AddExecutionStrategy(() => new SqlAzureExecutionStrategy());
}
}

Is there a limit on the number of sessions for Azure Web SQL Database?

We are using the Azure SQL Database (Web Edition) for a MVC3 ASP.NET/EF5 application.
Is there a limit to the number of sessions that this SQL Database setup supports? I am just wondering whether any delays that we are getting is due to some form of queuing or pooling. Currently we have about 5 concurrent users.
Thanks.

The SQL Azure Web edition database should support a high number of concurrent users - we've had applications running that issue thousands of queries per minute against Web databases.
Throttling
SQL Azure does implement database throttling to maintain performance for all users of the platform. If throttling has been applied to the current operation you'll receive error 40501. The link I've provided also shows you how to determine why throttling is being applied. If you receive this error you can treat it as a transient error and wait before retrying.
It doesn't sound like your connections are being throttled, because you mention only 5 concurrent users and talk about delays, whereas the throttling error would occur pretty quickly.
Transient error handling
If you're getting connection timeouts etc you need to handle them as transient errors. Transient errors are timeouts or dropped connections, as well as error codes 10054, 10053, 40501 (throttling as described above) and 40197 (usually because an upgrade or failover operation is in progress).
You should ensure you implement retry logic to handle transient errors.
Query performance
If you're executing long running queries you can check which ones are slow by logging into the database management URL:
https://<database-id>.database.windows.net/#$database=<database-name>
Log in and click "Query Performance" - take a look at the longest running queries at the top.

Catching auth error on redis / heroko / node.js

I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?

In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string