Azure Functions static SqlConnection - right way to scale? - azure

I'm using Azure Functions with queue triggers for part of our workload. The specific function queries the database and this creates problems with scaling since the large concurrent number of function instances pinging the db results in maximum allowed number of Azrue DB connections being hit constantly.
This article https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections lists HttpClient as one of those resources that should be made static.
Should database access also be made static with static SqlConnection to resolve this issue, or would that cause some other problems by keeping the constant connection object?

Should database access also be made static with static SqlConnection
Definitely not. Each function invocation should open a new SqlConnection, with the same connection string, in a using block. It's not really clear how many concurrent Function Invocations the runtime will make to a single instance of your application. But if it's more than 1, then a singleton SqlConnection is a bad thing.
I wonder exactly which limit you're hitting in SQL Database, the connection limit or the concurrent request limit? In either case I'm a bit surprised (not a Functions expert) that you get that many concurrent function invocations, so there might be something else going on. Like you're leaking SqlConnections.
But reading the Functions docs, my guess is that the functions runtime is scaling by launching multiple instances of your function app. Your .NET app could scale in a single process, but that's apparently not the way Functions works. Each instance of your Functions app has it's own ConnectionPool for SQL Server, and by default each ConnectionPool can have 100 connections.
Perhaps if you sharply limit the Max Pool Size in your connection string, won't have so many connections open. When you hit the Max Pool Size, new calls to SqlConnection.Open() will block for up to 30 seconds waiting for a pooled SqlConnection to become available. So this not only limits the connection use for each instance of your application, it throttles your throughput under load.

You can use the configuration settings in host.json to control the level of concurrency your functions execute at per instance and the max scaleout setting to control how many instances you scale out to. This will let you control the total amount of load put on your database.

For future readers, the documentation has been updated with some information about the SQL connection stating:
Your function code may use the .NET Framework Data Provider for SQL Server (SqlClient) to make connections to a SQL relational database. This is also the underlying provider for data frameworks that rely on ADO.NET, such as Entity Framework. Unlike HttpClient and DocumentClient connections, ADO.NET implements connection pooling by default. However, because you can still run out of connections, you should optimize connections to the database. For more information, see SQL Server Connection Pooling (ADO.NET).
So, as David Browne already mentioned, you shouldn't make your SqlConnection static.

Related

Connecting from AWS Lambda to MongoDB

I'm working on a NodeJS project and using pretty common AWS setup it seems. My ApiGateway receives call, triggers lambda A, then this lambda A triggers other lambdas, say B or C depending on params passed from ApiGateway.
Lambda A needs to access MongoDB and to avoid hassle with running MongoDB myself I decided to use mLab. ATM Lambda A is accessing MongoDB using NodeJS driver.
Now, not to start connection with every Lambda A execution I use connection pool, again, inside of Lambda A code, outside of handler I keep connection pool that allows me to reuse connections when Lambda A is invoked multiple times.
This seems to work fine.
However, I'm not sure how to deal with connections when Lambda A is invoking Lambda B and Lambda B needs to access mLab's MongoDB database.
Is it possible to pass connection pool somehow or Lambda B would have to keep its own connection pool?
I was thinking of using mLab's Data API that exposes most of the operations of MongoDB driver and so I could use HTTP calls e.g. GET and POST to run commands against database. It seems similar to RESTHeart it seems.
I'm leaning towards option 2 but on mLab's Data API it clearly states to avoid using REST api unless cannot connect using MongoDB driver directly:
The first method—the one we strongly recommend whenever possible for
added performance and functionality—is to connect using one of the
available MongoDB drivers. You do not need to use our API if you use
the driver. The second method, documented in this article, is to
connect via mLab’s RESTful Data API. Use this method only if you
cannot connect using a MongoDB driver.
Given all this how would it be best to approach it? 1 or 2 or is there any other option I should consider?
Unfortunately you won't be able to 'share' a mongo connection across lambdas because ultimately there's a 'physical' socket to the connection which is specific to that instance.
I think both of your solutions are good depending on usage.
If you tend to have steady average concurrency on both lambda A and B across an hour period (which is a bit of a rule of thumb as to how long AWS keeps a lambda instance alive), then having them both own their own static connections is a good solution. This is because the chances are that a request will reach an already started and connected lambda. I would also guess that node drivers for 'vanilla' mongo are more mature than those for the RESTFul Data API.
However if you get spikey or uneven load, then you might use the RESTFul Data API. This is because you'll be centralising the responsibility for managing the number of open connections to your instances to a single point, which under these conditions means you're less likely to be opening unneeded connections, or using all of your current capacity and having to wait for a new connection to be established.
Ultimately it's a game of probabilistic load balancing- either you 'pool' all your connections in a central place (the Data API) and become less affected by the usage of a single function at the expense of greater latency on individual operations, or you pool at a function level but are more exposed to cold-starts opening connections under uneven concurrency.

How to share database connection between different lambda functions

I went through some articles about taking advantage of lambda's container and sharing things like database connection between multiple instances, however, what if I have multiple lambda functions accessing the database and I want to have them share the same connection knowing that these functions call each other, for example, an API gateway calls the authenticator lambda function and then calls the insert user function, both of these functions make calls to the database, is it possible for them to share the same connection?
I'm using NodeJS but I can use a different language if it would support that.
You can't share connections between instances. Concurrent invocations do not use the same instance.
You can however share connections between invocations (which might be executed on the same container/instance). However, there you have to check if you connection is (still) open, in which case you can reuse it. Otherwise open a new one.
If you are worried about too many connections to your db just close the connections when you exit your lambda & instantiate new ones every time. You may also need to think about concurrency if that is a problem. A few weeks ago AWS added the possibility to control concurrency on a per function basis, which is neat.

reuse mongodb connection and close it

I'm using the Node native client 1.4 in my application and I found something in the document a little bit confusing:
A Connection Pool is a cache of database connections maintained by the driver so that connections can be re-used when new connections to the database are required. To reduce the number of connection pools created by your application, we recommend calling MongoClient.connect once and reusing the database variable returned by the callback:
Several questions come in mind when reading this:
Does it mean the db object also maintains the fail over feature provided by replica set? Which I thought should be the work of MongoClient (not sure about this but the C# driver document does say MongoClient maintains replica set stuff)
If I'm reusing the db object, when should I invoke the db.close() function? I saw the db.close() in every example. But shouldn't we keep it open if we want to reuse it?
EDIT:
As it's a topic about reusing, I'd also want to know how we can share the db in different functions/objects?
As the project grows bigger, I don't want to nest all the functions/objects in one big closure, but I also don't want to pass it to all the functions/objects.
What's a more elegant way to share it among the application?
The concept of "connection pooling" for database connections has been around for some time. It really is a common sense approach as when you consider it, establishing a connection to a database every time you wish to issue a query is very costly and you don't want to be doing that with the additional overhead involved.
So the general principle is there that you have an object handle ( db reference in this case ) that essentially goes and checks for which "pooled" connection it can use, and possibly if the current "pool" is fully utilized then and create another ( or a few others ) connection up to the pool limit in order to service the request.
The MongoClient class itself is just a constructor or "factory" type class whose purpose is to establish the connections and indeed the connection pool and return a handle to the database for later usage. So it is actually the connections created here that are managed for things such as replica set fail-over or possibly choosing another router instance from the available instances and generally handling the connections.
As such, the general practice in "long lived" applications is that "handle" is either globally available or able to be retrieved from an instance manager to give access to the available connections. This avoids the need to "establish" a new connection elsewhere in your code, which has already been stated as a costly operation.
You mention the "example" code which is often present through many such driver implementation manuals often or always calling db.close. But these are just examples and not intended as long running applications, and as such those examples tend to be "cycle complete" in that they show all of the "initialization", the "usage" of various methods, and finally the "cleanup" as the application exits.
Good application or ODM type implementations will typically have a way to setup connections, share the pool and then gracefully cleanup when the application finally exits. You might write your code just like "manual page" examples for small scripts, but for a larger long running application you are probably going to implement code to "clean up" your connections as your actual application exits.

Connecting to the new Azure Caching (DataCache, DataCacheFactory, & Connection Pooling)

The Windows Azure Caching Document says
If possible, store and reuse the same DataCacheFactory object to conserve memory and optimize performance."
Has anyone seen any metrics or any quantification of how expensive this is?
One argument is that
"MaxConnectionsToServer setting... determines the number of chennels per DataCacheFactory that are opened to the cache cluster."
So if MaxConnectionsToServer = 1 and DataCacheFactory is a singleton in your app, then you've effectively syncronized all requests to your web server!
However, there is a lot of indication that DataCacheFactory should be a singleton (i.e. put in Application_OnStart).
This is critical and I can't believe it is not in the Microsoft documentation. Is the DataCacheFactory treated the same in AppFabric, Azure Shared Caching, and Azure Caching? I just have a difficult time believing that Microsoft designed caching in a way that requires a singleton factory object. This is like requiring anyone that uses SqlConnection to have a singleton SqlConnectionFactory object in their application.
So, considering a relatively average web app (For example, 1,000s of requests per hour, ~ 100 objects in cache, the average request accesses 5 cached objects):
By default (and recommendation) how many Factory objects should there be at one time?
How long does it take to create a DataCacheFactory reference?
How long does it take to create a DataCache reference?
Should their only be 1 DataCacheFactory object per app and only 1 DataCache reference per request?
EDIT (answers in progress):
(1/2). Let Azure connection pooling handle the Factory objects
(3). Still testing...
(4). Still trying to figure out if I should re-use DataCache references
How about that, Microsoft did document best practices and it does involve connection pooling! Although not easy to find (at least for me).
It appears that the answer is simply to not use the DataCacheFactory object when implementing the newer Azure Caching and just access the DataCache object directly
"There are also new overloads to the DataCache constructor that make
it simpler to create a cache client. In the past, it was always
necessary to create a DataCacheFactory object that returns the target
cache. Now it is possible to create the cache with the DataCache
constructor directly. The following example creates a client to the
default cache from the default section of the configuration file."
DataCache cache = new DataCache();
And to use connection pooling.
"With the latest Windows Azure SDK, connection pooling is enabled by
default when you define your cache settings in the application or web
configuration files. Because of this default behavior, it is important
to set the size of the connection pool correctly. The connection pool
size is configured with the maxConnectionsToServer attribute on the
dataCacheClient element."
I wish Microsoft gave some guidance on how to configure the maxConnectionsToServer correctly but that can be determined through testing. The automatic connection pooling with the new Azure Caching is pretty cool :)
I'm assuming you're referring to Shared Caching Service (previously known as Azure AppFabric Cache)
There is no cost per an individual connection. However, when you purchase a Cache account, you're paying not for only the size of a cache account but also for a particular number of connections.
Smallest cache account has 10 connections per hour, while most expensive one allows for 160 concurrent connections. Thus, if you're concerned that you may run out of connections given the size of your account, it maybe prudent to be careful as to how many connections you open from your app.
More details
http://msdn.microsoft.com/en-us/library/windowsazure/hh697522.aspx

How to manage db connections on server?

I have a severe problem with my database connection in my web application. Since I use a single database connection for the whole application from singleton Database class, if i try concurrent db operations (two users) the database rollsback the transactions.
This is my static method used:
All threads/servlets call static Database.doSomething(...) methods, which in turn call the the below method.
private static /* synchronized*/ Connection getConnection(final boolean autoCommit) throws SQLException {
if (con == null) {
con = new MyRegistrationBean().getConnection();
}
con.setAutoCommit(true); //TODO
return con;
}
What's the recommended way to manage this db connection/s I have, so that I don't incurr in the same problem.
Keeping a Connection open forever is a very bad idea. It doesn't have an endless lifetime, your application may crash whenever the DB times out the connection and closes it. Best practice is to acquire and close Connection, Statement and ResultSet in the shortest possible scope to avoid resource leaks and potential application crashes caused by the leaks and timeouts.
Since connecting the DB is an expensive task, you should consider using a connection pool to improve connecting performance. A decent applicationserver/servletcontainer usually already provides a connection pool feature in flavor of a JNDI DataSource. Consult its documentation for details how to create it. In case of for example Tomcat you can find it here.
Even when using a connection pool, you still have to write proper JDBC code: acquire and close all the resources in the shortest possible scope. The connection pool will on its turn worry about actually closing the connection or just releasing it back to pool for further reuse.
You may get some more insights out of this article how to do the JDBC basics the proper way. As a completely different alternative, learn EJB and JPA. It will abstract away all the JDBC boilerplate for you into oneliners.
Hope this helps.
See also:
Is it safe to use a static java.sql.Connection instance in a multithreaded system?
Am I Using JDBC Connection Pooling?
How should I connect to JDBC database / datasource in a servlet based application?
When is it necessary or convenient to use Spring or EJB3 or all of them together?
I've not much experience with PostgreSql, but all the web applications I've worked on have used a single connection per set of actions on a page, closing it and disposing it when finished.
This allows the server to pool connections and stops problems such as the one that you are experiencing.
Singleton should be the JNDI pool connection itself; Database class with getConnection(), query methods et al should NOT be singleton, but can be static if you prefer.
In this way the pool exists indefinitely, available to all users, while query blocks use dataSource.getConnection() to draw a connection from the pool; exec the query, and then close statement, result set, and connection (to return it to the pool).
Also, JNDI lookup is quite expensive, so it makes sense to use a singleton in this case.

Resources