Azure: keeping connection in static variable to reuse it. Good strategy? - azure

I am dealing with an Azure function that connects to a DB (Java), i suppose something quite common.
The functions may have cold or warm starts, mine should be warm for most of the time (it is called every 5 minutes). The connection is stored in a pool (JDBCPooledConnectionSource) in a static variable, so theoreticaly the connection should be reused for every warm start, gaining in efficiency.
Is it a good strategy or could cause problems? For example, a fisical connection gets broken, but its reference is still in the heap: when the reference will be used to make a query, and an exception may occur.
To avoid calls to broken connection, I could use a non static variable to store the connection. This should be safer, but less efficient because the connection should be re-created at every call.
Which strategy is the best? I suppose there are many functions that do the same (connecting to DB) so for sure somebody more experienced than me in Azure knows the best strategy or common errors.

I write the answer because I found an error about how I was using the connectionSource(). I was executing the query without releasing the connection:
ConnectionSource cs = getConnectionSource();
DatabaseConnection dbc = cs.getReadOnlyConnection("my_table");
dbc.executeStatement("Select count(*) from my_table;", JdbcDatabaseConnection.DEFAULT_RESULT_FLAGS);
and the connection was never released, so it also was not removed/reused. Now I added the following, and it works as expected:
cs.releaseConnection(dbc);

Related

Can I cache a single value in Azure Functions without any negative effects?

I have an Azure Function on a timer that activates every minute, which calls an API which returns an integer value. I store this value in SQL.
I then have another Azure Function that can be queried by the user to retrieve the integer value. This query could in theory be as high as hundreds or thousands of times per second.
Rather than have the second Azure Function query SQL every single time it gets a request, I would like it to cache the value in memory. If the cache were perfect there would be no need for SQL at all, but because Functions can scale up and also seem to lose their cache periodically there has to be some persistent storage.
Is it just a case of a static variable within the function to cache the value, and another with the last date retrieved? Or is there another type of caching that I can use within the function?
I understand there are solutions such as Redis but they seem pretty overkill to spin up just for a single integer value. I'm also not even sure if Azure SQL itself would cache the value when it's requested.
My question is, would a static variable work (if it's null/reset then we'd just do a quick SQL query to get the value) and actually persist? Or does an alternative like redis or similar exist that wouldn't be overkill for this application? And finally, is there actually any harm (performance problems) in hammering SQL over and over to retrieve a single value (i.e. is it clever enough to cache already so there's not a significant performance hit vs. querying a variable in memory)?
Really depends. If you understand the limitations of using in-memory cache in an azure function, and your business case is fine with those limitations, you should use it.
The main thing is you can't invalidate cache.
So for example, if your number changes, it can be not usable for you. You will have cases where a container for your azure is spinning, and it has an old value. The same user could get different values on each request because who knows which instance he will hit, and what that instance is caching.
If you number is something that is set only once and doesn't change, you don't have this issue.
And another important thing is that you still make quite a few requests just to cache it. Every new container will have to cache it for itself, while centralized cache would do it only once. This can be fine for something smaller, but if the thing you're caching really takes significant amount of time, or if the resources of the service are super limited, you would be a lot more efficient with centralized cache.
No matter what, caching in Azure Function level still reduces the load, and there's no reason to make requests when you don't have to.
To answer your sql question, yes, most likely the SQL server will cache it too, but your azure function still needs to establish a connection to sql server, make the request and kill the connection.
Azure functions best practices states that functions should be stateless and your state information should be with data. I think Redis is still the better option that SQL.

Azure Functions static SqlConnection - right way to scale?

I'm using Azure Functions with queue triggers for part of our workload. The specific function queries the database and this creates problems with scaling since the large concurrent number of function instances pinging the db results in maximum allowed number of Azrue DB connections being hit constantly.
This article https://learn.microsoft.com/en-us/azure/azure-functions/manage-connections lists HttpClient as one of those resources that should be made static.
Should database access also be made static with static SqlConnection to resolve this issue, or would that cause some other problems by keeping the constant connection object?
Should database access also be made static with static SqlConnection
Definitely not. Each function invocation should open a new SqlConnection, with the same connection string, in a using block. It's not really clear how many concurrent Function Invocations the runtime will make to a single instance of your application. But if it's more than 1, then a singleton SqlConnection is a bad thing.
I wonder exactly which limit you're hitting in SQL Database, the connection limit or the concurrent request limit? In either case I'm a bit surprised (not a Functions expert) that you get that many concurrent function invocations, so there might be something else going on. Like you're leaking SqlConnections.
But reading the Functions docs, my guess is that the functions runtime is scaling by launching multiple instances of your function app. Your .NET app could scale in a single process, but that's apparently not the way Functions works. Each instance of your Functions app has it's own ConnectionPool for SQL Server, and by default each ConnectionPool can have 100 connections.
Perhaps if you sharply limit the Max Pool Size in your connection string, won't have so many connections open. When you hit the Max Pool Size, new calls to SqlConnection.Open() will block for up to 30 seconds waiting for a pooled SqlConnection to become available. So this not only limits the connection use for each instance of your application, it throttles your throughput under load.
You can use the configuration settings in host.json to control the level of concurrency your functions execute at per instance and the max scaleout setting to control how many instances you scale out to. This will let you control the total amount of load put on your database.
For future readers, the documentation has been updated with some information about the SQL connection stating:
Your function code may use the .NET Framework Data Provider for SQL Server (SqlClient) to make connections to a SQL relational database. This is also the underlying provider for data frameworks that rely on ADO.NET, such as Entity Framework. Unlike HttpClient and DocumentClient connections, ADO.NET implements connection pooling by default. However, because you can still run out of connections, you should optimize connections to the database. For more information, see SQL Server Connection Pooling (ADO.NET).
So, as David Browne already mentioned, you shouldn't make your SqlConnection static.

When is blocking code acceptable in node.js?

I know that blocking code is discouraged in node.js because it is single-threaded. My question is asking whether or not blocking code is acceptable in certain circumstances.
For example, if I was running an Express webserver that requires a MongoDB connection, would it be acceptable to block the event loop until the database connection was established? This is assuming that all pages served by Express require a database query (which would fail if MongoDB was not initialized).
Another example would be an application that requires the contents of a configuration file before being initializing. Is there any benefit in using fs.readFile over fs.readFileSync in this case?
Is there a way to work around this? Is wrapping all the code in a callback or promise the best way to go? How would that be different from using blocking code in the above examples?
It is really up to you to decide what is acceptable. And you would do that by determining what the consequences of blocking would be ... on a case-by-case basis. That analysis would take into account:
how often it occurs,
how long the event loop is likely to be blocked, and
the impact that blocking in that context will have on usability1.
Obviously, there are ways to avoid blocking, but these tend to add complexity to your application. Really, you need to decide ... on a case-by-case basis ... whether that added complexity is warranted.
Bottom line: >>you<< need to decide what is acceptable based on your understanding of your application and your users.
1 - For example, in a game it would be more acceptable to block the UI while switching "levels" than during active play. Or for a general web service, "once off" blocking while a config file is loaded or a DB connection is established during webserver startup is more acceptable that if this happened on every request.
From my experience most tasks should be handled in a callback or by returning a promise. You DO NOT want to block code in a Node application. That's what makes it so nice! Mostly with MongoDB it will crash before it has a chance to connect if there is no connection. It won't' really have an effect on an API call because your server will be dead!
Source: I'm a developer at a bootcamp that teaches MEAN stack.
Your two examples are completely different. The distinction actually answers the question in and of itself.
Grabbing data from a database is dependent on being connected to that database. Any code that is dependent upon that data is then dependent upon that connection. These things have to happen serially for the app to function and be meaningful.
On the other hand, readFileSync will block ALL code, not just code that is reliant on it. You could start reading a csv file while simultaneously establishing a database connection. Once both are done, you could add that csv data to the database.

reuse mongodb connection and close it

I'm using the Node native client 1.4 in my application and I found something in the document a little bit confusing:
A Connection Pool is a cache of database connections maintained by the driver so that connections can be re-used when new connections to the database are required. To reduce the number of connection pools created by your application, we recommend calling MongoClient.connect once and reusing the database variable returned by the callback:
Several questions come in mind when reading this:
Does it mean the db object also maintains the fail over feature provided by replica set? Which I thought should be the work of MongoClient (not sure about this but the C# driver document does say MongoClient maintains replica set stuff)
If I'm reusing the db object, when should I invoke the db.close() function? I saw the db.close() in every example. But shouldn't we keep it open if we want to reuse it?
EDIT:
As it's a topic about reusing, I'd also want to know how we can share the db in different functions/objects?
As the project grows bigger, I don't want to nest all the functions/objects in one big closure, but I also don't want to pass it to all the functions/objects.
What's a more elegant way to share it among the application?
The concept of "connection pooling" for database connections has been around for some time. It really is a common sense approach as when you consider it, establishing a connection to a database every time you wish to issue a query is very costly and you don't want to be doing that with the additional overhead involved.
So the general principle is there that you have an object handle ( db reference in this case ) that essentially goes and checks for which "pooled" connection it can use, and possibly if the current "pool" is fully utilized then and create another ( or a few others ) connection up to the pool limit in order to service the request.
The MongoClient class itself is just a constructor or "factory" type class whose purpose is to establish the connections and indeed the connection pool and return a handle to the database for later usage. So it is actually the connections created here that are managed for things such as replica set fail-over or possibly choosing another router instance from the available instances and generally handling the connections.
As such, the general practice in "long lived" applications is that "handle" is either globally available or able to be retrieved from an instance manager to give access to the available connections. This avoids the need to "establish" a new connection elsewhere in your code, which has already been stated as a costly operation.
You mention the "example" code which is often present through many such driver implementation manuals often or always calling db.close. But these are just examples and not intended as long running applications, and as such those examples tend to be "cycle complete" in that they show all of the "initialization", the "usage" of various methods, and finally the "cleanup" as the application exits.
Good application or ODM type implementations will typically have a way to setup connections, share the pool and then gracefully cleanup when the application finally exits. You might write your code just like "manual page" examples for small scripts, but for a larger long running application you are probably going to implement code to "clean up" your connections as your actual application exits.

Implementing general purpose long polling

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.
Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.
Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

Resources