Why are database connection pools better than a single connection? - multithreading

I'm currently working on writing a multithreaded application that will need to access a database in order to serve requests. I see many people saying that using a pool of many persistent database connections is the way to go for this type of application, but I'm trying to wrap my head around why exactly this is the case.
Keep in mind that I'm designing this application in Erlang, so I'll be using threads/processes/workers a lot.
So let's compare two situations:
You have a single thread that owns a single database connection. All your client-handling-threads talk to this thread in order to make database queries.
You have a pool of threads, each with their own database connection. When a client-handling-thread wants to access the database, it gets one of these threads from the pool, and uses that to query the DB.
In the first case, I see many people saying that it is bad because having one thread handling all database related queries will in turn cause a bottleneck. But my confusion is the following: Wouldn't the bottleneck in that single thread actually be the database itself? If all that the thread is doing is querying the database through its connection handle, isn't waiting for the DB to respond to requests the main source of latency? How will throwing more connections threads at this problem solve it?

The database probably has well-developed multithreading abilities. Using a connection pool allows:
Make use of the DB's multithreading / load-balancing ability
Avoid the overhead of setting up and tearing down connections over and over
When the database is serving multiple connections, it can make its own decisions on how to prioritize requests. Imagine this scenario:
User A requests a set of records from Table A with 100,000 rows
User B requests a set of records from Table B with 50 rows
User C updates Table A
If multiple connections are used, the DB can take advantage of the fact that (1) and (2) can occur concurrently, and User B gets his 50 records without having to wait for User A to get all 100,000 of his. Only User C has to wait for User A to finish.
Also, setting up and tearing down TCP connections is a relatively expensive task. Using a pool allows one user to release the resource without tearing down the TCP connection, so the next user doesn't have to wait for a new connection. Your single-threaded approach wouldn't benefit from this aspect of connection-pooling, though.

Related

How does Postgres handle more requests than connections

While going through the Postgres Architecture, one of the things mentioned was that the Postgres DB has a connection limit of 500(which can be modified). And to fetch any data from the Postgres DB, we first need to make a connection to it. So in this case what happens if there are simultaneous 10k requests coming to the DB? How does the requests map to the connection limit, since we have the limit of 500. Do we need to increase the limit or do we need to create more instance of Postgres or is concurrency in play?
If there are 10000 concurrent statements running on a single database, any hardware will be overloaded. You just cannot do that.
Even 500 is way too many concurrent requests, so that value is too high for max_connections (or for the number of concurrent active sessions to be precise).
The good thing is that you don't have to do that. You use a connection pool that acts as a proxy between the application and the database. If your database statements are sufficiently short, you can easily handle thousands of concurrent application users with a few dozen database connections. This protects the database from getting overloaded and avoids opening database connections frequently, which is expensive.
If you try to open more database connections than max_connections allows, you will get an error message. If more processes request a database connection from the pool than the limit allows, some sessions will hang and wait until a connection is available. Yet another point for using a connection pool!

How many session will create using single pool?

I am using Knex version 0.21.15 npm. my pooling parameter is pool {min: 3 , max:300}.
Oracle is my data base server.
pool Is this pool count or session count?
If it is pool, how many sessions can create using a single pool?
If i run one non transaction query 10 time using knex connection ,how many sessions will create?
And when the created session will cleared from oracle session?
Is there any parameter available to remove the idle session from oracle.?
suggest me please if any.
WARNING: a pool.max value of 300 is far too large. You really don't want the database administrator running your Oracle server to distrust you: that can make your work life much more difficult. And such a large max pool size can bring the Oracle server to its knees.
It's a paradox: often you can get better throughput from a database application by reducing the pool size. That's because many concurrent queries can clog the database system.
The pool object here governs how many connections may be in the pool at once. Each connection is a so-called serially reusable resource. That is, when some part of your nodejs program needs to run a query or series of queries, it grabs a connection from the pool. If no connection is already available in the pool, the pooling stuff in knex opens a new one.
If the number of open connections is already at the pool.max value, the pooling stuff makes that part of your nodejs program wait until some other part of the program finishes using a connection in the pool.
When your part of the nodejs program finishes its queries, it releases the connection back to the pool to be reused when some other part of the program needs it.
This is almost absurdly complex. Why bother? Because it's expensive to open connections and much cheaper to re-use them.
Now to your questions:
pool Is this pool count or session count?
It is a pair of limits (min / max) on the count of connections (sessions) open within the pool at one time.
If it is pool, how many sessions can create using a single pool?
Up to the pool.max value.
If i run one non transaction query 10 time using knex connection ,how many sessions will create?
It depends on concurrency. If your tenth query before the first one completes, you may use ten connections from the pool. But you will most likely use fewer than that.
And when the created session will cleared from oracle session?
As mentioned, the pool keeps up to pool.max connections open. That's why 300 is too many.
Is there any parameter available to remove the idle session from oracle.?
This operation is called "evicting" connections from the pool. knex does not support this. Oracle itself may drop idle connections after a timeout. Ask your DBA about that.
In the meantime, use the knex defaults of pool: {min: 2, max: 10} unless and until you really understand pooling and the required concurrency of your application. max:300 would only be justified under very special circumstances.

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

reuse mongodb connection and close it

I'm using the Node native client 1.4 in my application and I found something in the document a little bit confusing:
A Connection Pool is a cache of database connections maintained by the driver so that connections can be re-used when new connections to the database are required. To reduce the number of connection pools created by your application, we recommend calling MongoClient.connect once and reusing the database variable returned by the callback:
Several questions come in mind when reading this:
Does it mean the db object also maintains the fail over feature provided by replica set? Which I thought should be the work of MongoClient (not sure about this but the C# driver document does say MongoClient maintains replica set stuff)
If I'm reusing the db object, when should I invoke the db.close() function? I saw the db.close() in every example. But shouldn't we keep it open if we want to reuse it?
EDIT:
As it's a topic about reusing, I'd also want to know how we can share the db in different functions/objects?
As the project grows bigger, I don't want to nest all the functions/objects in one big closure, but I also don't want to pass it to all the functions/objects.
What's a more elegant way to share it among the application?
The concept of "connection pooling" for database connections has been around for some time. It really is a common sense approach as when you consider it, establishing a connection to a database every time you wish to issue a query is very costly and you don't want to be doing that with the additional overhead involved.
So the general principle is there that you have an object handle ( db reference in this case ) that essentially goes and checks for which "pooled" connection it can use, and possibly if the current "pool" is fully utilized then and create another ( or a few others ) connection up to the pool limit in order to service the request.
The MongoClient class itself is just a constructor or "factory" type class whose purpose is to establish the connections and indeed the connection pool and return a handle to the database for later usage. So it is actually the connections created here that are managed for things such as replica set fail-over or possibly choosing another router instance from the available instances and generally handling the connections.
As such, the general practice in "long lived" applications is that "handle" is either globally available or able to be retrieved from an instance manager to give access to the available connections. This avoids the need to "establish" a new connection elsewhere in your code, which has already been stated as a costly operation.
You mention the "example" code which is often present through many such driver implementation manuals often or always calling db.close. But these are just examples and not intended as long running applications, and as such those examples tend to be "cycle complete" in that they show all of the "initialization", the "usage" of various methods, and finally the "cleanup" as the application exits.
Good application or ODM type implementations will typically have a way to setup connections, share the pool and then gracefully cleanup when the application finally exits. You might write your code just like "manual page" examples for small scripts, but for a larger long running application you are probably going to implement code to "clean up" your connections as your actual application exits.

JDBC: Can I share a connection in a multithreading app, and enjoy nice transactions?

It seems like the classical way to handle transactions with JDBC is to set auto-commit to false. This creates a new transaction, and each call to commit marks the beginning the next transactions.
On multithreading app, I understand that it is common practice to open a new connection for each thread.
I am writing a RMI based multi-client server application, so that basically my server is seamlessly spawning one thread for each new connection.
To handle transactions correctly should I go and create a new connection for each of those thread ?
Isn't the cost of such an architecture prohibitive?
Yes, in general you need to create a new connection for each thread. You don't have control over how the operating system timeslices execution of threads (notwithstanding defining your own critical sections), so you could inadvertently have multiple threads trying to send data down that one pipe.
Note the same applies to any network communications. If you had two threads trying to share one socket with an HTTP connection, for instance.
Thread 1 makes a request
Thread 2 makes a request
Thread 1 reads bytes from the socket, unwittingly reading the response from thread 2's request
If you wrapped all your transactions in critical sections, and therefore lock out any other threads for an entire begin/commit cycle, then you might be able to share a database connection between threads. But I wouldn't do that even then, unless you really have innate knowledge of the JDBC protocol.
If most of your threads have infrequent need for database connections (or no need at all), you might be able to designate one thread to do your database work, and have other threads queue their requests to that one thread. That would reduce the overhead of so many connections. But you'll have to figure out how to manage connections per thread in your environment (or ask another specific question about that on StackOverflow).
update: To answer your question in the comment, most database brands don't support multiple concurrent transactions on a single connection (InterBase/Firebird is the only exception I know of).
It'd be nice to have a separate transaction object, and to be able to start and commit multiple transactions per connection. But vendors simply don't support it.
Likewise, standard vendor-independent APIs like JDBC and ODBC make the same assumption, that transaction state is merely a property of the connection object.
It's uncommon practice to open a new connection for each thread.
Usually you use a connection pool like c3po library.
If you are in an application server, or using Hibernate for example, look at the documentation and you will find how to configure the connection pool.
The same connection object can be used to create multiple statement objects and these statement objects can then used by different threads concurrently. Most modern DBs interfaced by JDBC can do that. The JDBC is thus able to make use of concurrent cursors as follows. PostgreSQL is no exception here, see for example:
http://doc.postgresintl.com/jdbc/ch10.html
This allows connection pooling where the connection are only used for a short time, namely to created the statement object and but after that returned to the pool. This short time pooling is only recommended when the JDBC connection does also parallelization of statement operations, otherwise normal connection pooling might show better results. Anyhow the thread can continue work with the statement object and close it later, but not the connection.
1. Thread 1 opens statement
3. Thread 2 opens statement
4. Thread 1 does something Thread 2 does something
5. ... ...
6. Thread 1 closes statement ...
7. Thread 2 closes statement
The above only works in auto commit mode. If transactions are needed there is still no need to tie the transaction to a thread. You can just partition the pooling along the transactions that is all and use the same approach as above. But this is only needed not because of some socket connection limitation but because the JDBC then equates the session ID with the transaction ID.
If I remember well there should be APIs and products around with a less simplistic design, where teh session ID and the transaction ID are not equated. In this APIs you could write your server with one single database connection object, even when it does
transactions. Will need to check and tell you later what this APIs and products are.

Resources