Entity framework core stress testing is slow - multithreading

I build a .net core 2.1 application with EF core.
I have use Transaction with read uncommitted isolation level.
I build the async API and create a simple ef query async (get 5 fields of first user, not reference to other table).
[query user][1]
When i create a single request, the query take small time
When i stress test with 10 threads, ramp-up: 5, loop forever (using jmeter), the query time is same
However, when i stress test to the api using jmeter (100 threads, ramp-up: 20s, loop forever), some query take small time, some query take large time (maybe 5s, 10s, 25s ...), another query throw connection timeout exception
what should i do?
Issue resolved: Take some days to investigating, i tried with this solution and it's working well. So, i will share it on this post, if you have other solutions to increase the performance, pls tell me about it.
Creating database connections is an expensive process that takes time. You can specify that you want a minimum pool of connections that should be created and kept open for the lifetime of the application. These are then reused for each database call.
Should use transaction isolation level "Read Uncommitted"
Should use the same Database Connection for multiple operations on one request
All APIs, methods should be Async method, make sure do not mixing Async with Sync.
Thanks all !!!

First using JMeter, run your test in NON GUI mode to ensure you don't have wrong results and follow best-practices, see:
https://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/
Once you confirmed issues are real, check multiple things:
No N+1 Select issue (loops of queries)
Granularity of retrieved data, are you retrieving too much data
performances of SQL queries issued by looking at DB ?
Pool size
See some interesting blogs:
http://www.progware.org/Blog/post/Slow-Performance-Is-it-the-Entity-Framework-or-you.aspx
https://www.thereformedprogrammer.net/entity-framework-core-performance-tuning-a-worked-example/
https://medium.com/#hoagsie/youre-all-doing-entity-framework-wrong-ea0c40e20502

Related

Maintain a distributed incremental counter in Azure cosmos DB

I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document.
I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-
{"counter": 1}
Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.
Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.
My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?
Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.
Update 1:
I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -
{
"id": "randomId",
"counter": 0
}
Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario.
Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-
Update 2:
The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.
Test configuration-
Cosmos DB container with RUs to automatically scale from 400 to 4000
2 instances of application sharing the load.
Locust script to ingest load on the application
Findings-
Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-
"exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".
I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.
"exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.
This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.
PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.
In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.
If your Azure Cosmos DB account is configured with multiple write
regions, conflicts and conflict resolution policies are applicable at
the document level, with Last Write Wins (LWW) being the default
conflict resolution policy

Is there a way to read a database link in Cosmos DB Java V4 API?

For example, reading "dbs/colls/document" instead of getting a container, then calling read on the container.
I've been having an issue where the first readItem on a container (after calling database.getContainer(x)) is extremely slow (like 1 second or longer) and was thinking using a database link could be faster.
I'm guessing a read after getting the container is slow because it doesn't make a service call until I call read.
Is there a way I can have this preloaded when reading in a database?
I have an application with a read(collectionName, key) method, and my approach was to use getContainer(collectionName) and then call read on that, but this method needs to be fast.
As discussed, the best practice is to keep an instance of your container alive between requests and call readItem on each request. This should resolve the primary issue.
As for the secondary concern, the "high latency every 50 requests or so", this is a known issue however it should only occur in the first minute or so of operation. If you can tolerate the initial slow requests, the solution is to wait for performance to stabilize. How long do you have to run your app for before you no longer see these high-latency requests?
FYI, if latency is a concern, run your client application in a geographically colocated Azure VM. Also a good rule of thumb is to allocate client CPU cores such that CPU utilization is not more than 40% or 50%.

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

How can I instrument and log my KnexJS transactions?

I have a serious problem in production causing the application to become unresponsive and output the following error:
Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call?
A running hypothesis is some operations are holding onto long-running Knex transactions. Enough of them to reach the pool size, basically.
Is there a way to query the KnexJS API for how many pool connections are in use at any one time? Unfortunately since KnexJS occupies the max pool settings from the config, it can be hard to know how many are actually in use. From the postgres end, it seems like KnexJS is idling on all of its connections when they are not in use.
Is there a good way to instrument Knex transaction and transacting with some kind of middleware or hook? Another useful thing is to log the callstack of any transaction (or any longer than, say, 7 seconds). One challenge is I have calls to Knex transaction and transacting throughout my project. Maybe it's a long shot.
Any advice is greatly appreciated.
System Information
KnexJS version: 0.12.6 (we will update in the next month)
Database + version: Postgres 9.6
OS: Heroku Linux (Ubuntu?)
Easiest was to see whats happening on connection pool level is to run knex with DEBUG=knex:* environment variable set, which will print quite a lot debug info whats happening inside knex. Those logs shows for example when connections are fetched from pool and returned to there and every ran query too.
There are couple of global events that you can use to hookup to every query, but there is not any for hooking to transactions. Here is related question where I have written some example code how to actually measure transaction durations with query hooks though: Tracking DB querying time - Bookshelf/knex It probably leaks some memory, so its not very production ready solution, but for your debugging purposes it might be helpful.

Connection pool using pg-promise

I'm using Node js and Postgresql and trying to be most efficient in the connections implementation.
I saw that pg-promise is built on top of node-postgres and node-postgres uses pg-pool to manage pooling.
I also read that "more than 100 clients at a time is a very bad thing" (node-postgres).
I'm using pg-promise and wanted to know:
what is the recommended poolSize for a very big load of data.
what happens if poolSize = 100 and the application gets 101 request simultaneously (or even more)?
Does Postgres handles the order and makes the 101 request wait until it can run it?
I'm the author of pg-promise.
I'm using Node js and Postgresql and trying to be most efficient in the connections implementation.
There are several levels of optimization for database communications. The most important of them is to minimize the number of queries per HTTP request, because IO is expensive, so is the connection pool.
If you have to execute more than one query per HTTP request, always use tasks, via method task.
If your task requires a transaction, execute it as a transaction, via method tx.
If you need to do multiple inserts or updates, always use multi-row operations. See Multi-row insert with pg-promise and PostgreSQL multi-row updates in Node.js.
I saw that pg-promise is built on top of node-postgres and node-postgres uses pg-pool to manage pooling.
node-postgres started using pg-pool from version 6.x, while pg-promise remains on version 5.x which uses the internal connection pool implementation. Here's the reason why.
I also read that "more than 100 clients at a time is a very bad thing"
My long practice in this area suggests: If you cannot fit your service into a pool of 20 connections, you will not be saved by going for more connections, you will need to fix your implementation instead. Also, by going over 20 you start putting additional strain on the CPU, and that translates into further slow-down.
what is the recommended poolSize for a very big load of data.
The size of the data got nothing to do with the size of the pool. You typically use just one connection for a single download or upload, no matter how large. Unless your implementation is wrong and you end up using more than one connection, then you need to fix it, if you want your app to be scalable.
what happens if poolSize = 100 and the application gets 101 request simultaneously
It will wait for the next available connection.
See also:
Chaining Queries
Performance Boost
what happens if poolSize = 100 and the application gets 101 request simultaneously (or even more)? Does Postgres handles the order and makes the 101 request wait until it can run it?
Right, the request will be queued. But it's not handled by Postgres itself, but by your app (pg-pool). So whenever you run out of free connections, the app will wait for a connection to release, and then the next pending request will be performed. That's what pools are for.
what is the recommended poolSize for a very big load of data.
It really depends on many factors, and no one will really tell you the exact number. Why not test your app under huge load and see in practise how it performs, and find the bottlenecks.
Also I find the node-postgres documentation quite confusing and misleading on the matter:
Once you get >100 simultaneous requests your web server will attempt to open 100 connections to the PostgreSQL backend and 💥 you'll run out of memory on the PostgreSQL server, your database will become unresponsive, your app will seem to hang, and everything will break. Boooo!
https://github.com/brianc/node-postgres
It's not quite true. If you reach the connection limit at Postgres side, you simply won't be able to establish a new connection until any previous connection is closed. Nothing will break, if you handle this situation in your node app.

Resources