EntityFramework performance - query-performance

Why is EF (6) so slow for a very simple query of a few rows?
My test db has only one table and I read all rows (say 80) and by EF I need about 2500 ms (SQL Express on local machine) while SqlDataReader using the exact same sql query only needs about 150 ms.
Increasing the number of rows doesn't make EF much slower so it seems like this is a one timer overhead (and running the same query again is blizzard fast).
Is there some setting which may be used to persist some compilation or sql execution strategy or something like that?

Related

Slow bulk insert to Azure database

We are running an elastic pool in Azure running multiple databases, when running 1 of our larger imports this seems to take longer than we are used to. During these imports we ran at 6 cores as a test. All databases are allowed to use all cores.
On our local enviroment, it inserts about 100k records per second, however, the same dataset on Azure does about 1k per second (our vm) to 4k per second (dev laptop).
During this insert, the database only uses 14% log IO, 5% CPU and 0% DataIO.
When setting up a new database using DTU model in P2 we are noticing the same experience. So we are not even hitting the limits of the database
The table contains about 36 columns which are all required.
We have tried this using BulkInsert in the following way using different batchsizes
BulkConfig b = new BulkConfig();
b.BatchSize = 100000;
await dbcontext.BulkInsertAsync(entities, b);
As well as using standard EntityFramework addranges using smaller batches. We even went as far as using the manually written SqlBulkCopy methods, however all with no dice.
Now the question is mainly, is this a software issue? Are we running into issues in our AzureDB? Do we need to change the way we do Bulk imports?
Edit:
Attempted to run the import using the TempDB Setting in BulkInsert, however this also does not increase performance. LogIO is still at 14%.
Iterate through the dataset on the application layer, invoking a
stored procedure for each row that will perform an INSERT/UPDATE
action based on the existence of a record with a certain key. If the
number of records to upsert is limited, this strategy may work well;
otherwise, roundtrips and log writes will have a major influence on
speed.
To minimise roundtrips and log writes and increase throughput, use
bulk insert approaches like the SqlBulkCopy class in ADO.NET to
upload the full dataset to Azure SQL Database and then execute all
the INSERT/UPDATE (or MERGE) operations in a single batch. Overall
execution times may be reduced from hours to minutes/seconds using
this method.
Here, is a discussion related to same scenario: Optimize Azure SQL Database Bulk Upsert scenarios - link.

How to best stage large amounts of data with Hibernate/JPA?

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Azure SQL Database update performance

We're migrating some databases from an Azure VM running SQL Server to Azure SQL. The current VM is a Standard DS12 v2 with two 1TB SSDs attached.
We are using an elastic pool at the P1 performance level. We're early days in this, so nothing else is really running in the pool.
At any rate, we are doing an ETL process that involves a handful of ~20M row tables. We bulk load these tables and then update some attributes to help with the rest of the process.
For example, I am currently running the following update:
UPDATE A
SET A.CompanyId = B.Id
FROM etl.TRANSACTIONS AS A
LEFT OUTER JOIN dbo.Company AS B
ON A.CO_ID = B.ERPCode
TRANSACTIONS is ~ 20M rows; Company is fewer than 50.
I'm already 30 minutes into running this update which is far beyond what will be acceptable. The usage meter on the Pool is hovering around 40%.
For reference, our Azure VM runs this in about 2 minutes.
I load this table via the bulk copy and this update is already beyond what it took to load the entire table.
Any suggestions on speeding up this (and other) updates?
We are using an elastic pool at the P1 performance level.
Not sure ,how this translates your VM performance levels and what criteria you are using to compare both
I would recommend below steps ,since there is no execution plan provided ..
1.Check if there is any wait type ,while the update is running
select
session_id,
start_time,
command,
db_name(ec.database_id) as dbname,
blocking_session_id,
wait_type,
last_wait_type,
wait_time,
cpu_time,
logical_reads,
reads,
writes,
((database_transaction_log_bytes_used +database_transaction_log_bytes_reserved)/1024)/1024 as logusageMB,
txt.text,
pln.query_plan
from sys.dm_exec_requests ec
cross apply
sys.dm_exec_sql_text(ec.sql_handle) txt
outer apply
sys.dm_exec_query_plan(ec.plan_handle) pln
left join
sys.dm_tran_database_transactions trn
on trn.transaction_id=ec.transaction_id
the wait type,provides you lot of info,which can be used to troubleshoot..
2.You can also use below query to see in parallel ,what is happening with the query
set statistics profile on
your update query
then run below query in a seperate window
select
session_id,physical_operator_name,
row_count,actual_read_row_count,estimate_row_count,estimated_read_row_count,
rebind_count,
rewind_count,
scan_count,
logical_read_count,
physical_read_count,
logical_read_count
from
sys.dm_exec_query_profiles
where session_id=your sessionid;
as per your question,there don't seems to be an issue with DTU.So i dont see much issue on that front..
Slow performance solved in one case:
I have recently had severe problems with slow Azure updates that made it nearly unusable. It was updating only 1000 rows in 1 second. So 1M rows was 1000 seconds. I believe this is due to logging in Azure, but I haven't done enough research to be certain. Opening a MS support incident went nowhere. I finally solved the issue using two techniques:
Copy the data to a temporary table and make updates in the temp table. So in the above case, try copying the 50 rows to a temp table & then back again after updates. No/Minimal logging in this case.
My copying back was still slow (I had a few 100K rows), and I create a clustered index on that table. Update duration dropped by a factor of 4-5.
I am using a S1-20DTU database. It is still about 5 times slower than a dedicated instance, but that is fantastic performance for the price.
The real answer to this issue is that SQL Azure will spill to the tempdb much faster than you would expect if you are used to using a well provisioned VM or physical machine.
You can tell that this is happening by recording the actual execution query plan. Look for the warning icon:
The popup will complain about the spill:
At any rate, if you see this, it is likely that you're trying to do too much in the statement.
The Microsoft support person suggested updating the statistics, but this did not change the situation for us.
What seems to be working is the traditional advice to break the inserts up into smaller batches.

Cassandra multi row selection

Somewhere I have heard that using multi row selection in cassandra is bad because for each row selection it runs new query, so for example if i want to fetch 1000 rows at once it would be the same as running 1000 separate queries at once, is that true?
And if it is how bad would it be to keep selecting around 50 rows each time page is loaded if say i have 1000 page views in a single minute, would it severely slow cassandra down or not?
P.S I'm using PHPCassa for my project
Yes, running a query for 1000 rows is the same as running 1000 queries (if you use the recommended RandomPartitioner). However, I wouldn't be overly concerned by this. In Cassandra, querying for a row by its key is a very common, very fast operation.
As to your second question, it's difficult to tell ahead of time. Build it and test it. Note that Cassandra does use in memory caching so if you are querying the same rows then they will cache.
We are using Playorm for Cassandra and there is a "findAll" pattern there which provides support to fetch all rows quickly. Visit
https://github.com/deanhiller/playorm/wiki/Support-for-retrieving-many-entities-in-parallel for more details.
1) I have little bit debugged the Cassandra code base and as per my observation to query multiple rows at the same time cassandra has provided the multiget() functionality which is also inherited in phpcassa.
2) Multiget is optimized to to handle the batch request and it saves your network hop.(like for 1k rows there will be 1k round trips, so it definitely reduces the time for 999 round trips)
3) More about multiget() in phpcassa: php cassa multiget()

Subqueries in EF code first 4.1

I have created a WCF data service over a fairly simple EF 4.1 code first model. With each request I must provide a clientid to maintain segregation of data in my multi-tenant db. I am seeing horrible performance and after running a sql server trace I see that all of the parametrized queries are using subqueries like so.
select top 100 <This is because of paging>
colA,
colB,
colC
from (select colA, colB, colC
from table
where clientid = 12345)
orderby .....
Is there any way to tweak this so that it skips the subquery for the select? It seems ridiculously unneeded and slows down the performance by a surprising order of magnitude.
Thanks.
Is there any way to tweak this so that it skips the subquery for the select?
No unless you are going to rewrite whole EF provider for MSSQL Server (or other database you are using).
It seems ridiculously unneeded and slows down the performance by a surprising order of magnitude.
Did you actually investigate source of performance problems? The query you showed should be optimized by query optimizer on DB server and it should not have any significant performance impact.
Make sure you have correctly configured indexes and up-to-data statistics.

Resources