Slow bulk insert to Azure database - azure

We are running an elastic pool in Azure running multiple databases, when running 1 of our larger imports this seems to take longer than we are used to. During these imports we ran at 6 cores as a test. All databases are allowed to use all cores.
On our local enviroment, it inserts about 100k records per second, however, the same dataset on Azure does about 1k per second (our vm) to 4k per second (dev laptop).
During this insert, the database only uses 14% log IO, 5% CPU and 0% DataIO.
When setting up a new database using DTU model in P2 we are noticing the same experience. So we are not even hitting the limits of the database
The table contains about 36 columns which are all required.
We have tried this using BulkInsert in the following way using different batchsizes
BulkConfig b = new BulkConfig();
b.BatchSize = 100000;
await dbcontext.BulkInsertAsync(entities, b);
As well as using standard EntityFramework addranges using smaller batches. We even went as far as using the manually written SqlBulkCopy methods, however all with no dice.
Now the question is mainly, is this a software issue? Are we running into issues in our AzureDB? Do we need to change the way we do Bulk imports?
Edit:
Attempted to run the import using the TempDB Setting in BulkInsert, however this also does not increase performance. LogIO is still at 14%.

Iterate through the dataset on the application layer, invoking a
stored procedure for each row that will perform an INSERT/UPDATE
action based on the existence of a record with a certain key. If the
number of records to upsert is limited, this strategy may work well;
otherwise, roundtrips and log writes will have a major influence on
speed.
To minimise roundtrips and log writes and increase throughput, use
bulk insert approaches like the SqlBulkCopy class in ADO.NET to
upload the full dataset to Azure SQL Database and then execute all
the INSERT/UPDATE (or MERGE) operations in a single batch. Overall
execution times may be reduced from hours to minutes/seconds using
this method.
Here, is a discussion related to same scenario: Optimize Azure SQL Database Bulk Upsert scenarios - link.

Related

Parallel read and write to postgres database slows down application (backend)

I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.
Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.

How to best stage large amounts of data with Hibernate/JPA?

How can I best stage large amounts of data for migration into our database using Hibernate efficiently? Performance when dealing with >25K records that are 100+ columns are not ideal.
Let me explain:
Background
I'm working for a large company that operates around the world. I've been tasked with leading a team (at least for backend) to create a full stack application that allows for various levels of management to perform their tasks. The current tech stack for backend is Java, Spring Boot, Hibernate, and PostgreSQL. Management would like to upload Excel files to our application and have our application parse them so we can refresh the data in our database.
Unfortunately, these files range from 25K to 50K records. We're aware that these Excel files are generated using SQL queries from Excel. However, we are not permitted to access the database with this data directly. The security is very tight and will not permit us access to any APIs, DB calls, etc. to work around Excel. Due to memory constraints and scalability concerns, we're using SAX parsing to keep a low footprint. Once we parse the Excel files, we're mapping them to a Hibernate entity that represents a staging table. Then we're migrating data from it to our other tables.
Currently to stage 25K records and migrate all the data to our other tables takes 15 minutes, which is unacceptable in the eyes of management. Especially, since this will need to be done on a daily basis.
Things I've tried
Enabling batch processing in Hibernate by following Vlad's answer here. This knocked maybe 20 seconds off the overall time for staging.
Rewriting criteria and other queries for fetching data.
Reducing amount of data to process (most fields are required so the amount can't be too heavily reduced).
Indexing important columns in both the staging and destination tables. I'm doing the indexing as part of schema generation.
Optimize parts of code that clean parsed data of imperfections.
I cannot post code due to NDA
Summary of Constraints
This app needs strong support for generating reports on related data (one of the reasons we went with RDBMS. Also, the data fits well into a relational model).
Must maintain a complete audit history of all records (currently using Hibernate Envers).
We have to approve any new dependency/library through the company's cybersecurity team. This can result in days of lost production while we wait for approval. It's not ideal to request new dependencies for the project.
There are no ways of working around the Excel files at this time. An API call or simple database query would be nice, but that's not an option to us for security reasons.
Scalability is a growing concern. Another team under this project has to parse an Excel file of 50K rows with 100 rows. All of this is only data for the USA. The project owner has said the company eventually wants to expand this app's management capabilities abroad.
My Thoughts
Purely regarding the staging issue, I think it's best to get rid of the Hibernate entities responsible for staging. I'll rewrite the migration of staged data into our live tables in SQL using stored procedures. Despite it being vendor-specific (to my knowledge, anyway) I'll use Postgres' COPY command to do the heavy lifting with the large amounts of rows. I can rewrite the parser to direct data to a CSV or other delimited file instead. The only issue I have then is how to migrate the data to tables that use Hibernate sequences and generators. I haven't figured out how to synchronize Hibernate's sequences after a manual update to the database like that. It likes the throw errors about duplicate primary keys until it comes across an ID in the sequence that's not used. But I feel that's another question entirely.
Edit 1:
I should clarify. The 15 minutes is the total time for all of staging. This includes staging and migration. Just the staging of the 25K records takes around 1:30, which also isn't ideal. I've run session metrics a few times and get around the following numbers for Spring Data persisting the 25K records:
2451000 nanoseconds spent acquiring 1 JDBC connection;
0 nanoseconds spent releasing 0 JDBC connections;
96970800 nanoseconds spent preparing 24851 JDBC statements;
9534006000 nanoseconds spent executing 24849 JDBC statements;
21666942900 nanoseconds spent executing 830 JDBC statements;
23513568700 nanoseconds spent executing 2 flushes (flushing a total of 49696 entities and 0 collections)
211588700 nanoseconds spent executing 1 partial-flushes (flushing a total of 24848 entities and 24848 collections)
For this specific case, I'm staging the roughly 25K entities and then using a stored procedure to move only employee data from staging to live tables (a small fraction of what makes up the 15 total minutes). That procedure seems to run instantly. But there's other data that we have to determine via joins, group by statements, etc., which appear to be costly. I'm just not sure why it's taking Spring Data so long to persist that many records when it would take pure SQL significantly less.

Azure SQL virtual machine performance - Inserts very slow

I'm trying out different pricing tiers on SQL Server.
Im inserting 4000 rows distributed over 4 tables in 10 seconds
My problem: I don't any performance improvements from a small D2S_V3 to D8S_V3
My application need to insert many rows (bulking is not an option), and this kind of performance is not acceptable
I wonder why I dont see improvements.
So my noob question: Do I need to configure something to see improvements? My naive thinking says I should some difference :-)
What am I doing wrong?
Without knowing much about your schema, it looks like you are storage bound or network bound.
Storage:
Try to mount the database to the local (temporary disk) and see if you notice any difference, if it is faster then your bottleneck is the mounted disk.
Network bound:
Where is the client that's inserting these transaction? On same machine? On Azure?
I suggest you setup a client in the same region and do the tests.
inserting 4000 rows distributed over 4 tables in 10 seconds.I don't any performance improvements from a small D2S_V3 to D8S_V3
I would approach this problem using wait stats approach rather than throwing hardware first with out knowing problem..
For example,running below insert
insert into #t
select orderid from orders o
join
Customers c
on c.custid=o.custid
showed me below wait stats
Wait WaitType="SOS_SCHEDULER_YIELD" WaitTimeMs="1" WaitCount="167"
Wait WaitType="PAGEIOLATCH_SH" WaitTimeMs="12" WaitCount="3" />
Wait WaitType="MEMORY_ALLOCATION_EXT" WaitTimeMs="21" WaitCount="4975" />
most of the time, the query spent time on
PAGEIOLATCH_SH:
getting data from disk into memory
MEMORY_ALLOCATION_EXT :allocating memory for the query to run
based on this i will try to troubleshoot by seeing if i have memory pressure on my system,since this query is trying to get data from disk.
This is just one example,but hopefully this will give you an idea..
Further i will try to see if select is returing data fast
Performance can be directly linked to your hardware or configuration, but it's more likely that it has to do with the structures and the queries. Take a look at the execution plan for the INSERT operation to see how it is being resolved by the optimizer. Also, capture the query metrics using extended events to see how many resources are being used by the operation. These are more likely to lead to a resolution on why the query is performing slowly and enable you to scale the hardware to best serve the query.

Azure SQL Database update performance

We're migrating some databases from an Azure VM running SQL Server to Azure SQL. The current VM is a Standard DS12 v2 with two 1TB SSDs attached.
We are using an elastic pool at the P1 performance level. We're early days in this, so nothing else is really running in the pool.
At any rate, we are doing an ETL process that involves a handful of ~20M row tables. We bulk load these tables and then update some attributes to help with the rest of the process.
For example, I am currently running the following update:
UPDATE A
SET A.CompanyId = B.Id
FROM etl.TRANSACTIONS AS A
LEFT OUTER JOIN dbo.Company AS B
ON A.CO_ID = B.ERPCode
TRANSACTIONS is ~ 20M rows; Company is fewer than 50.
I'm already 30 minutes into running this update which is far beyond what will be acceptable. The usage meter on the Pool is hovering around 40%.
For reference, our Azure VM runs this in about 2 minutes.
I load this table via the bulk copy and this update is already beyond what it took to load the entire table.
Any suggestions on speeding up this (and other) updates?
We are using an elastic pool at the P1 performance level.
Not sure ,how this translates your VM performance levels and what criteria you are using to compare both
I would recommend below steps ,since there is no execution plan provided ..
1.Check if there is any wait type ,while the update is running
select
session_id,
start_time,
command,
db_name(ec.database_id) as dbname,
blocking_session_id,
wait_type,
last_wait_type,
wait_time,
cpu_time,
logical_reads,
reads,
writes,
((database_transaction_log_bytes_used +database_transaction_log_bytes_reserved)/1024)/1024 as logusageMB,
txt.text,
pln.query_plan
from sys.dm_exec_requests ec
cross apply
sys.dm_exec_sql_text(ec.sql_handle) txt
outer apply
sys.dm_exec_query_plan(ec.plan_handle) pln
left join
sys.dm_tran_database_transactions trn
on trn.transaction_id=ec.transaction_id
the wait type,provides you lot of info,which can be used to troubleshoot..
2.You can also use below query to see in parallel ,what is happening with the query
set statistics profile on
your update query
then run below query in a seperate window
select
session_id,physical_operator_name,
row_count,actual_read_row_count,estimate_row_count,estimated_read_row_count,
rebind_count,
rewind_count,
scan_count,
logical_read_count,
physical_read_count,
logical_read_count
from
sys.dm_exec_query_profiles
where session_id=your sessionid;
as per your question,there don't seems to be an issue with DTU.So i dont see much issue on that front..
Slow performance solved in one case:
I have recently had severe problems with slow Azure updates that made it nearly unusable. It was updating only 1000 rows in 1 second. So 1M rows was 1000 seconds. I believe this is due to logging in Azure, but I haven't done enough research to be certain. Opening a MS support incident went nowhere. I finally solved the issue using two techniques:
Copy the data to a temporary table and make updates in the temp table. So in the above case, try copying the 50 rows to a temp table & then back again after updates. No/Minimal logging in this case.
My copying back was still slow (I had a few 100K rows), and I create a clustered index on that table. Update duration dropped by a factor of 4-5.
I am using a S1-20DTU database. It is still about 5 times slower than a dedicated instance, but that is fantastic performance for the price.
The real answer to this issue is that SQL Azure will spill to the tempdb much faster than you would expect if you are used to using a well provisioned VM or physical machine.
You can tell that this is happening by recording the actual execution query plan. Look for the warning icon:
The popup will complain about the spill:
At any rate, if you see this, it is likely that you're trying to do too much in the statement.
The Microsoft support person suggested updating the statistics, but this did not change the situation for us.
What seems to be working is the traditional advice to break the inserts up into smaller batches.

Azure Table Storage transaction limitations

I'm running performance tests against ATS and its behaving a bit weird when using multiple virtual machines against the same table / storage account.
The entire pipeline is non blocking (await/async) and using TPL for concurrent and parallel execution.
First of all its very strange that with this setup i'm only getting about 1200 insertions. This is running on a L VM box, that is 4 cores + 800mbps.
I'm inserting 100.000 rows with unique PK and unique RK, that should leverage the ultimate distribution.
Even more deterministic behavior is the following.
When I run 1 VM i get about 1200 insertions per second.
When I run 3 VM i get about 730 on each insertions per second.
Its quite humors to read the blog post where they are specifying their targets.
https://azure.microsoft.com/en-gb/blog/windows-azures-flat-network-storage-and-2012-scalability-targets/
Single Table Partition– a table partition are all of the entities in a table with the same partition key value, and usually tables have many partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the 20,000 entities/second, which is the overall account target described above.
What shall I do to be able to utilize the 20k per second, and how would it be possible to execute more than 1,2k per VM?
--
Update:
I've now also tried using 3 storage accounts for each individual node and is still getting the performance / throttling behavior. Which i can't find a logical reason for.
--
Update 2:
I've optimized the code further and now i'm possible to execute about 1550.
--
Update 3:
I've now also tried in US West. The performance is worse there. About 33% lower.
--
Update 4:
I tried executing the code from a XL machine. Which is 8 cores instead of 4 and the double amount of memory and bandwidth and got a 2% increase in performance so clearly this problem is not on my side..
A few comments:
You mention that you are using unique PK/RK to get ultimate
distribution, but you have to keep in mind that the PK balancing is
not immediate. When you first create a table, the entire table will
be served by 1 partition server. So if you are doing inserts across
several different PKs, they will still be going to one partition
server and be bottlenecked by the scalability target for a single
partition. The partition master will only start splitting your
partitions among multiple partition servers after it has identified hot
partition servers. In your <2 minute test you will not see the
benefit of multiple partiton servers or PKs. The throughput in the
article is targeted towards a well distributed PK scheme with
frequently accessed data, causing the data to be divided amongst
multiple partition servers.
The size of your VM is not the issue as
you are not blocked on CPU, Memory, or Bandwidth. You can achieve
full storage performance from a small VM size.
Check out
http://research.microsoft.com/en-us/downloads/5c8189b9-53aa-4d6a-a086-013d927e15a7/default.aspx.
I just now did a quick test using that tool from a WebRole VM in the
same datacenter as my storage account and I acheived, from a single
instance of the tool on a single VM, ~2800 items per second upload
and ~7300 items per second download. This is using 1024 byte
entities, 10 threads, and 100 batch size. I don't know how efficient this tool is or if it disables Nagles Algorithm as I was unable to get great results (I got ~1000/second) using a batch size of 1, but at least with the 100 batch size it shows that you can achieve high items/second. This was done in US West.
Are you using Storage client library 1.7 (Microsoft.Azure.StorageClient.dll) or 2.0 (Microsoft.Azure.Storage.dll)? The 2.0 library has some performance improvements and should yield better results.
I suspect this may have to do with TCP Nagle.
See this MSDN article and this blog post.
In essence, TCP Nagle is a protocol-level optimization that batches up small requests. Since you are sending lots of small requests this is likely to negatively affect your performance.
You can disable TCP Nagle by executing this code when starting your application
ServicePointManager.UseNagleAlgorithm = false;
Are the compute instances and storage account in the same affinity group? Affinity groups ensure that network proximity between the services is optimal and should result in lower latency at the network level.
You can find affinity group configuration under the network tab.
I would tend to believe that the maximum throughput is for an optimized load. For example, I bet you that you can achieve higher performance using Batch requests than individual requests you are doing now. And of course, if you use GUIDs for your PK, you can't Batch in your current test.
So what if you changed your test to batch insert entities in groups of 100 (maximum per batch), still using GUIDs, but for which 100 entities would have the same PK?

Resources