Moving to Azure Synapse from Azure SQL Database - azure

We are currently ingesting data from application databases into an Azure SQL Database. The size is around 600 GB now (which mainly distributed to only 3 fact tables, the rest of the tables are master data which is quite small) and it's running on 40 vCores (we use it a lot for reporting, so need a high number of vCores).
Some difficulties I'm currently facing:
Data copy from source to sink usually takes a really long time. The approach we are using is to delete all records for this month, and then copy this month's data from application db over. Writing to sink usually takes a lot of time as well (due to the indices on the fact table I believe).
High data I/O whenever someone pulls a big query.
Here to hope someone can shed some lights on how to make the setup works faster.
Thanks!

Though i needed some more info , but i think i can share some pointers .
When you mentioned big query taking more time " , have you checked the query and make sure that the indices are there on the required columns and they are updated regulary ? It appears that you have 3 tables with lot of data , is the query are slow on all three ? ( If I were you i will try the divide the problems into smaller issues and investigate each one of them )
On the copy part , before you copy you will have to select the data and so we will have to improve the query perofiormance which I mentioned above . How are you copying data ? Since I see ADF tag , I am assuming its ADF . Are you copying data sequentially ? I mean copy data to BigTable1 then copy BigTable2 then BigTable 3 ? You can explore the possiblilty of copying data in parallel . I am not sure how have you implmeneted the logic in ADF but three copy avtivity one below the other will do the trick .
In each copy activity you have the option to set parallelism and also the batchcount , I could suggest to take a look on that .
Performanace problem is very difficult to help with unless you have access to the data :) let me know how it goes .
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-features#parallel-copy
Thanks
Himanshu

Related

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

Copy Data pipeline on Azure Data Factory from SQL Server to Blob Storage

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.
The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively.
Additionally, the views have the same query structure, e.g. (pseudo-code)
with
v as (
select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
from Table
)
select *
from v
and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.
The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.
I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.
Thanks in advance for any heads up.
To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.
When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :
Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
Try searching the documentation (this doc., provided by #Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.
N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

Azure Data factory and Data flow taking too much time to process data from staging to Database

So I have one data factory which runs every day, and it selects data from oracle on-premise database around 80M records and moves it to parquet file, which is taking around 2 hours I want to speed up this process... also the data flow process which insert and update data in db
parquet file setting
Next step is from parquet file it call the data flow which move data as upsert to database but this also taking too much time
data flow Setting
Let me know which compute type for data flow
Memory Optimized
Computed Optimized
General Purpose
After Round Robin Update
Sink Time
Can you open the monitoring detailed execution plan for the data flow? Click on each stage in your data flow and look to see where the bulk of the time is being spent. You should see on the top of the view how much time was spent setting-up the compute environment, how much time was taken to read your source, and also check the total write time on your sinks.
I have some examples of how to view and optimize this here.
Well, I would surmise that 45 min to stuff 85M files into a SQL DB is not horrible. You can break the task down into chunks and see what's taking the longest time to complete. Do you have access to Databricks? I do a lot of pre-processing with Databricks, and I have found Spark to be super-super-fast!! If you can pre-process in Databricks and push everything into your SQL world, you may have an optimal solution there.
As per the documentation - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-sink can you try modifying your partition settings under Optimize tab of your Sink ?
I faced similar issue with the default partitioning setting, where the data load was taking close to 30+ mins for 1M records itself, after changing the partition strategy to round robin and provided number of partitions as 5 (for my case) load is happening in less than a min.
Try experimenting with both Source partition (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#partitioning-on-source) & Sink partition settings to come up with the optimum strategy. That should improve the data load time

Azure SQL Database update performance

We're migrating some databases from an Azure VM running SQL Server to Azure SQL. The current VM is a Standard DS12 v2 with two 1TB SSDs attached.
We are using an elastic pool at the P1 performance level. We're early days in this, so nothing else is really running in the pool.
At any rate, we are doing an ETL process that involves a handful of ~20M row tables. We bulk load these tables and then update some attributes to help with the rest of the process.
For example, I am currently running the following update:
UPDATE A
SET A.CompanyId = B.Id
FROM etl.TRANSACTIONS AS A
LEFT OUTER JOIN dbo.Company AS B
ON A.CO_ID = B.ERPCode
TRANSACTIONS is ~ 20M rows; Company is fewer than 50.
I'm already 30 minutes into running this update which is far beyond what will be acceptable. The usage meter on the Pool is hovering around 40%.
For reference, our Azure VM runs this in about 2 minutes.
I load this table via the bulk copy and this update is already beyond what it took to load the entire table.
Any suggestions on speeding up this (and other) updates?
We are using an elastic pool at the P1 performance level.
Not sure ,how this translates your VM performance levels and what criteria you are using to compare both
I would recommend below steps ,since there is no execution plan provided ..
1.Check if there is any wait type ,while the update is running
select
session_id,
start_time,
command,
db_name(ec.database_id) as dbname,
blocking_session_id,
wait_type,
last_wait_type,
wait_time,
cpu_time,
logical_reads,
reads,
writes,
((database_transaction_log_bytes_used +database_transaction_log_bytes_reserved)/1024)/1024 as logusageMB,
txt.text,
pln.query_plan
from sys.dm_exec_requests ec
cross apply
sys.dm_exec_sql_text(ec.sql_handle) txt
outer apply
sys.dm_exec_query_plan(ec.plan_handle) pln
left join
sys.dm_tran_database_transactions trn
on trn.transaction_id=ec.transaction_id
the wait type,provides you lot of info,which can be used to troubleshoot..
2.You can also use below query to see in parallel ,what is happening with the query
set statistics profile on
your update query
then run below query in a seperate window
select
session_id,physical_operator_name,
row_count,actual_read_row_count,estimate_row_count,estimated_read_row_count,
rebind_count,
rewind_count,
scan_count,
logical_read_count,
physical_read_count,
logical_read_count
from
sys.dm_exec_query_profiles
where session_id=your sessionid;
as per your question,there don't seems to be an issue with DTU.So i dont see much issue on that front..
Slow performance solved in one case:
I have recently had severe problems with slow Azure updates that made it nearly unusable. It was updating only 1000 rows in 1 second. So 1M rows was 1000 seconds. I believe this is due to logging in Azure, but I haven't done enough research to be certain. Opening a MS support incident went nowhere. I finally solved the issue using two techniques:
Copy the data to a temporary table and make updates in the temp table. So in the above case, try copying the 50 rows to a temp table & then back again after updates. No/Minimal logging in this case.
My copying back was still slow (I had a few 100K rows), and I create a clustered index on that table. Update duration dropped by a factor of 4-5.
I am using a S1-20DTU database. It is still about 5 times slower than a dedicated instance, but that is fantastic performance for the price.
The real answer to this issue is that SQL Azure will spill to the tempdb much faster than you would expect if you are used to using a well provisioned VM or physical machine.
You can tell that this is happening by recording the actual execution query plan. Look for the warning icon:
The popup will complain about the spill:
At any rate, if you see this, it is likely that you're trying to do too much in the statement.
The Microsoft support person suggested updating the statistics, but this did not change the situation for us.
What seems to be working is the traditional advice to break the inserts up into smaller batches.

Azure SQL DW data loads taking long time

I am trying to load the data from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.
I have around 24 files, which is around 22GB of size and trying to load the data from External table to a Internal table on 300 DWU with a largerc resource class service/user account.
My insert into statement ( which is very straight forward) is running for more than 10 hours.
insert into Trxdata.Details_data select * from Trxdata.Stage_External_Table_details_data;
I also tried with below statement, thats also running for more than 10 hours.
CREATE TABLE Trxdata.Details_data12
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM Trxdata.Stage_External_Table_details_data
;
I see - both the SQLs are running with ACTIVE status in "sys"."dm_pdw_exec_requests" [ I was thinking, it may be concurrency slot issue and it hasnt got concurrency slots to run, but its not the case]
and I was hoping , increasing/scaling up DWU - might improve the performance. but looking at the DWU usage in portal.azure.com - I am not convinced to increased the DWU because the DWU usage chart shows <50DWU for the last 12 hours
DWU USage chart
So, I am trying to understand- how can I find - what is taking such a long time, How can I improve the performance of my data load ?
I suspect your problem lies with the file(s) being compressed. Many azure documents state that you will only get one reader per compressed file. As a test I would suggest you decompress your data and try a load and see if decompressing/load is faster than then 10 hours loading compressed data you are currently seeing. I also have better luck with several files rather than 1 large file, if that is an option for your system.
Please have a look at the below blog from SQL CAT on data loading optimizations.
https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/
Based on the info provided, a couple things to consider are:
1) Locality of the blob files compared to the DW instance. Make sure they are in the same region.
2) Clustered Columnstore is on by default. If you are loading 22GB of data, a HEAP load may perform better (but not sure on row count either). So:
CREATE TABLE Trxdata.Details_data12
WITH (HEAP, DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;
If the problem still persists, please file a support ticket:
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/
You mention that the data is in a compressed format. How many compressed files does the data reside in? For compressed files, you'll achieve more parallelism and thus better performance when the data is spread across many files. Having the data in multiple files is not needed for uncompressed files in order to achieve better performance, so another way to test if this is your performance issue is to un-compress your files.

Resources