How to do Incremental loading without comparing with whole data? - apache-spark

I was trying to do incremental load from my on-prem data lake to azure data lake gen2.
select
ac_id,mbr_id ,act_id ,actdttm,
cretm ,rsltyid,hsid,cdag,cdcts
from df2_hs2_lakeprd_ACTV_table where cdcts > last modified date
I am very less records updating or adding daily. My Source table is very large. When I run above query ADF copy activity is taking so much of time to load. I think The filer condition is begin checked with all records in the source table so, it is taking time.
Is there any way I can query or anything and make it will directly load updated records from the
source. And also my source table is partitioned by date can partitioned column will help in load fast.

Why not
Have the incremental files land into 2 folders. A. incremental_yyyy_mm_dd_hh_min_seconds B. Datalake
Read from the incremental folder always this way you may end up reading only delta or the excess records you read will be very low. Once the incremental folder is read maintain status that the folder is read.
So the datalake folder will always have the full snapshot.

Related

Azure Data Factory Truncate prescript taking too long as data size increases

I am using Azure Data Factory to load data from the Azure SQL server into a snowflake data warehouse. I am using the truncate table script in the pre-script section of sink data to completely remove the data from the destination and insert all of the data that's in the source table. It's fast when the data size is small but once the data size gets really big the entire copy activity takes hours to complete the sync. What other alternatives can I use to copy data from my source to the destination?
I thought about using Upsert instead of truncating and inserting but since it's going to check each record I assumed that it will be slower.

Need to do an incremental load using ADF. Source is a csv from ADLS and Sink is Azure SQL

I am trying to do an incremental data load to Azure sql from csv files in ADLS through ADF. The problem I am facing is Azure SQL would generate the primary key column (ID) and the data would be inserted to Azure SQl. But when the pipeline is re triggered the data would be duplicated again. So how do I handle these duplicates ? Because only incremental load should be updated everytime but since primary key column is generated by SQL there would be duplicates every run. Please help !!
You can consider comparing source and sink data first by excluding
Primary key column and then filter that rows which modified and take
it to sink table.
In below video I created a hash on top of few columns from source and sink and compared them to identify changed data. Same way you can consider checking the changed data first and then load it to sink table.
https://www.youtube.com/watch?v=i2PkwNqxj1E

In Azure, is there a way to maintain a current external/polybase table while also maintaining history for rolling back transactions?

I have a Delta Table in Azure Databricks that stores history for every change that occurs. I also have a polybase external table for the users to read from in an Azure SQL Data Warehouse/Azure Synapse. But when an update or delete is necessary, I have to vacuum the table in order to update the polybase to the latest, or it has a copy from every previous version. That vacuum, by nature, deletes the history, so I can no longer roll back.
I'm thinking my only choice is to manually keep a row change table with colums = [schema_name, table_name, primary_key, old_value, new_value] so that we can reapply the changes in reverse if necessary. Is there anything more elegant I can do?

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

How can I copy dynamic data from on prem sqlserver to azure datawarehouse

I have created a linked service that takes the data from on prem and store into the azure blob, but my data is dynamic how can I build a pipeline that takes the updated table into the blob and takes that blob and transfer it into the azure datawarehouse, I need this in such a way so that all my tables are in realtime sync into the azure datawarehouse.
What you are probably looking for is incrementally loading data into your datawarehouse.
The procedure described below is documented here. It assumes you have periodic snapshots of your whole source table into blobstorage.
You need to elect a column to track changes in your table.
If you are only appending and never changing existing rows, the primary key will do the job.
However, if you have to cope with changes in existing rows, you need a way to track those changes (for instance in with a column named "timestamp-of-last-update" - or any better, more succinct name).
Note: if you don't have such a column, you will not be able to track changes and therefore will not be able to load data incrementally.
For a given snapshot, we are interested in the rows added or updated in the source table. This content is called the delta associated to the snapshot. Once delta is computed, it can be upserted into your table with a Copy Activity that invokes an stored procedure. Here you can find details on how this is done.
Assuming the values of the elected column will only grow as rows are added/updated in the source table, it is necessary to keep track of its maximum value through the snapshots. This tracked value is called watermark. This page describes a way to persist the watermark into SQL Server.
Finally, you need to be able to compute the delta for a given snapshot given the last watermark stored. The basic idea is to select the rows where the elected column is greater than the stored watermark. You can do so using SQL Server (as described in the referred documentation), or you can use Hive on HDInsight to do this filtering.
Do not forget to update the watermark with the maximum value of the elected column once delta is upserted into your datawarehouse.

Resources