Delta Lake Vaccum - delta-lake

In Delta Lake, is it possible to only vaccum specific rows from the data. Client has a requirement to vaccum the data of a customer, when customer is deactivated in an effort to delete customer's entire data set. They do not want to keep history.

You need to delete customer's data, and then run vacuum to remove old files that contained original customer data.

Related

How to do Incremental loading without comparing with whole data?

I was trying to do incremental load from my on-prem data lake to azure data lake gen2.
select
ac_id,mbr_id ,act_id ,actdttm,
cretm ,rsltyid,hsid,cdag,cdcts
from df2_hs2_lakeprd_ACTV_table where cdcts > last modified date
I am very less records updating or adding daily. My Source table is very large. When I run above query ADF copy activity is taking so much of time to load. I think The filer condition is begin checked with all records in the source table so, it is taking time.
Is there any way I can query or anything and make it will directly load updated records from the
source. And also my source table is partitioned by date can partitioned column will help in load fast.
Why not
Have the incremental files land into 2 folders. A. incremental_yyyy_mm_dd_hh_min_seconds B. Datalake
Read from the incremental folder always this way you may end up reading only delta or the excess records you read will be very low. Once the incremental folder is read maintain status that the folder is read.
So the datalake folder will always have the full snapshot.

In Azure, is there a way to maintain a current external/polybase table while also maintaining history for rolling back transactions?

I have a Delta Table in Azure Databricks that stores history for every change that occurs. I also have a polybase external table for the users to read from in an Azure SQL Data Warehouse/Azure Synapse. But when an update or delete is necessary, I have to vacuum the table in order to update the polybase to the latest, or it has a copy from every previous version. That vacuum, by nature, deletes the history, so I can no longer roll back.
I'm thinking my only choice is to manually keep a row change table with colums = [schema_name, table_name, primary_key, old_value, new_value] so that we can reapply the changes in reverse if necessary. Is there anything more elegant I can do?

LastUpdatedDate in ODS

I am migrating data from SAP HANA view to ODS (Azure Data Factory). From there, the other third-party company is moving data to Salesforce database. Now, when I migrate it we are doing a truncate and load in sink.
There is no column in source which shows the date or last updated date when the news rows are added in SAP HANA.
Do we need to have the date in the source, or any other way we can write it in ODS?
It must show with a last updated date or something to denote when a row has been inserted or changed after initial load. So that they have a track when loading onto Salesforce database.
Truncate and Load a staging table, then run a stored procedure to MERGE into your target table, marking inserted and updated rows with the current sysdatetime(). Or MERGE from the staging table into a Temporal Table, or a table with Change Tracking enabled to track the changes automatically.

How can I copy dynamic data from on prem sqlserver to azure datawarehouse

I have created a linked service that takes the data from on prem and store into the azure blob, but my data is dynamic how can I build a pipeline that takes the updated table into the blob and takes that blob and transfer it into the azure datawarehouse, I need this in such a way so that all my tables are in realtime sync into the azure datawarehouse.
What you are probably looking for is incrementally loading data into your datawarehouse.
The procedure described below is documented here. It assumes you have periodic snapshots of your whole source table into blobstorage.
You need to elect a column to track changes in your table.
If you are only appending and never changing existing rows, the primary key will do the job.
However, if you have to cope with changes in existing rows, you need a way to track those changes (for instance in with a column named "timestamp-of-last-update" - or any better, more succinct name).
Note: if you don't have such a column, you will not be able to track changes and therefore will not be able to load data incrementally.
For a given snapshot, we are interested in the rows added or updated in the source table. This content is called the delta associated to the snapshot. Once delta is computed, it can be upserted into your table with a Copy Activity that invokes an stored procedure. Here you can find details on how this is done.
Assuming the values of the elected column will only grow as rows are added/updated in the source table, it is necessary to keep track of its maximum value through the snapshots. This tracked value is called watermark. This page describes a way to persist the watermark into SQL Server.
Finally, you need to be able to compute the delta for a given snapshot given the last watermark stored. The basic idea is to select the rows where the elected column is greater than the stored watermark. You can do so using SQL Server (as described in the referred documentation), or you can use Hive on HDInsight to do this filtering.
Do not forget to update the watermark with the maximum value of the elected column once delta is upserted into your datawarehouse.

Use offset of Windowstart in azure data factory

I wish to incrementally copy data from azure table to azure blob. I have created linked services, datasets and pipelines. I wish to copy data from table to blob after every hour. The table has a timestamp column.I want to transfer data from table to blob in such a way that the data which gets added to the table from 7am to 8am should be pushed to blob in activity window starting at 8 am. In other words, I don't want to miss any data flowing into the table.
I have changed the query used to extract data from the azure table.
"azureTableSourceQuery": "$$Text.Format('PartitionKey gt \\'{0:yyyyMMddHH} \\' and PartitionKey le \\'{1:yyyyMMddHH}\\'', Time.AddHours(WindowStart, -2), Time.AddHours(WindowEnd, -2))"
This query will get data which was added to the table 2 hours back and hence I wont miss any data.

Resources