LastUpdatedDate in ODS - azure

I am migrating data from SAP HANA view to ODS (Azure Data Factory). From there, the other third-party company is moving data to Salesforce database. Now, when I migrate it we are doing a truncate and load in sink.
There is no column in source which shows the date or last updated date when the news rows are added in SAP HANA.
Do we need to have the date in the source, or any other way we can write it in ODS?
It must show with a last updated date or something to denote when a row has been inserted or changed after initial load. So that they have a track when loading onto Salesforce database.

Truncate and Load a staging table, then run a stored procedure to MERGE into your target table, marking inserted and updated rows with the current sysdatetime(). Or MERGE from the staging table into a Temporal Table, or a table with Change Tracking enabled to track the changes automatically.

Related

Insert Excel sheet data to an existing Big Query table

We have an existing table on bigquery that gets updated via a scheduler that checks ftp server and upload the new added data into it.
The issue is that few days were dropped from the FTP and now I need to upload the data manually into the table.
Hopefully, I didn't want to create another table and upload the data into it and then make union between the two tables, I was looking for a solution that would insert the sheets to the main table right away

External Table in Databricks is showing only future date data

I had a delta table in databricks and data is available in ADLS. data is partitioned by date column, from 01-06-2022 onwards data is available in parquet format in adls but when i query the table in databricks i can see data from future date onwards everyday.older data is not displaying. every day data is overwriting to the table path with partitioned date column.
df.write.format('delta').mode('overwrite').save('{}/{}'.format(DELTALAKE_PATH, table))
Using Overwrite mode will delete past data and add new data. This is the reason for your issue.
df.write.format('delta').mode('append').save('{}/{}'.format(DELTALAKE_PATH, table))
Using append mode will append new data beneath the existing data. This will keep your existing data and when you execute a query, it will return past records as well.
You need to use append mode in place of overwrite mode.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only to queries where existing rows in the Result Table are not expected to change.
Reference - https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts

Azure Data flow - presql script, dynamic content

I want to run a presql script in the Data flow in SINK. I want to delete exiting records for particular year.
This particular year will be coming from source excel file.
Say I have a file for 2021 data and loaded that data into DB. when I rerun the pipeline for the same excel file I want to delete 2021 related records in DB and insert fresh. This table may contain multiple years data. So everytime a new file arrives for a particular year, I want to delete that respective records and reload the new data.
I can read the year value from source file column. And I can keep it as derived column. How Can I write a presql script to delete?
delete from where year = <sourcefile.year>
how can i do this in data flow?
Pls help!
You can use a temporary table in sink1 and delete the records in the sink2.
Please follow the demonstration below:
This is my SQL table with a sample data.
Create two SQL sinks for it from same source using a new branch.
In the first sink, provide a dataset with edit table name option for the temporary table.
In sink check on Recreate table.
In sink 2, give your SQL table with the below SQL script.
DELETE FROM exceltable WHERE year in (select year from dbo.temp1)
drop table dbo.temp1;
In the settings of Data flow give the Write order of sinks.
Temporary table sink should execute first.
This is my result in the SQL table after deleting records.

How to create an incremental load with Salesforce as source in Azure data factory?

Is there any way we can fetch the max of last modified date from the last processed file and store it in a config table
From Supported data stores and formats you can see Salesforce, Salesforce service cloud and Marketing cloud are supported.
You have to perform the following steps:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Follow this to setup Linked Service with Salesforce in Azure Data Factory
When copying data from Salesforce, you can use either SOQL query or SQL query. Note that these two has different syntax and functionality support, do not mix it. You are suggested to use the SOQL query, which is natively supported by Salesforce.
Process for incremental loading, or delta loading, of data through a watermark:
In this case, you define a watermark in your source database. A watermark is a column that has the last updated time stamp or an incrementing key. The delta loading solution loads the changed data between an old watermark and a new watermark. The workflow for this approach is depicted in the following diagram:
ADF will scan all the files from the source store, apply the file filter by their LastModifiedDate, and only copy the new and updated file since last time to the destination store.
For capabilities, Prerequisites and Salesforce request limits refer Copy data from and to Salesforce by using Azure Data Factory
Refer doc: Delta copy from a database with a control table This article describes a template that's available to incrementally load new or updated rows from a database table to Azure by using an external control table that stores a high-watermark value.
This template requires that the schema of the source database contains a timestamp column or incrementing key to identify new or updated rows.
The template contains four activities:
Lookup retrieves the old high-watermark value, which is stored in an external control table.
Another Lookup activity retrieves the current high-watermark value from the source database.
Copy copies only changes from the source database to the destination store. The query that identifies the changes in the source database is similar to 'SELECT * FROM Data_Source_Table WHERE TIMESTAMP_Column > “last high-watermark” and TIMESTAMP_Column <= “current high-watermark”'.
StoredProcedure writes the current high-watermark value to an external control table for delta copy next time.
Go to the Delta copy from Database template. Create a New connection to the source database that you want to data copy from.
Create connections to the external control table and stored procedure that you created and Select Use this template.
Choose the available pipeline
For Stored procedure name, choose [dbo].[update_watermark]. Select Import parameter, and then select Add dynamic content.
click Add dynamic content and Type in below query. This will get a maximum date in your watermark column that we can use for delta slice.
You can use this query to fetch the max of last modified date from the last processed file
select MAX(LastModifytime) as NewWatermarkvalue from data_source_table"
or
For files only you can use Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool
Refer: Source: ADF Incremental loading with configuration stored in a table

Append new data to an External Data table in Excel

If I have a table in Excel, populated via an external data connection, how can I refresh the data in such a way as to insert new rows for new data, but keep the old rows as well?
For example, this is my table:
Unfortunately the database that I'm working with only holds onto the current month's data, so if I refresh, I'll only get February 2011's data back. The end result I want is:
Are there any built-in Excel options that I'm missing (similar to "External Data Properties"->"Insert entire rows for new data, clear unused cells") or should I go the programmatic route and save the old data in a temp table, etc?
Since Excel external data is based on a query of the external source, Refresh will update to whatever is in that source. I think you will need to code a routine to append the external link data to another sheet

Resources