Output updated rows following Spark DML statements - apache-spark

When updating data in a SQL Server database, the updated records of update, insert, delete and merge statements can out retrieved by adding an output clause.
This is particularly useful when there is a merge statement that retains some parts of the old record within the new, merged version (such as a PreviousVersion or PreviousDate type column).
output allows that data to be carried forwards into another process, as it returns the merged version of the record without having to query the target table again. This facilitates further processing only the newly arrived data, including the updates produced by the merge, without having to execute a subsequent select on the target table (e.g. filtering on an UpdatedDate type column) or as a join from the new data into the updated target table.
Having looked through the documentation for Spark, I can't see any way of replicating this output clause behaviour without an additional read of or join onto the target table. Is there a way of outputting only the updated records from a merge statement and if not, what is the best way to achieve something similar?
An example of this logic would be something like:
New Data
ID
Start
End
1
2022-01-01
2022-08-01
Target Table
ID
Start
End
PreviousEnd
1
2022-01-01
2022-07-01
2022-06-01
MANY
MORE
DATA
ROWS
Merge Logic (pseudo)
when matched
target.End = source.End
and target.PreviousEnd = target.End
output updated
Merge Output (Just one data row)
ID
Start
End
PreviousEnd
1
2022-01-01
2022-08-01
2022-07-01
And then from this point the output row can be used to go and (as an easy example) add an additional month of time (End - PreviousEnd) to a summary held somewhere else, without having to query into the larger target table a second time.

Related

Upsert Option in ADF Copy Activity

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?
The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Delete data between date range in Data Flow

I have a data Flow that reads from Parquet files, does some filtering and then loads into a Delta Lake. The data flow would run multiple times and I don't want duplicate data in my Delta Lake. To safeguard this, I thought to implement a delete-insert mechanism- Find the minimum and maximum date of the incoming data and delete all the data in destination (delta) that falls under this range. Once deleted, all filtered incoming data would be inserted into delta lake.
From documentation, I saw that I need to add policies at row level in an Alter Row Tx to mark that particular row for deletion. I added Delete-If condition as - between(toDate(date, 'MM/dd/yyyy'), toDate("2021-12-22T01:49:57", 'MM/dd/yyyy'), toDate("2021-12-23T01:49:57", 'MM/dd/yyyy')) where date is a column in incoming data.
However, in data preview of Alter Row Tx, all the rows are being marked for insertion and 0 for deletion when there definitely are records that belong to that range.
I suspect that Delete-If condition does not work the way I want it to. In that case, how do I implement deletion between data range in Data Flow with Delta as destination ?
You need to tell ADF what to do with the other portions of the timestamp (it's not a date type yet). Try this:
toString(toTimestamp('2021-12-22T01:49:57', 'yyyy-MM-dd'T'HH:mm:ss'), 'MM/dd/yyyy')

Remove Duplicates based on latest date in power query

I got a dataset that I am loading into my sheet via power query and wish to transform the data a little bit according to my liking before loading it in.
To give a little more context, I have some ID's and I would like the older rows to be removed and the rows which have the newer date to be loaded in.
Solution is described at https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/
"Remove Duplicates and Keep the Last Record with Power Query"
In short, sort per date in a buffered table and then remove duplicate id
Another way I think would be to group by id and get MAX date but it depends of the data size

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

Excel Power Query - Incremental Load from Query and adding date

I am trying to do something quite simple which I am failing to understand.
Take the output from a query, date time stamp and write it into a Excel table.
Iterate the logic again and you get the same output but the generated date time has progressed in time.
Query 1 -- From SQL which yields 2 columns category, count.
I am taking this and adding a generated date to it using DateTime.LocalNow().
Query 2 -- Target table
How can i construct a query which adds to an existing table and doesnt require me to load the result into a new table.
I have seen this blog.oraylis.de and i cant make it work since the DateTime.LocalNow() call runs for source and target and i end up with the same datetime throughout the query.
I think i am missing something obvious.
EDIT:-
= Table.Combine({SOURCE_DATA, TARGET_DATA})
This loads into a 3rd new table and doesnt take into account that 3rd table when loading - so you just end up with a new version of just the first two tables with new timestamp
These steps should work
create a query Q1 based on the SQL Statement, add your timestamp using DateTime.LocalNow() and load this into an Excel table (execute the query)
create a new query Q2 based on this Excel new table (just like that, no transforms)
Modify the first query Q1 by adding the Table.Combine with Q2 as the last step.
So, in other words, Q2 loads the existing data from the Excel table into which Q1 writes. The Excel table is always written completely but since the existing data is preserved you will get the result of new data being loaded to the table. Hope this helps.
Good luck, Hilmar

Resources