Delete data between date range in Data Flow - azure

I have a data Flow that reads from Parquet files, does some filtering and then loads into a Delta Lake. The data flow would run multiple times and I don't want duplicate data in my Delta Lake. To safeguard this, I thought to implement a delete-insert mechanism- Find the minimum and maximum date of the incoming data and delete all the data in destination (delta) that falls under this range. Once deleted, all filtered incoming data would be inserted into delta lake.
From documentation, I saw that I need to add policies at row level in an Alter Row Tx to mark that particular row for deletion. I added Delete-If condition as - between(toDate(date, 'MM/dd/yyyy'), toDate("2021-12-22T01:49:57", 'MM/dd/yyyy'), toDate("2021-12-23T01:49:57", 'MM/dd/yyyy')) where date is a column in incoming data.
However, in data preview of Alter Row Tx, all the rows are being marked for insertion and 0 for deletion when there definitely are records that belong to that range.
I suspect that Delete-If condition does not work the way I want it to. In that case, how do I implement deletion between data range in Data Flow with Delta as destination ?

You need to tell ADF what to do with the other portions of the timestamp (it's not a date type yet). Try this:
toString(toTimestamp('2021-12-22T01:49:57', 'yyyy-MM-dd'T'HH:mm:ss'), 'MM/dd/yyyy')

Related

Upsert Option in ADF Copy Activity

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?
The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Output updated rows following Spark DML statements

When updating data in a SQL Server database, the updated records of update, insert, delete and merge statements can out retrieved by adding an output clause.
This is particularly useful when there is a merge statement that retains some parts of the old record within the new, merged version (such as a PreviousVersion or PreviousDate type column).
output allows that data to be carried forwards into another process, as it returns the merged version of the record without having to query the target table again. This facilitates further processing only the newly arrived data, including the updates produced by the merge, without having to execute a subsequent select on the target table (e.g. filtering on an UpdatedDate type column) or as a join from the new data into the updated target table.
Having looked through the documentation for Spark, I can't see any way of replicating this output clause behaviour without an additional read of or join onto the target table. Is there a way of outputting only the updated records from a merge statement and if not, what is the best way to achieve something similar?
An example of this logic would be something like:
New Data
ID
Start
End
1
2022-01-01
2022-08-01
Target Table
ID
Start
End
PreviousEnd
1
2022-01-01
2022-07-01
2022-06-01
MANY
MORE
DATA
ROWS
Merge Logic (pseudo)
when matched
target.End = source.End
and target.PreviousEnd = target.End
output updated
Merge Output (Just one data row)
ID
Start
End
PreviousEnd
1
2022-01-01
2022-08-01
2022-07-01
And then from this point the output row can be used to go and (as an easy example) add an additional month of time (End - PreviousEnd) to a summary held somewhere else, without having to query into the larger target table a second time.

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

Criteria matching in cstore columnar storage

In columnar storage all analytical query will be faster than the row-store. what if the records to be included in a query is filtered with a criteria?
select sum(A.a) from A where A.b > 100 and A.c <= 10;
How does columnar storage manage filtering when columns are stored separately. Also how does it apply join across various tables.
cstore_fdw uses block range filters for each column block. It first checks the data range is compatible with the filter before reading column data. Therefore, if your data distribution along the filtered column helps removal of data blocks, then you would get significant performance gains.
Regarding joins, cstore_fdw does not perform any operation. It composes rows of data and forwards that to postgres engine for further processing. Further processing might be anything like aggregation, window function processing or join operation.

How to invert a merge query in power query

I have a single column table of customer account numbers and a main table containing 400,000 records pulling from an access database. I want to remove all records from the table where the customer account number can be found in the single column table.
The merge query capability in power query allows me to return only the records where there is a match on the customer list (in addition to a variety of other variations on this theme) but I would like to know whether there is a way to invert this so that I return all records where the customer number does not appear in this list.
I have achieved this already by using the List.Contains function and adding a custom column to identify the rows to exclude and then filtering them out, but I think this is severely impacting the performance of my workbook. Refreshing the table that initially has 400,000 rows prior to this series of transformations takes a very long time, and all queries that depend on this table then also take a long time to refresh.
Thank you
If you do a Left Anti Join of your table with a single column, this will give you your table filtered to only have the rows which do not match to the single column.

Resources