How can I efficiently prevent duplicated rows in my facts table? - azure

I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.
My pipeline contains the following activities:
[Stored Procedure] Clear staging tables;
[Stored Procedure] Get the timestamp of the last successful update;
[U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
[Stored Procedure] Merge the data from the staging table into the production table;
[U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse fact table;
[Stored Procedure] Update the timestamp of the successful update.
The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.
Question
How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?
Update
I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.
Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).

Do the lifting in Azure SQL Data Warehouse ... your performance will improve dramatically, and your problem will go away.
How many rows are in your filtered files? If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage. The performance of Polybase + SQL should overcome the additional data volume.
If you can avoid the filter, use this logic and throw away the U-SQL processing:
Ingest files to staging table with suitable hash distribution
Take the latest version of each row (suitable for SCD1)
Merge stage to fact using a query like this:
BK = Business Key column/s. COLn = non-key columns
-- Get latest row for each business key to eliminate duplicates.
create table stage2 with (heap,distribution = hash(bk)) as
select bk,
col1,
col2,
row_number() over (partition by bk order by timestamp desc) rownum
from stage
where rownum = 1;
-- Merge the stage into a copy of the dimension
create table dimension_copy with (heap,distribution=replicate) as
select s.bk,
s.col1,
s.col2
from stage2 s
where not exists (
select 1
from schema.dimension d
where d.bk = s.bk)
union
select d.bk,
case when s.bk is null then d.col1 else s.col1 end,
case when s.bk is null then d.col2 else s.col2 end
from dimension d
left outer join stage2 s on s.bk = d.bk;
-- Switch the merged copy with the original
alter table dimension_copy switch to dimension with (truncate_target=on);
-- Force distribution of replicated table across nodes
select top 1 * from dimension;

Related

Delete data between date range in Data Flow

I have a data Flow that reads from Parquet files, does some filtering and then loads into a Delta Lake. The data flow would run multiple times and I don't want duplicate data in my Delta Lake. To safeguard this, I thought to implement a delete-insert mechanism- Find the minimum and maximum date of the incoming data and delete all the data in destination (delta) that falls under this range. Once deleted, all filtered incoming data would be inserted into delta lake.
From documentation, I saw that I need to add policies at row level in an Alter Row Tx to mark that particular row for deletion. I added Delete-If condition as - between(toDate(date, 'MM/dd/yyyy'), toDate("2021-12-22T01:49:57", 'MM/dd/yyyy'), toDate("2021-12-23T01:49:57", 'MM/dd/yyyy')) where date is a column in incoming data.
However, in data preview of Alter Row Tx, all the rows are being marked for insertion and 0 for deletion when there definitely are records that belong to that range.
I suspect that Delete-If condition does not work the way I want it to. In that case, how do I implement deletion between data range in Data Flow with Delta as destination ?
You need to tell ADF what to do with the other portions of the timestamp (it's not a date type yet). Try this:
toString(toTimestamp('2021-12-22T01:49:57', 'yyyy-MM-dd'T'HH:mm:ss'), 'MM/dd/yyyy')

Populate missing data points in azure dataflow

We are working on building ETL pipeline using Azure data flows.
Our requirement here is have to fill in the missing data points (add rows as required) and data for it to be copied from the previous available data point ( when sorted on key columns )
Example -
If the input data is :
The output should be like this:
The rows highlighted in green have values copied from previous available key columns ( Name, year and period )
Any idea how i can achieve the same in azure data flow.
You can use a combination of mapLoop function to generate years + quarters in 1 column. Then flatten tx it to get a table of years+quarters. Then left outer join that table to the original table.
You will have the resulting tables with nulls for the missing quarters. Then use the filldown technique to fill in values(this only works for small data)

Incremental load without date or primary key column using azure data factory

I am having a source lets say SQL DB or an oracle database and I wanted to pull the table data to Azure SQL database. But the problem is I don't have any date column on which data is getting inserting or a primary key column. So is there any other way to perform this operation.
One way of doing it semi-incremental is to partition the table by a fairly stable column in the source table, then you can use mapping data flow to compare the partitions ( can be done with row counts, aggregations, hashbytes etc ). Each load you store the compare output in the partitions metadata somewhere to be able to compare it again the next time you load. That way you can reload only the partitions that were changed since your last load.

implementing scd type 2 as a generalized stored procedure in azure data warehouse for all dimensions?

I am new to azure and I am working on Azure data warehouse. I have loaded few dimensions and staging tables. I want to implement SCD type 2 as a generalised procedure for all the updates with hashbytes. As we know, ADW doesnt support merge, I am trying to implement this with normal insert and update statements with a startdate and enddate column. But the schemas of the dimension table are not exactly the same as those in the staging table, there are few columns that are not considered.
Initially i thought i will pass in the staging and dimension table as parameters, and fetch schema from sys objects, create a temp table, load necessary column from stage and do a hashbyte and compare hashbyte from temp and dimension, but is this a good approach?
PS: Also one more problem, is the sometimes the column names are mapped different, like branchid as branch_id. How do i fetch columns for these. Note that this is just one case, and this could be the case in many tables as well.

Criteria matching in cstore columnar storage

In columnar storage all analytical query will be faster than the row-store. what if the records to be included in a query is filtered with a criteria?
select sum(A.a) from A where A.b > 100 and A.c <= 10;
How does columnar storage manage filtering when columns are stored separately. Also how does it apply join across various tables.
cstore_fdw uses block range filters for each column block. It first checks the data range is compatible with the filter before reading column data. Therefore, if your data distribution along the filtered column helps removal of data blocks, then you would get significant performance gains.
Regarding joins, cstore_fdw does not perform any operation. It composes rows of data and forwards that to postgres engine for further processing. Further processing might be anything like aggregation, window function processing or join operation.

Resources