Azure Data Factory Merge to files before inserting in to DB - azure

We have two files that are ^ delimited file and a comma separated txt files which are stored in the Blob Storage like below
File1 fields are like
ItemId^Name^c1^type^count^code^Avail^status^Ready
File2 Fields are like
ItemId,Num,c2
Here the first column in both the files are the key and based on it I need to insert them in to one table on the Azure DB using the Azure Data Factory. Can anyone suggest how can this be done in the ADF. Should we merge the two files into one file before inserting into the Database.
AzureDB columns are
ItemId Name c1 type count code Avail status Ready Num c2
So it should be like
Item1 ABC(S) 1234 Toy 10 N N/A POOL N/A 19 EM
Item2 DEF(S) 5678 toy 7 X N/A POOL N/A 6 MP
I was referring to this Merging two or more files from a storage account based on a column using Azure Data Factory but couldnt understand if we can merge the two files before inserting in to DB

You can use the 2 files to create 2 datasets, use join activity to jointhem together and simply sink to the SQL table in a dataflow.
Here Inner join is used, you can adapt to use the type of join your preferred.
You can see the preview of the join successfully merged the 2 files/data sources.
Adjust the field mapping in Sink if needed.
Here is the arrow-separated.csv I used:
ItemId^Name^c1^type^count^code^Avail^status^Ready
Item1^ABC(S)^1234^Toy^10^N^N/A^POOL^N/A
Item2^DEF(S)^5678^toy^7^X^N/A^POOL^N/A
Here is the comma-separated.csv I used:
ItemId,Num,c2
Item1,19,EM
Item2,6,MP
Result in DB:

Related

Transforming CSV column names in Azure Synapse during implicit copy

I am making a data pipeline in Azure Synapse. I want to copy a 500 GB CSV file from a Blob container file and convert it into an Azure Data Lake Storage Gen2 table. Before I copy it into the table, I want to make some changes to the data using a Data Flow block, to change some column names and other transformations.
Is it possible to copy the data and make transformations implicitly, without a staging Parquet store ?
If yes, how do I make the transformations implicitly ? Ex: Remove dashes ("-") from all column names ?
You can use rule-based mapping in the select transformation to remove the Hyphen symbol from all the column names.
Matching condition: true()
Output column name expression: replace($$, '-','')
Select output:

Daily incremental copying from Amazon S3 data into Amazon Redshift

I have a RDS database whose snapshot is taken everyday and is kept in a S3 bucket. I copy the RDS snapshot data from S3 to Amazon Redshift database daily. I can use copy to copy the tables but instead of copying the whole table, I want to copy only the rows which were added since the last snapshot was taken(Incremental copying).
For example, in RDS, there is a table name "user" which looks like this at 25-05-2021
id | username
1 | john
2 | cathy
When I will run the data loader for first time on 26-05-2021, it will copy these two rows into the Redshift table with the same name.
Now on 26-05-2021, the table in RDS looks like this:
id | username
1 | john
2 | cathy
3 | ola
4 | mike
When I will run the data loader on 27-05-2021, instead of copying all three rows, I want to copy/take only the rows which has been newly added(id = 3 and id = 4) as I already have the other rows.
What should be the best way of doing this incremental loading?
The COPY command will always load the entire table. However, you could create an External Table using Redshift Spectrum that accesses the files without loading them into Redshift. Then, you could construct a query that does an INSERT where the ID is greater than the last ID used in the Redshift table.
Perhaps I should explain it a bit simpler...
Table existing_table in Redshift already has rows up to id = 2
CREATE EXTERNAL TABLE in_data to point at the files in S3 containing the data
The use INSERT INTO existing_table SELECT * FROM in_data WHERE id > (SELECT MAX(id) FROM existing_table
In theory, this should only load the new rows into the table.

Populate missing data points in azure dataflow

We are working on building ETL pipeline using Azure data flows.
Our requirement here is have to fill in the missing data points (add rows as required) and data for it to be copied from the previous available data point ( when sorted on key columns )
Example -
If the input data is :
The output should be like this:
The rows highlighted in green have values copied from previous available key columns ( Name, year and period )
Any idea how i can achieve the same in azure data flow.
You can use a combination of mapLoop function to generate years + quarters in 1 column. Then flatten tx it to get a table of years+quarters. Then left outer join that table to the original table.
You will have the resulting tables with nulls for the missing quarters. Then use the filldown technique to fill in values(this only works for small data)

Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?
One way to achieve this is to use the mod or % operator.
To start with set a surrogate key on the CSV file or use any sequential key in the data.
Add a aggregate step with a group by clause that is your key % row count
Set the Aggregates function to collect()
Your output should now be a array of rows with the expected count in each.
We can't specify the row number to split the csv file. The closest workaround is specify the partition of the sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json and json2.json:
Optimize:
Partition operation: Set partition
Partition type: Dynamic partition
Number of partitions: 2 (means split the csv data to 2 partitions)
Stored ranges in columns: id(split based on the id column)
Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 5,000 row data).

How can I efficiently prevent duplicated rows in my facts table?

I have built a Data Factory pipeline which ETL the data from a Data Lake into a Datawarehouse. I chose the SCD type 1 for my dimensions.
My pipeline contains the following activities:
[Stored Procedure] Clear staging tables;
[Stored Procedure] Get the timestamp of the last successful update;
[U-SQL] Extract the dimension data from the filtered files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse staging dimension table;
[Stored Procedure] Merge the data from the staging table into the production table;
[U-SQL] Extract the fact data from the files (the ones that have been modified since the last successful update) in Azure Data Lake, transform it and output it in a csv file;
[Copy Data] Load the csv into a SQL datawarehouse fact table;
[Stored Procedure] Update the timestamp of the successful update.
The problem with this pipeline is that I end up with duplicated fact entries in my warehouse if the run the pipeline twice.
Question
How can I efficiently prevent duplicated rows in my facts table, considering all the unsupported features in Azure SQL Data Warehouse?
Update
I have read another piece of information regarding the indexing (and the statistics) of a warehouse and how it must be rebuilt after an update.
Considering that, the simplest thing that I thought of was to apply the same principle to the facts as the one I am using for the Dimensions. I can load all the new facts in a staging table, but then use an index on the fact table to include only the facts that do not exist (the facts can't be updated right now).
Do the lifting in Azure SQL Data Warehouse ... your performance will improve dramatically, and your problem will go away.
How many rows are in your filtered files? If it is in the millions to tens of millions, I think you can probably avoid the filter at the data lake stage. The performance of Polybase + SQL should overcome the additional data volume.
If you can avoid the filter, use this logic and throw away the U-SQL processing:
Ingest files to staging table with suitable hash distribution
Take the latest version of each row (suitable for SCD1)
Merge stage to fact using a query like this:
BK = Business Key column/s. COLn = non-key columns
-- Get latest row for each business key to eliminate duplicates.
create table stage2 with (heap,distribution = hash(bk)) as
select bk,
col1,
col2,
row_number() over (partition by bk order by timestamp desc) rownum
from stage
where rownum = 1;
-- Merge the stage into a copy of the dimension
create table dimension_copy with (heap,distribution=replicate) as
select s.bk,
s.col1,
s.col2
from stage2 s
where not exists (
select 1
from schema.dimension d
where d.bk = s.bk)
union
select d.bk,
case when s.bk is null then d.col1 else s.col1 end,
case when s.bk is null then d.col2 else s.col2 end
from dimension d
left outer join stage2 s on s.bk = d.bk;
-- Switch the merged copy with the original
alter table dimension_copy switch to dimension with (truncate_target=on);
-- Force distribution of replicated table across nodes
select top 1 * from dimension;

Resources