split the file by their transaction date though ADF - azure

By using ADF we unloaded data from on-premise sql server to datalake folder in single parquet for full load.
Then in delta load we are keeping in current day's folder yyyy/mm/dd structur going forward.
But i want full load file also separate it by their respective transaction day's folder.ex: in full load file we have 3 years data. i want data split it by transaction day wise in each separate folder. like 2019/01/01..2019/01/02 ..2020/01/01 instead of single file.
is there way to achieve this in ADF or while unloading itself can we get this folder structure for full load?

Hi#Kumar AK After a period of exploration, I found the answer. I think we need to use Azure data flow to achieve that.
My source file is a csv file, which contains transaction_date column.
Set this csv as the source in data flow.
In DerivedColumn1 activity, we can generate a new column FolderName via column transaction_date. FolderName will be used as a folder structure.
In sink1 activity, select Name file as column data as File Name option, select FolderName column as Column data.
That's all. These rows of the csv file will be split into files in different folders. The debug result is as follows, :

Related

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

Add file name to Copy activity in Azure Data Factory

I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:

Add column to CSV File from another CSV File (Azure Data Factory)

For example:
Persons.csv
name, last_name
-----------------------
jack, jack_lastName
luc, luc_lastname
FileExample.csv
id
243
123
Result:
name, last_name, exampleId
-------------------------------
jack, jack_lastName, 243
luc, luc_lastname, 123
I want to aggregate any number of columns from another data source, to insert that final result in a file or in a database table.
I have been trying many ways but I can't do it.
You can try to make use of Mergefiles in azure data factory pipeline to merge two csv files .
Select copydata activity and go to source to loop through wild card entry *.csv to search for csv files in storage(configure linked storage to adf in this process).
Then the create a output csv in the same container if required as in my case to merge files and store by naming it some examplemerge.csv.
Check mark the first row as header.
validate and try to debug .
Then you must be able to see merged files in the resultant merged file in output folder.
You can check this reference vlog Merge Multiple CSV files to single CSV for more details and also this vlog on Load Multiple CSV Files to a Table in Azure Data Factory if required.
But if you want to join the files , there must be some common column to join.
Also check this thread from Q&A Azure Data Factory merge 2 csv files with different schema

Create list of files in Azure Storage and send it to sql table using ADF

I need to copy file names of excel files that are in my Azure Storage as blobs and then put these names in the SQL Server table using ADF. It can be a file path as a name of a file but the hardest thing is that in the dataset which takes all the files from one specific folder I have to select a sheet name and these sheet names are different for each file, therefore it returns an error. Is there a way to create a collective dataset without indicating the sheet name?
So, if I understand your question correctly you are looking for a way to write all Excel filenames to a SQL Database using ADF.
You can use the generic Get Metadata activity and use a binary dataset as source. Select Child items as an field to retrieve. This will retrieve all files in the folder. Then add a filter to only select the Excel file types.
Hope that this gets you on the right track.

Add a date column in Azure Data Factory

I am wondering if it is possible to add a date column to each file uploaded.
For example each month a CSV is produced. I am wanted to add for example "December 2020" to each row and then for the next months upload add "January 2021" to every row in the CSV file. Before copying this into a SQL database.
e.g. file name "Latest Rating December 2020" I would want the 'December 2020' as a column and be the same value for all rows. The naming convention will be the same for each months upload.
Thanks
I've created a test to add a column to the csv file.
The result is as follows:
We can get file name via Child Items in Get MetaData activity.
The dataset is to the container in ADLS.
Then we can declare a variable FileName to store the file name via the expression #activity('Get Metadata1').output.childItems[0].name.
3.We can use additional column in Copy activity, and use the expression #concat(split(variables('FileName'),' ')[2],' ',split(variables('FileName'),' ')[3]) to
get the value we need. Note that the single quote contains a space.
In the dataset, we need key in a dynamic content #variables('FileName') to specify which file to be copied.
The sink is the same as source in Copy activity.
Then we can run debug to confirm it.
Here I think we also can copy into SQL table directly, when we set the sink to a sql table.

Resources