I have an Azure Data Factory requirement. There are 50 csv files and each file is named like Product, Department, Employee, Sales, etc. Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file. Every day, I receive a new set of files in an Azure Data Lake Storage Gen 2 folder at 11 PM CST. At 12:05 AM CST, each of these files should be loaded into its respective table.
There should be only one pipeline or there can be 2 pipelines where the parent pipeline collects the metadata of the file and supplies it to the child pipeline which does the data load. It should find the files with the timestamp of the previous day and then it should loop through these files and load them into its respective target table, one by one. Can someone briefly explain the Activities and Transformations I need to use to fulfil this requirement.
I am new to ADF. I haven't tried anything so far.
Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file.
As you have same columns for both source and target and same names for files and tables. The below process will work for you if you have same schema for both.
First Use Get Meta data activity for the source folder to get the files list.
To get latest uploaded files, use Filter by last modified option in the Get meta data. This option only supports UTC time format and CST equals UTC-6. Give the Start time and End time as per your requirement by cross checking both timezones. Use appropriate Date time functions for it.
For sample I have given like below.
which will give the result array like this.
Give this ChildItems array #activity('Get Metadata1').output.childItems to a ForEach activity. Inside ForEach use a copy activity to copy each file iteration wise.
Create another Source dataset and create a dataset parameter (Here sourcefiename) and give it like below.
Give this to copy activity source and assign #item().name for the parameter value.
In sink, create a Database dataset with two dataset parameters schema and table_name. Use these like below.
from #item().name extract the file name other '.csv' text by using split and give that to the above parameter.
#split(item().name,'.c')[0]
Now, schedule this pipeline as per your time zone using schedule trigger.
Related
I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.
I'm trying to create a dynamic copy activity which uses a GetMededata and a filter activity within data factory to do incremental loads.
The file path I'm trying to copy looks like :
2022/08/01
It goes from year folder - month folder - to days of the month folder and each day folder containing several files:
I need get metadata and filter activity to read the file, however the output directory on the filter activity foes not have the desired path.
The variable I've currently is set is dynamically expressed
#concat(variables('v_BlobSourceDirectory'), variables('v_Set_date'))
filter activity
#activity('Get Source').output.childItems
get metadata activity
#and(equals(item().type,'Folder'),endswith(item().name,variables('v_Set_Date')))
I have reproduced ADLSGen2 file path 2022/09/21 format added sample files.
Using Metadata activity all the child items has been pulled and given has input to foreach activity.
Added Dynamic values for month and days.
Every childitem is iterated through foreach activity.
for each activity output is given to copy.
**Result:**Every childitem is copied and stored in Dynamicoutput1
By using ADF we unloaded data from on-premise sql server to datalake folder in single parquet for full load.
Then in delta load we are keeping in current day's folder yyyy/mm/dd structur going forward.
But i want full load file also separate it by their respective transaction day's folder.ex: in full load file we have 3 years data. i want data split it by transaction day wise in each separate folder. like 2019/01/01..2019/01/02 ..2020/01/01 instead of single file.
is there way to achieve this in ADF or while unloading itself can we get this folder structure for full load?
Hi#Kumar AK After a period of exploration, I found the answer. I think we need to use Azure data flow to achieve that.
My source file is a csv file, which contains transaction_date column.
Set this csv as the source in data flow.
In DerivedColumn1 activity, we can generate a new column FolderName via column transaction_date. FolderName will be used as a folder structure.
In sink1 activity, select Name file as column data as File Name option, select FolderName column as Column data.
That's all. These rows of the csv file will be split into files in different folders. The debug result is as follows, :
I have a requirement where in I have a source file containing the Table Name(s) in Mapping Data Flow. Based on the Table Name in the file - there needs to be a dynamic query where column metadata, along with some other properties is retrieved from the data dictionary tables and inserted into a different sink table. The table name from the file would be used as a where condition filter.
Since there can be multiple tables listed in the input file (lets assume its a csv with only one column containing the table names), if we decide to use a cache sink for the file :
Is it possible to use the results of that cached sink in the Source transformation query in the same mapping data flow - as a lookup (from where the column metadata is being retrieved) and if Yes, how
What would be the best way to restrict data from the metadata table query based on this table name
Though of alternatively achieving this with a pipeline using For Each passing the table name as parameter to data flow, but in this case if there are 100 tables in the file, there would be 100 iterations and 100 times the cluster would need to be spun up. Please advise if this is wronf or there are better ways to achieve this
You would need to use option 3. Loop through the table names and pass each in as a parameter to the data flow to set the table name in the dataset.
ADF handles the cluster creation and teardown. All you have to worry about is whether you want to execute each sequentially or in parallel and how many. There are concurrency limits in ADF, so you should consider a batch count of 20 if you run in parallel.
I am wondering if it is possible to add a date column to each file uploaded.
For example each month a CSV is produced. I am wanted to add for example "December 2020" to each row and then for the next months upload add "January 2021" to every row in the CSV file. Before copying this into a SQL database.
e.g. file name "Latest Rating December 2020" I would want the 'December 2020' as a column and be the same value for all rows. The naming convention will be the same for each months upload.
Thanks
I've created a test to add a column to the csv file.
The result is as follows:
We can get file name via Child Items in Get MetaData activity.
The dataset is to the container in ADLS.
Then we can declare a variable FileName to store the file name via the expression #activity('Get Metadata1').output.childItems[0].name.
3.We can use additional column in Copy activity, and use the expression #concat(split(variables('FileName'),' ')[2],' ',split(variables('FileName'),' ')[3]) to
get the value we need. Note that the single quote contains a space.
In the dataset, we need key in a dynamic content #variables('FileName') to specify which file to be copied.
The sink is the same as source in Copy activity.
Then we can run debug to confirm it.
Here I think we also can copy into SQL table directly, when we set the sink to a sql table.