I am wondering if it is possible to add a date column to each file uploaded.
For example each month a CSV is produced. I am wanted to add for example "December 2020" to each row and then for the next months upload add "January 2021" to every row in the CSV file. Before copying this into a SQL database.
e.g. file name "Latest Rating December 2020" I would want the 'December 2020' as a column and be the same value for all rows. The naming convention will be the same for each months upload.
Thanks
I've created a test to add a column to the csv file.
The result is as follows:
We can get file name via Child Items in Get MetaData activity.
The dataset is to the container in ADLS.
Then we can declare a variable FileName to store the file name via the expression #activity('Get Metadata1').output.childItems[0].name.
3.We can use additional column in Copy activity, and use the expression #concat(split(variables('FileName'),' ')[2],' ',split(variables('FileName'),' ')[3]) to
get the value we need. Note that the single quote contains a space.
In the dataset, we need key in a dynamic content #variables('FileName') to specify which file to be copied.
The sink is the same as source in Copy activity.
Then we can run debug to confirm it.
Here I think we also can copy into SQL table directly, when we set the sink to a sql table.
Related
I have an Azure Data Factory requirement. There are 50 csv files and each file is named like Product, Department, Employee, Sales, etc. Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file. Every day, I receive a new set of files in an Azure Data Lake Storage Gen 2 folder at 11 PM CST. At 12:05 AM CST, each of these files should be loaded into its respective table.
There should be only one pipeline or there can be 2 pipelines where the parent pipeline collects the metadata of the file and supplies it to the child pipeline which does the data load. It should find the files with the timestamp of the previous day and then it should loop through these files and load them into its respective target table, one by one. Can someone briefly explain the Activities and Transformations I need to use to fulfil this requirement.
I am new to ADF. I haven't tried anything so far.
Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file.
As you have same columns for both source and target and same names for files and tables. The below process will work for you if you have same schema for both.
First Use Get Meta data activity for the source folder to get the files list.
To get latest uploaded files, use Filter by last modified option in the Get meta data. This option only supports UTC time format and CST equals UTC-6. Give the Start time and End time as per your requirement by cross checking both timezones. Use appropriate Date time functions for it.
For sample I have given like below.
which will give the result array like this.
Give this ChildItems array #activity('Get Metadata1').output.childItems to a ForEach activity. Inside ForEach use a copy activity to copy each file iteration wise.
Create another Source dataset and create a dataset parameter (Here sourcefiename) and give it like below.
Give this to copy activity source and assign #item().name for the parameter value.
In sink, create a Database dataset with two dataset parameters schema and table_name. Use these like below.
from #item().name extract the file name other '.csv' text by using split and give that to the above parameter.
#split(item().name,'.c')[0]
Now, schedule this pipeline as per your time zone using schedule trigger.
I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.
I need to pick a time stamp data from a column ‘created on’ from a csv file in ADLS. Later I want to query Azure SQL DB like delete from table where created on = ‘time stamp’ in ADF. Please help on how could this be achieved.
Here I repro'd to fetch a selected row from the CSV in ADLS.
Create a Linked service and Dataset of the source file.
Read the Data by the Lookup Activity from the Source path.
For each activity iterates the values from the output of Lookup.#activity('Lookup1').output.value
Inside of For Each activity use Append Variable and set Variable Use value for append variable from the For each item records.
Using it as Index variable.
Use script activity to run query and reflect the script on the data.
Delete FROM dbo.test_table where Created_on = #{variables('Date_COL3')[4]}
I have metadata in my Azure SQL db /csv file as below which has old column name and datatypes and new column names.
I want to rename and change the data type of oldfieldname based on those metadata in ADF.
The idea is to store the metadata file in cache and use this in lookup but I am not able to do it in data flow expression builder. Any idea which transform or how I should do it?
I have reproduced the above and able to change the column names and datatypes like below.
This is the sample csv file I have taken from blob storage which has meta data of table.
In your case, take care of new Data types because if we don't give correct types, it will generate error because of the data inside table.
Create dataset and give this to lookup and don't check first row option.
This is my sample SQL table:
Give the lookup output array to ForEach.
Inside ForEach use script activity to execute the script for changing column name and Datatype.
Script:
EXEC SP_RENAME 'mytable2.#{item().OldName}', '#{item().NewName}', 'COLUMN';
ALTER TABLE mytable2
ALTER COLUMN #{item().NewName} #{item().Newtype};
Execute this and below is my SQL table with changes.
By using ADF we unloaded data from on-premise sql server to datalake folder in single parquet for full load.
Then in delta load we are keeping in current day's folder yyyy/mm/dd structur going forward.
But i want full load file also separate it by their respective transaction day's folder.ex: in full load file we have 3 years data. i want data split it by transaction day wise in each separate folder. like 2019/01/01..2019/01/02 ..2020/01/01 instead of single file.
is there way to achieve this in ADF or while unloading itself can we get this folder structure for full load?
Hi#Kumar AK After a period of exploration, I found the answer. I think we need to use Azure data flow to achieve that.
My source file is a csv file, which contains transaction_date column.
Set this csv as the source in data flow.
In DerivedColumn1 activity, we can generate a new column FolderName via column transaction_date. FolderName will be used as a folder structure.
In sink1 activity, select Name file as column data as File Name option, select FolderName column as Column data.
That's all. These rows of the csv file will be split into files in different folders. The debug result is as follows, :