Azure Data Factory Dynamic Folder Path - azure

I'm trying to create a dynamic copy activity which uses a GetMededata and a filter activity within data factory to do incremental loads.
The file path I'm trying to copy looks like :
2022/08/01
It goes from year folder - month folder - to days of the month folder and each day folder containing several files:
I need get metadata and filter activity to read the file, however the output directory on the filter activity foes not have the desired path.
The variable I've currently is set is dynamically expressed
#concat(variables('v_BlobSourceDirectory'), variables('v_Set_date'))
filter activity
#activity('Get Source').output.childItems
get metadata activity
#and(equals(item().type,'Folder'),endswith(item().name,variables('v_Set_Date')))

I have reproduced ADLSGen2 file path 2022/09/21 format added sample files.
Using Metadata activity all the child items has been pulled and given has input to foreach activity.
Added Dynamic values for month and days.
Every childitem is iterated through foreach activity.
for each activity output is given to copy.
**Result:**Every childitem is copied and stored in Dynamicoutput1

Related

Azure Datafactory Pipeline query

I have an Azure Data Factory requirement. There are 50 csv files and each file is named like Product, Department, Employee, Sales, etc. Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file. Every day, I receive a new set of files in an Azure Data Lake Storage Gen 2 folder at 11 PM CST. At 12:05 AM CST, each of these files should be loaded into its respective table.
There should be only one pipeline or there can be 2 pipelines where the parent pipeline collects the metadata of the file and supplies it to the child pipeline which does the data load. It should find the files with the timestamp of the previous day and then it should loop through these files and load them into its respective target table, one by one. Can someone briefly explain the Activities and Transformations I need to use to fulfil this requirement.
I am new to ADF. I haven't tried anything so far.
Each of these files has a unique number of columns. In Azure SQL Database, I have 50 tables like Product, Department, Employee, Sales, etc. The columns of each table match with its corresponding file.
As you have same columns for both source and target and same names for files and tables. The below process will work for you if you have same schema for both.
First Use Get Meta data activity for the source folder to get the files list.
To get latest uploaded files, use Filter by last modified option in the Get meta data. This option only supports UTC time format and CST equals UTC-6. Give the Start time and End time as per your requirement by cross checking both timezones. Use appropriate Date time functions for it.
For sample I have given like below.
which will give the result array like this.
Give this ChildItems array #activity('Get Metadata1').output.childItems to a ForEach activity. Inside ForEach use a copy activity to copy each file iteration wise.
Create another Source dataset and create a dataset parameter (Here sourcefiename) and give it like below.
Give this to copy activity source and assign #item().name for the parameter value.
In sink, create a Database dataset with two dataset parameters schema and table_name. Use these like below.
from #item().name extract the file name other '.csv' text by using split and give that to the above parameter.
#split(item().name,'.c')[0]
Now, schedule this pipeline as per your time zone using schedule trigger.

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

Azure Data Factory Copy Activity New Last Modified Column from Metadata

I am copying many files into one with ADF Copy Activity but I want to add a column and grab the Blob's Last modified date on the Metadata like the $$FILEPATH.
Is there an easy way to do that as I only see System Variables related to pipeline details etc.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
Since the requirement is to add a column to each file where this column value is the lastModified date of that blob, we can iterate through each file, add column to it which has the current blob's lastModified date, copy it into a staging folder.
From this staging folder, you can use final copy activity to where you merge all the files in this folder to a single file in the final destination folder.
Look at the following demonstration. The following are my files in ADLS storage.
I used Get Metadata to get the name of files in this container (final and output1 folders are created in later stages, so they won't affect the process).
Using the return filenames as items (#activity('Get Metadata1').output.childItems) in the for each activity, I obtained the lastModified of each file using another get metadata activity inside the for each.
The dataset of this Get Metadata2 is configured as shown below:
Now I have copied these files into output1 folder by adding an additional column where I gave the following dynamic content (lastModified from get metadata2)
#activity('Get Metadata2').output.lastModified
Now you can use a final copy data activity after this foreach to merge these files into a single file into the final folder.
The following is the final output for reference:

How to copy particular files from sFTP source location if the files are not already present in sFTP sink location in Azure Data Factory

I want to filter source folder for files have name starting with 'File'.
Then I want to check if those files are already present in sink folder.
If not present then copy else skip.
Picture 1 -This is the initial picture which contains files in source and sink
Picture 2 - This is the desired output where only those files are copied which were not present in Sink (except junk files)
Picture 3 - This is how I tried. There are IF & copyData activity in ForEach, But I am getting error in copyData activity.
I have reproed in my local environment as shown below.
Get sink files list where filename starts with ‘file’ using Get Metadata activity.
The output of Get Metadata1:
Create an array variable to store the list of sink files.
Convert Get Metadata activity (Get sink files) output to the array by using ForEach activity and append each filename to an array variable.
#activity('Get Sink Files').output.childItems
Add append variable activity inside ForEach activity.
Now get the list of source files using another Get Metadata activity in the pipeline.
The output of Get metadata2:
Connect Get Metadata activity2 (Get Source files) output and ForEach activity to another ForEach activity2.
#activity('Get Source Files').output.childItems
Add If Condition activity inside ForEach2 activity. Add expression to check the current item (each source file) contains in the array variable.
#contains(variables('sink_files_list'),item().name)
When false add copy activity to copy source file to sink.
Source:
Sink:

Regex Additional Column in Copy Activity Azure Data Factory

I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:

Resources