Azure Data Factory Copy Activity New Last Modified Column from Metadata - azure

I am copying many files into one with ADF Copy Activity but I want to add a column and grab the Blob's Last modified date on the Metadata like the $$FILEPATH.
Is there an easy way to do that as I only see System Variables related to pipeline details etc.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables

Since the requirement is to add a column to each file where this column value is the lastModified date of that blob, we can iterate through each file, add column to it which has the current blob's lastModified date, copy it into a staging folder.
From this staging folder, you can use final copy activity to where you merge all the files in this folder to a single file in the final destination folder.
Look at the following demonstration. The following are my files in ADLS storage.
I used Get Metadata to get the name of files in this container (final and output1 folders are created in later stages, so they won't affect the process).
Using the return filenames as items (#activity('Get Metadata1').output.childItems) in the for each activity, I obtained the lastModified of each file using another get metadata activity inside the for each.
The dataset of this Get Metadata2 is configured as shown below:
Now I have copied these files into output1 folder by adding an additional column where I gave the following dynamic content (lastModified from get metadata2)
#activity('Get Metadata2').output.lastModified
Now you can use a final copy data activity after this foreach to merge these files into a single file into the final folder.
The following is the final output for reference:

Related

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

Azure Data Factory Copy Files

I want to copy files from one folder to another folder in data lake using ADF pipelines.
Ex : a/b/c/d. TO a/b
Here a,b,c,d are folders here I don't want to copy c,d folders .I have to copy the files inside those folders to 'b' folder.
I created a pipeline using Get Metadata , For each and in For Each I used copy activity.But here I am able to copy files with folder itself .I'm failing to remove folders.
I reproduced your scenario follow the below steps:
In my demo container I have nested folders like a/b/c/d under d folder I have 3 files as below.
To copy files from folder to folder I took Get metadata activity to get list of files from folder.
Dataset for Get Metadata:
Get Metadata settings:
Then I took for-each activity and passed the output of Get Metadata activities output to it.
#activity('Get Metadata1').output.childItems
Then created copy activity inside for each activity and
created source dataset with filename parameter
In file name gave dynamic value as #dataset().filename
In copy activity source gave dynamic value for dataset property filename as #item().name
Now created sink dataset with a/b directories only
and passed it to sink
Output
files copied under b folder without coping c and d folder

Azure Data Factory Dynamic Folder Path

I'm trying to create a dynamic copy activity which uses a GetMededata and a filter activity within data factory to do incremental loads.
The file path I'm trying to copy looks like :
2022/08/01
It goes from year folder - month folder - to days of the month folder and each day folder containing several files:
I need get metadata and filter activity to read the file, however the output directory on the filter activity foes not have the desired path.
The variable I've currently is set is dynamically expressed
#concat(variables('v_BlobSourceDirectory'), variables('v_Set_date'))
filter activity
#activity('Get Source').output.childItems
get metadata activity
#and(equals(item().type,'Folder'),endswith(item().name,variables('v_Set_Date')))
I have reproduced ADLSGen2 file path 2022/09/21 format added sample files.
Using Metadata activity all the child items has been pulled and given has input to foreach activity.
Added Dynamic values for month and days.
Every childitem is iterated through foreach activity.
for each activity output is given to copy.
**Result:**Every childitem is copied and stored in Dynamicoutput1

How to copy particular files from sFTP source location if the files are not already present in sFTP sink location in Azure Data Factory

I want to filter source folder for files have name starting with 'File'.
Then I want to check if those files are already present in sink folder.
If not present then copy else skip.
Picture 1 -This is the initial picture which contains files in source and sink
Picture 2 - This is the desired output where only those files are copied which were not present in Sink (except junk files)
Picture 3 - This is how I tried. There are IF & copyData activity in ForEach, But I am getting error in copyData activity.
I have reproed in my local environment as shown below.
Get sink files list where filename starts with ‘file’ using Get Metadata activity.
The output of Get Metadata1:
Create an array variable to store the list of sink files.
Convert Get Metadata activity (Get sink files) output to the array by using ForEach activity and append each filename to an array variable.
#activity('Get Sink Files').output.childItems
Add append variable activity inside ForEach activity.
Now get the list of source files using another Get Metadata activity in the pipeline.
The output of Get metadata2:
Connect Get Metadata activity2 (Get Source files) output and ForEach activity to another ForEach activity2.
#activity('Get Source Files').output.childItems
Add If Condition activity inside ForEach2 activity. Add expression to check the current item (each source file) contains in the array variable.
#contains(variables('sink_files_list'),item().name)
When false add copy activity to copy source file to sink.
Source:
Sink:

Regex Additional Column in Copy Activity Azure Data Factory

I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:

Resources