Azure Data Factory Copy Files - azure

I want to copy files from one folder to another folder in data lake using ADF pipelines.
Ex : a/b/c/d. TO a/b
Here a,b,c,d are folders here I don't want to copy c,d folders .I have to copy the files inside those folders to 'b' folder.
I created a pipeline using Get Metadata , For each and in For Each I used copy activity.But here I am able to copy files with folder itself .I'm failing to remove folders.

I reproduced your scenario follow the below steps:
In my demo container I have nested folders like a/b/c/d under d folder I have 3 files as below.
To copy files from folder to folder I took Get metadata activity to get list of files from folder.
Dataset for Get Metadata:
Get Metadata settings:
Then I took for-each activity and passed the output of Get Metadata activities output to it.
#activity('Get Metadata1').output.childItems
Then created copy activity inside for each activity and
created source dataset with filename parameter
In file name gave dynamic value as #dataset().filename
In copy activity source gave dynamic value for dataset property filename as #item().name
Now created sink dataset with a/b directories only
and passed it to sink
Output
files copied under b folder without coping c and d folder

Related

Reading folders with Date format (YYYY-MM) using Azure Data Factory

I have few folders inside the Data lake (Example:Test1 container) that are created every month in this format YYYY-MM (Example:2022-11) and inside this folder I have few set of data files, I want to copy this data files to different folders in the data lake.
And again in the next month new folder is created in the same data lake (Example:Test1 container) with 2022-12 and list goes on, 2023-01.....etc., I want to copy files inside these folders every month to different data lake folder.
How to achieve this?
Solution is mentioned in this thread, Create a folder based on date (YYYY-MM) using Data Factory?
Follow the Sink Dataset section and Copy Sink section....remove the parameter sinkfilename from the dataset, and use this dataset as source in the copy activity.
It worked for me.
Alternative approach. For reading folders with Date format as (YYYY-MM)
I reproduce the same in my environment with copy activity.
Open sink dataset and create a parameter with Name: Folder.
Go to Connection and Add this dynamic content: #dataset().folder
You can Add this dynamic content:
#concat(formatDateTime(utcnow(), 'yyyy/MM'))
Or
#concat(formatDateTime(utcnow(), 'yyyy'), '/',formatDateTime(utcnow(),'MM')
Pipeline successfully executed and got the output:

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

how to compare the file names that are inside a folder (Datalake) using ADF

I have list of files inside a datalake folder and I have list of files names stored in the .CSV File..
My requirement is to compare the files names in the Datalake folder with the filenames in the .CSV File and if the filenames are matching then I want to copy these files and if filenames are not matching then I want to send an Email with missing files in the datalake.
I have used GetMetaData activity(child items) to get the list of files in the datalake folder and I'm stuck here. Now I want to compare these filenames with the filenames stored in the .CSV File and do the further operations.
Kindly Help
My requirement is to compare the files names in the Datalake folder with the filenames in the .CSV File and if the filenames are matching then I want to copy these files and if filenames are not matching then I want to send an Email with missing files in the datalake.
Get Metadata activity is taken, and dataset is created for datalake. ChildItems is taken as argument for output.
Output of the metadata activity is passed in for each activity.
#activity('Get Metadata1').output.childItems
Inside for-each, lookup is taken and csv file which contains list of file names is referred.
If condition is taken and expression is given as
#contains(string(activity('Lookup1').output.value),item().name)
In true case, copy activity is added to copy the matched file name into SQL database.
Edited- To copy from one location to other location in datalake, follow below steps 1 and 2
Source dataset is taken and in file path , file name is given as #{item().name}
In Sink dataset also, file path is given similarly. This will dynamically create filename as in source.
In false case, append variable is added and all the values which do not match with lookup, got appended to variable of type array.
Refer the MS document on How to send email - Azure Data Factory & Azure Synapse | Microsoft Learn for sending email.

Azure Data Factory Copy Activity New Last Modified Column from Metadata

I am copying many files into one with ADF Copy Activity but I want to add a column and grab the Blob's Last modified date on the Metadata like the $$FILEPATH.
Is there an easy way to do that as I only see System Variables related to pipeline details etc.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
Since the requirement is to add a column to each file where this column value is the lastModified date of that blob, we can iterate through each file, add column to it which has the current blob's lastModified date, copy it into a staging folder.
From this staging folder, you can use final copy activity to where you merge all the files in this folder to a single file in the final destination folder.
Look at the following demonstration. The following are my files in ADLS storage.
I used Get Metadata to get the name of files in this container (final and output1 folders are created in later stages, so they won't affect the process).
Using the return filenames as items (#activity('Get Metadata1').output.childItems) in the for each activity, I obtained the lastModified of each file using another get metadata activity inside the for each.
The dataset of this Get Metadata2 is configured as shown below:
Now I have copied these files into output1 folder by adding an additional column where I gave the following dynamic content (lastModified from get metadata2)
#activity('Get Metadata2').output.lastModified
Now you can use a final copy data activity after this foreach to merge these files into a single file into the final folder.
The following is the final output for reference:

How to copy particular files from sFTP source location if the files are not already present in sFTP sink location in Azure Data Factory

I want to filter source folder for files have name starting with 'File'.
Then I want to check if those files are already present in sink folder.
If not present then copy else skip.
Picture 1 -This is the initial picture which contains files in source and sink
Picture 2 - This is the desired output where only those files are copied which were not present in Sink (except junk files)
Picture 3 - This is how I tried. There are IF & copyData activity in ForEach, But I am getting error in copyData activity.
I have reproed in my local environment as shown below.
Get sink files list where filename starts with ‘file’ using Get Metadata activity.
The output of Get Metadata1:
Create an array variable to store the list of sink files.
Convert Get Metadata activity (Get sink files) output to the array by using ForEach activity and append each filename to an array variable.
#activity('Get Sink Files').output.childItems
Add append variable activity inside ForEach activity.
Now get the list of source files using another Get Metadata activity in the pipeline.
The output of Get metadata2:
Connect Get Metadata activity2 (Get Source files) output and ForEach activity to another ForEach activity2.
#activity('Get Source Files').output.childItems
Add If Condition activity inside ForEach2 activity. Add expression to check the current item (each source file) contains in the array variable.
#contains(variables('sink_files_list'),item().name)
When false add copy activity to copy source file to sink.
Source:
Sink:

Resources