I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.
Related
I have few folders inside the Data lake (Example:Test1 container) that are created every month in this format YYYY-MM (Example:2022-11) and inside this folder I have few set of data files, I want to copy this data files to different folders in the data lake.
And again in the next month new folder is created in the same data lake (Example:Test1 container) with 2022-12 and list goes on, 2023-01.....etc., I want to copy files inside these folders every month to different data lake folder.
How to achieve this?
Solution is mentioned in this thread, Create a folder based on date (YYYY-MM) using Data Factory?
Follow the Sink Dataset section and Copy Sink section....remove the parameter sinkfilename from the dataset, and use this dataset as source in the copy activity.
It worked for me.
Alternative approach. For reading folders with Date format as (YYYY-MM)
I reproduce the same in my environment with copy activity.
Open sink dataset and create a parameter with Name: Folder.
Go to Connection and Add this dynamic content: #dataset().folder
You can Add this dynamic content:
#concat(formatDateTime(utcnow(), 'yyyy/MM'))
Or
#concat(formatDateTime(utcnow(), 'yyyy'), '/',formatDateTime(utcnow(),'MM')
Pipeline successfully executed and got the output:
I am copying many files into one with ADF Copy Activity but I want to add a column and grab the Blob's Last modified date on the Metadata like the $$FILEPATH.
Is there an easy way to do that as I only see System Variables related to pipeline details etc.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
Since the requirement is to add a column to each file where this column value is the lastModified date of that blob, we can iterate through each file, add column to it which has the current blob's lastModified date, copy it into a staging folder.
From this staging folder, you can use final copy activity to where you merge all the files in this folder to a single file in the final destination folder.
Look at the following demonstration. The following are my files in ADLS storage.
I used Get Metadata to get the name of files in this container (final and output1 folders are created in later stages, so they won't affect the process).
Using the return filenames as items (#activity('Get Metadata1').output.childItems) in the for each activity, I obtained the lastModified of each file using another get metadata activity inside the for each.
The dataset of this Get Metadata2 is configured as shown below:
Now I have copied these files into output1 folder by adding an additional column where I gave the following dynamic content (lastModified from get metadata2)
#activity('Get Metadata2').output.lastModified
Now you can use a final copy data activity after this foreach to merge these files into a single file into the final folder.
The following is the final output for reference:
I'm trying to create a dynamic copy activity which uses a GetMededata and a filter activity within data factory to do incremental loads.
The file path I'm trying to copy looks like :
2022/08/01
It goes from year folder - month folder - to days of the month folder and each day folder containing several files:
I need get metadata and filter activity to read the file, however the output directory on the filter activity foes not have the desired path.
The variable I've currently is set is dynamically expressed
#concat(variables('v_BlobSourceDirectory'), variables('v_Set_date'))
filter activity
#activity('Get Source').output.childItems
get metadata activity
#and(equals(item().type,'Folder'),endswith(item().name,variables('v_Set_Date')))
I have reproduced ADLSGen2 file path 2022/09/21 format added sample files.
Using Metadata activity all the child items has been pulled and given has input to foreach activity.
Added Dynamic values for month and days.
Every childitem is iterated through foreach activity.
for each activity output is given to copy.
**Result:**Every childitem is copied and stored in Dynamicoutput1
I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:
I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output: