I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:
Related
How to copy the data from append variable activity to a csv file using Azure Data Factory
I have array of file names stored in append variable activity. I want to store all these files names inside a .CSV file in the data lake location.
For more info refer this
how to compare the file names that are inside a folder (Datalake) using ADF
In this repro, Variable V1 (array type) is taken with values as in below image.
New variable v2 of string type is taken and value is given as #join(variables('V1'),decodeUriComponent('%0A'))
This step is done to join all the strings of the array using \n (line feed).
Then Copy activity is taken and dummy source dataset with one row is taken.
In Source, +New is selected and value is given as dynamic content #varaiables('v2').
Sink dataset is created for CSV file.
In Mapping, import schemas is clicked and other than col1 , all other columns are deleted.
Then pipeline is debugged, and values got loaded in csv file.
Edited
Variable v2 is storing all the missing file names. (False activity of IF condition)
After for-each, Set variable is added and variable v3 (string type) is set as
#join(variables('v2'),decodeUriComponent('%0A'))
Then, in copy activity, +New column is added in source
I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.
I want to filter source folder for files have name starting with 'File'.
Then I want to check if those files are already present in sink folder.
If not present then copy else skip.
Picture 1 -This is the initial picture which contains files in source and sink
Picture 2 - This is the desired output where only those files are copied which were not present in Sink (except junk files)
Picture 3 - This is how I tried. There are IF & copyData activity in ForEach, But I am getting error in copyData activity.
I have reproed in my local environment as shown below.
Get sink files list where filename starts with ‘file’ using Get Metadata activity.
The output of Get Metadata1:
Create an array variable to store the list of sink files.
Convert Get Metadata activity (Get sink files) output to the array by using ForEach activity and append each filename to an array variable.
#activity('Get Sink Files').output.childItems
Add append variable activity inside ForEach activity.
Now get the list of source files using another Get Metadata activity in the pipeline.
The output of Get metadata2:
Connect Get Metadata activity2 (Get Source files) output and ForEach activity to another ForEach activity2.
#activity('Get Source Files').output.childItems
Add If Condition activity inside ForEach2 activity. Add expression to check the current item (each source file) contains in the array variable.
#contains(variables('sink_files_list'),item().name)
When false add copy activity to copy source file to sink.
Source:
Sink:
within Azure Adf, I have an excel file with 4 tabs named Sheet1-Sheet4
I would like to loop through the excel creating a CSV per tab
I have created a SheetsNames parameter in the pipeline with a default value of ["Sheet1","Sheet2","Sheet3","Sheet4"]
How do I use this with the copy task to loop with the tabs?
Please try this:
Create a SheetsNames parameter in the pipeline with a default value of ["Sheet1","Sheet2","Sheet3","Sheet4"].
Add a For Each activity and type #pipeline().parameters. SheetsNames in the Items option.
Within the For Each activity, add a copy activity.
Create Source dataset and create a parameter named sheetName with empty default value.
Navigate to the Connection setting of the Source dataset and check Edit in the Sheet name option. Then type #dataset().sheetName in it.
Navigate to the Source setting of the Copy data activity and pass #item() to the sheetName.
Create a Sink dataset and it's setting is similar to the Source dataset.
Run the pipeline and get this result:
There are two pipelines Master and child.
In the child pipeline there is a foreach activity which takes files as input and process them in parallel.
For instance, there are 4 files , in which 2 files are successfully processed and loaded the data into a table. Then, 3rd file processing is failed and 4th file processing is successful. Now, when I retrigger the Master pipeline I only want the 3rd file to be processed, not all the 4 files.
How can we achieve this.
I have tried below.
To move/delete the file once the processing is completed
But as per the requirement, I should not move/delete the file. Could someone please assist.
I create a test and succefuly achieve that. My overall idea is: use the Lookup activity to extract the copied file names array from the sql table, and then do a Filter operation with the source file names array. If the file name already exists in the sql table, the file copy activity will not be performed. It needs us to add file name to the sql table in Copy activity via Aditional columns.
In my sql table, it looks like as follows:
I declared 3 variables. arr1 Array type variable stores source file names. filterArray Array type variable stores copied file names array from the sql table.
At lookup activity, we can use this query select distinct FileName from [dbo].[emp] to get copied file names array from the sql table.
Assign the value to the variable filterArray.
I set the default value ["emp.csv","emp2.csv","emp3.csv","emp4.csv"] as source file names to the variable arr1.
At Foreah acivity, we can foreach the variable arr1.
Inside Foreach activity, assign the value #item() to the variable arrItem.
Then do Filter operation. Items: #variables('filterArr'), Condition: #contains(item().FileName,variables('arrItem')), This item() here represents each element in the filterArray array.
At If condition activity, use #empty(activity('Filter1').output.Value) to determine whether this file has been copied.
In Ture activity, key in dynamic content #item(), this represents the name of the file to be copied.
That's all.