Adf Copy task with Excel in a foreach loop - excel

within Azure Adf, I have an excel file with 4 tabs named Sheet1-Sheet4
I would like to loop through the excel creating a CSV per tab
I have created a SheetsNames parameter in the pipeline with a default value of ["Sheet1","Sheet2","Sheet3","Sheet4"]
How do I use this with the copy task to loop with the tabs?

Please try this:
Create a SheetsNames parameter in the pipeline with a default value of ["Sheet1","Sheet2","Sheet3","Sheet4"].
Add a For Each activity and type #pipeline().parameters. SheetsNames in the Items option.
Within the For Each activity, add a copy activity.
Create Source dataset and create a parameter named sheetName with empty default value.
Navigate to the Connection setting of the Source dataset and check Edit in the Sheet name option. Then type #dataset().sheetName in it.
Navigate to the Source setting of the Copy data activity and pass #item() to the sheetName.
Create a Sink dataset and it's setting is similar to the Source dataset.
Run the pipeline and get this result:

Related

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

Add file name to Copy activity in Azure Data Factory

I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:

Regex Additional Column in Copy Activity Azure Data Factory

I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:

Azure Data Factory Copy Data Activity Mapping in Using Triggers

I am creating a pipeline using ADF to copy the data in a XML file to a SQL database. I want this pipeline to be triggered when the XML file is uploaded to Blob Storage. Therefore, here I will be using a parameter with the input Dataset.
Now, in the Copy Data activity that I am using, I want to be able to define the mappings. This is usually quite easy when the path to the file is given, however, in this situation, where a parameter is being used, how can I do this?
From what I have gathered, the mappings can be defined as a JSON schema and assigned to the activity, but is there perhaps an easier way to do this? Maybe by uploading a demo file from which the schema can be imported?
When you want to load a xml file into sql DB you are using a Hierarchical source to tabular sink method.
When copying data from hierarchical source to tabular sink, copy activity supports the following capabilities:
Extract data from objects and arrays.
Cross apply multiple objects with the same pattern from an array, in which case to convert one JSON object into multiple records in tabular result.
You can define such mapping on Data Factory authoring UI:
On copy activity -> mapping tab, click Import schemas button to import both source and sink schemas. As Data Factory samples the top few objects when importing schema, if any field doesn't show up, you can add it to the correct layer in the hierarchy - hover on an existing field name and choose to add a node, an object, or an array.
Select the array from which you want to iterate and extract data. It will be auto populated as Collection reference. Note only single array is supported for such operation.
Map the needed fields to sink. Data Factory automatically determines the corresponding JSON paths for the hierarchical side.
Note: For records where the array marked as collection reference is empty and the check box is selected, the entire record is skipped.
Here I am using a sample XML file at source
If you notice here I have used a dataset parameter to which I will be assigning the file name value as obtained from trigger. And now I have placed it in the file name field for file path property in dataset connection.
Next I have created a pipeline parameter to hold the input obtained from trigger before assigning it to the dataset parameter.
Create storage event trigger
Click continue and you fill find a preview of all the files that are applicable for trigger conditions
When you have moved to next slide, if you have created pipeline parameter, which we have, you will see them there
Fill in the value as per your need. See the available system variables here Storage event trigger scope
Now, lets move to copy data activity, here you will find the dataset parameter, assign the pipeline parameter value to it.
Now move to sink tab in copy activity, since you want the source schema to be followed into sink, best way is to select to Auto create a table.
For which you have to make appropriate changes in sink dataset. Now, to configure sink dataset, for table choose edit and manually enter a name for table which does not already exist in your server i.e a new table will be created in this name in the sql server mentioned in sink. Make sure you clear all schema as you will be getting source schema in copy activity.
Back to mapping tab in copy activity, click on import schema and select the fields you want to copy to table. Additionally you can specify the data types and Collection reference is necessary.
Refer: Parameterize mapping
You can also switch to Advanced editor, in which case you can directly see and edit the fields' JSON paths. If you choose to add new mapping in this view, specify the JSON path.
So when a file is created in the storage a blob created event is triggered and pipeline runs
You can see the new table "dbo.NewTable" created under ktestsql and it has the data from xml as row.

How to process only the failed files in for each activity when the master pipeline is retriggered in ADF

There are two pipelines Master and child.
In the child pipeline there is a foreach activity which takes files as input and process them in parallel.
For instance, there are 4 files , in which 2 files are successfully processed and loaded the data into a table. Then, 3rd file processing is failed and 4th file processing is successful. Now, when I retrigger the Master pipeline I only want the 3rd file to be processed, not all the 4 files.
How can we achieve this.
I have tried below.
To move/delete the file once the processing is completed
But as per the requirement, I should not move/delete the file. Could someone please assist.
I create a test and succefuly achieve that. My overall idea is: use the Lookup activity to extract the copied file names array from the sql table, and then do a Filter operation with the source file names array. If the file name already exists in the sql table, the file copy activity will not be performed. It needs us to add file name to the sql table in Copy activity via Aditional columns.
In my sql table, it looks like as follows:
I declared 3 variables. arr1 Array type variable stores source file names. filterArray Array type variable stores copied file names array from the sql table.
At lookup activity, we can use this query select distinct FileName from [dbo].[emp] to get copied file names array from the sql table.
Assign the value to the variable filterArray.
I set the default value ["emp.csv","emp2.csv","emp3.csv","emp4.csv"] as source file names to the variable arr1.
At Foreah acivity, we can foreach the variable arr1.
Inside Foreach activity, assign the value #item() to the variable arrItem.
Then do Filter operation. Items: #variables('filterArr'), Condition: #contains(item().FileName,variables('arrItem')), This item() here represents each element in the filterArray array.
At If condition activity, use #empty(activity('Filter1').output.Value) to determine whether this file has been copied.
In Ture activity, key in dynamic content #item(), this represents the name of the file to be copied.
That's all.

Resources