I'm using Azure Data Factory to copy data from a source folder (Azure Blob) that has multiple folders inside it (and each one of those folders has a year as it's name, and inside the folders are the Excel spreadsheets with the data) to a SQL Server table. I want to iterate through the folders, select the folder name, and insert the name into a column in a table, so for each data read inside the files in the folder, the folder name where this data is will be in the table, like this:
Data 1 |Data 2 |Year
------------------------
A |abc |2020
B |def |2020
C |ghi |2021
D |jkl |2022
E |lmn |2023
My pipeline is like this:
I have a Get Metadata activity called Get Metadata1 pointing to the folders, and after that a ForEach to iterate through the folders with two activities: one "Set variable" activity setting a variable named FolderYear with #item().name as value (to select the folder name) and a Copy activity which creates a additional column into the dataset named Year using the variable.
I'm trying to map the additional Year column to a column in the table, but when I debug the pipeline, the following error appears:
{ "errorCode": "2200", "message": "Mixed properties are used to reference 'source' columns/fields in copy activity mapping. Please only choose one of the three properties 'name', 'path' and 'ordinal'. The problematic mapping setting is 'name': 'Year', 'path': '','ordinal': ''. ", "failureType": "UserError", "target": "Copy data1", "details": [] }
It's possible to insert the folder name which I'm currently iterating into a database column?
I've made a same test and copied the data(include the folder name) into a SQL table successfully.
I have two folders in the container and each folder contains one cvs file for test.
The previous settings are the same as you.
Inside the ForEach activity, I use the Additional columns to add the folder name to the datasource.
After copied into a SQL table, the results show as follow:
Update:
My file structure is as follows:
You can use expression #concat('FolderA/FolderB/',item().name):
Related
I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.
As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.
I am copying many files into one with ADF Copy Activity but I want to add a column and grab the Blob's Last modified date on the Metadata like the $$FILEPATH.
Is there an easy way to do that as I only see System Variables related to pipeline details etc.
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
Since the requirement is to add a column to each file where this column value is the lastModified date of that blob, we can iterate through each file, add column to it which has the current blob's lastModified date, copy it into a staging folder.
From this staging folder, you can use final copy activity to where you merge all the files in this folder to a single file in the final destination folder.
Look at the following demonstration. The following are my files in ADLS storage.
I used Get Metadata to get the name of files in this container (final and output1 folders are created in later stages, so they won't affect the process).
Using the return filenames as items (#activity('Get Metadata1').output.childItems) in the for each activity, I obtained the lastModified of each file using another get metadata activity inside the for each.
The dataset of this Get Metadata2 is configured as shown below:
Now I have copied these files into output1 folder by adding an additional column where I gave the following dynamic content (lastModified from get metadata2)
#activity('Get Metadata2').output.lastModified
Now you can use a final copy data activity after this foreach to merge these files into a single file into the final folder.
The following is the final output for reference:
I'm trying to create a dynamic copy activity which uses a GetMededata and a filter activity within data factory to do incremental loads.
The file path I'm trying to copy looks like :
2022/08/01
It goes from year folder - month folder - to days of the month folder and each day folder containing several files:
I need get metadata and filter activity to read the file, however the output directory on the filter activity foes not have the desired path.
The variable I've currently is set is dynamically expressed
#concat(variables('v_BlobSourceDirectory'), variables('v_Set_date'))
filter activity
#activity('Get Source').output.childItems
get metadata activity
#and(equals(item().type,'Folder'),endswith(item().name,variables('v_Set_Date')))
I have reproduced ADLSGen2 file path 2022/09/21 format added sample files.
Using Metadata activity all the child items has been pulled and given has input to foreach activity.
Added Dynamic values for month and days.
Every childitem is iterated through foreach activity.
for each activity output is given to copy.
**Result:**Every childitem is copied and stored in Dynamicoutput1
I am trying to parse the $$FILEPATH value in the "Additional columns" section of the Copy Activity.
The filepaths have a format of: time_period=202105/part-12345.parquet . I would like just the "202105" portion of the filepath. I cannot hardcode it because there are other time_period folders.
I've tried this (from the link below): #{substring($$FILEPATH, add(indexOf($$FILEPATH, '='),1),sub(indexOf($$FILEPATH, '/'),6))} but I get an error saying Unrecognized expression: $$FILEPATH
The only other things I can think of are using: 1) Get Metadata Activity + For each Activity or 2) possibly trying to do this in DataFlow
$$FILEPATH is the reserved variable to store the file path. You cannot add dynamic expression with $$FILEPATH.
You have to create a variable to store the folder name as required and then pass it dynamically in an additional column.
Below is what I have tried.
As your folder name is not static, getting the folder names using the Get Metadata activity.
Get Metadata Output:
Pass the output to the ForEach activity to loop all the folders.
Add a variable at the pipeline level to store the folder name.
In the ForEach activity, add the set variable activity to extract the date part from the folder name and add the value to the variable.
#substring(item().name, add(indexof(item().name, '='),1), sub(length(item().name), add(indexof(item().name, '='),1)))
Output of Set variable:
In source dataset, parameterize the path/filename to pass them dynamically.
Add copy data activity after set variable and select the source dataset.
a) Pass the current item name of the ForEach activity as a file path. Here I hardcoded the filename as *.parquet to copy all files from that path (this works only when all files have the same structure).
b) Under Additional Column, add a new column, give a name to a new column, and under value, select to Add dynamic content and add the existing variable.
Add Sink dataset in Copy data Sink. I have added the Azure SQL table as my sink.
In Mapping, add filename (new column) to the mappings.
When you run the pipeline, the ForEach activity runs the number of items in the Get Metadata activity.
Output:
So basically my issue is this, I will use metadata to get the names of the files from a source folder in the storage account in azure. I need to parse that name and insert it into a respective table. example below.
File name would be in this format. customer_GUIID_TypeOfData_other information.csv
i.e. 1c56d6s4s33s4_Sales_09112021.csv
156468a5s5s54_Inventory_08022021.csv
so these are 2 different customers and two different types of information.
the tables in SQL will be exactly that without the date. 156468a5s5s54_Inventory or 1c56d6s4s33s4_Sales
how can I copy the data from the CSV to the respective table depending on the file name? I will also need to insert or update existing rows in the destination table based on a unique identifier in the file dataset using AZURE Data Factory.
Get the file name using Get Metadata activity and copy data from CSV to Azure SQL table using Dataflow activity with Upsert enable.
Input blob files:
Step1:
• Create a Delimiter Source dataset. Create a parameter for a filename to pass it dynamically.
• Create Azure SQL database Sink dataset and create a parameter to pass table name dynamically.
Source dataset:
Sink dataset:
Step2:
• Connect Source dataset to Get Metadata activity and pass “*.csv” as the file name to get a list of all file names from blob folder.
Output of Get Metada:
Step3:
• Connect the output of Get Metadata activity to ForEach loop, to load all the incoming Source files/data to Sink.
• Add expression to the items to get the child items from previous activity.
#activity('Get Metadata1').output.childitems
Step4:
• Add dataflow activity inside foreach loop.
• Connect Source to Source dataset.
Dataflow Source:
Step5:
• Connect Sink to Sink dataset.
• Enable Allow upsert to update if record exists based on unique key column.
Step6:
• Add AlterRow between source and sink to add condition for upsert.
• Upsert when unique key column is not null or is found.
Upsert if: isNull(id)==false()
Step7:
• In the ForEach loop, dataflow settings, add expressions for source filename and sink table name dynamically.
Src_file_name: #item().name
• As we are extracting the sink table name from the source file name. Split the file name based on underscore “_” and then combine 1st 2 strings to eliminate the date part.
Sink_tbname: #concat(split(item().name, '_')[0],'_',split(item().name, '_')[1])
Step8:
When the pipeline is run, you can see the loop executes the number of source files in the blob and loads data to respective tables based on the file name.