Azure Data Factory - Event Triggers on Files In Multiple Folders - azure

We are invoking ADF pipeline based on event based trigger.
Is there a way to trigger this pipeline only when a file arrives in both of these child folders
e.g
ParentFolder
-- ChildFolder1
-- ChildFolder2
Now we would like to trigger our pipelines only if a new file arrives in both of these folders.i.e ChildFolder1 and ChildFolder2

There is no out of box approach this. I can think of the below alternatives.
First Approach
You can set a trigger at the ChildFolder2
You can use a 'lookup activity' or 'Get Metadata Activity' which fetches the file with the name in the ChildFolder1 - See whether the file is created at the ChildFolder1.
If you would like to check after some time - delay it say for 10- 15 minutes. You could make use of the Wait activity
Now, if the file is existent - then you could continue with the rest of the execution of the pipeline. If the file is not created in Childfolder1 - then you could end the pipeline with no activity carried out.
The pipeline will eventually be triggered when the file is created in childfolder2. The Execution flow changes based on an If activity and existence of the file in the childfolder1.
Second Approach
If you don't have filename - and would like to get a file dynamically time created.
In the same way as above - you could set a Event trigger at the childfolder2.
In the pipeline execution you filter the files based out of the timestamp of the file pipeline start. This is slightly tricky.
You do a GetMetada for the childfolder1 and filter it using the foreach and if condition. (get the latest added file in a folder [Azure Data Factory])
If there is any file then execute the pipeline with rest of the activities else you could end the pipeline execution.

Related

Azure Synapse Analytics - deleting pipeline Folder

I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop
We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png

How to ignore files with invalid schema Azure Data Factory?

I am new to Azure Data Factory. Currently, I am trying to copy some data from files in blob storage to Azure Data Explorer.
The pipeline reads multiple files from a specific cosmos root directory and copies the data to ADE. Currently, the pipeline successfully handles the copy activities. I now want to add a validation component so that the pipeline only reads files with the correct schema and ignores the rest of the files.
The schema is as follows: Name, Type, Value, Date.
The pipeline itself should not fail/stop if the files are of invalid type or schema, it should just skip them and continue on with the rest of the files.
What component can I use for this validation?
Thanks!
You can use ForEach activity to achieve this requirement. The ForEach activity (in this case) can be used to copy each source file into destination. In case the schema don't match, the copy activity usually fails.
But when a copy activity inside foreach fails, the pipeline execution does not fail/stop. It throws an error for that particular copy activity that fails, but it continues the execution of copy activity for remaining files.
Look at the following demonstration for clear understanding.
The following are the files I am using for sample. The file sample_3.csv does not match the schema of my destination table.
I used Get Metadata activity to get files names of the above files using child items field list. I passed these values to ForEach as #activity('Get Metadata1').output.childItems.
Inside Foreach, I have a copy activity along with wait activity (wait executes only if copy activity fails)
When I execute this pipeline, it completes successfully without any errors. The following is the debug output.
The third iteration of copy activity fails as sample_3.csv has incorrect schema. And yet the pipeline runs successfully without failing.
UPDATE:
Mapping need not be specified in copy activity.

How to rerun the metadata activity if no ChildItems found and usage of timeout in Azure Data Factory

I have metadata activity which helps me to find the file based on the regex. When I have the matched pattern I am able to retrieve the child items but when not matched it simply returns empty childitems list and the timeout is not working as expected. I want to rerun the metadata activity if no child items are found and put time out for a maximum of 2 days for searching.
Available Filename in blob :
SampleStores_multi
SampleStores_single
Stores.txt
Input Filename: Sales_*
Could you please help me with the solution?
You can create storage event trigger for your pipeline and provide the wild card file path to trigger the pipeline when the required file is uploaded in the folder path.
You can loop the Get Metadata activity inside until loop.
• Create a variable to enable the flag when the file is found.
• Loop the Until loop by checking if the flag is enabled.
• Inside loop, check if the file is found in the child items of getting metadata activity output and update the variable using set variable activity.

Azure Data Factory: How to copy specific files based on file's create date?

I failed for below test case:
Create new folderA on SourceContainer and insert fileA
Trigger Pipelines to copy it to TargetContainer successfully
Delete folderA/fileA from TargetContainer
Trigger Pipelines with specific start and end time (try to not copy all folders/files again)
Cannot see folderA/fileA on TargetContainer
Within Copy activity, source features: you can mention the range of the date of the files which you need to copy:
And w.r.t trigger a pipeline when file is created, you can use Event trigger to achieve that.

Filter recent files in Logic Apps' SFTP when files are added/modified trigger

I have this Logic App that connects to an SFTP server and it's triggered by the "files are added or modified" trigger. It's set to run every 10 minutes, looking for new/modified files and copying them to an Azure storage account.
The problem is that this SFTP server path is set to overwrite a set of files every X minutes (I have no control over this) and so, pretty often the Logic App overlaps with the update process of these files and downloads files that are still being written. The result is corrupted files.
Is there a way to add a filter to the When files are added or modified (properties only) so that it only takes into consideration files with a modified date of, at least, 1 minute old?
That way, files that are currently being written won't be added to the list of files to download. The next run of the Logic App would then fetch this ignored files and so on.
UPDATE
I've found a Trigger Conditions in the trigger's setting but I can't find any documentation about it.
According to test the trigger "When files are added or modified", it seems we can not add a filter in the trigger to filter the records which are modified at least 1 minute ago. We can just get the List of Files LastModified datetime and loop them, use "If" condition to judge if we should download it.
Update:
The expression in the screenshot is:
sub(ticks(utcNow()), ticks(triggerBody()?['LastModified']))
Update workaround
Is it possible to add a "Delay" action when the last modified time less than 1 minute ? For example, if the last modified time less than 60 seconds, use "Delay" to wait 5 minutes until the overwrite operation complete, then do the download.
I check the sample #equals(triggers().code, 'InternalServerError'), actually it uses the condition functions in Logical comparison functions, so the key word is make sure the property you want to filter is in the trigger or triggerBody or you will get the below error.
So I change the expression to like #greater(triggerBody().LastModified,'2020-04-20T11:23:00Z'), this could filter the file modified less than 2020-04-20T11:23:00Z not trigger the flow.
Also you could use other function like less ,greaterOrEquals etc in the Logical comparison functions.

Resources