copy specific files from a container using wildcard filename in ADF pipeline

copy specific files from a container using wildcard filename in ADF pipeline - azure

I'm trying to copy specific files from a container in a Storage Account using an ADF pipeline. lets say the container has the following files
aa_aaa_01_yyyymmdd.csv
aa_abb_01_yyyymmdd.csv
aa_aaa_02_yyyymmdd.csv
aa_aaa_03_yyyymmdd.csv
aa_abb_02_yyyymmdd.csv
ab_abc_01_yyyymmdd.csv
My pipeline has to copy all the files beginning with 'aa_aaa_'. I tried using the * wildcard at the time of creating the source dataset - like "aa_aaa_*.csv" but didn't work; the validation fails.
please help. Thanks

You can use prefix option like below in copy activity

Related

Azure Synapse Analytics - deleting pipeline Folder

I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop

We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png

How to ignore files with invalid schema Azure Data Factory?

I am new to Azure Data Factory. Currently, I am trying to copy some data from files in blob storage to Azure Data Explorer.
The pipeline reads multiple files from a specific cosmos root directory and copies the data to ADE. Currently, the pipeline successfully handles the copy activities. I now want to add a validation component so that the pipeline only reads files with the correct schema and ignores the rest of the files.
The schema is as follows: Name, Type, Value, Date.
The pipeline itself should not fail/stop if the files are of invalid type or schema, it should just skip them and continue on with the rest of the files.
What component can I use for this validation?
Thanks!

You can use ForEach activity to achieve this requirement. The ForEach activity (in this case) can be used to copy each source file into destination. In case the schema don't match, the copy activity usually fails.
But when a copy activity inside foreach fails, the pipeline execution does not fail/stop. It throws an error for that particular copy activity that fails, but it continues the execution of copy activity for remaining files.
Look at the following demonstration for clear understanding.
The following are the files I am using for sample. The file sample_3.csv does not match the schema of my destination table.
I used Get Metadata activity to get files names of the above files using child items field list. I passed these values to ForEach as #activity('Get Metadata1').output.childItems.
Inside Foreach, I have a copy activity along with wait activity (wait executes only if copy activity fails)
When I execute this pipeline, it completes successfully without any errors. The following is the debug output.
The third iteration of copy activity fails as sample_3.csv has incorrect schema. And yet the pipeline runs successfully without failing.
UPDATE:
Mapping need not be specified in copy activity.

Azure Data Factory: How to copy specific files based on file's create date?

I failed for below test case:
Create new folderA on SourceContainer and insert fileA
Trigger Pipelines to copy it to TargetContainer successfully
Delete folderA/fileA from TargetContainer
Trigger Pipelines with specific start and end time (try to not copy all folders/files again)
Cannot see folderA/fileA on TargetContainer

Within Copy activity, source features: you can mention the range of the date of the files which you need to copy:
And w.r.t trigger a pipeline when file is created, you can use Event trigger to achieve that.

Parametrization using Azure Data Factory

I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?

Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno

The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.

File Transform task fails to transform XML configurations on zipped package

I'm working on Release pipeline, which will perform transformation on App Service Worker configs, then will publish workers + web application.
My input package is a zip package produced out of MsBuild publish (from ASP.NET build pipeline).
...\PackageTmp\app_data\jobs\triggered\BillingWorker\App.Prod.config
...\PackageTmp\app_data\jobs\triggered\BillingWorker\App.Test.config
...\PackageTmp\app_data\jobs\triggered\BillingWorker\BillingWorker.exe.config
...\PackageTmp\app_data\jobs\triggered\EtlWorker\App.Prod.config
...\PackageTmp\app_data\jobs\triggered\EtlWorker\App.Test.config
...\PackageTmp\app_data\jobs\triggered\EtlWorker\EtlWorker.exe.config
...\PackageTmp\Web.config
...\PackageTmp\Web.Test.config
...\PackageTmp\Web.Prod.config
...\PackageTmp\many other files
Transformation of Web.config is done correctly by Publish to Azure Web App task. However, workers configs aren't transformed automatically, so I added a File Transform task with following config:
This step doesn't work and here is the output:
2019-08-14T15:41:01.1435779Z ##[section]Starting: File Transform: config
2019-08-14T15:41:01.1576716Z ==============================================================================
2019-08-14T15:41:01.1576853Z Task : File transform
2019-08-14T15:41:01.1576932Z Description : Replace tokens with variable values in XML or JSON configuration files
2019-08-14T15:41:01.1576994Z Version : 1.156.0
2019-08-14T15:41:01.1600786Z Author : Microsoft Corporation
2019-08-14T15:41:01.1600885Z Help : https://learn.microsoft.com/azure/devops/pipelines/tasks/utility/file-transform
2019-08-14T15:41:01.1600986Z ==============================================================================
2019-08-14T15:41:01.6339900Z ##[warning]Unable to apply transformation for the given package. Verify the following.
2019-08-14T15:41:01.6351367Z ##[warning]Unable to apply transformation for the given package. Verify the following.
2019-08-14T15:41:01.8369297Z Initiated variable substitution in config file :
...
... many lines about variable subsitution
...
This output looks wrong, as it produces warning without declared explanation. How to workaround this warning?

The problem is that File Transform task strongly relies on names of both files - the one being transformed and the one containing transformation rules. Strict naming convention is required which can be described in following words:
A template named Name.xml can be transformed only by files named Name.Debug.xml, Name.Release.xml, and more general - Name.{anything-here}.xml.
What's happening here is that App.config file is renamed to {YourApplicationName}.exe.config during build thus the tranformation using App.Debug.config fails.
I see 2 workarounds:
1. Preserve the original name App.config
a. In a project file, set App.config file's property to Copy to output directory: Copy always
b. Setup "File Transform task" with args -transform *.Debug.config -xml *.config -result {YourApplicationName}.exe.config
c (optional). If you didn't specify -result in task, you need to setup another task to rename App.config to {YourApplicationName}.exe.config after the transformation has finished (for example a Command Line task with command copy App.config {YourApplicationName}.exe.config /Y)
2. Write custom transformator script
a. Unzip package into temp folder
b. Transform file using Powershell (make use of Microsoft.Web.XmlTransform.dll installed on agent)
c. Zip again and replace the original zip.

The native step in the official task doesn't support transformation in zip files. You can use another task for do it before the deploy task.
I used this and it worked fine to me:
https://marketplace.visualstudio.com/items?itemName=solidify-labs.vsts-task-tokenize-in-archive

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

copy specific files from a container using wildcard filename in ADF pipeline - azure

You can use prefix option like below in copy activity

Related

Azure Synapse Analytics - deleting pipeline Folder

How to ignore files with invalid schema Azure Data Factory?

Azure Data Factory: How to copy specific files based on file's create date?

Parametrization using Azure Data Factory

File Transform task fails to transform XML configurations on zipped package

Categories

Resources