Parametrization using Azure Data Factory

Parametrization using Azure Data Factory - azure

I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?

Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno

The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.

Related

Copy files from AWS s3 sub folder to Azure Blob

I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks

I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist

Azure Synapse Analytics - deleting pipeline Folder

I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop

We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png

copy specific files from a container using wildcard filename in ADF pipeline

I'm trying to copy specific files from a container in a Storage Account using an ADF pipeline. lets say the container has the following files
aa_aaa_01_yyyymmdd.csv
aa_abb_01_yyyymmdd.csv
aa_aaa_02_yyyymmdd.csv
aa_aaa_03_yyyymmdd.csv
aa_abb_02_yyyymmdd.csv
ab_abc_01_yyyymmdd.csv
My pipeline has to copy all the files beginning with 'aa_aaa_'. I tried using the * wildcard at the time of creating the source dataset - like "aa_aaa_*.csv" but didn't work; the validation fails.
please help. Thanks

You can use prefix option like below in copy activity

How to reference the most current Physical Sequential (PS) file in JCL

I wanted to create a job where I need to consider the latest file available as input file.
File format is as below: FILE1.TEST.TYYMMDD
is there any way to identify latest file based on date present in file name via JCL.
P.S. GDG versions are not created in existing process . Only PS file is created.
Thank you

I wanted to create a job where I need to consider the latest file available as input file. File [name] format is as below: FILE1.TEST.TYYMMDD is there any way to identify latest file based on date present in file name via JCL.
No.
You indicate that GDGs are not created in the existing process. GDGs would be the best way to accomplish your goal. Absent GDGs, you must write code.
You could accomplish your goal by writing (C, clist, COBOL, PL/I, Rexx) code using the LMDINIT and LMDLIST ISPF services. Then you would execute your code by running ISPF in batch. Many mainframe shops have a cataloged procedure to execute ISPF in batch.

Agree with #cschneid that there is not a platform way to handle this. However, I want to point out that GDGs are the platform way of managing PS files for access in a relative form.
Your comment
GDG versions are not created in existing process . Only PS file is
created.
That statement didn't make sense to me. GDGs are not a file type like physical sequential (PS) or partitioned (PO). It's a convention to allow relative reference to files created over time which sounds like what you want. I've only seen the use of GDGs for PS files.
Putting the date in the file name can have its uses but to z/OS its only part of the filename and not meta information that it operates on (like G0000v00's in GDGs.

Using Logic Apps to get specific files from all sub(sub)folders, load them to SQL-Azure

I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!

Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps

We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parametrization using Azure Data Factory - azure

Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level. Really appreciate your response, have a fantastic day. Regards Rayno

Related

Copy files from AWS s3 sub folder to Azure Blob

Azure Synapse Analytics - deleting pipeline Folder

copy specific files from a container using wildcard filename in ADF pipeline

How to reference the most current Physical Sequential (PS) file in JCL

Using Logic Apps to get specific files from all sub(sub)folders, load them to SQL-Azure

Categories

Resources