How to get files from a subfolder present under nested parent folder in azure data factory? - azure

My folder structure is like below,
Container/xx56585/DST_1/2021-03-26/xxxxxxxx.csv
Container/xx56585/DST_1/2021-03-26/xxxxxxxx.ctl
Container/xx56585/DST_2/2021-03-26/yyyyyyyyy.csv
Container/xx56585/DST_2/2021-03-26/yyyyyyyyy.ctl
Container/xx56585/DST_3/2021-03-26/zzzzzzzzz.csv
Container/xx56585/DST_3/2021-03-26/zzzzzzzzz.ctl
Container/xx56585/DST_4/2021-03-26/sssssssssss.csv
Container/xx56585/DST_4/2021-03-26/sssssssssss.ctl
I need to copy .csv and .ctl files to sFTP target and move these files to achieve folder(in the blob storage after copy activity)
Please help me on this

Update:
We can use Get Metadata1 to check does the ctl file exist.
Add dynamic content #concat('xx56585/',item(),'/',substring(adddays(utcnow(),-3),0,10),'/') to the path.
I created a simple test to copy files under <rundate> folders to target folder.
My folder structure
Input/xx56585/DST_1/2021-03-26/xxxxxxxx.csv
Input/xx56585/DST_2/2021-03-26/yyyyyyyyy.csv
Input/xx56585/DST_3/2021-03-26/zzzzzzzzz.csv
Input/xx56585/DST_4/2021-03-26/sssssssssss.csv
Output:
Define an Array type variable Array1 and assign the value ["DST_1","DST_2","DST_3","DST_4"].
At ForEach1 activity, we can add dynamic content #variables('Array1') to traverse this array.
Inside ForEach1 activity, we can use Copy activity to copy files under the dynamic path via expression #concat('xx56585/',item(),'/',substring(adddays(utcnow(),-3),0,10),'/').
My current date is 2020-03-29 so I use adddays(utcnow(),-3) to get 2020-03-26 in the above steps.
That's all.

I think we can add filter activity in this before copy activity in which we can use substring function and find if file name contains .ctl or .csv

Related

Azure Synapse Analytics - deleting pipeline Folder

I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop
We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png

Using ADF to get a subset of files from the directory in Azure File Share

For example my Azure file share directory contains the following files:
abc_YYYYMMDD.txt
def_YYYYMMDD.txt
ijk_YYYYMMDD.txt
I'm only interested to get abc_YYYYMMDD.txt and ijk_YYYYMMDD.txt
Currently, I have a Get Metadata activity that gets a list of files (childItems property) inside a File Share directory.
Then I have Filter activity that has this dynamic content:
#startswith(item().name, variables('filename_filter')) OR startswith(item().name,
variables('filename2filter')))
Unfortunately, it has an error:
Position 54 'startswith' is a primitive and doesn't support nested properties
How do I resolve this if I have multiple conditions inside the Dynamic Content for Filter activity?
Your Filter expression should be like this:
#or(startswith(item().name,variables('filename_filter')),startswith(item().name,variables('filename2filter'))
The expression doesn't support "or" directly, you should use or() function.
This is my source folder:
I create a pipeline, using parameters to filter the filename which start with "test1" and "test2":
Run the pipeline:
Output:
HTH.

How to delete files based older than specified date in Azure Data lake

I have data folders created on daily basis in datalake. Folder path is dynamic from JSON Format
Source Folder Structure
SAPBW/Master/Text
Destination Folder Structure
SAP_BW/Master/Text/2019/09/25
SAP_BW/Master/Text/2019/09/26
SAP_BW/Master/Text/2019/09/27
..
..
..
SAP_BW/Master/Text/2019/10/05
SAP_BW/Master/Text/2019/09/06
SAP_BW/Master/Text/2019/09/07
..
..
SAP_BW/Master/Text/2019/09/15
SAP_BW/Master/Text/2019/09/16
SAP_BW/Master/Text/2019/09/17
I want to delete the folders created before 5 days for each folder of sinkTableName
So, in DataFactory, i have Called the folder path in a for each loop as
#concat(item().DestinationPath,item().SinkTableName,'/',item().LoadTypeName,'/',formatDateTime(adddays(utcnow(),-5),item().LoadIntervalFormat),'/')"
Need syntax to delete the files in each folder based on the JSON.
Unable to find the way to delete folder wise and setup the delete activity depending on the dates prior to five days from now
I see that you are doing a concatenation , which I think is the way to go . But I see that you are using the expression formatDateTime(adddays(utcnow(),-5) , which will give you something like 2019-10-15T08:23:18.9482579Z which i don't think is desired . I suggest to try with #formatDateTime(adddays(utcnow(),-5) ,'yyyy/MM/dd'). Let me know how it goes .

SSIS won't execute foreach loop for dynamic xlsx filename [duplicate]

This question already has answers here:
SSIS - How to loop through files in folder and get path+file names and finally execute stored Procedure with parameter as Path + Filename
(2 answers)
Closed 3 years ago.
I have a xlsx file that will be dropped into a folder on a monthly basis. The filename will change every month (filename_8292019) based on the date, to which I cannot change.
I want to build a foreach loop to pick up the xlsx file and manipulate it (load into SQL server table, the move the file to an archive folder). I cannot figure out how to do this with a dynamic filename (where the date changes.
I was able to successfully run the package when converting the xlsx to CSV, and also when pointing directly to the xlsx filename.
[Flat File Destination [219]] Error: Cannot open the datafile "filename"
OR errors relating to file not found
The Files: entry on the Collection tab of the Foreach Loop container will accept wildcard characters.
The general pattern here is to create a variable, say, FileName. Set your Files: to something like:
Files:
BaseFileName*
or, if you want to be sure to only pick up spreadsheets, maybe:
Files:
BaseFileName*.xlsx
Select either Name and extension or Fully qualified, which will include the full file path. I usually just use Name and extension and put the file path into another variable so when Ops tells me they're moving my drop location, I can change a parameter instead of editing the package. This step tells the container to remember the name of the file it just found so you can use it later for a variable mapping.
On the Variable Mappings tab, select your variable name and assign it to Index 0.
Then, for each spreadsheet, the container will loop, pick up the name of the first file it finds that matches your pattern, and assign the full name, with the date extension (and path, if you go that way), to your variable. Pass the variable as in input parameter to the tasks inside the loop and use that to process the file, including moving it to the archive, or you'll get yourself into an infinite loop, processing the same file(s) over and over. <--Does that sound like the voice of experience? Yeah. Been there, done that.
Edit:
Here, the FullFilePath variable is just the folder name, without a file reference. (Red variable to red entry in the Folder box).
The FileBaseName variable drives what shows up in the Files box. (Blue to blue).
Another variable picks up the actual file name, with the date extension. Later, say in a File System Task, if I need the folder & file name together, I concatenate the variables.
As far as the Excel Connection Manager error you're getting, unfortunately I'm no help. I don't use it. We have SentryOne's Task Factory for SSIS which includes a much more resilient Excel connector.

SSIS: Loop on multiple directories given in an Excel

I have to read several input files of same structure and different origin with SSIS. These files are stored in multiple directories. Each directory contains multiple files that are related to a specific company location. The files are of same structure.
Factory A --> file A, file B,...
Factory B --> file C, file D,..
There is also an Excel with some details on Factory A and B. We plan to push the data to SAP and each file needs a specific accounting code depending on its location/factory.
E.G.
Factory A, account 12345
Factory A, account 54321
The idea was to first read the Excel line by line with the names of the factories and then do a For Each-Loop on each directory to read the files.
I managed to read a single directory by filling a variable with the directory name. E.g. I start with gstrSubdir="Factory A" and then do a simple Foreach-Loop on the subdirectory. So far, I am fine.
Now I need a loop that reads the first line of the Excel, sets the subdir variable, loops on the subdir and then reads the next line.
In principle I need to know how to do a nested For-Loop in SSIS without C#.
I hope my explanation is somehow understandable.
Would be more than happy to get some advice.
Regards,
Lars

Resources