Azure Synapse Analytics - deleting pipeline Folder - azure

I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop

We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png

Related

Azure Data Factory Counting number of Files in Folder

I am attempting to determine if a folder is empty.
My current method involves using a GetMeta shape and running the following to set a Boolean.
#greater(length(activity('Is Staging Folder Empty').output.childItems), 0)
This works great when files are present.
When the folder is empty (a state I want to test for) I get
"The required Blob is missing".
Can I trap this condition?
What alternatives are there to determine if a folder is empty?
I have reproduced the above and got same error.
This error occurs when the folder is empty, and the source is a Blob storage. You can see it is working fine for me when the Source is ADLS.
for sample I have used set variable.
inside false of if.
if folder is empty:
Can I trap this condition?
What alternatives are there to determine if a folder is empty?
One alternative can be to use ADLS instead of Blob storage as source.
(or)
You can do like below, if you want to avoid this error with Blob storage as source. Give an if activity for the failure of Get Meta and check the error in the expression.
#startswith(string(activity('Get Metadata1').error.message), 'The required Blob is missing')
In True activities (required error and folder is empty) I have used set variable for demo.
In False activities (If any other error apart from the above occurs) use Fail activity to fail the pipeline.
Fail Message: #string(activity('Get Metadata1').error.message)
For success of Get Meta activity, there is no need to check the count of child Items because Get Meta data fails if the folder is empty. So, on success Go with your activities flow.
An alternative would be
Blob:
Dataset :
where test is the container and test is the folder inside the container which I am trying to scan (Which ideally doesnt exist as seen above)
Use get meta data activity to check if the folder exists :
If false, exit else count for the files

Copy files from AWS s3 sub folder to Azure Blob

I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist

How to ignore files with invalid schema Azure Data Factory?

I am new to Azure Data Factory. Currently, I am trying to copy some data from files in blob storage to Azure Data Explorer.
The pipeline reads multiple files from a specific cosmos root directory and copies the data to ADE. Currently, the pipeline successfully handles the copy activities. I now want to add a validation component so that the pipeline only reads files with the correct schema and ignores the rest of the files.
The schema is as follows: Name, Type, Value, Date.
The pipeline itself should not fail/stop if the files are of invalid type or schema, it should just skip them and continue on with the rest of the files.
What component can I use for this validation?
Thanks!
You can use ForEach activity to achieve this requirement. The ForEach activity (in this case) can be used to copy each source file into destination. In case the schema don't match, the copy activity usually fails.
But when a copy activity inside foreach fails, the pipeline execution does not fail/stop. It throws an error for that particular copy activity that fails, but it continues the execution of copy activity for remaining files.
Look at the following demonstration for clear understanding.
The following are the files I am using for sample. The file sample_3.csv does not match the schema of my destination table.
I used Get Metadata activity to get files names of the above files using child items field list. I passed these values to ForEach as #activity('Get Metadata1').output.childItems.
Inside Foreach, I have a copy activity along with wait activity (wait executes only if copy activity fails)
When I execute this pipeline, it completes successfully without any errors. The following is the debug output.
The third iteration of copy activity fails as sample_3.csv has incorrect schema. And yet the pipeline runs successfully without failing.
UPDATE:
Mapping need not be specified in copy activity.

Azure container sync periodically

I have a scenario where i am storing payload from 2 subscriber (Service bus topics) into 2 different storage/container. Storing mechanism are different in these case.
Now i have to run a sync every 30 minutes which will compare the files created target and Source,
if anything is missing in the target it should be able to copy that file from source to target.
I am looking at AZCopy sync, but that is a local application . There are logic app and Function app option as well.
Kindly share what is the best solution to this problem
This method is a little on the brute force side, but should work. Here is the top level pipeline diagram:
Here are the steps:
Get Metadata for the "Needs sync" folder (FolderA). Be sure to check add the "Child items" argument:
ForEach over the FolderA child items to extract the file names and append them to an array variable:
This makes it easier to work with the names later.
Get Metadata for the "Always right" folder (FolderB). Same process as above, but over the FolderB location.
ForEach over FolderB's child items.
Inside the ForEach, add an If Condition to test whether or not the FolderB item exists in the FolderA list.
If the FolderB item is not in the FolderA list, append it to a Missing_Items array variable.
From here, it's a matter of looping over the Missing Items array and handling it however you prefer [probably with a Copy activity].

How to get files from a subfolder present under nested parent folder in azure data factory?

My folder structure is like below,
Container/xx56585/DST_1/2021-03-26/xxxxxxxx.csv
Container/xx56585/DST_1/2021-03-26/xxxxxxxx.ctl
Container/xx56585/DST_2/2021-03-26/yyyyyyyyy.csv
Container/xx56585/DST_2/2021-03-26/yyyyyyyyy.ctl
Container/xx56585/DST_3/2021-03-26/zzzzzzzzz.csv
Container/xx56585/DST_3/2021-03-26/zzzzzzzzz.ctl
Container/xx56585/DST_4/2021-03-26/sssssssssss.csv
Container/xx56585/DST_4/2021-03-26/sssssssssss.ctl
I need to copy .csv and .ctl files to sFTP target and move these files to achieve folder(in the blob storage after copy activity)
Please help me on this
Update:
We can use Get Metadata1 to check does the ctl file exist.
Add dynamic content #concat('xx56585/',item(),'/',substring(adddays(utcnow(),-3),0,10),'/') to the path.
I created a simple test to copy files under <rundate> folders to target folder.
My folder structure
Input/xx56585/DST_1/2021-03-26/xxxxxxxx.csv
Input/xx56585/DST_2/2021-03-26/yyyyyyyyy.csv
Input/xx56585/DST_3/2021-03-26/zzzzzzzzz.csv
Input/xx56585/DST_4/2021-03-26/sssssssssss.csv
Output:
Define an Array type variable Array1 and assign the value ["DST_1","DST_2","DST_3","DST_4"].
At ForEach1 activity, we can add dynamic content #variables('Array1') to traverse this array.
Inside ForEach1 activity, we can use Copy activity to copy files under the dynamic path via expression #concat('xx56585/',item(),'/',substring(adddays(utcnow(),-3),0,10),'/').
My current date is 2020-03-29 so I use adddays(utcnow(),-3) to get 2020-03-26 in the above steps.
That's all.
I think we can add filter activity in this before copy activity in which we can use substring function and find if file name contains .ctl or .csv

Resources