I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist
Related
I am currently making a little API which returns jsons according to an input. This API needs to run some local programs on the server and also needs to place some temporary files. It all works if I ask the API once a time and just with one "user".
The problem is, that I only have one temp folder to store the temporary files.so whenever there are multiple API queries, the tmp folder screws up - the data in there mixes up.
What would be a good way to have an API using temp files - and still keep it working if every run needs its own temp data?
The current process is:
server/api/getGeometry?lat=52.5167776&lon=13.4092091&bboxSize=2000&output=glb
server queries zips according to lat/lon
unpacks
converts
does voodoo
generates json
sends back the json
cleans up the tmp folder
next run same story...
so every time the same tmp folder, I guess it's not the way to go.
Thanks a lot for ideas!
I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop
We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png
I have a scenario where i am storing payload from 2 subscriber (Service bus topics) into 2 different storage/container. Storing mechanism are different in these case.
Now i have to run a sync every 30 minutes which will compare the files created target and Source,
if anything is missing in the target it should be able to copy that file from source to target.
I am looking at AZCopy sync, but that is a local application . There are logic app and Function app option as well.
Kindly share what is the best solution to this problem
This method is a little on the brute force side, but should work. Here is the top level pipeline diagram:
Here are the steps:
Get Metadata for the "Needs sync" folder (FolderA). Be sure to check add the "Child items" argument:
ForEach over the FolderA child items to extract the file names and append them to an array variable:
This makes it easier to work with the names later.
Get Metadata for the "Always right" folder (FolderB). Same process as above, but over the FolderB location.
ForEach over FolderB's child items.
Inside the ForEach, add an If Condition to test whether or not the FolderB item exists in the FolderA list.
If the FolderB item is not in the FolderA list, append it to a Missing_Items array variable.
From here, it's a matter of looping over the Missing Items array and handling it however you prefer [probably with a Copy activity].
I have a Pipeline job in Azure Data Factory which I want to use to run the pipeline job but pass all files for a specific month through for example.
I have a folder called 2020/01 inside this folder is numerous files with different names.
The question is: Can one pass a parameter through to only extract and load the files for 2020/01/01 and 2020/01/02 if that makes sense?
Excellent, Thanks Jay it worked and i can now run my pipeline jobs passing through the month or even day level.
Really appreciate your response, have a fantastic day.
Regards
Rayno
The question is: Can one pass a parameter through to only extract and
load the files for 2020/01/01 and 2020/01/02 if that makes sense?
You did't mention which connector you are using in pipeline job,but you mentioned folder in your question.As i know,the majority folder path could be parametrization in ADF copy activity configuration.
You could create a param :
Then apply it in the wildcard folder path:
Even if your files' names have same prefix,you could apply 01*.json on the wildcard file name property.
I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!
Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps
We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert