How to identify and copy the most recently added files in Azure Data Factory when there are multiple sub-folders? - azure

The way folders are structured in DF is something like this:
Parentfolder/Subfolder1/Subfolder12/Subfolder13/File1
Parentfolder/Subfolder2/Subfolder22/Subfolder23/File2
Parentfolder/Subfolder3/Subfolder32/Subfolder33/File3
The Goal is create a pipeline that can identify the file that was most recently added under the Parentfolder and copy only that file and move to Sink. This may require multiple nested pipelines & foreach loops but I have not been able to get to a solution.

In order to copy the last modified file from a folder you can follow the steps described in this thread - ADF: copy last modified blob
If you are having multiple subfolders under a parent folder and you want to copy the latest file from each subfolder, then you will have to use a parent pipeline in which you use GetMetaData activity to get the list of subfolders and then pass the output to a subsequent ForEach activity to iterate through the list of sub folder names and then inside ForEach activity have an Execute pipeline activity which will execute the copy the last modified pipeline for each subfolder.
Below GIF is just to copy last modified file from a folder:
This topic is being discussed here as well: https://learn.microsoft.com/answers/questions/379892/index.html

Related

ADF Copy Activity problem with wildcard path

I have a seemingly simple task to integrate multiple json files that are residing in a data lake gen2
The problem is files that need to be integrated are located in multiple folders, for example this is a typical structure that I am dealing with:
Folder1\Folder2\Folder3\Folder4\Folder5\2022\Month\Day\Hour\Minute\ <---1 file in Minute Folder
Than same structure for 20223 year, so in order for me to collect all the files I have to go to bottom of the structure which is Minute folder, if I use wildcard path it looks like this:
Wildcard paths 'source from dataset"/ *.json, it copies everything including all folders, and I just want files, I tried to narrow it down and copies only first for 2022 but whatever I do is not working in terms of wildcard paths, help is much appreciated
trying different wildcard combinations did not help, obviously I am doing something wrong
There is no option to copy files from multiple sub- folders to single destination folder. Flatten hierarchy as a copy behavior also will have autogenerated file names in target.
image reference MS document on copy behaviour
Instead, you can follow the below approach.
In order to list the file path in the container, take the Lookup activity and connect to xml dataset with HTTP linked service.
Give the Base URL in HTTP connector as,
https://<storage_account_name>.blob.core.windows.net/<container>?restype=directory&comp=list.
[Replace <storage account name> and <container> with the appropriate name in the above URL]
Lookup activity gives the list of folders and files as separate line items as in following image.
Take the Filter activity and filter the URLs that end with .json from the lookup activity output.
Settings of filter activity:
items:
#activity('Lookup1').output.value[0].EnumerationResults.Blobs.Blob
condition:
#endswith(item().URL,'.json')
Output of filter activity
Take the for-each activity next to filter activity and give the item of for-each as #activity('Filter1').output.value
Inside for-each activity, take the copy activity.
Take http connector and json dataset as source, give the base url as
https://<account-name>.blob.core.windows.net/<container-name>/
Create the parameter for relative URL and value for that parameter as #item().name
In sink, give the container name and folder name.
Give the file name as dynamic content.
#split(item().name,'/')[sub(length(split(item().name,'/')),1)]
This expression will take the filename from relative URL value.
When the pipeline is run, all files from multiple folders got copied to single folder.

How to Append Files with Azure Data Factory

I have tried Flatten Hierarchy, Merge Files and Preserve Hierarchy in my attempts to Append or Merge files with Data Factory, but it will neither Append or Merge
The Sink looks like the following:
Can someone let me know how to configure Data Factory to merge files please
To merge the files, use the copy activity after the ForEach loop.
First copy the individual files from REST to ADLS folder using the above loop. Then use another copy activity with source (give the datasets folder path).
Use Wildcard path. Here I have used csv for sample.
Now in sink, use merge option.
files with same structure from REST API copied to ADLS folder.
Final csv file after merging.

I want to copy files at the bottom of the folder hierarchy and put them in one folder

In Azure Synapse Analytics, I want to copy files at the bottom of the folder hierarchy and put them in one folder.
The files you want to copy are located in their respective folders.
(There are 21 files in total.)
enter image description here
I tried it using ability to flatten the hierarchy of "Copy" activity.
However, as you can see in the attached image, the file name is created on the Synapse side.
enter image description here
I tried to get the name of the bottom-level file with the "Get Metadata" activity, but I could not use wildcards in the file path.
I considered creating and running 21 pipelines that would copy each file, but since the files are updated daily in Blob, it would be impractical to run the pipeline manually every day using 21 folder paths.
Does anyone know of any smart way to do this?
Any help would be appreciated.
Using flatten hierarchy does not preserve existing file name, new file name will be generated. Wildcard paths are not accepted by Get metadata activity. Hence one option is to use Get Metadata with ForEach to achieve the requirement.
The following are the images of folder structure that I used for this demonstration.
I created a Get Metadata activity first. I am retrieving the folder names (21 folders like '20220701122731.zip') inside Intage Sample folder using field list as child items.
Now I used ForEach activity to loop through these folders names by giving items value as #activity('Get folders level1').output.childItems.
Inside ForEach I have 3 activities. First is another Get Metadata activity to get the subfolder names (to get one folder inside '20220701122731.zip', that is '20220701122731')
In this, while creating dataset, we passed the name of parent folder (folder_1 = '20220701122731.zip') to the dataset to use it in the path as
#{concat('unzipped/Intage Sample.zip/Intage Sample/',dataset().folder_1)}
This returns the names of subfolders (like '20220701122731') which are inside parent folder (like '20220701122731.zip' which have 1 subfolder each). I used set variable activity to assign the child items output to this variable using #activity('Get folder inner').output.childItems .
The final step is copy activity to move the required files to one single destination folder. Since there is only one sub-folder inside each of the 21 folders (only one sub-folder like '20220701122731' inside folder like '20220701122731.zip'), we can use the values achieved from above steps directly to complete the copy.
Along with the help of wildcard paths in this copy data activity, we can complete the copy. The wildcard directory path will be
#{concat('unzipped/Intage Sample.zip/Intage Sample/',item().name, '/', variables('test')[0].name)}
#item().name give parent folder name, in your case- '20220701122731.zip'
#variables('test')[0].name gives sub-folder name, in your case like '20220701122731'
For sink, I have created a dataset pointing to a folder inside my container called output_files. When triggered, the pipeline runs successfully.
The following are the contents of my output_files folder.

Transfer files under multiple folder from sftp to sharepoint document library using logic app

I have a scenario how to transfer files from sftp server to sharepoint document library
Example: files in sftp folder are /new/folder1/id1.csv /new/folder2/id2.csv like that every day the files will be uploaded to folders. how to transfer the same structure in sharepoint document library using logic apps..
The workflow for your folder structure would be as follows:
List files in your SFTP folder "/new".
Create a "For each" loop using the output of the list action as a parameter.
To make sure you don't treat files as folders (if you can have files in /new, e.g. /new/test.txt), add a condition: the IsFolder property of the loop item = true.
Inside the loop (and the condition result True) list files again, this time in the subfolder, using the Path property of the loop item.
Create a new "For each" loop using the output of this list action as a parameter.
Optionally, add a condition: the IsFolder property of the inner loop item = false.
Get content of the SFTP file, using the Id property of the inner loop item.
Create a file on SharePoint using the folder path, file name, and file content retrieved in the previous actions as parameters. If the folder doesn't exist in SharePoint library, it should be created automatically.
This is the most simple scenario, given the folder structure provided in your question. If the folder structure is more complex (subfolders can contain both files and subfolders, which in turn can contain other files and subfolders, and so on) then you'd need to use a recursive algorithm - first the Logic App would need to list files in a single SFTP folder (provided to the Logic App in the HTTP request body), then for each listed file (not subfolder) upload its content to SharePoint, and for each listed subfolder (not file) the Logic App would need to call itself passing the subfolder path in the HTTP request body - this way all subfolders would be processed recursively and all files in them would be transferred to SharePoint.
Please note that each such a workflow is run it would transfer all files - it wouldn't check what files are new, what files have been transferred in previous Logic App runs, etc. - that would be a completely different challenge.

How to Export Multiple files from BLOB to Data lake Parquet format in Azure Synapse Analytics using a parameter file?

I'm trying to export multiples .csv files from a blob storage to Azure Data Lake Storage in Parquet format based on a parameter file using ADF -for each to iterate each file in blob and copy activity to copy from src to sink (have tried using metadata and for each activity)
as I'm new on Azure could someone help me please to implement a parameter file that will be used in copy activity.
Thanks a lot
If so. I created simple test:
I have a paramfile contains the file names that will be copied later.
In ADF, we can use Lookup activity to the paramfile.
The dataset is as follows:
The output of Lookup activity is as follows:
In ForEach activity, we should add dynamic content #activity('Lookup1').output.value. It will foreach the ouput array of Lookup activity.
Inside ForEach activity, at source tab we need to select Wildcard file path and add dynamic content #item().Prop_0 in the Wildcard paths.
That's all.
I think you are asking for an idea of ow to loop through multiple files and merge all similar files into one data frame, so you can push it into SQL Server Synapse. Is that right? You can loop through files in a Lake by putting wildcard characters in the path to files that are similar.
Copy Activity pick up only files that have the defined naming pattern—for example, "*2020-02-19.csv" or "???20210219.json".
See the link below for more details.
https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/

Resources