Azure data factory | incremental data load from SFTP to Blob - azure

I created a (once run) DF (V2) pipeline to load files (.lta.gz) from a SFTP server into an azure blob to get historical data.
Worked beautifully.
Every day there will be several new files on the SFTP server (which cannot be manipulated or deleted). So I want to create an incremental load pipeline which checks daily for new files - if so ---> copy new files.
Does anyone have any tips for me how to achieve this?

Thanks for using Data Factory!
To incrementally load newly generated files on SFTP server, you can leverage the GetMetadata activity to retrieve the LastModifiedDate property:
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Essentially you author a pipeline containing the following activities:
getMetadata (return list of files under a given folder)
ForEach (iterate through each file)
getMetadata (return lastModifiedTime for a given file)
IfCondition (compare lastModifiedTime with trigger WindowStartTime)
Copy (copy file from source to destination)
Have fun building data integration flows using Data Factory!

since I posted my previous answer in May last year, many of you contacted me asking for pipeline sample to achieve the incremental file copy scenario using the getMetadata-ForEach-getMetadata-If-Copy pattern. This has been important feedback that incremental file copy is a common scenario that we want to further optimize.
Today I would like to post an updated answer - we recently released a new feature that allows a much easier and scalability approach to achieve the same goal:
You can now set modifiedDatetimeStart and modifiedDatetimeEnd on SFTP dataset to specify the time range filters to only extract files that were created/modified during that period. This enables you to achieve the incremental file copy using a single activity:
https://learn.microsoft.com/en-us/azure/data-factory/connector-sftp#dataset-properties
This feature is enabled for these file-based connectors in ADF: AWS S3, Azure Blob Storage, FTP, SFTP, ADLS Gen1, ADLS Gen2, and on-prem file system. Support for HDFS is coming very soon.
Further, to make it even easier to author an incremental copy pipeline, we now release common pipeline patterns as solution templates. You can select one of the templates, fill out the linked service and dataset info, and click deploy – it is that simple!
https://learn.microsoft.com/en-us/azure/data-factory/solution-templates-introduction
You should be able to find the incremental file copy solution in the gallery:
https://learn.microsoft.com/en-us/azure/data-factory/solution-template-copy-new-files-lastmodifieddate
Once again, thank you for using ADF and happy coding data integration with ADF!

Related

Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?

I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work.
After reading some documentation, I was able to create a pipeline that only copies certain files.
However, I would like to automate this pipeline by using dynamic file paths to specify the source files in Sharepoint.
In other words, when I run the pipeline on 2022/07/14, I want to get the files from the Sharepoint folder named for that day, such as "Data/2022/07/14/files".
I know how to do this with PowerAutomate, but my company does not want me to use PowerAutomate.
The current pipeline looks like the attached image.
Do I need to use parameters in the URL of the source dataset?
Any help would be appreciated.
Thank you.
Try this approach.
You can create a parameterized dataset as below
Then from the copy activity you can give the file path to this parameter
as
#concat('Data/',formatDateTime(utcnow(),'yyyy'),'/',formatDateTime(utcnow(),'MM'),'/',formatDateTime(utcnow(),'dd'))

How to delete specific blob file when that file has been removed from the source location via Azure Data Factory (self-hosted)

I created a Copy Data task in Azure Data Factory which will periodically copy modified files from my file system (self-hosted integration runtime) to an Azure Blob location. That works great when an existing file is modified or when a new file is created in the source, however, it won't delete a file from the blob destination when the corresponding file is deleted from the source file path location - since the file is gone, there is no modified date. Is there a way to keep the source and destination in sync via Azure Data Factory with individually deleted files such as in the scenario described above? Is there a better way to do this? Thanks.
I'm afraid to say Data Factory can't do that with actives, the pipeline only support read the exist file and copy them to sink. And sink side also doesn't support delete a file.
You can achieve that in code level, such as functions or notebook. After the copy finished, build a logic to compare the source and destination files list, delete the file which not exist in source list.
HTH.

Azure Data Factory: Migration of pipelines from one data factory to another

I have some pipelines which I want to move from one data factory to another. Is there any possible way to migrate them?
The easiest way to do this is to just pull the git repo for the source factory down to your local file system and then just copy and paste the desired files into your destination factory folder structure. That's it.
Alternatively, you can do this through the ADF editor by creating a shell of the pipeline in the target factory first, then go to the source factory and switch to the code view for that pipeline, copy and paste that code into the target pipeline shell you created, and then save from there.
A pipeline is just json. You may need to copy the dependent objects also, but those are done the exact same way.
There is an import/export feature in the data factory canvas which supports this use case.
Moreover, this is the case where continuous deployment and integration proves very useful. More literature can be found - https://learn.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment

How to unzip and move files in Azure?

Problem: I get an email with a zip file. In that zip are two files. I want to extract one of those files and place it in a folder in ADL.
I've automated this before using logic apps but the zip and extra file is throwing a wrench in the gears here. So far I've managed to get a logic app going to download the zip into a blob container and another logic app to extract the files to another container. Don't know how to proceed from there. Should I be using data factory? I want this automated and to run every week every time I receive an email from a specific sender.
Update:
I am sorry, dont notice your source is ADL, the below steps only need to change the source as ADL is ok. the key is select the Compression type of your source, it will unzip the file for you.
Original Answer:
Create a pipeline,
2.Create a activity.
3.After you create a copy data activity, you need to choose the source and the sink. From your description, you need to unzip a file in a storage container to another container. So, please follow these steps:
And the sink is similar, also choose the azure storage blob and choose the same linked service. Select the container that you want to copy to.
4.Then let's Validate all. If there is no problem, we can publish them.
Now please trigger your pipeline:
6.After that, your zip file will successful unzip and copy to another container.:)

How to copy the data based on the Last modified time using Azure data factory from FTP Source?

I am trying to create a pipeline where the pipeline needs to trigger only when the file is modified in the FTP Server.
I have used GET METADATA activity to get the lastmodified date and used IF activity to copy the data.
Below is the expression i have used in IF activity
#less(activity('GET_DATA').output.lastModified,formatDateTime(utcnow(),'yyyy-MM-dd HH:mm:ss'))
I would want the lasted updated file to be copied into the Destination.
So Can anyone please suggest on how to model the pipeline for this?
Here is a guide for incremental load. Hope it helps
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool
Also there is a template for incremental load.
https://learn.microsoft.com/en-us/azure/data-factory/solution-template-copy-new-files-lastmodifieddate

Resources