Azure Synapse Pipeline Execution based on file copy in DataLake - azure

I want to execute Azure Synapse Pipeline whenever a file is copied into a folder in data lake.
Can we do that and how can we achieve that?
Thanks,
Pavan.

You can trigger a pipeline (start pipeline execution) based on a file copied to datalake folder using storage event triggers. The storage event triggers can start the execution of pipeline based on a selected action.
You can follow the steps specified below to create a storage event trigger.
Assuming you have a pipeline named ‘pipeline1’ in azure synapse which you want to execute based on file copied to datalake folder, click on trigger and select New/Edit.
Choose a new trigger. Select trigger type as storage events and specify the datalake storage details on which you want to start trigger when a file is copied into it. Specify container name, blob path begins with and blob path ends with according to your datalake directory structure and type of files.
Since you need to start pipeline when a blob file appears in datalake folder, check Blob Created event. Check start trigger on action, complete creating the trigger and publish it.
These steps allow you to create a storage event trigger for your pipeline based on the datalake storage. As soon as files are uploaded or copied to the specific directory of the datalake container, the pipeline execution will be started, and you can work on further steps. You can refer to the following document to understand more about event triggers.
https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger?tabs=data-factory

Related

How to create Event Trigger in Azure Data Factory when three files created in Azure Blob Container?

I need to create a schedule trigger (it will run every 15 minutes for 3 hours) in Azure Data Factory, which will pipeline when three different files are created in an Azure Blob storage container. Pipeline execution should only start when all 3 files are created in the blob container. For example, if 3 hours pass and there are only two files in the storage blob, the pipeline will not have to run.
There is no direct way for the event trigger of 3 files as an AND condition in ADF as of now.
What you can do is:
Create an ADF pipeline with
a) Get meta data activity>> check whether there are 3 required files
b) If yes, then use Execute pipeline activity to trigger the pipeline that should be run when there are 3 files
If not, ignore/throw error etc
Create Event triggers for the files and associate with the pipeline.
So in the case of 3rd event trigger, all files would be found and then the main pipeline would be executed.

Prevent triggering next activity if no files were copied in previous activity in Azure Data Factory

I am using Azure Data Factory to copy the data from one Blob Storage Account to Data Lake Storage Gen2 account.
I have created a pipeline and created a copy activity inside that. I trigger this pipeline from a Timer Trigger Azure Function using C# SDK.
I am copying only incremental data by making use of Filter by last modified feature. I am passing UTC StartTime and EndTime.
Now, the question is - I don't want to trigger the second activity if no files are found within this range. How can I do that?
You can use If Condition activity, and check whether has files through this expression: #greater(activity('Copy data1').output.filesWritten,0). Then put the second activity into case True activity like this:

How to create azure data factory pipeline and trigger it automatically whenever file arrive in SFTP?

I'm building azure data factory pipeline where source is SFTP and target is azure blob storage.
The files can arrive at anytime and any number of files can come into the SFTP on the daily basis.
I have to copy the file from Sftp to blob storage whenever any file arrive in SFTP.
I know event trigger functionality in ADF but It's possible only if files are coming into the blob storage.
Is it possible to achieve same kind of functionality i.e copying files on arrival,when sources are different from blob storage.
Data Factory can't achieve that.
Some ideas is that you could achieve your purpose with logic app:
You could create a SFTP server trigger: When a file is added or
modified
Add an action get a pipeline run to execute the Data Factory
pipeline:
Pass the new added filename to the pipeline and run the pipeline.

Trigger Azure data factory pipeline - Blob upload ADLS Gen2 (programmatically)

We are uploading files into Azure data lake storage using Azure SDK for java. After uploading a file, Azure data factory needs to be triggered. BLOB CREATED trigger is added in a pipeline.
Main problem is after each file upload it gets triggered twice.
To upload a file into ADLS gen2, azure provides different SDK than conventional Blobstorage.
SDK uses package - azure-storage-file-datalake.
DataLakeFileSystemClient - to get container
DataLakeDirectoryClient.createFile - to create a file. //this call may be raising blob created event
DataLakeFileClient.uploadFromFile - to upload file //this call may also be raising blob created event
I think ADF trigger is not upgraded to capture Blob created event appropriately from ADLSGen2.
Any option to achieve this? There are restrictions in my org not to use Azure functions, otherwise Azure functions can be triggered based on Storage Queue message or Service bus message and ADF pipeline can be started using data factory REST API.
You could try Azure Logic Apps with a blob trigger and a data factory action:
Trigger: When a blob is added or modified (properties only):
This operation triggers a flow when one or more blobs are added or
modified in a container. This trigger will only fetch the file
metadata. To get the file content, you can use the "Get file content"
operation. The trigger does not fire if a file is added/updated in a
subfolder. If it is required to trigger on subfolders, multiple
triggers should be created.
Action: Get a pipeline run
Get a particular pipeline run execution
Hope this helps.

Duplicate Blob Created Events When Writing to Azure Blob Storage from Azure Databricks

We are using an Azure Storage Account (Blob, StorageV2) with a single container in it. We are also using Azure Data Factory to trigger data copy pipelines from blobs (.tar.gz) created in the container. The trigger works fine when creating the blobs from an Azure App Service or by manually uploading via the Azure Storage Explorer. But when creating the blob from a Notebook on Azure Databricks, we get two (2) events for every blob created (same parameters for both events). The code for creating the blob from the notebook resembles:
dbutils.fs.cp(
"/mnt/data/tmp/file.tar.gz",
"/mnt/data/out/file.tar.gz"
)
The tmp folder is just used to assemble the package, and the event trigger is attached to the out folder. We also tried with dbutils.fs.mv, but same result. The trigger rules in Azure Data Factory are:
Blob path begins with: out/
Blob path ends with: .tar.gz
The container name is data.
We did find some similar posts relating to zero-length files, but at least we can't see them anywhere (if some kind of by-product to dbutils).
As mentioned, just manually uploading file.tar.gz works fine - a single event is triggered.
We had to revert to uploading the files from Databricks to the Blob Storage using the azure-storage-blob library. Kind of a bummer, but it works now as expected. Just in case anyone else runs into this.
More information:
https://learn.microsoft.com/en-gb/azure/storage/blobs/storage-quickstart-blobs-python

Resources