How to Trigger ADF Pipeline from Synapse Pipelines - azure

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?

As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

Related

How to invoke Job/Task in Azure Databricks from Azure Function

I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.

Azure Data Factory(ADF) vs Azure Functions: How to choose?

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

Azure Functions as Scheduler

Is Azure functions a good alternative to Azure Data Factory to use as scheduler? It has blob trigger to monitor and can use C# to trigger databricks jobs using API. But is it a viable alternative.
Edited to add more information. Wanted to trigger a databricks job based on a trigger file but do not want to use Azure Data Factory or Data bricks job.
I would probably use simple logic app with Event Grid trigger on blob storage event blob created event. Based on trigger data I would call Databricks Job REST API.
I did entire demo below working in under 10 minutes so its fast to set up.
With this demo I used
And logic app setup as trigger
Where I strongly suggest to add prefix filter like
/blobServices/default/containers/<container_name>
So you don't fire too many logic apps from different containers as event grid reacts to all events in entire storage account.
And HTTP call like so
Of course at this point simply change clusters list to submitting job REST call.
And see execution like
Just make sure that EventGrid resource provider is registered or logic app will never fire off.

Is that possible to use control M to orchestrate Azure Data factory Jobs

Is that possible to use control M to orchestrate Azure Data factory Jobs?
I found this agent that can be installed on an VM:
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bmc-software.ctm-agent-linux-arm
But I didn't find documentation about it.
Cal Control M call an REST API to run and monitor a Job? I could user Azure functions and Blobs to control it.
All Control-M components can be installed and operated on Azure (and most other cloud infrastructure). Either use the link you quote or alternatively deploy Agents using Control-M Automation API (AAPI) or a combination of the two.
So long as you are on a fairly recent version Control-M you can do most operational tasks, for example you can monitor a job like so -
ctm run jobs:status::get -s "jobid=controlm:00001"
The Control-M API is developing quickly, check out the documentation linked from here -
https://docs.bmc.com/docs/automation-api/9019100monthly/services-872868740.html#Control-MAutomationAPI-Services-ctmrunjob:status::get
Also see -
https://github.com/controlm/automation-api-quickstart http://controlm.github.io https://docs.bmc.com/docs/display/public/workloadautomation/Control-M+Automation+API+-+Services https://52.32.170.215:8443/automation-api/swagger-ui.html
At this time, I don't believe you will find any out of the box connectors for Control-M to Azure Data Factory integration. You do have some other options, though!
Proxy ADF Yourself
You can write the glue code for this, essentially being the mediator between the two.
Write a program that will invoke the ADF REST API to run a pipeline.
Details Here
After triggering the pipeline, then write the code for monitoring for status.
Details Here
Have Control-M call your code via an Agent that has access to it.
I've done this with a C# console app running on a local server, and a Control-M Agent that invokes the glue code.
Control-M Documentation here also allows a way for you to execute an Azure Function directly from Control-M. This means you could put your code in an Azure Function.
Details Here'
ALTERNATIVE METHOD
For a "no code" way, check out this Logic App connector.
Write a logic app to run the pipeline and get the pipeline run to monitor status in a loop.
Next, Control-M should be able to use a plugin to invoke the logic app.
Notes
**Note that Control-M required an HTTP Trigger for Azure Functions and Logic Apps.
**You might also be able to take advantage of the Control-M Web Services plugin. Though, in my experience, I wasn't impressed with the lack of support for different authentication methods.
Hope this helps!
I just came across this post so a bit late to the party.
Control-M includes Application Integrator which enables you to use integrations created by others and to either enhance them or build your own. You can use REST or cli to instruct Control-M what requests should be generated to an application when a job is started, during execution and monitoring and how to analyze results and collect output.
A public repository accessible from Application Integrator shows existing jobs and there is one for Data Factory. I have extended it a bit so that the the Data Factory is started and monitored to completion via REST but then a Powershell script is invoked to retrieve the pipeline run information for each activity within the pipeline.
I've posted that job and script in https://github.com/JoeGoldberg/automation-api-community-solutions/tree/master/4-ai-job-type-examples/CTM4AzureDataFactory but the README is coming later.

Azure Data factory pipeline: How to display Custom Activity Progress in azure portal

I have a time consuming custom activity running in a Azure data factory pipeline.
It copies files from Blob to FTP server recursively.
The entire activity take 3-4 hours based on the number of files in the folder.
But when I am running the pipeline, it shows in progress 0%.
How update pipeline progress from custom activity?
In short, I doubt you will be able to. The services are very discounted from each other.
You might be better off writing out to the Azure generic activity log and monitoring directly from the custom activity method. This is an assumption though.
Hope this helps.

Resources