Can datafactory start or stop self integration runtime? - azure

I have a self integration runtime configured in a virtual machine on azure, because I need access on-premise database to load some information and this database can be accessed only throught VPN connection.
Is there a way to on/off the virtual machine when loading process is going to run (once a week) in order to optimize cost in the cloud ? because it not makes sense to me leave a vm billing at idle times.
Thank you

Updated:
We can create Azure runbook, add PowerShell code to turn on/off the VM and call it in ADFv2 using Webhook according to this post.
Create a trigger that runs a pipeline on a schedule. When creating a schedule trigger, you specify a schedule (start date, recurrence, end date etc.) for the trigger, and associate with a pipeline.
At the start of the pipeline, you can use WebHook activity to start the vm, then copy data, at the end stop the vm. As follows:
If you're looking to build data pipelines in Azure Data Factory, your cost will be split into two categories:
Data Factory Operations
Pipeline Orchestration and Execution
Data Factory Operations
Read/Write: Every time you create/edit/delete a pipeline activity or a Data Factory entity such as a dataset, linked service, integration runtime or trigger, it counts towards your Data Factory Operations cost. These are billed at $0.50 per 50,000 operations.
Monitoring: You can monitor each pipeline run and view the status for each individual activity. For each pipeline run, you can expect to retrieve one record for the pipeline and one record for each activity or trigger. For instance, you would be charged for 3 Monitoring activities if you debug a pipeline containing 2 activities. Monitoring activities are charged at $0.25 per 50,000 run records retrieved.
Pipeline Orchestration
Self Hosted
Every time you run a pipeline, you are charged for every activity and trigger inside that pipeline that is executed at a rate of $1.50 per 1000 Activity runs.
As an example, executing a pipeline with a trigger and two activities would be charged as 3 Activity runs.
Pipeline Execution
Self hosted
Data movement : $0.10/hour
Pipeline activities : $0.002/hour
External activities : $0.0001/hour
An inactive pipeline is charged at $0.80 per month.
Summary:
When there are 30 days in a month and if your pipeline starts to run on a certain day and continues to run for 10 hours to move data. It will charge according to 10hx$0.10/h = $1. Pipelines that are inactive for an entire month are billed at the applicable "inactive pipeline" rate for the month as 29days/30days x $0.80 = $0.80. So your total cost is $1.8.

Related

How to Trigger ADF Pipeline from Synapse Pipelines

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

Azure data factory end time trigger

I have one scenario let say I have one ADF instance name XYZ contains one pipeline which is schedule trigger starts at 12:00 AM in night. Trigger some time ends in 1 hours and sometime it takes more than 2 hours because of data load.
I have one more ADF instance ABC in that also one pipeline I have, now my requirement is that I have to schedule the ABC instance pipeline when xyz instance trigger is completed.
Kindly help on this requirement. Both ADF have different instance & also trigger end time may vary based on load.
The simplest way is using Logic app. In Logic app designer, we can create two pipeline run steps to trigger the two pipelines in different Data Factory running.
Create a Recurrence trigger to run this logic app in a loop.
In the Azure Data Factory operations:
select Create a pipeline run Action.
The summary is here:
So we can trigger the pipeline run of the ADF instance name XYZ, and when it is completed, it will trigger the pipeline run of the ADF instance ABC.

How to schedule jobs at scale with Azure?

I have a web api application deployed to App Service.
I need to be able to set arbitrary number of scheduled jobs (http calls to my web api) with arbitrary due date times (from a few hours to a few months).
The point is that i need to be able to set/edit/cancel them on the fly programatically based on different state of my app and i need to maintain thousands of them.
Is there some recommended way to do it?
I would persist the jobs into either SQL database or Table Storage (I would call it 'ScheduledJobs').
Then have an Azure Function that will query the ScheduledJobs storage at some interval, say every hour to pickup jobs that should be processed at that point in time.
The jobs that are due for processing can then be written to a queue (I would name it 'jobs-queue').
Then have another Azure Function that will pickup jobs from 'jobs-queue'. This Azure Function has the business logic of how to process each job.
From my experience, an Azure Function can only run up to 10 minutes. The time it will take to process each job should not be longer than this period.
Hope this gives you some idea.

How to set an alert for Azure Data factory when Pipeline takes more than N minutes to complete

I need to setup an alert system if my Azure Datafactory pipeline runs for more than 20 minutes. The alert should come while the pipeline is running and the duration passes 20mins, not after the completion of pipeline. How can I do this? I think this can be done using Azure function but I am not familiar with it so I'm in search for a script for the same.
Yes, azure function is a solution to acheive your requirement.
For example, if you are using Python. You need an azure function that runs periodically to monitor the status of the pipeline. The key is the duration time of the pipeline. pipeline is based on activities. You can monitor every activity.
In Python, This is how to get the activity you want:
https://learn.microsoft.com/en-us/python/api/azure-mgmt-datafactory/azure.mgmt.datafactory.operations.activityrunsoperations?view=azure-python#query-by-pipeline-run-resource-group-name--factory-name--run-id--filter-parameters--custom-headers-none--raw-false----operation-config-
The below is to get the duration time of azure datafactory activity:
https://learn.microsoft.com/en-us/python/api/azure-mgmt-datafactory/azure.mgmt.datafactory.models.activityrun?view=azure-python#variables
(There is a variable named duration_in_ms, you can use this to get the duration time of the activity run.)
This is use Python to monitor pipeline:
https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically#python
You can create a azure function app with a timetrigger to monitor the azure datafactory activity. This is the document of azure function timetrigger:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-timer?tabs=python
The basic idea is put the code that monitor the pipeline whether run more than N minutes in the logic body of azure function timetrigger. And then use the status of the azure function to reflect whether the pipeline running time of azure datafactory exceeds N hours.
Then use the alarm event of the azure function. The alarm events supported by azure for the azure function are as follows: (You can set an output binding of your azure function.)
In azure portal, you can find the alert in this place:
(Select Email/SMS message as the action type and give it your email address.)

Azure Data Factory Pipeline Cost

I'm using azure data factory for some migration project. while doing it I came a cross with some clarification. I just want to know if I keep a pipeline in ADF without using it will there be a cost for that? I need to run some pipelines according to a schedule like weekly or monthly. Please help
Yes: $0.80 / month / inactive pipeline. According to Data Pipeline Pricing, under the heading Inactive pipelines,
A pipeline is considered inactive if it has no associated trigger or any runs within the month. An inactive pipeline is charged at $0.80 per month.
I just want to know if I keep a pipeline in ADF without using it will
there be a cost for that?
Quick answer is NO. Based on ADF pricing document,the billing consists of Data pipelines and SQL Server Integration Services.
Your account only need to pay when you execute your activities(in the pipelines). Or it is related to migration of SQL Server DB.

Resources