Azure Batch Job Schedule - notification after recurrence job completed/failed - azure

I am trying to use Azure Batch Job Schedule in my application with .Net core. I want to get some notification/event trigger once the recurrence job is completed/failed in the job schedule so that I can copy output files to storage and send email to the end-user.
Is it possible to get such notification from azure Batch job schedule or is there any solution to this?
I can't find any sample implementation of Azure Batch job scheduling.

I hope this blog will sove your problem: https://mindmajix.com/azure-batch
Step by step example with code has been provided
Batch Tutorial
Here we will use dot net batch library and visual studio to create a sample batch task.
Step 1. Create containers in Azure Blob Storage.
Step 2. Upload task application files and input files to containers.
Step 3. Create a Batch pool.
3a. The pool StartTask downloads the task binary files (TaskApplication) to nodes as they join the pool.
Step 4. Create a Batch job.
Step 5. Add tasks to the job.
5a. The tasks are scheduled to execute on nodes.
5b. Each task downloads its input data from Azure Storage, then begins execution.
Step 6. Monitor tasks.
6a. As tasks are completed, they upload their output data to Azure Storage.
Step 7. Download task output from Storage.

Related

How to invoke Job/Task in Azure Databricks from Azure Function

I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.

How to Trigger ADF Pipeline from Synapse Pipelines

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

Can datafactory start or stop self integration runtime?

I have a self integration runtime configured in a virtual machine on azure, because I need access on-premise database to load some information and this database can be accessed only throught VPN connection.
Is there a way to on/off the virtual machine when loading process is going to run (once a week) in order to optimize cost in the cloud ? because it not makes sense to me leave a vm billing at idle times.
Thank you
Updated:
We can create Azure runbook, add PowerShell code to turn on/off the VM and call it in ADFv2 using Webhook according to this post.
Create a trigger that runs a pipeline on a schedule. When creating a schedule trigger, you specify a schedule (start date, recurrence, end date etc.) for the trigger, and associate with a pipeline.
At the start of the pipeline, you can use WebHook activity to start the vm, then copy data, at the end stop the vm. As follows:
If you're looking to build data pipelines in Azure Data Factory, your cost will be split into two categories:
Data Factory Operations
Pipeline Orchestration and Execution
Data Factory Operations
Read/Write: Every time you create/edit/delete a pipeline activity or a Data Factory entity such as a dataset, linked service, integration runtime or trigger, it counts towards your Data Factory Operations cost. These are billed at $0.50 per 50,000 operations.
Monitoring: You can monitor each pipeline run and view the status for each individual activity. For each pipeline run, you can expect to retrieve one record for the pipeline and one record for each activity or trigger. For instance, you would be charged for 3 Monitoring activities if you debug a pipeline containing 2 activities. Monitoring activities are charged at $0.25 per 50,000 run records retrieved.
Pipeline Orchestration
Self Hosted
Every time you run a pipeline, you are charged for every activity and trigger inside that pipeline that is executed at a rate of $1.50 per 1000 Activity runs.
As an example, executing a pipeline with a trigger and two activities would be charged as 3 Activity runs.
Pipeline Execution
Self hosted
Data movement : $0.10/hour
Pipeline activities : $0.002/hour
External activities : $0.0001/hour
An inactive pipeline is charged at $0.80 per month.
Summary:
When there are 30 days in a month and if your pipeline starts to run on a certain day and continues to run for 10 hours to move data. It will charge according to 10hx$0.10/h = $1. Pipelines that are inactive for an entire month are billed at the applicable "inactive pipeline" rate for the month as 29days/30days x $0.80 = $0.80. So your total cost is $1.8.

Action on error in Azure Machine Learning pipeline

I have a published and scheduled pipeline running at regular intervals. Some times, the pipeline may fail (for example if the datastore is offline for maintenance). Is there a way to specify the scheduled pipeline to perform a certain action if the pipeline fails for any reason? Actions could be to send me an email, try to run again in a few hours later or invoke a webhook. As it is now, I have to manually check the status of our production pipeline at regular intervals, and this is sub-optimal for obvious reasons. I could of course instruct every script in my pipeline to perform certain actions if they fail for whatever reason, but it would be cleaner and easier to specify it globally for the pipeline schedule (or the pipeline itself).
Possible sub-optimal solutions could be:
Setting up an Azure Logic App to invoke the pipeline
Setting a cron job or Azure Scheduler
Setting up a second Azure Machine Learning pipeline on a schedule that triggers the pipeline, monitors the output and performs relevant actions if errors are encountered
All the solutions above suffers from being convoluted and not very clean - surely there must exist a simple, clean solution for this problem?
This solution reads from the logs of your pipeline and let's you do something within a Logic App capability, I used it to email the team when a scheduled pipeline failed.
Steps:
Create Event Namespace and Event Hub
Create Service Bus Namespace and Service Bus Queue
Create a Stream Analytics Job using EventHub as Input and Service
Bus Queue as Output
Create Logic App with a trigger to any event coming into the Service
Bus Queue then, add an Outlook 360 send an email (v2) step
Create an Event Subscription inside ML Service that sends filtered
events to the Event Hub
Start Stream Analytics Job
Two fundamental steps while creating the Event subscription:
Subscribe to the 'Run Status Changed' event to get the log when a pipeline fails
Use the advanced filters section to specify which pipeline you want to monitor (change 'deal-UAT' to your specific ml experiment), like this:
It looks like a lot of setup but it's super easy and quick to do, it would look something like this in the end:

Difference among Azure batch, scheduler and web job and when to use what

I could see primarily there are 3 options in Windows Azure to schedule jobs. Batch, scheduler and web jobs. Is there any link or video explaining what are the differences and what to use when and benefits?
Thanks in advance
So far I didn't see anything official from azure.com or msdn, so let me take a stab.
Azure Batch - is a way to run parallel (typically compute intensive) HPC style job on the cloud. Batch pitches the value of parallel job running as a service so you don't worry about provisioning/managing large cluster. A typical scenario is, go encoding those 10K H.264 videos from 1080p to 720p - instead of spinning up 200 VMs you just configure the command line and specify the location of those 10k videos (blobs).
Azure Scheduler is a way to run recurring job at specified time. It's Windows Task Scheduler in cloud. For example, start a cloud service 8AM every weekday and shut it down at 6PM.
Azure Web Job is focusing on doing background job for Azure Website. It's working daemon web server farm in cloud. An example is - compress all images uploaded from the webpage.
To add to Yiding answer, Azure Scheduler and Azure WebJobs actually work together and complete each other in that sense.
Azure WebJobs will host your code/executable that is doing the work.
Azure Scheduler will schedule when to run your work --> WebJob.
To start create a scheduled Azure WebJob which will create both resources.

Resources