Azure Functions as Scheduler - azure

Is Azure functions a good alternative to Azure Data Factory to use as scheduler? It has blob trigger to monitor and can use C# to trigger databricks jobs using API. But is it a viable alternative.
Edited to add more information. Wanted to trigger a databricks job based on a trigger file but do not want to use Azure Data Factory or Data bricks job.

I would probably use simple logic app with Event Grid trigger on blob storage event blob created event. Based on trigger data I would call Databricks Job REST API.
I did entire demo below working in under 10 minutes so its fast to set up.
With this demo I used
And logic app setup as trigger
Where I strongly suggest to add prefix filter like
/blobServices/default/containers/<container_name>
So you don't fire too many logic apps from different containers as event grid reacts to all events in entire storage account.
And HTTP call like so
Of course at this point simply change clusters list to submitting job REST call.
And see execution like
Just make sure that EventGrid resource provider is registered or logic app will never fire off.

Related

How to invoke Job/Task in Azure Databricks from Azure Function

I need to develop a event driven pipeline which should get trigger on file arrival in ADLS2 i.e. ABFS. On file arrival I need to trigger 4 subsequent Spark jobs on Azure Databricks cluster.
For orchestrating the Spark Jobs I can use Databricks jobs as an option so that jobs could get triggered in a pipeline.
But the first job should get triggered only after the file arrival.
I am currently exploring ways to achieve this but need expert advice to design this in a best possible manner w.r.t cost.
One solution could be to use Azure Data Factory for orchestrating the entire flow based on Storage Event Trigger component but going for ADF just because of event based trigger don't look feasible to me as the rest part of the application i.e. Spark jobs can be pipelined from Databricks Job feature. Also, in terms of cost ADF can be expensive. Another solution could be to use Azure Functions Blob Trigger to know the file arrival but I am not able to understand how can I trigger Azure Databricks jobs from Azure Functions. As going with Functions can be cost effective as the function would not be running/active until the file has arrived.
Note:There can be multiple files arriving in an hour. No fixed duration on file arrival.
Also, trigger file is different than data files. i.e. On arrival of trigger files, Spark pipeline would consume actual data files.
Data files and Trigger files have different extensions and both are arriving in ABFS.
Your worry about ADF cost is misplaced. The Pipelines are extremely cheap. The activities that actually move data and use CPU are where most of the cost is. For instance Data Flows are run on managed Spark clusters, which is reflected in the pricing. See Data Factory Pricing. Using a Pipeline to orchestrate Databricks jobs is a common, simple, and (at least for ADF) very inexpensive.
If you want to kick off a Databricks job from an Azure Function, there's an API. Also check out the Databricks Autoloader, but running your Databricks cluster continuously can be expensive.

How to Trigger ADF Pipeline from Synapse Pipelines

Problem
Due to internal requirements, I need to run a Synapse pipeline and then trigger an ADF pipeline. It does not seem that there is a Microsoft-approved method of doing this. The pipelines run infrequently (every week or month) and the ADF pipeline must run after the Synapse pipeline.
Options
It seems that other answers pose several options:
Azure Functions. Create an Azure function that calls the CreatePipelineRun function on the ADF pipeline. At the end of the Synapse pipeline, insert a block that calls the Azure function.
Use the REST API and Web Activity. Use the REST API to make a call to run the ADF pipeline. Insert a Web Activity block at the end of the Synapse pipeline to make the API call.
Tables and polling. Insert a record into a table in a managed database with data about the Synapse pipeline run. Have regular polling from the ADF pipeline to check for new records and run when ready.
Storage Event. Create a timestamped blob file at the end of the Synapse run. Use the "storage event trigger" within ADF to trigger the ADF pipeline.
Question
Which of these would be closest to the "approved" option? Are there any clear disadvantages to any of these?
As you mentioned, there is no "approved" solution for this problem. All the approaches you mentioned have pros and cons and should work. For me, Option #3 has been very successful. We have built a Queue Manager based on Tables & Stored Procedures in Azure SQL. We use Logic Apps to process the Triggers which can be Scheduled, Blob Events, or REST calls. Those Logic Apps insert jobs in the Queue table via Stored Procedure. That Stored Procedure can be called directly by virtually any system, so your Synapse pipeline could insert a Queue job to execute the ADF pipeline. Other benefits include a log of all the pipeline runs, support for multiple Data Factories (and now Synapse Workspaces), and a web interface we wrapped around the database for management and tracking.
We have 2 other Logic Apps that process the Queue (a Status manager and an Executor). These run constantly (every 1 minute and every 3 minutes). The actions to check status and create pipeline runs are both implemented as .NET Azure Functions [you'll need different SDKs for Synapse vs. ADF]. This system runs thousands of pipelines a month, sometimes more, across numerous Data Factories and Synapse Workspaces.
The PROs here are many, but this disconnected approach permits facets of your system to operate in isolation. And it is flexible, in that you can tie virtually any system into the Queue. Your example of a pipeline that needs to execute another pipeline in a different system is a perfect example.
The CON here is that this is the most involved approach. If this is a on-off problem you are trying to solve, choose one of the other options.

Monitor specific activity logs to trigger Azure Function

Usecase: Trigger Azure Function only for predefined Azure activity logs.
I tried to configure Azure Activity logs and Export to Event Hub, but it won't allow Filter set on it. As per Azure document, the filter settings do not have an impact on export settings.
My usecase is to trigger an Azure Function only for a specific set of activity logs (say VM, VNet, NSG Create/Delete/Modify). What other Azure services can I use to accomplish this?
One option, but with some constraints, is to create Alerts at Resource Group level or even for specific resources. Alerts provide some flexibility in filtering specific events for which you would want to trigger an Action, say an Azure Func in your case.
I was thinking Azure Logic Apps would do this as well. However, to my surprise I could not find an option to add Activity Log as a trigger. Probably, it would come in the future. As Azure is updated quite frequently, keep checking every now and then to see if you get any new options to do this.

Scalable Azure Function with blob trigger

I made an Azure Function on a Consumption Plan with a blob trigger. Then I add lots of files to the blob and I expect the Azure Function to be invoked every time a file is added to the trigger.
And because I use Azure Function and Consumption Plan, I would expect that there is no scalability problem, right? WRONG.
I can easily add files to the blob faster than the Azure Function can process them. Hundred users can add to the blob but there seems to be only one instance of the Azure Function working at any one time. Meaning it can easily fall behind.
I thought the platform would just create more instances of the Azure Function as needed. Well, it seems it doesn't.
Any advice how I can configure my Azure Function to be truly scalable with a blob trigger?
This is because you are affecting with cold-start
As per the note here
When you're using a blob trigger on a Consumption plan, there can be
up to a 10-minute delay in processing new blobs. This delay occurs
when a function app has gone idle. After the function app is running,
blobs are processed immediately. To avoid this cold-start delay, use
an App Service plan with Always On enabled, or use the Event Grid
trigger.
For your case, you need to consider Event-Grid trigger instead of a blob trigger, Event trigger has the built-in support for blob-events as well.
When to consider Event Grid?
Use Event Grid instead of the Blob storage trigger for the following scenarios:
Blob storage accounts
High scale
Minimizing cold-start delay
Read more here
Update in 2020
Azure Function has a new tier/plan called a premium where you can avoid the Cold Start
E.g, YouTube Video

Programmatically Schedule one-time execution of Azure function

I have looked through documentation for WebJobs, Functions and Logic Apps in Azure but I cannot find a way to schedule a one-time execution of a process through code. My users need to be able to schedule notifications to go out at a specific time in the future (usually within a few hours or a day from being scheduled). Everything I am reading on those processes is using CRON expressions which is not designed for one-time executions. I realize that I could schedule the job to run on intervals and check the database to see if the rest of the job needs to run, but I would like to avoid running the jobs unnecessarily if possible. Any help is appreciated.
If it is relevant, I am using C#, ASP.NET MVC Core, App Services and a SQL database all hosted in Azure. My plan was to use Logic apps to check the database for a scheduled event and send notifications through Twilio, SendGrid, and iOS/Android push notifications.
One option is to create Azure Service Bus Messages in your App using the ScheduledEnqueueTimeUtc property. This will create the message in the queue, but will only be consumable at that time.
Then a Logic App could be listening to that Service Bus Queue and doing the further processing, e.g. SendGrid, Twilio, etc...
HTH
You could use Azure Queue trigger with deferred visibility. This will keep the message invisible for a specified timeout. This conveniently acts as a timer.
CloudQueue queueOutput; // same queue as trigger listens on
var strjson = JsonConvert.SerializeObject(message); // message is your payload
var cloudMsg = new CloudQueueMessage(strjson);
var delay = TimeSpan.FromHours(1);
queueOutput.AddMessage(cloudMsg, initialVisibilityDelay: delay);
See https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.queue.cloudqueue.addmessage?view=azure-dotnet for more details on this overload of AddMessage.
You can use Azure Automation to schedule tasks programmatically using REST API. Learn about it here.
You can use Azure Event Grid also. Based on this article you can “Extend existing workflows by triggering a Logic App once there is a new record in your database".
Hope this helps.
The other answers are all valid options, but there are some others as well.
For Logic Apps you can build this behavior into the app as described in the Scheduler migration guide. The solution described there is to create a logic app with a http trigger, and pass the desired execution time to that trigger (in post data or query parameters). The 'Delay Until' block can then be used to postpone the execution of the following steps to the time passed to the trigger.
You'd have to change the logic app to support this, but depending on the use case that may not be an issue.
For Azure functions a similar pattern could be achieved using Durable Functions which has support for Timers.

Resources