Azure Databricks - Use Event Grid

Azure Databricks - Use Event Grid - azure

I am currently using an Azure Durable Function (Orchestrator) to orchestrate a process that involves creating a job in Azure Databricks. The Durable Function creates the job using the REST API of Azure Databricks and provides a callback URL. Once the job is created, the orchestrator waits indefinitely for the external event (callback) to be triggered, indicating the completion of the job (callback pattern). The job in Azure Databricks is wrapped in a try:except block to ensure that a status (success/failure) is reported back to the orchestrator no matter the outcome.
However, I am concerned about the scenario where the job status turns to Internal Error, and the piece of code is not executed, leaving the orchestrator waiting indefinitely. To ensure reliability, I am considering several solutions:
Setting a timeout on the orchestrator
Polling: Checking the status of the job every x minutes
Using an event-driven architecture by writing an event to a topic (e.g. Azure Event Grid) and having a separate service subscribe to it
My question is, can I send events to a topic (Azure Event Grid) when the Databricks job completes (succeeds, fails, errors, every possible outcome) to ensure that the orchestrator is notified and can take appropriate action? Looking at the REST API Jobs 2.1 docs, I can get notified via email or a specify a webhook on start, success and failure (Preview Feature). Can I enter the topic URL of Event Grid here so that Databricks writes events to it? Docs to manage notification destinations. It's not clear to me. Is there another way in Azure to achieve the same result?
Edit: I've looked into the documentation to find how to manage notification destinations and created a new system notification:
However, when testing the connection:
The request fails:
401: Request must contain one of the following authorization signature: aeg-sas-token, aeg-sas-key. Report '12345678-7678-4ab9-b90f-37aabf1b10b8:7:1/23/2023 6:17:09 PM (UTC)' to our forums for assistance or raise a support ticket.
The same would happen if there was a POST request from another client (e.g. Postman). Now the question is: How can I provide a token so that Databricks can write events to a topic?
I've posted this question also here: Webhook Security (Bearer Auth)

Related

Call API when Azure Automation runbook has completed

I want to call an API to trigger an Azure Automation runbook. I believe this can be done with webhooks. When doing so, I get back 202 response code which suggests that the request was successfully queued.
Now I'm trying to find out how I can specify a call-back API call that the Azure Automation should trigger once it has finished the execution, including the result status (completed, failed). Is this call-back something I should code in the Azure Automation job myself, or is there a default functionality available that would allow an API call-back when the runbook completes?
I'm trying to avoid that my client application which triggered the automation job would have to continuously poll to see if the automation job is still running.

First, there is no default functionality that would allow an API call-back when runbook completes.
And as you may know, we can do this behavior via writing code to check it's status or setup an alert when it's completed. But it would have latency or need poll periodically.
The best solution I can think of is that put the call-back api in the runbook. For example, you can put your code in try - catch - finally code block, and put the api in the finally section.
Hope it can help.

Is there a way to get the live logs of a running pipeline using Azure REST API?

I am running a build pipe line in azure that is having multiple tasks. But I have a requirement to get logs using rest API calls after triggering pipeline. I used Builds-Get Build Logs, but it listing only completed task logs and not listing ongoing task log. Is there any mechanism available to get ongoing task logs/live logs?

Is there a way to get the live logs of a running pipeline using Azure REST API?
I am afraid there is no such mechanism available to get ongoing task logs/live logs.
As we know, the Representational State Transfer (REST) APIs are service endpoints that support sets of HTTP operations (methods), which provide create, retrieve, update, or delete access to the service's resources.
The task is executed inside the agent, and the result of the execution be passed back to azure devops only after a task is completed. So, HTTP operations (methods) are triggered only when the task is completed and the results are returned, and then we could use the REST API to get the results.
So, we could not use the Azure REST API to get ongoing task logs/live logs. This is limited by the azure devops design pattern.
Hope this helps.

Diagnosing failures in azure event grid?

I did not find much in the way of troubleshooting events lost scenario in the azure event grid.
Hence I am asking question in relation to following scenario:
Our code publishes the events to the domain.
The events are delivered to the configured web hook in the subscription.
This works for a while.
The consumer (who owns the web hook endpoint) complains that he is not receiving some events but most are coming through.
We look in the configured dead-letter queue and find that there are no events. It has been more than a day and hence all retries are already exhausted.
Hence we assume that all events are being delivered because there are no failed delivery events in the metrics.
We also make sure that we indeed submitted these mysterious events to the grid.
But consumer insists about the problem and proves that there is nothing wrong with his side.
Now we need to figure out if some of these events are being swallowed by the event grid.
How do I go about troubleshooting this scenario?

The current version of the AEG is not integrated for Diagnostic settings feature which can be help very well for streaming the metrics and logs.
For your scenario which is based on the Event Domains (still in the public preview, see limits) can help an Azure Monitoring REST API, to see all metrics in the specific your Event Domain.
The valid metrics are:
PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
The following example is a REST GET request to obtain all metrics values within your event domain for specific timespan and interval:
https://management.azure.com/subscriptions/{mySubId}/resourceGroups/{myRG}/providers/Microsoft.EventGrid/domains/{myDomain}/providers/Microsoft.Insights/metrics?api-version=2018-01-01&interval=PT1H&aggregation=count,total&timespan=2019-02-06T07:58:12Z/2019-02-07T08:58:12Z&metricnames=PublishSuccessCount,PublishFailCount,PublishSuccessLatencyInMs,MatchedEventCount,DeliveryAttemptFailCount,DeliverySuccessCount,DestinationProcessingDurationInMs,DroppedEventCount,DeadLetteredCount
Based on the response values, you can see metrics of the AEG behavior from the publisher side and the event delivery to the subscriber. For your production version, I do recommend to use a polling technique to obtain all metrics from AEG and pushing them to the Event Hub for a streaming analyzing, alerting, etc. Based on the query parameters (such as timespan, interval, etc.), it can be close to the real-time. When the Diagnostic settings will be supported by AEG, than this polling and publishing all metrics is obsoleted and small modification at the analyzing stream job can be continued.
The other point is to extend your eventing model for auditing part. I do recommend the following:
Add a domain scope subscription to capture all events in the event domain and push them to the Event Hub for streaming purposes. Note, that any published event within that event domain should be in this published stream pipeline.
Add a storage subscription for dead-letter messages and push them to the same Event Hub for streaming purposes.
(optional) Add the Diagnostic settings (some metrics) of the dead-letter storage to the same Event Hub for streaming purposes. Note, that the dead-letter message is dropped after 4 hours trying to store it in the blob container. There is no any log message for that failed process, just only metric counter.
For the customer side, I do recommend that each subscriber will create a log message (aeg headers + event message) for auditing and troubleshooting purposes. It should be stored in the blob container or locally and then uploaded, etc. The point is, that this reference can be very useful for analyzing stream job to quickly figure out where is the problem.
In addition to your eventing model, your publisher should periodically (for instance once per hour) probes the event domain endpoint and also should send a probe event message to the probe topic for test purposes. The event subscription for that probe topic will configure a deadlettering option. The subscriber webhook handler should be always failed with a error code = HttpStatusCode.BadRequest such as no retrying action. Note, that there is a 300 seconds delay time, when the deadletter message will be stored in the storage. In other words, after probe event + 5 minutes, the deadlettering message should be in the stream pipeline. This probe scenario in your eventing model will probe a functionality of the AEG from the publisher and delivery point of the view.
The above described solution is shown in the following screen snippet:

How do you maintain idempotency with Azure EventGrid webhooks?

I have configured an EventGrid subscription to initiate a web hook call for events in a resource group when a resource is created.
The web hook call is successfully handled, and I return a 200 OK. To maintain idempotency, I store all events that have occurred in a webhook_events table with the id of the event. Any new events are checked to see if they exist in that table by their id.
Azure EventGrid attempts to remove the event from the retry queue after returning a 200 OK. No matter how quickly I respond with a 200 OK, EventGrid reliably retries sending.
I am receiving the same event multiple times (as I said, EventGrid always retries, as it cannot remove the event from the retry queue fast enough). This however is not the focus of my question; rather, the issue exists in the fact that each of these retries presents me with a different id for the event. This means that I cannot logically determine the uniqueness of an event, and my application code is not being executed in an idempotent fashion.
How can I maintain idempotency between my application and Azure despite there being no unique identifier between event retries?

It's the way EventGrid is implemented if you look at the documentation
If the endpoint responds within 3 minutes, Event Grid will attempt to
remove the event from the retry queue on a best effort basis but
duplicates may still be received.
you can use back-end code to clean up logs and stored data, using event and message IDs to identify duplicates.

The id field is in fact unique per event and kept identical between retries & therefore can be used for dedupe.
What you're running into is a specific issue with some events generated by Azure Resource Manager (ARM). Specifically, the two events you are seeing are in fact distinct events, not duplicates, generated by ARM at different stages of the creative flow for some resource types.
ARM is acting as the API front door to the various Azure services and emits a set of events for that are generalized and often to get the details of what has occurred, you need to look in the data payload. For example, ARM will emit a success event for each 2xx status code it receives from an Azure service, so a 202 accepted and a 201 created can result in two events being emitted and the only way to see the difference would be in the data payload.
This is a known pain point, and we are working to emit more high-fidelity events that will be clearer and easier to react to in these scenarios. The ideal state will be a change-feed of sorts for the Azure control plane.

Programmatically Schedule one-time execution of Azure function

I have looked through documentation for WebJobs, Functions and Logic Apps in Azure but I cannot find a way to schedule a one-time execution of a process through code. My users need to be able to schedule notifications to go out at a specific time in the future (usually within a few hours or a day from being scheduled). Everything I am reading on those processes is using CRON expressions which is not designed for one-time executions. I realize that I could schedule the job to run on intervals and check the database to see if the rest of the job needs to run, but I would like to avoid running the jobs unnecessarily if possible. Any help is appreciated.
If it is relevant, I am using C#, ASP.NET MVC Core, App Services and a SQL database all hosted in Azure. My plan was to use Logic apps to check the database for a scheduled event and send notifications through Twilio, SendGrid, and iOS/Android push notifications.

One option is to create Azure Service Bus Messages in your App using the ScheduledEnqueueTimeUtc property. This will create the message in the queue, but will only be consumable at that time.
Then a Logic App could be listening to that Service Bus Queue and doing the further processing, e.g. SendGrid, Twilio, etc...
HTH

You could use Azure Queue trigger with deferred visibility. This will keep the message invisible for a specified timeout. This conveniently acts as a timer.
CloudQueue queueOutput; // same queue as trigger listens on
var strjson = JsonConvert.SerializeObject(message); // message is your payload
var cloudMsg = new CloudQueueMessage(strjson);
var delay = TimeSpan.FromHours(1);
queueOutput.AddMessage(cloudMsg, initialVisibilityDelay: delay);
See https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.queue.cloudqueue.addmessage?view=azure-dotnet for more details on this overload of AddMessage.

You can use Azure Automation to schedule tasks programmatically using REST API. Learn about it here.
You can use Azure Event Grid also. Based on this article you can “Extend existing workflows by triggering a Logic App once there is a new record in your database".
Hope this helps.

The other answers are all valid options, but there are some others as well.
For Logic Apps you can build this behavior into the app as described in the Scheduler migration guide. The solution described there is to create a logic app with a http trigger, and pass the desired execution time to that trigger (in post data or query parameters). The 'Delay Until' block can then be used to postpone the execution of the following steps to the time passed to the trigger.
You'd have to change the logic app to support this, but depending on the use case that may not be an issue.
For Azure functions a similar pattern could be achieved using Durable Functions which has support for Timers.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string