Azure batch service add webhook on job completion - azure

I've set up a batch service for media file encoding with ffmpeg. Each job can contain multiple tasks, each task will encode one file. I use the task specific resource- and output-file system, so the batch service automatically fetches and delivers the files from and to the blob storage.
However: how do I know that a job or task has completed or failed?
Since the job can take very long - even more so on low priority nodes - I need some sort of webhook or event. Continuous polling on the job status is not viable.
The options I could think of:
after running the ffmpeg command, connect a curl command. Something like:
"commandLine" : "/bin/bash -c "ffmpeg -i inputFile outputFile &&
curl https://my-webhook-receiver.org ""
Technically it works, but I'm worried about timing. The curl request is probably(?) done before the batch service pushes the result file back to the blob storage. If it's a big file, and it takes maybe half a minute to upload, I will get notified before the file exists on the output container.
Use the blob storage event system.
This has the advantage that the result file obviously must've have arrived. However, what if the job failed? It won't get triggered ever then...
Batch alert system. You apparently can create alerts for certain batch events (e.g. task completion) and you can hook it up to an action group and finally a webhook. Is that the right call? It feels kinda hacky and not the right way to use this system.
Isn't there a way to connect azure batch with e.g. azure event grid directly?
What is the "correct" way to let my server know, the encoded file is ready?

There are a few ways to handle this, although admittedly some of these solutions are not very elegant:
Create a task dependency on each task. The dependent task is the one that invokes the webhook. You can make it such that the dependent task is invoked even if the task it is depending on fails with certain exit codes. You can also create a "merge task" that is dependent on all tasks in the job that can let you know when everything completes.
Use a job manager task instead. Job managers are typically used to monitor progression of a workflow and spawn other tasks, so you would be able to query status of task completion (success or failure) and send your webhook commands via this task or a task spawned by the job manager.
Use job release mechanisms to run actions when a job completes. This does not solve your per-task notification problem, but can be used as a job completion signal.

Related

Detecting the end of an Azure Batch job

In my application I create an Azure batch job. It's a Node app and I use an azure-batch Node client, but I could also be using REST, I don't think it matters. I can't switch to a C# client, however.
I expect the job to be completed in a few seconds and I wish to pause the code until the batch job is over but I am not sure how to detect the end of the job without polling the Job Status API. Neither the Node client nor the REST API exposes such functionality. I thought I could maybe register for an event of some sort but was not able to find anything like that. There are job release tasks but I am not sure if I can achieve this using them.
Any ideas how the end of an Azure batch job can be detected from within my application?
One way to do this is once you add your tasks to the job, set the job's onAllTasksComplete property to 'terminatejob'.
Then you can poll the Job-Get API, and check the state property on the job for when the job is complete (https://learn.microsoft.com/en-us/rest/api/batchservice/job/get#jobstate or https://learn.microsoft.com/en-us/javascript/api/azure-batch/job?view=azure-node-latest#get-string--object-).

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

SQS: Know remaining jobs

I'm creating an app that uses a JobQueue using Amazon SQS.
Every time a user logs in, I create a bunch of jobs for that specific user, and I want him to wait until all his jobs have been processed before taking the user to a specific screen.
My problem is that I don't know how to query the queue to see if there are still pending jobs for a specific user, or how is the correct way to implement such solution.
Everything regarding the queue (Job creation and processing is working as expected). But I am missing that final step.
Just for the record:
In my previous implementation I was using Redis + Kue and I had created a key with the user Id and the job count, every time a job was added that job count was incremented, and every time a job finished or failed I decremented that count. But now I want to move away from Redi + Kue and I am not sure how to implement this step.
Amazon SQS is not the ideal tool for the scenario you describe. A queueing system is normally used in a "Send and Forget" situation, where the sending system doesn't remain interested in later processing.
You could investigate Amazon Simple Workflow (SWF), which allows work to be monitored as it goes through several processes. Your existing code could mostly be re-used, just with the SWF framework added. Or even power it from Lambda, since you are already using node.js.

Can i execute an on-demand web job from a scheduled webjob?

I need to execute a long running webjob on certain schedules or on-demand with some parameters that need to be passed. I had it in a way where the scheduled webjob would put a message on the queue with the parameters and a queue message triggered job would take over - OR - some user interaction would put the same message on the queue with the parameters and the triggered job would take over. However for some reason the triggered function never-finishes - and right now i cannot see any exceptions being displayed in the dashboard outputs (see Time limit on Azure Webjobs triggered by Queue)
I m looking into whether I can execute my triggered webjob as an On-demand webjob and pass the parameters to it? Is there anyway to call an on-demand web job from a scheduled web job and pass it some command line parameters?
Thanks for your help!
QueueTriggered WebJob functions work very well when configured properly. Please see my answer on the other question which points to documentation resources on how to set your WebJobs SDK Continuous host up properly.
Queue messaging is the correct pattern for you to be using for this scenario. It allows you to pass arbitrary data along to your job, and will also allow you to scale out to multiple instances as needed when your load increases.
You can use the WebJobs Dashboard to invoke your job function directly (see "Run Function" button below) - you can specify the queue message input directly in the Dashboard as a string. This allows you to invoke the function directly as needed with whatever inputs you want, in addition to allowing the function to continue to respond to qeueue messages actually added to the queue.

Can Azure WebJobs poll queues on demand?

I have a WebJob which gets triggered when a user uploads a file to the blob storage - it is triggered by a queue storage message which is created once the upload is complete.
Depending on the purpose of the file, it will post messages to other queues to trigger processing jobs.
Some of these jobs are time critical, and run relatively quickly. In one case the processing takes about three seconds, and the user is waiting for the result.
However, because the minimum queue polling interval is 2 seconds, the time the user must wait for the two WebJobs to be invoked is generally doubling their wait time.
I tried combining the two WebJobs into one, hoping that when the first handler posts a queue message the corresponding processing handler would be immediately triggered, but in fact it consistently waits two seconds before picking up the message.
My question is, is there a way for me to tell my WebJob to check the queue triggers immediately from within the same WebJob if I know there is a message waiting? Or even better configure it to immediately check the queue triggers if I post to a queue from inside the WebJob?
Or would switching to a service bus queue improve the responsiveness to new messages?
Update
In the docs about using blob triggers, it says:
There is an exception for blobs that you create by using the Blob attribute. When the WebJobs SDK creates a new blob, it passes the new blob immediately to any matching BlobTrigger functions. Therefore if you have a chain of blob inputs and outputs, the SDK can process them efficiently. But if you want low latency running your blob processing functions for blobs that are created or updated by other means, we recommend using QueueTrigger rather than BlobTrigger.
http://azure.microsoft.com/en-gb/documentation/articles/websites-dotnet-webjobs-sdk-storage-blobs-how-to/
However there is no mention of anything similar for queues. Meaning if you need really low latency in this scenario then blobs are the better than queues, which seems wrong.
Update 2
I ended up working around this by pulling the orchestrating code out of the first WebJob and into the service layer of the application and removing the WebJob.. it was fast running anyway so perhaps separating it into its own WebJob was an overkill. This means only the processing WebJob has to be triggered after the file upload.
Currently 2 sec is the minimum time it will take for the SDK to poll for the new message. The SDK does an exponential back off polling so you can configure the MaxPollingInterval to be 2 sec always.
config.Queues.MaxPollingInterval = TimeSpan.FromSeconds(15);
For more details please see http://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-storage-queues-how-to/#config

Resources