What is the best way to do a one-time job on Azure?
Say we want to extend a table in the associated database with a double column. All the new entries will have this value computed by the worker(s) at insertion, but somebody has to take care of the entries that are already in the table. I thought of two alternatives:
a method called by the worker only if a database entry (say, "JobRun") is set to true, and the method would flip the entry to false.
a separate app that does the job, and which is downloaded and run manually using remote desktop (I cannot connect the local app to the Azure SQL server).
The first alternative is messy (how should I deal with the code at the next deployment? delete it? comment it? leave it there? also, what if I will have another job in the future? create a new database entry "Job2Run"?). The second one looks like a cheap hack. I am sure that there is a better way I could not think of.
If you want to run a job once you'll need to take into account the following:
Concurrency: While the job is running, make sure no other worker picks up the job and runs it at the same time (you can use leases for this. More info here).
Once the job is done, you'll need to keep track (in Table Storage, SQL Azure, ...) that the job completed successfully. The next time a worker tries to pick up the job, it will look in Table Storage / SQL Azure / ..., it will see that the job completed and skip the job.
Failure: Maybe your worker crashes during the job which should allow another worker to pick up the job without any issue.
In your specific use case I would also consider using a tool like dbup to manage updates to your schema and existing data with SQL Scripts. Tools like these keep track of which scripts have been executed by adding them in a table in the database.
Related
I know this might be a broad questions, but I've been trying to find the right way to do this and I don't seem to be going anywhere.
Basically, I have a bunch of Objects saved in mongo that contain events, like below :
{
"date" : "2020-09-09",
"day" : 1599573600000 // epoch time
"from" : 1599595200000 // epoch time
"to" : 1599695200000 // epoch time
}
I need to fire some events, like sending a reminded SMS etc, before the date that is specified in from field.
I know I can write a cron job and regularly check on my entire mongo collection, find all the ones that are due and the rest is obvious.
However, somehow I feel like there must be a better way, because this can be extremely slow after our database grows with millions of events.
So the question that I have is,
1- What are some other options, beside cron jobs.
2- Is there any difference between running Cron jobs in NodeJS, and running Cron Jobs in Google App Engine ( server-less instance), which one is better?
3- Is there any service out there that anyone has used?
Any direction would be appreciated.
I'm assuming you're trying to stay in the GCP ecosystem.
For scalability you could use cron to kick off a Google Dataflow pipeline. With this pipeline you can define a pipeline step to be executed for each record that matches the given query. Dataflow will ramp up the number of workers as it goes to handle the scale.
If you're not at that level of scale, Dataflow can be a bit heavy and may feel like overkill for your current use case. If that's the case, then you can use a combination of cron and google cloud tasks where you'd enqueue/launch a task per record. For large amounts of records, you could launch a task per batch of records (i.e. an injector pattern)
https://cloud.google.com/tasks/docs/manage-cloud-task-scaling#large-scalebatch_task_enqueues
Another option is just using google cloud tasks, using the 'schedule_time' field. Here you'd enqueue the tasks when you originally write the record into the DB, instead of periodically querying to see which ones need to be run
https://cloud.google.com/tasks/docs/creating-http-target-tasks
2- Is there any difference between running Cron jobs in NodeJS, and running Cron Jobs in Google App Engine ( server-less instance), which one is better?
I wasn't sure what you meant by your second question because you can run node.js in app engine. In my experience things do work better when you keep everything within GCP.
I am developing an application using Azure Cloud Service and web api. I would like to allow users that create a consultation session the ability to change the price of that session, however I would like to allow all users 30 days to leave the session before the new price affects the price for all members currently signed up for the session. My first thought is to use queue storage and set the visibility timeout for the 30 day time limit, but this seems like this could grow the queue really fast over time, especially if the message should not run for 30 days; not to mention the ordering issues. I am looking at the task scheduler as well but the session pricing changes are not a recurring concept but more random. Is the queue idea a good approach or is there a better and more efficient way to accomplish this?
The stuff you are trying to do should be done with a relational database. You can use timestamps to record when prices for session changed. I wouldn't use a queue at all for this. A queue is more for passing messages in a distributed system. Your problem is just about tracking what prices changed on what sessions and when. That data should be modeled in a database.
I think this scenario is more suitable to use Azure Scheduler. Programatically create a Job with one time recurrence with set date as 30 days later to run once. Once this job gets triggered automatically by scheduler, assign an action to callback to one of your API/Service to do the price & other required updates and also remove this Job from the scheduler as part of this action to have a clean jobs list. Anyways premium plan of Azure Scheduler Job Collection will give you unlimited number of jobs to run.
Hope this is exactly what you were looking for...
I would consider using Azure WebJobs. A WebJob basically gives you the ability to run a .NET console application within the context of an Azure Web App. It can be run on demand, continuously, or in response to a reoccurring schedule. If your processing requirements are low and allow for it they can also run in the same process that your Web App is running in to save you $$$ as they are free that way.
You could schedule the WebJob to run once or twice per day and examine the situation and react as is appropriate. Since it's really just a .NET worker role you have ultimate flexibility.
Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.
I am trying to port an application to azure platform. I want to run an existing application multiple times. My initial idea is as follows: I have a master_process. I have many slave_processes. Each process is a worker role in Azure. Each slave_process will run an instance of the application independently. I want master_process to start many slave_processes and provide them the input arguments. At the end, master_process will collect the results. Currently, I have a working setup for calling the whole application from a C# wrapper. So, for the success, I need two things: First, I have to find a way to start slave workers inside of a master worker (just like threads). Second, I need to find a way to store results of the slave workers and reach these result files from master worker. Can anyone help me?
I think I would try and solve the problem differently. Deploying a whole new instance can take 15 to 30 minutes. Adding extra instances to an already running worker role is a little quicker, but not by much. I'm going to presume that you want results faster than that and that this process is something that is run frequently.
I would have just one worker role type that runs your existing logic and as many instances of that worker role that you determine you'll need. Whatever your client is will decide that it needs to break the work up in a certain number of pieces, let's say 10 for the sake of argument. It will give each piece of work an ID (e.g. a guid) and then put 10 messages that contain the parameters and the ID into a queue. Your worker role instances take messages out of the queue, do their work and write their results to storage somewhere (either SQL Azure, Azure Table Storage or maybe even blob storage depending on what the results are). The client polls that storage to wait for all of the results to be complete and then carries on.
If this is a process that is only run infrequently, then rather than having the worker roles deployed all of the time, you could use the same method I've described, but in addition get the client code to deploy the worker roles when it starts and then delete them when it's finished through the management API. There are samples on MSDN on how to use this.
I have a similar situation you might find useful:
I have a large sequential batch process I run in Azure which requires pre and post processing. The technique I used was to use instances of a single multifunctional worker role, but to use a "quorum" to nominate a head node, which then controls the workflow.
They way I do it is using an azure page blob as the quorum (basically a kind of global mutex/lock), because once a node grabs it for writing it's locked. For resilience, in case there's an issue with the head node, all nodes occasionally try to recapture the quorum.
I want to build an Azure application that has two worker roles and NO web roles. When the worker roles first start up I want ONLY ONE of the roles to do the following a single time:
Download and parse a master file then enqueue multiple "child" tasks based on the
contents of the master file
Enqueue a single master file download "child" task to run the next day
Each of the "child" tasks would then be done by both of the workers until the task queue was exhausted. Think of the whole things as "priming the pump"
This sort of thing is really easy if I add the the first "master" task manually in a queue by calling a web role but seems to be really hard to do in an auto-start mode.
Any help in this regard would be greatly appreciated!
Thanks.....
One possibility: instead of calling a web role, just load the queue directly. (It sounds like this is the sort of application you'll want to automatically spin up to do some work and then shut down again... if you're automating that, it should be trivial to also automate loading the queue.)
A (perhaps) better option: Use some sort of locking mechanism to make sure only one worker instance does the initialization work. One way to do this is to try to create the queue (or a blob, or an entity in a table). If it already exists, then the other instance is handling initialization. If the create succeeds, then it's this instance's job.
Note that it's always better to use a lease than a lock, in case the instance that's doing the initialization fails. Consider using a timeout (e.g. storing a timestamp in table storage or in the metadata of the blob or in the name of the queue...).
We did end-up with the exact same sort of problem, that's why we introduced a O/C mapper (object to cloud). Basically, you want to introduce two types of cloud services:
QueueService that consumes messages whenever available.
ScheduledService that triggers operations on a scheduled basis.
Then, as others suggested, in the cloud, you really prefer using leases instead of locks, in order to avoid your cloud app to end up freezed forever due to a temporary hardware (or infrastructure) issue.