We have an Azure-batch job that uses some quite large files which we are uploading to Azure Blob storage asynchronously so that we don't have to wait for all files to upload before starting our batch job made up of a collection of Tasks that will process each file and generate output. All good so far - this is working fine.
I'd like to be able to create an Azure Task and Add it to an existing, running Azure Job increasing the length of the Task list but I cant find how to do this. It seems that Azure expects you to define ALL jobs for a Task before the Job starts and then it runs until all tasks are complete and terminates the job (which makes sense in some scenarios - but not mine).
I would like to suppress this Job completion behavior and be able to queue up additional Azure Tasks for the same job. I could then monitor the Azure Job status (via the Tasks) and determine myself if the Job is complete.
Our issue is that uploads of multi-MB files takes time and we want Task processing to start as soon as the first file is available. If we have to wait until all files are available, then our processing start is delayed which is not what we need.
We 'could'create a job per task and manage it in our application but that is a little 'messy' and I would like to use the encapsulating Azure Job entity and supporting functionality if I possibly can.
Has anyone done this and can offer some guidance? Many thanks?
You can add new tasks to an existing Azure Batch job in the active state. There is no running state for an Azure Batch job. You can find a list of Azure Batch job states here.
Azure Batch Jobs, by default, do not automatically complete by terminating upon all tasks completing. You can view this related question regarding this subject.
Related
I need an Azure VM (Ubuntu) to do some task (java application) every 10 minutes. Because the task lasts usually less than a minute I would save money if could start the machine every 10 minutes and stop it when the task accomplishes. I learned that I can schedule start and stop times in automation account, but more optimal would be to stop the VM in the very moment that task is completed. Is there a simple way to do that?
This really sounds like a job for Azure Batch. If you are looking for an IaaS solution, Azure Batch will do the job for you. Have a look at it: https://azure.microsoft.com/en-gb/services/batch/#overview.
It allows you to use VM's with your preferred OS (in Azure Batch it is called a node), and run a set of tasks. Once finished, the VM will be de-allocated.
So each node runs a set of pools, in each pool you have a job, and in each job you can have tasks. A task can be for example a cmd line that runs a specific app. So for instance you could just run example.exe 1 2 on a windows OS or the equivalent command line for an Ubuntu OS.
The power here is that it will allocate the tasks to run on the VM when you add them to the job, and then the VM will be disposed off once finished, and you would only pay for the compute time.
The disadvantages of this is that it is a stateless VM, therefore anything that you need installing or storing you would have to use alternative methods. Azure Batch allows you to pre-install a program (for example your Java application) each time it initiates. Also if you are using files and/or expecting files to be created, you would need a blob storage to support this. So if you are expecting it to use a certain amount of files, store them on blob storage and then write back to the blob storage if your program is doing this.
Finally your scheduler, this really depends on how you want to deal with it, if you have a local server or a server on Azure that is already running 24/7 you can add a scheduled job to the scheduler and run a program that will add the task to the Azure Batch. Or if you don't mind using Azure Functions, you can just add a timer Azure Function that will add a task to the job. There are multiple ways of dealing with this, you may already have an existing solution.
Hope you find this useful!
I am reading 10 million records from BigQuery and doing some transformation and creating the .csv file, the same .csv stream data I am uploading to SFTP server using Node.JS.
This job taking approximately 5 to 6 hrs to complete the request locally.
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
Please find below configuration of GCP Cloud Run.
Autoscaling: Up to 1 container instances
CPU allocated: default
Memory allocated: 2Gi
Concurrency: 10
Request timeout: 900 seconds
Is GCP Cloud Run is good option for long running background process?
You can use a VM instance with your container deployed and perform you job on it. At the end kill or stop your VM.
But, personally, I prefer serverless solution and approach, like Cloud Run. However, Long running job on Cloud Run will come, a day! Until this, you have to deal with the limit of 60 minutes or to use another service.
As workaround, I propose you to use Cloud Build. Yes, Cloud Build for running any container in it. I wrote an article on this. I ran a Terraform container on Cloud Build, but, in reality, you can run any container.
Set the timeout correctly, take care of default service account and assigned role, and, thing not yet available on Cloud Run, choose the number of CPUs (1, 8 or 32) for the processing and speed up your process.
Want a bonus? You have 120 minutes free per day and per billing account (be careful, it's not per project!)
Update: 2021-Oct
Cloudrun supports background activities.
Configure CPU to be always-allocated if you use background activities
Background activity is anything that happens after your HTTP response has been delivered. To determine whether there is background activity in your service that is not readily apparent, check your logs for anything that is logged after the entry for the HTTP request.
Configure CPU to be always-allocated
If you want to support background activities in your Cloud Run service, set your Cloud Run service CPU to be always allocated so you can run background activities outside of requests and still have CPU access.
Is GCP Cloud Run is good option for long running background process?
Not a good option because your container is 'brought to life' by incoming HTTP request and as soon as the container responds (e.g. sends something back), Google assumes the processing of the request is finished and cuts the CPU off.
Which may explain this:
Solution has been delpoyed on GCP Cloud run but after 2 to 3 second cloud run is closing the container with 503 error.
You can try using an Apache Beam pipeline deployed via Cloud Dataflow. Using Python, you can perform the task with the following steps:
Stage 1. Read the data from BigQuery table.
beam.io.Read(beam.io.BigQuerySource(query=your_query,use_standard_sql=True))
Stage 2. Upload Stage 1 result into a CSV file on a GCS bucket.
beam.io.WriteToText(file_path_prefix="", \
file_name_suffix='.csv', \
header='list of csv file headers')
Stage 3. Call a ParDo function which will then take CSV file created in Stage 2 and upload it to the SFTP server. You can refer this link.
You may consider a serverless, event-driven approach:
configure google storage trigger cloud function running transformation
extract/export BigQuery to CF trigger bucker - this is the fastest way to get BigQuery data out
Sometimes exported data in that way may be too large not be suitable in that form for Cloud Function processing, due to restriction like max execution time (9 min currently) or memory limitation 2GB,
In that case, you can split the original data file to smaller pieces and/or push then to Pub/Sub with storage mirror
All that said we've used CF to process a billion records from building bloom filters to publishing data to aerospike under a few minutes end to end.
I will try to use Dataflow for creating .csv file from Big Query and will upload that file to GCS.
I have an Azure Batch service set up with a job schedule that runs every minute.
The job manager task creates 3-10 tasks within the same job.
Sometimes, one of these tasks within the job may take extremely long to complete but usually are very fast.
In the event that one of the tasks takes long to apply, the next iteration of the job manager task does not begin in that case. It basically waits till all the tasks from the previous iteration have completed.
Is there a way to ensure that the job schedule keeps creating a version of the job every minute even if all the tasks from its previous iteration have not been completed?
I know one option is to make the job manager task create additional jobs instead of tasks. But preferably, I was hoping there is some configuration at the job schedule level that I can turn on that will allow the schedule to create tasks without the dependency of completion on the previous job.
This seems like more towards design question, AFAIK, No, the duplicate active job names should not be doable from az batch perspective. (I will get corrected if at all this is doable somehow)
Although in order to further think this you can read through various design recommendations via Azure batch technical overview page or posts like:
How to use Azure Batch in an event based design and terminate/cleanup finished jobs or
Add Tasks to a running Azure batch job and manually control termination
I think simplicity will be better like handling each iteration with unique job name or some thing of other sort but you will know your scenario better. Hope this helps.
Currently, a Job Schedule can have at most one active Job under it at any given time (link) so the behavour you're seeing is expected.
We don't have any simple feature you can just "turn on" to achieve concurrent jobs from a single job schedule - but I do have a suggestion:
Instead of using the JobSchedule to run all the processing directly, use it to create "worker" jobs that do the processing.
E.g.
At 10:03 am, your job schedule triggers to create job processing-20191031-1003.
At 10:04 am, your job schedule triggers to create job processing-20191031-1004.
At 10:05 am, your job schedule triggers to create job processing-20191031-1005.
and so on
Because the only thing your job schedule does is create another job, it will finish very quickly, ensuring the next job is created on time.
Since your existing jobs already create a variable number of tasks (you said 3-10 tasks, above), I'm hoping this won't be a very complex change for your code.
Note that you will need to ensure your concurrent worker jobs don't step on each others toes by trying to do the same work multiple times.
We have a very long running operation (potentially days) that we would like to have triggered from a BLOB file written to a Azure Storage. This job could be started once year, never, or many times over a few days.
Azure Batch jobs look exactly like what we need assuming there doesn't need to be a 'watcher' process on the batch job as it runs. For example, if we can have a Azure Function catch the BLOB event, fire up a Batch job, start the job in a "fire and forget" type fashion, and then the Function ends it is exactly what we need. We aren't really too worried about reporting progress of the job (we are using a SQL table for that), we just want to start the job then monitor it out of band.
Is there a way to start a batch job and let the instigator process disappear while the job continues to run in the background? If not, is there any way to do this without having to have a constantly running process (Worker Role or Fabric Worker)? We are trying to avoid having a process (Worker/Fabric Role, Function using the App Function Plan, etc.) running all the time when 99.9% of the time it isn't doing anything.
Short answer: No, you don't need a watcher process.
Azure Batch tasks are asynchronous in nature. When you add a task (under a job), your call against the Batch service immediately returns with success or failure of the submission action itself (and not if the task completed successfully on a compute node or not). The Batch service takes care of scheduling the task among the available compute nodes in your pool, internally monitoring the progress of the task, updating stats, etc.
If you elect to do so, you can monitor the progress of your task independently of the submitting actor using any SDK, REST API or client tool. Or you can opt to monitor it out-of-band yourself as you have described if your task is updating an external monitor or data store. Or you can schedule a task and not monitor it, the service does not force you to monitor/watch the task.
I have a C# console application which extracts 15GB FireBird database file on a server location to multiple files and loads the data from files to SQLServer database. The console application uses System.Threading.Tasks.Parallel class to perform parallel execution of the dataload from files to sqlserver database.
It is a weekly process and it takes 6 hours to complete.
What is best option to move this (console application) process to azure cloud - WebJob or WorkerRole or Any other cloud service ?
How to reduce the execution time (6 hrs) after moving to cloud ?
How to implement the suggested option ? Please provide pointers or code samples etc.
Your help in detail comments is very much appreciated.
Thanks
Bhanu.
let me give some thought on this question of yours
"What is best option to move this (console application) process to
azure cloud - WebJob or WorkerRole or Any other cloud service ?"
First you can achieve the task with both WebJob and WorkerRole, but i would suggest you to go with WebJob.
PROS about WebJob is:
Deployment time is quicker, you can turn your console app without any change into a continues running webjob within mintues (https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/)
Build in timer support, where WorkerRole you will need to handle on your own
Fault tolerant, when your WebJob fail, there is built-in resume logic
You might want to check out Azure Functions. You pay only for the processing time you use and there doesn't appear to be a maximum run time (unlike AWS Lambda).
They can be set up on a schedule or kicked off from other events.
If you are already doing work in parallel you could break out some of the parallel tasks into separate azure functions. Aside from that, how to speed things up would require specific knowledge of what you are trying to accomplish.
In the past when I've tried to speed up work like this, I would start by spitting out log messages during the processing that contain the current time or that calculate the duration (using the StopWatch class). Then find out which areas can be improved. The slowness may also be due to slowdown on the SQL Server side. More investigation would be needed on your part. But the first step is always capturing metrics.
Since Azure Functions can scale out horizontally, you might want to first break out the data from the files into smaller chunks and let the functions handle each chunk. Then spin up multiple parallel processing of those chunks. Be sure not to spin up more than your SQL Server can handle.