Are Azure Batch virtual machines automatically shut down after a job completes?

Are Azure Batch virtual machines automatically shut down after a job completes? - azure

I was just curious about the lifecycle of the Azure Batch's virtual machines. Say a VM is created and a task is completed. Is the VM terminated after a successful completion?

• No, according to the best practices official documentation and the nodes and pools documentation, the Azure VM in a batch doesn’t shutdown unless a repetitive job is initiated. If a repetitive scheduled job is to be executed on a VM in an Azure Batch, then the time period for which the batch job is to be executed is to be defined during the batch job creation itself and resource allocation instance.
Thus, when a batch job on an Azure VM is completed, the Azure VM becomes idle in state and after a defined threshold period, it is deleted. The threshold period is the time for which the Azure VM in the node pool is in hibernation state/idle. Post completion of which and absence of any lined-up task or job, the Azure VM in the node pool gets deleted/deallocated if a spot instance is used.
• Please refer to the below official documentation link on the provisioning of pool and compute node lifetime: -
https://learn.microsoft.com/en-us/azure/batch/nodes-and-pools#pool-and-compute-node-lifetime

Related

HA in databrick clusters in Azure and AWS

It is clear the benefits of spark to deal with vms disruption within a zone, but what happens if a zone goes down. Are the vms in the running cluster failing over to a second zone in the same region or is this something to be done manually?
br,
Mike

databricks is being deployed in following mode :
Active deployment: Users can connect to an active deployment of a Databricks workspace and run workloads. Jobs are scheduled periodically using Databricks scheduler or other mechanism. Data streams can be executed on this deployment as well. Some documents might refer to an active deployment as a hot deployment.
Passive deployment: Processes do not run on a passive deployment. IT teams can setup automated procedures to deploy code, configuration, and other Databricks objects to the passive deployment. A deployment becomes active only if a current active deployment is down. Some documents might refer to a passive deployment as a cold deployment.
For more details , you can ref link

VM Scaleset vs Azure Batch

VM scale set can be used to create multiple VM's based on the business requirement and, Also, Azure batch is also used to execute job in multiple VM's.
What is the exact difference between Azure Batch and VM Scale set?

Azure Batch is a Platform as a Service offering that has an entire plaform for scheduling, submitting tasks and obtaining their results. Jobs and tasks are submitted using Node pools. Node pools can be comprised of VMSS compute resourses.
Whereas a VMSS is an Infrastructure as a Service that provides compute resources for any intended purposes. While you can spin up your own VMSS for running tasks, you would have to also implement your own job, task and compute coordinator service around it in order to simulate the Azure Batch service offerings.

At a high-level, Azure Batch provides two fundamental pieces for scheduling Batch and HPC workloads in the cloud:
Managed infrastructure
Cloud-native job scheduling
Azure Batch presents infrastructure at a managed layer "above" VMSS and CloudServices. Azure Batch orchestrates the pieces underneath to provide a concept called Batch pools, which provide potentially higher scale (as multiple deployments can be orchestrated together transparently) and higher resiliency to failures as Batch automatically recovers virtual machines or cloud service instances which have degraded.
Additionally, and just as important, Azure Batch provides cloud-native job scheduling. This portion is fully managed, i.e., you don't have to run a scheduler yourself. In a nutshell, Azure Batch provides concepts for job queues and tasks which you can define within the programmatic (API/SDK) or tooling that is available. Azure Batch operates on these concepts to execute the work you define (e.g., a command-line with dependencies or a Docker container); tasks can even span multiple nodes (e.g., MPI jobs). Azure Batch has the ability to retry these tasks if they fail on different nodes within a pool. Azure Batch provides an autoscale system that allows you to dynamically resize your infrastructure (Batch pools) that respond to node metrics and the number of jobs/tasks executing in the system.
Please refer to the technical overview as a starting point.

azure batch intent is to run jobs, vmss workloads. technically they do overlap a fair bit, but job is something rather short lived\bursty, whereas workload has to be working all the time

VM Scaleset is used to provide automatic scaling for an application and load balancing of traffic.VM Scale sets are good for running web applications/api based workloads where automatic scaling of the applications is handled and traffic load balancing is done.
Azure Batch is for tasks, scheduling jobs, running intrinsically parallel and tightly coupled workloads. It can provide scaling and load balancing of different nodes/VM that would be used for performing a high computation job. It would probably not be a suitable target for long-running services. A common scenario for Batch involves scaling out intrinsically parallel work, such as the rendering of images for 3D scenes, on a pool of compute nodes.

Long Running Tasks in Service Fabric and Scaling Cluster In

We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?

Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!

Allocating temporary VMs to parallelize one-shot batch jobs (GCP, Azure or AWS)

I am evaluating options for launching arbitrary Python tasks/scripts on temporary cloud VMs that get shut down once the job is complete. I am looking across all cloud providers, but the ideal solution should not be vendor-specific. Here is what I found:
Docker Swarm / Kubernetes / Nomad for spinning up docker containers. All look attractive, but cannot confirm if they can terminate VMs once the task is complete.
Cloud Functions/Lambdas look great, but works only for short-lived (few minutes) tasks. Also, GCP supports only JavaScript.
Spins up/down VMs explicitly from a launch script with vendor specific commands. Straightforward and should work.
AWS Batch, Azure Batch - vendor-specific services for batch jobs
AWS Data Pipeline, Azure Data Factory, Google Dataflow - vendor-specific services for data pipelines
Did I miss any good option? Does any container orchestration service like Docker Swarm support allocation and deallocation of multiple temporary VMs to run a one-shot job?

Maximum cloud computing utilization - pay for computing, not idle time

I have one big task to do every day, with no need to scale, that takes about 30 minutes and is DB, processor and memory intensive.
This means actual 16h/month of computation time.
WebJobs require constantly running WebSite 744h/month
WebRole is also constantly running 744h/month
Azure Batch - suited for scaled storage input - storage output
processing (or that is how I understand it)
Stopped cloud service still cost you. Setting instance count to 0 is not available. And paying for 728h/month unused computation time looks like madness. Only thing I can imagine is automatic deployment of cloud service every day and automatic deletion of deployment once task is finished, but this also looks like madness.
Are there any options for this scenario in Azure?

Cloud service will be charged continuously until the deployment is deleted. Yes you can delete it every day and redeploy...
Azure VMs in Stopped (Deallocated) status, does not incur any charge. You can shut them down in portal or by script when you don't need them.
I think there is a large difference in billing if you only use it 62h/month. Would you consider switch this deployment to VM? WorkerRole and VMs can be placed on the same subnet, they can still connect to each other.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string