I'm using AKS and my goal is to create an AKS cluster in the stopped state. The reason for this is that I want to create the cluster, but I don't want to incur any cost until a future point. This is because in this usecase, the costs of cluster are tied to the computation it's performing, and our regular usage is to start the cluster on demand when we have a job to run.
Currently, I have tried creating a cluster in 2 ways:
Using the Azure console: https://portal.azure.com/?quickstart=True#create/microsoft.aks
Using the Azure CLI: az aks create
In both approaches, Azure first deploys the cluster, and then in a second step, I need to STOP the cluster.
Is it possible to create the cluster without starting it?
Perhaps the answer is that a new cluster needs to be created for every job, but I'll need to dig into this to understand any timing/cost tradeoffs doing this.
There isn't a way to create a cluster in stopped state. You should test the time difference between Create and Start. You'll probably find that there's little difference between the two operations and to just create a new cluster for each job... Create might even be faster :)
Related
I have some workload which needs to be run a few times per week. It requires some heavy computational work and runs about one hour (with 16 cores and 32gb memory). It is possible to run it in a container.
Azure offers many different possibilities to run containers. (I have no knowledge of most of the Azure services, so my conclusions might be wrong.) Firstly, I thought Azure Container Instances is perfect for this scenario, but it only offers containers with up to 4 vCPU and 16gb memory. There is no need for orchestration with a single container, so Azure Kubernetes Service and Azure Service Fabric come with too much overhead. Similarly, Azure Batch also offers computational clusters which are not needed for a single workload.
Which Azure service is the best fit for this use case?
While a "best fit" question is likely to be closed. Anyways, here's a suggestion.
Don't dismiss AKS. You can easily create a 1 node cluster using a VM that fits your required configuration. Using the standard SLA, you don't pay for the master node and you can stop your cluster after each run and stop being charged. No need to bother about orchestration, see this as a VM that has everything to run your container that you'll use like an ACI.
I am looking a way to run aks or k8s cluster in dev/test labs but I couldn't find an official way. I guess Azure has allow using production services in Dev/Test Lab however they haven't published yet a document to achieve this. I need rich memory VMs such as 128/256 gb though AKS doesn't support that vm on cluster. And AutoShutdown option will be cost saving for these VMs. So I have to build this in dev/test lab. Any suggestion would be helpful. Thanks!
AKS is a managed service and you can't run it on you own VMs or the ones from Dev/Test Labs. Why are you saying that you can't use 128/256GB RAM VMs? When selecting your VMs size in the portal, make sure to select the Memory Optimized family.
If I understand correctly, your goal is to save money running these high cost VMs. One possible way you can achieve this is create your cluster with a single instance of a smaller VM and create a a second node pool with the larger VMs. You can then create and destroy that second pool on demand.
We use Spark 2.2 on Azure HDInsight for ad hoc exploration and batch jobs.
The jobs should run ok on a 5x medium VM cluster. They are
1. notebooks (Zeppelin with Livy.spark2 magics)
2. compiled jars being run with Livy.
I have to remember to scale this cluster down to 1 worker when not using it, to save money. (0 workers would be nice, if that were possible).
I'd like Spark to manage this for me... When a Job starts, scale the cluster up to a minimum size first, then pause ~10 mins while that completes. After an idle period without Jobs, scale down again.
You can use PowerShell or Azure classic CLI to scale up/down the cluster. But you might need to write a script to track the cluster resource usage and scale down automatically.
Here is a powershell syntax
Set-AzureRmHDInsightClusterSize -ClusterName <Cluster Name> -TargetInstanceCount <NewSize>
Here is a PowerShell workflow runbook that will help you automate the process of scaling in or out your HDInsight clusters depending on your needs
https://gallery.technet.microsoft.com/scriptcenter/Scale-your-HDInsight-f57bb4d8
or
You can use the below option to scale it manually (even though your question is how to scale up/down automatically, I thought it would be useful to someone who wants to scale up/down manually)
Below is the link for an article explaining different methods to scale the cluster using PowerShell or Classic CLI (remember: the latest CLI does n't support scaling feature)
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-scaling-best-practices
If you want Spark to handle it dynamically, Azure Databricks is the best choice (but it is only Spark cluster, no Hadoop components (except Hive)). As HDInsight - Spark is not a Azure managed service, and will not solve your use case.
Below is the image of a new cluster (in Azure Data bricks) - I highlighted an "enable auto scaling option" which will allow you to scale dynamically when job is executed.
I'm told that Azure Databricks may be a better solution for this use case.
We are using Azure Service Fabric (Stateless Service) which gets messages from the Azure Service Bus Message Queue and processes them. The tasks generally take between 5 mins and 5 hours.
When its busy we want to scale out servers, and when it gets quiet we want to scale back in again.
How do we scale in without interrupting long running tasks? Is there a way we can tell Service Fabric which server is free to scale in?
Azure Monitor Custom Metric
Integrate your SF service with
EventFlow. For instance, make it sending logs into Application Insights
While your task is being processed, send some logs in that will indicate that
it's in progress
Configure custom metric in Azure Monitor to scale in only in case on absence of the logs indicating that machine
has in-progress tasks
The trade-off here is to wait for all the events finished until the scale-in could happen.
There is a good article that explains how to Scale a Service Fabric cluster programmatically
Here is another approach which requires a bit of coding - Automate manual scaling
Develop another service either as part of SF application or as VM extension. The point here is to make the service running on all the nodes in a cluster and track the status of tasks execution.
There are well-defined steps how one could manually exclude SF node from the cluster -
Run Disable-ServiceFabricNode with intent ‘RemoveNode’ to disable the node you’re going to remove (the highest instance in that node type).
Run Get-ServiceFabricNode to make sure that the node has indeed transitioned to disabled. If not, wait until the node is disabled. You cannot hurry this step.
Follow the sample/instructions in the quick start template gallery to change the number of VMs by one in that Nodetype. The instance removed is the highest VM instance.
And so forth... Find more info here Scale a Service Fabric cluster in or out using auto-scale rules. The takeaway here is that these steps could be automated.
Implement scaling logic in a new service to monitor which nodes are finished with their tasks and stay idle to scale them in using instructions described in previous steps.
Hopefully it makes sense.
Thanks a lot to #tank104 for the help on elaborating my answer!
We are working on an application that processes excel files and spits off output. Availability is not a big requirement.
Can we turn the VM sets off during night and turn them on again in the morning? Will this kind of setup work with service fabric? If so, is there a way to schedule it?
Thank you all for replying. I've got a chance to talk to a Microsoft Azure rep and documented the conversation in here for community sake.
Response for initial question
A Service Fabric cluster must maintain a minimum number of Primary node types in order for the system services to maintain a quorum and ensure health of the cluster. You can see more about the reliability level and instance count at https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-cluster-capacity/. As such, stopping all of the VMs will cause the Service Fabric cluster to go into quorum loss. Frequently it is possible to bring the nodes back up and Service Fabric will automatically recover from this quorum loss, however this is not guaranteed and the cluster may never be able to recover.
However, if you do not need to save state in your cluster then it may be easier to just delete and recreate the entire cluster (the entire Azure resource group) every day. Creating a new cluster from scratch by deploying a new resource group generally takes less than a half hour, and this can be automated by using Powershell to deploy an ARM template. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-creation-via-arm/ shows how to setup the ARM template and deploy using Powershell. You can additionally use a fixed domain name or static IP address so that clients don’t have to be reconfigured to connect to the cluster. If you have need to maintain other resources such as the storage account then you could also configure the ARM template to only delete the VM Scale Set and the SF Cluster resource while keeping the network, load balancer, storage accounts, etc.
Q)Is there a better way to stop/start the VMs rather than directly from the scale set?
If you want to stop the VMs in order to save cost, then starting/stopping the VMs directly from the scale set is the only option.
Q) Can we do a primary set with cheapest VMs we can find and add a secondary set with powerful VMs that we can turn on and off?
Yes, it is definitely possible to create two node types – a Primary that is small/cheap, and a ‘Worker’ that is a larger size – and set placement constraints on your application to only deploy to those larger size VMs. However, if your Service Fabric service is storing state then you will still run into a similar problem that once you lose quorum (below 3 replicas/nodes) of your worker VM then there is no guarantee that your SF service itself will come back with all of the state maintained. In this case your cluster itself would still be fine since the Primary nodes are running, but your service’s state may be in an unknown replication state.
I think you have a few options:
Instead of storing state within Service Fabric’s reliable collections, instead store your state externally into something like Azure Storage or SQL Azure. You can optionally use something like Redis cache or Service Fabric’s reliable collections in order to maintain a faster read-cache, just make sure all writes are persisted to an external store. This way you can freely delete and recreate your cluster at any time you want.
Use the Service Fabric backup/restore in order to maintain your state, and delete the entire resource group or cluster overnight and then recreate it and restore state in the morning. The backup/restore duration will depend entirely on how much data you are storing and where you export the backup.
Utilize something such as Azure Batch. Service Fabric is not really designed to be a temporary high capacity compute platform that can be started and stopped regularly, so if this is your goal you may want to look at an HPC platform such as Azure Batch which offers native capabilities to quickly burst up compute capacity.
No. You would have to delete the cluster and recreate the cluster and deploy the application in the morning.
Turning off the cluster is, as Todd said, not an option. However you can scale down the number of VM's in the cluster.
During the day you would run the number of VM's required. At night you can scale down to the minimum of 5. Check this page on how to scale VM sets: https://azure.microsoft.com/en-us/documentation/articles/service-fabric-cluster-scale-up-down/
For development purposes, you can create a Dev/Test Lab Service Fabric cluster which you can start and stop at will.
I have also been able to start and stop SF clusters on Azure by starting and stopping the VM scale sets associated with these clusters. But upon restart all your applications (and with them their state) are gone and must be redeployed.