Marathon on Azure Container Service - cannot scale to all nodes - azure

I have setup up a VM cluster using Azure Container Service. The container orchestrator is DC/OS. There are 3 Master nodes and 3 slave agents.
I have a Docker app that I am trying to launch on my cluster using Marathon. Each time I launch, I notice that the CPU utilization of 3 nodes is always 0 i.e. the app is never scheduled on them. The other 3 nodes, on the other hand, have almost 100% CPU utilization. (As I scale the application.) At that point, the scaling stops and Marathon shows state "waiting" for resource ads from Mesos.
I don't understand why Marathon is not scheduling more containers, despite there being empty nodes when I try to scale the application.
I know that Marathon runs on the Master nodes; is it unaware of the presence of the slave agents? (Assuming that the 3 free nodes are the slaves.)
Here is the config file of the application: pastebin-config-file
How can I make full use of the machines using Marathon?

Tasks are not scheduled to the masters. They are reserved for management of the cluster.

Related

what is an agent node in aks

I found the mention of an agent node in the aks documentation but i'm not finding the defition of it. can anyone please explain it to ? also want to know if is it an azure concept or a kubernetes concept.
Regards,
In Kubernetes the term node refers to a compute node. Depending on the role of the node it is usually referred to as control plane node or worker node. From the docs:
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The control plane manages the worker nodes and the Pods in the cluster. In production environments, the control plane usually runs across multiple computers and a cluster usually runs multiple nodes, providing fault-tolerance and high availability.
Agent nodes in AKS refers to the worker nodes (which should not be confused with the Kubelet, which is the primary "node agent" that runs on each worker node)

Azure Kubernetes Services scale up trigger

I am trying to figure out what is the trigger to scale AKS cluster out horizontally with nodes. I am having a cluster that runs on 103% CPU for 5+ minutes but there is no action taken. Any ideas what the triggers are and how I could customize them? If I start more jobs the cluster will lower the CPU allocation for all pods.
The article that MS has doesn't have anything specific around that https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler
You need to notice that:
The cluster autoscaler is a Kubernetes component. Although the AKS
cluster uses a virtual machine scale set for the nodes, don't manually
enable or edit settings for scale set autoscale in the Azure portal or
using the Azure CLI. Let the Kubernetes cluster autoscaler manage the
required scale settings.
Which brings us to the actual Kubernetes Cluster Autoscaler:
Cluster Autoscaler is a tool that automatically adjusts the size of
the Kubernetes cluster when one of the following conditions is true:
there are pods that failed to run in the cluster due to insufficient resources.
there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing
nodes.
The first condition above is the trigger you are looking for.
To get more details regarding the installation and configuration you can go through the Cluster Autoscaler on Azure. For example, you can customize your CA based on the Resources:
When scaling from an empty VM Scale Set (0 instances), Cluster
Autoscaler will evaluate the provided presources (cpu, memory,
ephemeral-storage) based on that VM Scale Set's backing instance type.
This can be overridden (for instance, to account for system reserved
resources) by specifying capacities with VMSS tags, formated as:
k8s.io_cluster-autoscaler_node-template_resources_<resource name>: <resource value>. For instance:
k8s.io_cluster-autoscaler_node-template_resources_cpu: 3800m
k8s.io_cluster-autoscaler_node-template_resources_memory: 11Gi

Pre pull image for new Kubernetes cluster autoscale-up Nodes

My pods/containers run on a docker image which is about 4GiB in size. Pulling the image from the container registry takes about 2 mins whenever a new VM node is spun up when resources are insufficient.
That is to say, whenever a new request comes in and the Kubernetes service auto scale up a new node, it takes 2 mins+. User has to wait 2 mins to put through a request. Not ideal. I am currently using the Azure AKS to deploy my application and using their cluster autoscaler feature.
I am using a typical deployment set up with 1 fix master pod, and 3 fix worker pods. These 3 worker pods correspond to 3 different types of requests. Each time a request comes in, the worker pod will generate a K8 Job to process the request.
BIG Question is, how can I pre pull the images so that when a new node is spun up in the Kubernetes cluster, users don't have to wait so long for the new Job to be ready?
If you are using Azure Container Registry (ACR) for storing and pulling your images you can enable teleportation that will significantly reduce your image pull time. Refer link for more information

How to reschedule my pods after scaling down a node in Azure Kubernetes Service (AKS)?

I am going to start with an example. Say I have an AKS cluster with three nodes. Each of these nodes runs a set of pods, let's say 5 pods. That's 15 pods running on my cluster in total, 5 pods per node, 3 nodes.
Now let's say that my nodes are not fully utilized at all and I decide to scale down to 2 nodes instead of 3.
When I choose to do this within Azure and change my node count from 3 to 2, Azure will close down the 3rd node. However, it will also delete all pods that were running on the 3rd node. How do I make my cluster reschedule the pods from the 3rd node to the 1st or 2nd node so that I don't lose them and their contents?
The only way I feel safe to scale down on nodes right now is to do the rescheduling manually.
Assuming you are using Kubernetes deployments (or replica sets) then it should do this for you. Your deployment is configured with a set number of replicas to create for each pod when you remove a node the scheduler will see that the current active number is less than the desired number and create new ones.
If you are just deploying pods without a deployment, then this won't happen and the only solution is manually redeploying, which is why you want to use a deployment.
Bear in mind though, what you get created are new pods, you are not moving the previously running pods. Any state you had on the previous pods that is not persisted will be gone. This is how it is intended to work.

Azure Kubernetes Cluster Node Failure Scenario

Let's say I have 3 nodes in my cluster and I want to run 300 jobs.
If I run 1 job per POD and 100 pods per NODE, what will happen if a node fails in Azure Kubernetes Service?
Those Jobs will go to pending, as Kubernetes supports 110 pods per node, so wouldn't have the resources to support the failed over jobs. You could look at using the Cluster Autoscaler (Beta) and it would provision more host to satisfy running those jobs that are in a pending state.
if a node fails
Cluster Autoscaler (CA) can be used to handle node failures in Azure using autoscaling groups:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md
https://learn.microsoft.com/en-us/azure/aks/autoscaler
https://learn.microsoft.com/en-us/azure/aks/scale-cluster

Resources