Start kubernetes pod memory depending on size of data job - apache-spark

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)?
Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice (sometimes 1000 rows, sometimes 100,000 rows).
So it will break if the data is bigger than the memory we have allocated beforehand.
I have thought of using slices by data volume, i.e. cut by every 10,000 rows, we will know memory requirement of processing a fixed amount of rows. But we are trying to aggregate by time hence the need for time-slice.
Or any other solutions, like Spark on kubernetes?
Another way of looking at it:
How can we do an implementation of Cloud Dataflow in Kubernetes on AWS

It's a best practice always define resources in your container definition, in particular:
limits:the upper level of CPU and memory
requests: the minimal level of CPU and memory
This allows the scheduler to take a better decision and it eases the assignment of Quality of Service (QoS) for each pod (https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) which falls into three possible classes:
Guaranteed (highest priority): when requests = limits
Burstable: when requests < limits
BestEffort (lowest priority): when requests and limits are not set.
The QoS enables a criterion for killing pods when the system is overcommited.

If you don’t know the memory requirement for your pod a priori for a given time-slice, then it is difficult for Kubernete Cluster Autoscaler to automatically scale node pool for you as per this documentation [1]. Therefore for both of your suggestions like running either Cloud Dataflow or Spark on Kubernete with Kubernete Cluster Autoscaler, may not work for your case.
However, you can use custom scaling as a workaround. For example, you can export memory related metrics of the pod to Stackdriver, then deploy HorizontalPodAutoscaler (HPA) resource to scale your application as [2].
[1] https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#how_cluster_autoscaler_works
[2] https://cloud.google.com/kubernetes-engine/docs/tutorials/custom-metrics-autoscaling

I have found the partial solution to this.
Note there are 2 parts to this problem.
1. Make the Pod request the correct amount of memory depending on size of data job
2. Ensure that this Pod can find a Node to run on.
The Kubernetes Cluster Autoscaler (CA) can solve part 2.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
According to the readme:
Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when there are pods that failed to run in the cluster due to insufficient resources.
Thus if there is a data job that needs more memory than available in the currently running nodes, it will start a new node by increasing the size of a node group.
Details:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
I am still unsure how to do point 1.
An alternative to point 1, start the container without specific memory request or limit:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#if-you-don-t-specify-a-memory-limit
If you don’t specify a memory limit for a Container, then one of these
situations applies:
The Container has no upper bound on the amount of memory it uses.
or
The Container could use all of the memory available on the Node where it is running.

Related

AKS cluster autoscaller profiles modification

We are Using AKS cluster 1-19.11 and our userpools where application podsrunning are under utlization (only 30% of consumption). So we were thinking of cost optimization by reducing the node counts of nodepools.
So would like to get more details considered while planning for node count decrease.
Assume that the node utlisation can be estimated and calculated using the pods requests value and no need to consider the limit range as auto scaller is enabled
Also is it possible to modify the autoscaler profile of cluster property "scaleDownUtilizationThreshold": "0.5", to more %.. and whether its recommeneded to increase to 70%. ?
The assumption,
node utlisation can be estimated and calculated using the pods requests value and no need to consider the limit range as auto scaler is enabled
will hold good as long as you don't care about what process/container gets evicted in case of node resource starvation (if controlled by a deployment or a replica set or stateful set the workloads will be resurrected in a new node that is scaled out by the auto scaler).
However, in most cases you would have some kind of priority for your workloads and you would want to set thresholds (limits) accordingly so that you don't have to deal with the kernel evicting important processes (maybe not the one that caused starvation but was using highest resources right at the time when the evaluation happened).
Also is it possible to modify the autoscaler profile of cluster property "scaleDownUtilizationThreshold": "0.5", to more %.. and whether its recommeneded to increase to 70%. ?
Yes, the value of Cluster Autoscaler Profile scale-down-utilization-threshold can be updated using the command:
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--cluster-autoscaler-profile scale-down-utilization-threshold=<percentage value>
[Reference]
AKS uses node resources to help the node function as part of your cluster. This usage can create a discrepancy between your node's total resources and the allocatable resources in AKS. [Reference]
Now scale-down-utilization-threshold is the node utilization level, defined as sum of requested resources divided by allocatable capacity, below which a node can be considered for scale down.
So, ultimately there can be no best practice shared on this as it is the user's use case, architectural design and requirements that dictate what should be the scale-down-utilization-threshold for the cluster auto scaler.

when to add nodes to cassandra cluster

What are the symptoms/signs that indicates that the existing cluster nodes are over-capacity and that the cluster would need more nodes to be added? Want to know what are the possible performance symptoms after which more nodes are to be added to the cluster.
It depends a lot on the configuration and your use cases. You may have to take a look at the different metrics from your existing cluster. A few metrics that you should keep an eye on includes.
CPU usage
Query Latency
Memory (Depends how you are using the heap memory)
Disk usage
Based on these metrics, you should make a decision whether to scale out the cluster or not.
These are the common scenarios you should look after for adding a new node
Performance of the cluster is degraded. You are not getting required throughput even after all the tunings.
Require more Disk space- Generally you can increase the disk space by adding a new disk but after a limit(2TB generally) it is advised to add a new node.
You have metrices in your hand to identify that your performance is degrading. For example you can use nodetool tablehistograms to identify read and write latency for a particular table. If read/write latency is under your required latencies then you are good and if you see your system is getting slower with more traffic, then it is sign that you should add a node to the cluster.

Kubernetes Pods not using CPU more than 1m

My cluster is in AKS with 5 Nodes of size Standard_D4s_v3 and with K8s version 1.14.8.
As soon as a pod is started/restarted it shows Running (kubectl get pods) and up until the pods are in Running state the CPU usage shows 150m or as much as they require.
But when I top it (kubectl top po) after a pod has moved to Running state, the specific pod shows only 1m CPU usage, but Memory usage is where they should be and the service is down as well.
Kubectl logs -f (pod_name) returns nothing but I can ssh into the pods(kubectl exec -it ....)
It's totally normal behavior, if You create pod it needs more CPU resources to create it, once it's created it doesn't need that much resources anymore.
You can always use cpu/memory limits and resources, more about it with examples how to do it here
Pod CPU/Memory requests define a set amount of CPU and memory that the pod needs on a regular basis.
When the Kubernetes scheduler tries to place a pod on a node, the pod requests are used to determine which node has sufficient resources available for scheduling.
Not setting a pod request will default it to the limit defined.
It is very important to monitor the performance of your application to adjust these requests. If insufficient requests are made, your application may receive degraded performance due to over scheduling a node. If requests are overestimated, your application may have increased difficulty getting scheduled.
Pod CPU/Memory limits are the maximum amount of CPU and memory that a pod can use. These limits help define which pods should be killed in the event of node instability due to insufficient resources. Without proper limits set pods will be killed until resource pressure is lifted.
Pod limits help define when a pod has lost of control of resource consumption. When a limit is exceeded, the pod is prioritized for killing to maintain node health and minimize impact to pods sharing the node.
Not setting a pod limit defaults it to the highest available value on a given node.
Don't set a pod limit higher than your nodes can support. Each AKS node reserves a set amount of CPU and memory for the core Kubernetes components. Your application may try to consume too many resources on the node for other pods to successfully run.
Again, it is very important to monitor the performance of your application at different times during the day or week. Determine when the peak demand is, and align the pod limits to the resources required to meet the application's max needs.

Kubernetes doesn't take into account total node memory usage when starting Pods

What I see: Kubernetes takes into account only the memory used by its components when scheduling new Pods, and considers the remaining memory as free, even if it's being used by other system processes outside Kubernetes. So, when creating new deployments, it attempts to schedule new pods on a suffocated node.
What I expected to see: Kubernetes automatically take in consideration the total memory usage (by kubernetes components + system processes) and schedule it on another node.
As a work-around, is there a configuration parameter that I need to set or is it a bug?
Yes, there are few parameters to allocate resources:
You can allocate memory and CPU for your pods and allocate memory and CPU for your system daemons manually. In documentation you could find how it works with the example:
Example Scenario
Here is an example to illustrate Node Allocatable computation:
Node has 32Gi of memory, 16 CPUs and 100Gi of Storage
--kube-reserved is set to cpu=1,memory=2Gi,ephemeral-storage=1Gi
--system-reserved is set to cpu=500m,memory=1Gi,ephemeral-storage=1Gi
--eviction-hard is set to memory.available<500Mi,nodefs.available<10%
Under this scenario, Allocatable will be 14.5 CPUs, 28.5Gi of memory and 98Gi of local storage. Scheduler ensures that the total memory requests across all pods on this node does not exceed 28.5Gi and storage doesn’t exceed 88Gi. Kubelet evicts pods whenever the overall memory usage across pods exceeds 28.5Gi, or if overall disk usage exceeds 88GiIf all processes on the node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved and/or system-reserved is not enforced and system daemons exceed their reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi or storage is greater than 90Gi
You can allocate as many as you need for Kubernetes with flag --kube-reserved and for system with flag -system-reserved.
Additionally, if you need stricter rules for spawning pods, you could try to use Pod Affinity.
Kubelet has the parameter --system-reserved that allows you to make a reservation of cpu and memory for system processes.
It is not dynamic (you reserve resources only at launch) but is the only way to tell Kubelet not to use all resources in node.
--system-reserved mapStringString
A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=500Mi,ephemeral-storage=1Gi) pairs that describe resources reserved for non-kubernetes components. Currently only cpu and memory are supported. See http://kubernetes.io/docs/user-guide/compute-resources for more detail. [default=none]

Selecting a node size for a GKE kubernetes cluster

We are debating the best node size for our production GKE cluster.
Is it better to have more smaller nodes or less larger nodes in general?
e.g. we are choosing between the following two options
3 x n1-standard-2 (7.5GB 2vCPU)
2 x n1-standard-4 (15GB 4vCPU)
We run on these nodes:
Elastic search cluster
Redis cluster
PHP API microservice
Node API microservice
3 x seperate Node / React websites
Two things to consider in my opinion:
Replication:
services like Elasticsearch or Redis cluster / sentinel are only able to provide reliable redundancy if there are enough Pods running the service: if you have 2 nodes, 5 elasticsearch Pods, well chances are 3 Pods will be on one node and 2 on the other: you maximum replication will be 2. If you happen to have 2 replica Pods on the same node and it goes down, you lose the whole index.
[EDIT]: if you use persistent block storage (this best for persistence but is complex to setup since each node needs its own block, making scaling tricky), you would not 'lose the whole index', but this is true if you rely on local storage.
For this reason, more nodes is better.
Performance:
Obviously, you need enough resources. Smaller nodes have lower resources, so if a Pod starts getting lots of traffic, it will be more easily reaching its limit and Pods will be ejected.
Elasticsearch is quite a memory hog. You'll have to figure if running all these Pods require bigger nodes.
In the end, as your need grow, you will probably want to use a mix of different capacity nodes, which in GKE will have labels for capacity which can be used to set resources quotas and limits for memory and CPU. You can also add your own labels to insure certain Pods end up on certain types of nodes.

Resources