Too many connected disks to AKS node - azure

I read that there is a limitation to the amount of data disks that can bound to a node in a cluster. Right now im using a small node which can only hold up to 4 data disks. If i exceed this amount i will get this error: 0/1 nodes are available: 1 node(s) exceed max volume count.
The question that i mainly have is how to handle this. I have some apps that just need a small amount of persistant storage in my cluster however i can only attach a few data disks. If i bind 4 data disks of 100m i already reached the max limit.
Could someone advice me on how to handle these scenarios? I can easily scale up the machines and i will have more power in my machine and more disks however the ratio disks vs server power is completely offset at that point.
Best
Pim

You should look at using Azure File instead of Azure Disk. With Azure File, you can do ReadWriteMany hence having a single mount on the VM(node) to allow multiple POD to access the mounted volume.
https://github.com/kubernetes/examples/blob/master/staging/volumes/azure_file/README.md
https://kubernetes.io/docs/concepts/storage/storage-classes/#azure-file
https://learn.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv

4 PV per node
30 pods per node
Thoses are the limits on AKS nodes right now.
You can handle it by add more nodes, and more money, or find a provider with different limits.
On one of those, as an example, the limits are 127 volumes and 110 pods, for the same node size.

Related

Combining two premium managed disks or one separate Disk is good to match capacity

I have this issue from lot longer, when purchasing Azure managed disks I have a requirement of allocating 512GB premium disk. I'm wondering if I allocate two P15 (256GB) premium disk will also give the same as P20 capacity with small pricing different, IOPS and throughput. I need to answer the following questions :
Which approach is best to match 512GB is it allocating single
(P20)512GB or two 256GB (P15) disk ?
If I allocate two 256GB (P15) disks will that double the IOPS and Throughput ?
From the managed disk price, two 256GiB (P15) is a larger price than one P20 512GiB premium disk. Also, the two P15 disks have 1100*2 less than P20 disk 2300 IOPS but have double throughput.
Take into account considerations listed below when making the choice.
Scale Limits (IOPS and Throughput) The IOPS and Throughput limits of
each Premium disk size is different and independent from the VM scale
limits. Make sure that the total IOPS and Throughput from the disks
are within scale limits of the chosen VM size.
For example, if an application requirement is a maximum of 250 MB/sec
Throughput and you are using a DS4 VM with a single P30 disk. The DS4
VM can give up to 256 MB/sec Throughput. However, a single P30 disk
has Throughput limit of 200 MB/sec. Consequently, the application will
be constrained at 200 MB/sec due to the disk limit. To overcome this
limit, provision more than one data disks to the VM or resize your
disks to P40 or P50.
So If you have a high scale VM, your application requires a larger throughput and it supports write or operate data to both disks at the same time to optimize the two disks' throughput. You could select two P15 disks, otherwise, generally, it prefers to use single P20 than two P15 disks.
For more information, you can see Azure premium storage: design for high performance.

How does CosmosDB distribute throughput when provisioned per database (shared RUs)

I have a problem understading how distribution of provisioned throughput works if I set up Cosmos DB to use shared RUs - setting RUs at database level.
I know when set on container (collection) level that throughput is divided between logical partitions e.g. if collection has provisioned throughput 400 RU/s and 10 logical partitions then throughput for each partition is 400/10 = 40 RU/s.
But what about when throughput is set per database?
Only documentation I found is https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput#set-throughput-on-a-database
As far as I can tell difference is that physical partitions are not dedicated to single container but can host logical partitions from different containers - does this mean that throughput is divided between all logical partitions of all collections/containers?
For example: I have database with throughput 1000 RU/s and 2 collections, one with 3 logical partitions and second with 7 logical partitions, so is throughput divided 1000 / (3 + 7) = 100 RU/s for each logical partition?
OR
Is throughput reserved for all collections/partitions in sum? e.g. there is database with 1000 RU/s and one logical partitions use 800RU/s and other use 200RU/s (no matter what collection) then is it ok as long as they in sum dont exceed 1000 RU/s ?
Maybe question is short - is shared throughput distributed evenly between logical partitions (same as when set on collection level) or is not (somehow different)?
Thanks for your feedback. If you are configuring the throughput at database level. You cannot guarantee the throughput for each containers in it, unless you configure the throughput for containers as well.
All containers created inside a database with provisioned throughput must be created with a partition key. At any given point of time, the throughput allocated to a container within a database is distributed across all the logical partitions of that container. When you have containers that share provisioned throughput configured on a database, you can't selectively apply the throughput to a specific container or a logical partition.
You can follow this URL for understading the effect of setting throughput at database and containers level.
Set throughput on a database and a container
Hope it helps.
Maybe question is short - is shared throughput distributed evenly between logical partitions (same as when set on collection level) or is not (somehow different)?
The answer is 'it is not'
This answer you're looking for is in the same page you have been looking at (I have also confirmed with with MS Support).
In this section
https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput#comparison-of-models
RUs assigned or available to a specific container (with database level provisioned throughput, manual or autoscaled):
No guarantees. RUs assigned to a given container depend on the properties. Properties can be the choice of partition keys of
containers that share the throughput, the distribution of the
workload, and the number of containers.
Basically Cosmos has an algorithm that works out how to allocate the provisioned database level throughput across the containers in the database.
I would expect there is some performance costs to the algorithm similar to standard autoscaling where if there was a burst of RU demand, the autoscaling won't scale up unless it is sustained for a certain period.
There is an exception if you do the database and container throughput provisioning where 1 or more containers has a fixed provisioned throughput. They would just be separate to the other containers that would share the database throughput. These containers would behave as you mentioned on how a standard container level throughput do e.g. divided among the partitions

Start kubernetes pod memory depending on size of data job

is there a way to scale dynamically the memory size of Pod based on size of data job (my use case)?
Currently we have Job and Pods that are defined with memory amounts, but we wouldn't know how big the data will be for a given time-slice (sometimes 1000 rows, sometimes 100,000 rows).
So it will break if the data is bigger than the memory we have allocated beforehand.
I have thought of using slices by data volume, i.e. cut by every 10,000 rows, we will know memory requirement of processing a fixed amount of rows. But we are trying to aggregate by time hence the need for time-slice.
Or any other solutions, like Spark on kubernetes?
Another way of looking at it:
How can we do an implementation of Cloud Dataflow in Kubernetes on AWS
It's a best practice always define resources in your container definition, in particular:
limits:the upper level of CPU and memory
requests: the minimal level of CPU and memory
This allows the scheduler to take a better decision and it eases the assignment of Quality of Service (QoS) for each pod (https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) which falls into three possible classes:
Guaranteed (highest priority): when requests = limits
Burstable: when requests < limits
BestEffort (lowest priority): when requests and limits are not set.
The QoS enables a criterion for killing pods when the system is overcommited.
If you don’t know the memory requirement for your pod a priori for a given time-slice, then it is difficult for Kubernete Cluster Autoscaler to automatically scale node pool for you as per this documentation [1]. Therefore for both of your suggestions like running either Cloud Dataflow or Spark on Kubernete with Kubernete Cluster Autoscaler, may not work for your case.
However, you can use custom scaling as a workaround. For example, you can export memory related metrics of the pod to Stackdriver, then deploy HorizontalPodAutoscaler (HPA) resource to scale your application as [2].
[1] https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#how_cluster_autoscaler_works
[2] https://cloud.google.com/kubernetes-engine/docs/tutorials/custom-metrics-autoscaling
I have found the partial solution to this.
Note there are 2 parts to this problem.
1. Make the Pod request the correct amount of memory depending on size of data job
2. Ensure that this Pod can find a Node to run on.
The Kubernetes Cluster Autoscaler (CA) can solve part 2.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
According to the readme:
Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster when there are pods that failed to run in the cluster due to insufficient resources.
Thus if there is a data job that needs more memory than available in the currently running nodes, it will start a new node by increasing the size of a node group.
Details:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
I am still unsure how to do point 1.
An alternative to point 1, start the container without specific memory request or limit:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#if-you-don-t-specify-a-memory-limit
If you don’t specify a memory limit for a Container, then one of these
situations applies:
The Container has no upper bound on the amount of memory it uses.
or
The Container could use all of the memory available on the Node where it is running.

Selecting a node size for a GKE kubernetes cluster

We are debating the best node size for our production GKE cluster.
Is it better to have more smaller nodes or less larger nodes in general?
e.g. we are choosing between the following two options
3 x n1-standard-2 (7.5GB 2vCPU)
2 x n1-standard-4 (15GB 4vCPU)
We run on these nodes:
Elastic search cluster
Redis cluster
PHP API microservice
Node API microservice
3 x seperate Node / React websites
Two things to consider in my opinion:
Replication:
services like Elasticsearch or Redis cluster / sentinel are only able to provide reliable redundancy if there are enough Pods running the service: if you have 2 nodes, 5 elasticsearch Pods, well chances are 3 Pods will be on one node and 2 on the other: you maximum replication will be 2. If you happen to have 2 replica Pods on the same node and it goes down, you lose the whole index.
[EDIT]: if you use persistent block storage (this best for persistence but is complex to setup since each node needs its own block, making scaling tricky), you would not 'lose the whole index', but this is true if you rely on local storage.
For this reason, more nodes is better.
Performance:
Obviously, you need enough resources. Smaller nodes have lower resources, so if a Pod starts getting lots of traffic, it will be more easily reaching its limit and Pods will be ejected.
Elasticsearch is quite a memory hog. You'll have to figure if running all these Pods require bigger nodes.
In the end, as your need grow, you will probably want to use a mix of different capacity nodes, which in GKE will have labels for capacity which can be used to set resources quotas and limits for memory and CPU. You can also add your own labels to insure certain Pods end up on certain types of nodes.

Increasing Number of VMs decreases Cassandra Throughput. What can be reason?

I am using YCSB benchmarking tool to benchmark Cassandra cluster.
I am varying the number of Virtual machines in the cluster.
I am using 1 physical host and I am using 1,2,3,4 virtual machines for benchmarking(as shown in attached figure).
The generated workload is same all the time Workload C 10,000,00 operations, 10,000 records
Each VM has 2 GB RAM, 20GB drive
Cassandra - 1 seed node, endpoint_snitch - gossipproperty
Keyspace YCSB - Replication factor 3,
The problem is that when I increase the number of virtual machines in the cluster, the throughput decreases. What can be the reason?
By definition, by increasing compute resources(i.e virtual machines), the cluster should offer better performance, but the opposite is happening as shown in attached figure. Kindly explain what can be the probable reason for this? I am writing my thesis on this topic but I am unable to figure out the reason for this, please help, I will be grateful to you.
Throughput observed by varying number of VMs in Cassandra cluster:
Very likely hitting a disk IO bottleneck. Especially with non ssd drives this is completely expected. Unless you have dedicated disk/cpu per vm the competition for resources will cause contention like this. Also 2gb per vm is not enough to do any kind of performance benchmark with Cassandra since the minimum recommended JVM heap size is 8gb.
Cassandra is great at horizontal scaling (nearly linear), but that doesn't mean that simply adding VMs to one physical host will increase throughput - a single VM on the physical host will have less contention for resources (disk, cpu, memory, network) than 4, so it's likely one VM would perform better than 4.
By definition, if you WERE increasing resources, you SHOULD see it perform better - but you're not, you're simply adding contention to existing resources. If you want to scale cassandra, you need to test it with additional physical resources - more physical machines, not more VMs on the same machine.
Finally, as Chris Lohfink mentions, your VMs are too small to do meaningful tests - 8GB JVM heap is recommended, with another 8GB of vm page cache to support reads - running Cassandra with less than 16G of RAM is typically non-ideal in production.
You're trying to test a jet engine (a distribute database designed for hundreds or thousands of physical nodes) with a gas station level equipment - your benchmark hardware isn't viable for a real production environment, so your benchmark results aren't meaningful.

Resources