Erlang NUMA Technology configuration - linux

I am trying to run erlang application on openstack vm and getting very poor performance and after testing i found something going on with NUMA, This is what i observe in my test.
My openstack compute host with 32 core so i have created 30 vCPU core vm on it which has all NUMA awareness, when i am running Erlang application benchmark on this VM getting worst performance but then i create new VM with 16 vCPU core (In this case my all VM cpu pinned with Numa-0 node) and in this case benchmark result was great.
based on above test its clear if i keep VM on single numa node then performance is much better but when i spread it out to multiple numa zone it get worse.
But interesting thing is when i run same erlang application run on bare metal then performance is really good, so trying to understand why same application running on VM doesn't perform well?
Is there any setting in erlang to better fit with NUMA when running on virtual machine?

It's possible that Erlang is not able to properly detect the cpu topology of your VM.
You can inspect the cpu topology as seen by the VM using lscpu and lstopo-no-graphics from the hwloc package:
#lscpu | egrep '^(CPU\(s\)|Thread|Core|Socket|NUMA)'
#lstopo-no-graphics --no-io
If it doesn't look correct, consider rebuilding the VM using OpenStack options like hw:cpu_treads=2 hw:cpu_sockets=2 as described at https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-vcpu-topology.html
On the Erlang side, you might experiment with the Erlang VM options +sct, +sbt as described at http://erlang.org/doc/man/erl.html#+sbt

Related

Azure Kubernetes CPU multithreading

I wish to run the Spring batch application in Azure Kubernetes.
At present, my on-premise VM has the below configuration
CPU Speed: 2,593
CPU Cores: 4
My application uses multithreading(~15 threads)
how do I define the CPU in AKS.
resources:
limits:
cpu: "4"
requests:
cpu: "0.5"
args:
- -cpus
- "4"
Reference: Kubernetes CPU multithreading
AKS Node Pool:
First of all, please note that Kubernetes CPU is an absolute unit:
Limits and requests for CPU resources are measured in cpu units. One
cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers
and 1 hyperthread on bare-metal Intel processors.
CPU is always requested as an absolute quantity, never as a relative
quantity; 0.1 is the same amount of CPU on a single-core, dual-core,
or 48-core machine
In other words, a CPU value of 1 corresponds to using a single core continiously over time.
The value of resources.requests.cpu is used during scheduling and ensures that the sum of all requests on a single node is less than the node capacity.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
The value of resources.limits.cpu is used to determine how much CPU can be used given that it is available, see How pods with limist are run
The spec.containers[].resources.limits.cpu is converted to its
millicore value and multiplied by 100. The resulting value is the
total amount of CPU time in microseconds that a container can use
every 100ms. A container cannot use more than its share of CPU time
during this interval.
In other words, the requests is what the container is guaranteed in terms of CPU time, and the limit is what it can use given that it is not used by someone else.
The concept of multithreading does not change the above, the requests and limits apply to the container as a whole, regardless of how many threads run inside. The Linux scheduler do scheduling decisions based on waiting time, and with containers Cgroups is used to limit the CPU bandwidth. Please see this answer for a detailed walkthrough: https://stackoverflow.com/a/61856689/7146596
To finally answer the question
Your on premises VM has 4 cores, operating on 2,5 GHz, and if we assume that the CPU capacity is a function of clock speed and number of cores, you currently have 10 GHz "available"
The CPU's used in standard_D16ds_v4 has a base speed of 2.5GHz and can run up to 3.4GHz or shorter periods according to the documentation
The D v4 and Dd v4 virtual machines are based on a custom Intel® Xeon®
Platinum 8272CL processor, which runs at a base speed of 2.5Ghz and
can achieve up to 3.4Ghz all core turbo frequency.
Based on this specifying 4 cores should be enough ti give you the same capacity as onpremises.
However number of cores and clock speed is not everything (caches etc also impacts performance), so to optimize the CPU requests and limits you may have to do some testing and fine tuning.
I'm afraid there is no easy answer to your question, while planning the right size of VM Node Pools for Kubernetes cluster to fit appropriately your workload requirements for resource consumption . This is a constant effort for cluster operators, and requires you to take into account many factors, let's mention few of them:
What Quality of Service (QoS) class (Guaranteed, Burstable, BestEffort) should I specify for my Pod Application, and how many of them I plan to run ?
Do I really know the actual usage of CPU/Memory resources by my app VS. how much of VM compute resources stay idle ? (any on-prem monitoring solution in place right now, that could prove it, or be easily moved to Kubernetes in-cluster one ?)
Do my cluster is multi-tenant environment, where I need to share cluster resources with different teams ?
Node (VM) capacity is not the same as total available resources to workloads
You should think here in terms of cluster Allocatable resources:
Allocatable = Node Capacity - kube-reserved - system-reserved
In case of Standard_D16ds_v4 VM size in AZ, you would have for workloads disposal: 14 CPU Cores not 16 as assumed earlier.
I hope you are aware, that scpecified through args number of CPUs:
args:
- -cpus
- "2"
is app specific approach (in this case the 'stress' utility written in go), not general way to spawn a declared number of threads per CPU.
My suggestion:
To avoid over-provisioning or under-provisioning of cluster resources to your workload application (requested resource VS. actually utilized resources), and to optimize costs and performance of your applications, I would in your place do a preliminary sizing estimation on your own of VM Node Pool size and type required by your SpringBoot multithreaded app, and thus familiarize first with concepts like bin-packing and app right-sizing. For these two last topics I don't know a better public guide than recently published by GCP tech team:
"Monitoring gke-clusters for cost optimization using cloud monitoring"
I would encourage you to find an answer to your question by your self. Do the proof of concept on GKE first (with free trial), replace in quide above the demo app with your own workload, come back here, and share your own observation, would be valuable for others too with similar task !

Scaling Node.js: Using autoscaling groups with small virtual servers or cluster processes on VM with many vCPUs?

In learning about Node.js's cluster module I've been turning over the following architecture in my head: Balancing costs with performance, would it be more beneficial (i.e. cheapest but still scalable) to run your Node.js application in a cloud service's autoscaling group using small servers with one virtual CPU (say, AWS's t2.small EC2, 1 vCPU, 2gb memory) or use a larger server (say, an m5.xlarge 4 vCPU, 16gb memory), run Node.js to cluster four child processes to use the 4 vCPUs, but still autoscale?
A possible trade-off is the time it takes AWS to deploy another small server to autoscale, but on a low-traffic app or utility app you'll have to take on the cost of running the larger server when usage is low. But if the time it takes to spin up another server to handle the load is nominal, does that negate the benefits of using the cluster module?
Specifically, my question is twofold: Are these two approaches feasible and, if so, is my presumption about the cluster module's usefulness in the small server approach correct?

Cassandra CPU performance

I deployed a Cassandra 2.2 ring composed by 4 nodes in the cloud with 8 vCPU and 8GB of ram. I am running some tests now with cassandra-stress and YCSB tools to test its performance. I am mainly interested in read requests with a small amount of write requests (95%/5%).
Running the experiments, I noticed that even setting a high number of threads (or clients) the CPU (and disk) does not saturate, but still always around the 60% of utilisation.
I am trying to figure out where is the bottleneck in my system. From the hardware point of view it seems all ok to me.
dstat
I also looked into the Cassandra configuration file to see if there are some tuning parameters to increase the system throughput. I increase the value of concurrent_read/write parameter, but it doesn't increase the performance.
The log file also does not contain any warning.
What it could be that is limiting my system?
Thanks
You might want to consider running cassandra-stress from outside the cluster and on multiple instances as described in
Usage of the Cassandra tool cassandra-stress

Setting up 3-4 node Cassandra (resource-light) test cluster (s.a. in linux container)?

Trying to see if it is possible to setup a 3 or 4 node Cassandra cluster, with minimum resource requirements, that can be installed on something like a single VM, either inside Linux container, or directly on the single VM using different port-numbers or virtual NICs/IPs.
This will be used for some application demonstration where I might like to demonstrate datastore high-availability, data partitioning, dynamic addition / removal of cluster node.
The setup would be running on a VM running on a laptop, so "resources" are a constraint (i.e. VRAM and VCPUs that can be allocated for this purposes). Also, as the actual data stored would be quite limited (let's say everything in a single key-space, about 10 tables, with 10 odd cols, and 1000 rows).
From your description, it sounds like ccm might be the tool for you. With it, you can create local clusters on your laptop (or in a VM, I suppose), add nodes, delete nodes etc etc. It can be easily installed on a linux OS, MAC, or Windows. You don't need to use a VM, your choice. I imagine you would see performance degradation in a VM.

Separate cpuset for jobs

Is there a way to restrict cpus and memory for users running scripts directly, but allow more cpus and memory on job submission?
I am running torque/pbs on Ubuntu 14.04 server and want to allow "normal" usage of 8 cpus and 16GB RAM, and the rest to be dedicated as a "mom" resource for the cluster. Normal cgroups/cpuset configuration also restricts the running jobs.
If you configure Torque with --enable-cpuset the mom will automatically create a cpuset for each job. Torque isn't really equipped to use part of a machine, but a hack that might work to make this work in conjunction with only using half the machine is to specify np= in the nodes file, and then the mom will restrict the jobs to the first X cpus.

Resources