While creating a KVM Virtual Machine in proxmox from the GUI, for the hard disk and CPU tab there are a couple of options that are confusing.
For example, this is the hard disk tab,
In that tab, what does "No backup", "Discard" and "Iothread" signify?
And similarly, this is the CPU tab,
In this tab, what does "Sockets", "Cores" and "Enable numa" mean?
I did not have any luck with google and the results that I got were conflicting.
No backup instructs proxmox to not perform any backups for that VM.
Discard allows the guest to use fstrim or the discard option to free up the unused space from the underlying storage system. This is only usable on a virtio_scsi driver.
Iothread sets the AIO mode to threads (instead of native). For more information check this presentation.
Sockets is the number of CPUs that the guest will see as installed/available.
Cores is the number of cores that the guest will be able to use for each CPU.
If your server has 2 sockets each with a CPU that has 6 cores you could put 2 in the Sockets and 6 in the Cores fields and the guest will be able to use 100% of both CPU's. You can also put 1 in Sockets and 3 in Cores fields and the guest will be able to use only 50% from one CPU, so only 25% of the available CPU power on the server.
Enable Numa will allow the guest to make use of the NUMA architecture (servers with more than 2 sockets) on specific servers. Check www.admin-magazine.com/Archive/2014/20/Best-practices-for-KVM-on-NUMA-servers
Related
Simple question: Is it dangerous to activate KSM on a running hypervisor (Debian 8 with 3.16 kernel)?
Or is it recommended to shut down all virtual machines (KVM/qemu) first, then activate KSM and then start the virtual machines again?
We expect a memory saving of approx. 50% (we have a similar system where KSM is already active and there we effectively save almost 50% due to the always very similar VM appliances).
I am using VirtualBox 5.1 running in a host with 48 CPU's and 250GB of RAM
the virtual machine that I am importing (the guest) initially had 2 CPU's and 4GB of RAM.
Inside this machine I am running a process with Java that starts a dynamic number of threads to perform some tasks.
I ran it with configurations below:
The whole process in my laptop (2 CPUs/4GB RAM) ~ 11 seconds
same program in the virtual machine in the server
(15 CPUs and 32GB of RAM) ~ 45 seconds
same program in the virtual machine in the server
(20 CPUs and 32GB of RAM) ~ 100+ seconds
same program in the virtual machine in the server
(10 CPUs and 32GB of RAM) ~ 5+ seconds
First I thought that there was a problem in how I was managing the threads from Java but after many tests I figured out that there was a relation between the number of CPU's that the virtual machine has and its performance, the maximum was 10, after that the overall performance of the machine slows down(CPU starvation?)
The virtual machine runs Oracle Enterprise Linux 6.7 and the host runs Oracle Enterprise Linux 6.9
I couldn't found any hard limit in the Virtual Machine documentation regarding the number of CPU's.
Is there a setting that needs to be set to enable/take advantage of more than 10 CPU's in a VirtualBox instance?
Time has happened since I posted this question, just for the archive I
will share my findings hoping they can help to save time for others.
It turns out that the performance issues were due to the way how VirtualBox works. Especially the relationship between the OS and the hypervisor.
The Virtual Machine (the guest OS) at the end is a single process for the host and when you modify the number of CPU's in the Virtual Machine settings what they will do is change the number of threads that the process will have to emulate the other CPU's. (at least in VirtualBox)
Having said that, when I assigned 10+ CPUs to the VM I ended up with:
a single process with 10+ threads
an emulated OS running hundreds of processes
my Java code which was creating another bunch of threads
All of that together caused that the setup was saturating the host Virtual Machine process which I think it was due to the way how the host OS was handling the processes context switching
On my server, the hard limit was 7 virtual CPU's, if I added more than that it would slow down the performance of the Java software
Running the Java software outside of the VM didn't show any performance issue, it worked out of the box with 60+ isolated threads.
We have almost the same setup as yours (Virtualbox running on a 48-core machine across 2 NUMA nodes).
I initially set the number of cores to the maximum supported in Virtualbox (e.g. 32), but quickly realized that one of the two NUMA nodes was always idling while the other stayed at medium loads, when the VM was under load.
Long story short, a process can only be assigned to a single NUMA node, and Virtualbox runs one user process with several threads... which means that we are limited to using 24 cores (and even less in practice considering that this is a 12-core cpu with hyperthreading).
Google Compute Engine rents all size Linux VMs from 1 core to 64 cores at various prices. There are "preempt-able" instances for about 1/4 the price of guaranteed instances, but the the preempt-able instances can be terminated at any time (with an ACPI G2 soft off warning and ~ 30 secs until hard cutoff). Although you can provide a startup and shutdown script, the usual approach seems to lead to the unnecessary overhead of having to then create additional software to allow calculations to be interrupted, and to manage partial results of calculations whereas the suspend-to-disk/restore-from-disk scheme seen in laptops and desktops could be a much simpler approach to storing and resuming calculations and therefore preferable.
If I start a Linux preemptible VM on GCE, is it possible generally to suspend the state of the VM to the disk (aka hibernate) and restart a new preemptible VM from the disk afterwards? My idea is:
Start a new preemptible Linux VM.
When the OS receives the the preemption notice (ACPI G2 Soft Off signal), then trigger suspend the to disk - hibernate the Linux OS.
Start a new preemptible Linux VM from the suspended image, i.e. restore the former VM and continue in the computation.
How would I configure Linux to suspend/restore in this way?
I have an Arch Linux host which runs virtualised router.
When using a LXC guest as router, everything is fine. I get 100MBits Up/Down and almost no CPU usage at all.
However, when I use libvirt gest (pfSense FreeBSD) as a router, whenever there is heavy network traffic going through the guest, the CPU usage goes unreasonably high (up to 100%) but the worst thing is that the network throughput is halved! I get 45-49Mbits max.
Host doesn’t support PCI pass through, so this is my config for the libvirtd VM:
Nic1 (wan)
Network source: Direct ‘eth0’
Source mode: passthrough
Device model: virtio
Nic2 (lan)
Bridge name: br0
Device model: virtio
I tried e1000 instead but it changes absolutely nothing.
Host CPU: AMD A4-5000 Kabini
Guest CPU: default or Opteron_G3
This has been so since over a year now, since I started using KVM. If I do not solve this problem, I will have to dump libvirt because such performance is unacceptable.
It is pretty hard to diagnose these sort of problems with such limited information. Definitely don't use e1000 or any other NIC model - virtio-net will offer the best performance of any virtualized NIC. Make sure the host has /dev/vhost-net available as that accelerates guest NIC traffic in host kernel space.
If you want to use a guest as a high performance network routing appliance though, there's quite a few ways to tune it the VM in general. Pinning the guest vCPUs to specific host physical CPUs, and keeping other guests off these CPUs ensures the guest won't get its cache trashed by being pre-empted by other processes. Next up, use huge pages for the guest RAM to massively increase the TLB cache hit rate for guest memory access. If the host has multiple NUMA nodes, makes sure the guest CPU and guest RAM (hugepages) are fixed to come from the same host NUMA node. Similarly ensure IRQ handling for the host NIC used by the guest has affinity set to match the pCPUs used by the guest.
Problem: We implement a video recording system on a Windows Server 2012 system. In spite of low CPU and memory consumption, we face serious performance problems.
Short program description: the application (VS2005/C++) creates many network sockets, each receiving a multicast UDP video stream from an Ethernet network. Per stream the application provides a receiver buffer by calling WSARecvFrom() (overlapped operation), waits in MsgWaitForMultipleObjects() for the Window's "data arrived" event, takes the data packet, and repeats all again in an endless loop. For testing, to assure minimal CPU and memory consumption beside the pure socket IO work, the application does nothing, neither any disk/file IO. The application process is configured to use all available cores on the machine (default affinity settings unchanged).
Tests run: the test is run on two different machines: a) a Windows 7 with 4 physical cores / 8 with hyper-threading, and b) a Windows Server 2012 with 12 physical cores / 24 with hyper-threading.
Both systems show the same problem: everything works fine up to a certain number of configured sockets / network streams. Increasing them further (and we need to) finally paralyses the Windows desktop (mouse-pointer, repainting). At this stage the total CPU load is still very low (i.e. 10-15%) and there is much free memory available. But the Task-Manager shows extremely one-sided CPU loads: CPU 0 nearly 100%, all other CPUs near to 0%. Changing the Processor Affinity for the process in the Task Manager doesn't help.
Question 1: it looks like CPU 0 is doing the whole kernel's network IO work. Is that likely ?
Question 2: if yes, is there a way to control the kernel's use of available CPUs? If yes, how ?
Question 3: if no, is there any other way to make Windows distribute the (kernel) network IO work to other CPUs (i.e. by installing multiple NIC Cards, each NIC receiving only a subset of the network streams, and bind each NIC to another CPU) ?
Most thankful for any hints from anybody out there.
I'm not a Windows server guy, but this sounds like an interrupt issue. This often happens in high throughput systems, especially real-time ones.
Background:
Simply speaking, for each packet your network interface generates an interrupt, informing the CPU that it needs to handle the newly arrived data. High throughput network cards (e.g. 10Gbps) that receive small packets can easily overwhelm the CPU with these interrupts.
Just to get a feel for the problem, let's do some math - if you saturate a 10G line with 100 byte packets, that means that (ideally) 12,500,000 packets are sent over the line each second. In reality, It's less due to overhead; say 10,000,000 packets per second (pps). Your 3Ghz cpu generates 3,000,000,000 clocks per second. So it needs to handle a packet a packet every 300 clock cycles. That's pretty hard for a general purpose machine.
Now, I don't know the rate of packet arrival in your case, nor do I know your average packet length. But based on the symptoms you described, you might have run into this issue.
Solutions
Offload work to your card
Modern day network cards, especially high throughput ones, support all kinds of useful offloads such as GRO, TOE, and others. These take some network related work off the CPU (such as checksum calculation, packet fragmentation etc) and put it onto the network card which carries dedicated hardware for performing it. Check out the offloads supported by your card. In Linux, managing offloading is performed using an application called ethtool. Since I never played with offloading in windows, I can only point in the direction of the most relevant windows article I found, but I can't offer any experience-based advice.
Use interrupt throttling.
Interrupt throttling is another ability of (some) network cards and their drivers which allows them to limit the number of interrupts your CPU receives, essentially interrupting the core once every few packets instead of once per packet.
Use a multi-queue network card, and set interrupt affinities.
Some network cards have multiple (packet) queues, and therefore multiple interrupt lines, one per queue. They split incoming traffic evenly between queues using a hash function, creating (usually) 8 or 16 flows at 1/8 or at 1/16 of the line rate. Each flow can be tied to a specific CPU core using interrupt affinity, and since the hash function is calculated on IPs and port numbers, and is deterministic, each TCP/IP level session will always be handled by the same core. In Linux, setting the affinity requires writing to /proc/irq/<interrupt number>/smp_affinity. In windows, this seems to be the way.