Linux KVM: setup considerations in relation to ESXi
My personal 2024 setup is simple and effective:
fast, Free Open-Source
It’s not:
flawless because it’s not funded well
or easy to learn because it’s not well documented
And it’s not as fast as ESXi, specifically for Windows guests (depends on the kind of workload):
Read the summary if you are evaluating this for Windows guests.
– I don’t believe VMware products will benefit from the 2023 Broadcom acquisition. Let’s learn from history.
Announcement: End of general availability of ESXi (free tier). Feb 13, 2024
Solution: move on.
The alternative:
Libvirt is a control software which also supports Linux KVM guests (based on qemu). – A simple hypervisor (KVM) with simple tooling (qemu etc.). Many opportunities.
To name a few:
Windows 11 remote clients (TPM 2.0)
Ubuntu 22 LTS Server for self-hosting
Terabytes of Cloud Storage, managed via Shared Folders (Virtio)
And no: I wouldn’t run a data-center with this. Maybe with oVirt, but not with plain Libvirt Linux KVM.
And yes: you can run any system you want if it’s supported. But I have not tested that.
The setup in the following:
It’s a single-host setup. This is for small-scale scenarios.
1 Gpbs, no LACP, no nothing
RAID 1, no HBA, no nothing
no vLANs, no vSwitch, no nothing
…
you get the point: small-scale
no VM at-rest encryption (on Hypervisor-management level because libvirt has a useless secrets handling process)
you can use luks in qcow2 images, but you need to store the key in cleartext
of course you can use Full-Disk Encryption on the guests
ToC
- 1 Compartmentalize the HW host into many virtual guests
- 1.1 Dedicated Server
- 2 Headless setup (remote): libvirt daemon
- 2.1 Dedicated server: KSM (Kernel Same-page Merging)
- 2.2 tuned - the right settings for the kernel, at the right time
- 2.3 Debian 12 has Transparent Huge Pages support
- 2.4 Firewall - iptables for IPv4 and IPv6
- 2.5 Guest IP enumeration
- 2.5.1 Dnsmasq leases
- 2.5.2 Bash and virsh
- 2.6 How to NAT out a KVM guest via iptables
- 2.7 Using vhost-net and bbr with KVM qemu guests for better speeds (1 Gbps)
- 3 Windows 11 guests, and Ubuntu Server
- 4 Summary
Compartmentalize the HW host into many virtual guests
In the following, I will only write about the interesting stuff.
Simple is better. NAT mode,
You can map guests out directly with a Zero Trust SASE architecture:
Or you use iptables for NATing, if you need higher throughput / reliability. I use both. This is a small-scale setup without an LB / big CDN cache etc.
Dedicated Server
Hetzner Serverbörse 50 bucks, 8 TB cloud storage + computation is dirt cheap in 2024.
HW | Spec | Purpose |
---|---|---|
CPU | i7 8700 | 12 cores for small-scale VMs |
RAM | 128 GB | split between VMs |
Disk | 8 TB RAID 1 | 2 TB - base OSes 6 TB - data |
Net | IPv4 IPv6 | routeable IPs NAT internal net to external IP:Port |
Headless setup (remote): libvirt daemon
Install: SSH connect to the HW server (Debian 12 Bookworm):
apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system
apt install qemu-block-extra qemu-utils # qcow2 format support, block optional
apt install ovmf # UEFI
apt install swtpm-tools swtpm-utils # tpm 2.0
apt install dnsmasq # default network 192.168/24, NAT via IPtables
systemctl disable dnsmasq # libvirt only needs the util, not the service
# optionally
apt install irqbalance
apt install haveged
# apt install numad # on different HW
apt install apparmor-profiles
TPM emulation is for Windows 11.
TPM (2.0) is not a security feature. It’s a way to ensure that corporate systems with Full-Disk Encryption can be administered by IT. TPMs may have a reasonable secure at-rest encryption. But no in-transit encryption.
Given that the TPM transmits the Bitlocker Key in clear-text (via the service here, or via the Bus on HW systems), it undermines the security of Bitlocker. Use a different encryption, for example LVM on the host with dmcrypt LUKS (without TPM) or an encrypted qcow2 file backend.
Dedicated server: KSM (Kernel Same-page Merging)
Kernel Same-page Merge (KSM) is a process that allows guests to share identical memory pages. By sharing pages, the combined memory usage of the guest is reduced. The savings are especially increased when multiple guests are running similar base operating system images.
(Source)
Many Linux distributions do not enable KSM by default, even if you compile the kernel feature. The ksmtuned
daemon can dynamically adjust the frequency of the deduplication algorithms.
sudo apt-get install ksmtuned --no-install-recommends
Or manually (older example, from around 2015):
root@mjo:/home/marius/scripts# grep -H '' /sys/kernel/mm/ksm/pages_*
/sys/kernel/mm/ksm/pages_shared:0
/sys/kernel/mm/ksm/pages_sharing:0
/sys/kernel/mm/ksm/pages_to_scan:100
/sys/kernel/mm/ksm/pages_unshared:0
/sys/kernel/mm/ksm/pages_volatile:0
root@mjo:/home/marius/scripts# cat /sys/kernel/mm/ksm/run
0
root@mjo:/home/marius/scripts# echo 1 > /sys/kernel/mm/ksm/run
root@mjo:/home/marius/scripts# grep -H '' /sys/kernel/mm/ksm/pages_*
/sys/kernel/mm/ksm/pages_shared:0
/sys/kernel/mm/ksm/pages_sharing:0
/sys/kernel/mm/ksm/pages_to_scan:100
/sys/kernel/mm/ksm/pages_unshared:0
/sys/kernel/mm/ksm/pages_volatile:15900
What exactly the kernel will do with my 15900 volatile pages depends on the kernel version. If you set it this, the value will grow first, and then at some point residual pages can be shared. Here are more details about this.
pages_shared: The number of unswappable kernel pages that KSM is using
pages_sharing: An indication of memory savings
pages_unshared: The number of unique pages repeatedly checked for merging
pages_volatile: The number of pages that are changing too often
The bottom line is: KSM can share pages between guests, which do not change frequently. So if you check 15 minutes later, it may look like this:
Feb 13, 2024
KSM is efficient, especially if you base your VMs on templates.
VMs (shared memory) | RAM | Used (KSM) on the host |
---|---|---|
3 x Win 11
| 12 GB 12 GB 12 GB | Transparent Huge Pages Support KSM daemon Debian 12 host Template Machines |
6 x Ubuntu Server 22.04 LTS 22 | 12 GB 4 GB 4 GB 8 GB 8 GB 4 GB | |
Sum | 76 GB | 46,6 |
Roughly saves 25 GB, 30% here.
tuned - the right settings for the kernel, at the right time
Tuned belongs to the standard Red Hat libvirt KVM stack. You should also add it to you guests, with the guest profile, to avoid swappiness of database hosts.
And in the crontab:
This way we can save power at night.
Debian 12 has Transparent Huge Pages support
We don’t need to adjust this:
The kernel takes care of this automatically.
Firewall - iptables for IPv4 and IPv6
In some cases, dnsmasq (controlled by libvirt) may open DHCP and DNS listening sockets:
Here we see a DHCP listening socket bound to external interfaces. In case, you also see a public DNS resolver, you may also want to adjust your IPtables rules for that.
You may want to send RST instead.
If this is a perimeter-facing device, conntrack
may be useful to you.
Guest IP enumeration
virsh
isn’t able to show a list like this:
Guest hostname - Guest IP
PowerCLI can. For libvirt, you read that right. I haven’t used that in years, but it’s an option.
Dnsmasq leases
Assuming that you use DHCP, you will probably have dnsmasq
on the host machine. Now what are the IPs of the guest VMs? I want to SSH / RDP into my web server VM.
Result (example):
This assumes:
virbr0
is your bridge interfaceyou use
dnsmasq
on Debian 12 (paths can differ)
Bash and virsh
Same result, different script, just virsh
.
How to NAT out a KVM guest via iptables
By default, my KVM guests get an internal IP (version 4). I use iptables NAT, and port forwarding to map the specific service to the external interface.
You can automize this:
I took a look at Netfilter
nftables
but the core problem is the same: the organization of NAT, prerouting, filter and other chains is a mess. Pf is a much better system.
This assumes:
clean NAT tables
virbr0
being used for this guestIP forwarding is not active by default on your distribution
you want to use stateful iptables rules, but not everywhere
guests will initiate connections (you probably don’t need to do this)
Add whatever services you like.
Using vhost-net and bbr with KVM qemu guests for better speeds (1 Gbps)
I have to use Virtio and NAT, because of the Hetzner networks. You’d need to be careful with the vSwitch topology, to avoid that virtual MAC addresses are leaking
With VMware vSwitch you can use two Layer 2 bridges, but I wanted to get rid of this complexity. In theory, however, you can archive the same with OpenVSwitch. But why bother, if you only have one single NIC and a 1 Gbps uplink. No LACP, not vLANs etc.
For reference:
With the following setup, I was able to push the RX throughput from 40 MB/s to approx. 114 MB/s. That’s a significant improvement.
The brr TCP stack is the default in Debian 12.
vhost-net is available as well. It allows offloading some networking to the host.
Enable vhost-net by default (host system kernel, Debian 12 Bookwork, tested)
Apply these tweaks to the host system (also Debian 12 of course):
Adjust the Virtio network interface of the respective guest, which is going to have a high throughput:
Note: None of these values are secret because knowing this doesn’t add any risk. This is standard Linux admin stuff: tracking bottlenecks, looking for a fix. Performance engineering
4 vhost-net queues - 4 vCPUs (first formula, 4 - 4)
Windows 11 guests, and Ubuntu Server
Finding documentation… man. Here and there. Hidden like easter eggs all over the internet.
TPM 2.0 - Windows 11
If you get permissions problems during the boot (I did):
You can start as many Windows 11 systems as you want. You can clone them, snapshot them etc..
Hyper-V (nested virtualization) - Windows 10 / 11 - EXPERIMENTAL
On the i7 8700 you must disable Model Specific Registers (MSRs) in KVM:
The other parameters are from my tuning. ignore_msrs
matters here.
Otherwise, you will get into Boot loops and Blue Screen “Threat execution” errors, which are very difficult to trace. Obviously, because MSRs are involved.
You can disable MSRs at runtime (in the KVM kernel module on the host, Debian 12):
I tested the following Hyper-V Enlightenments (QEMU feature) ( Feb 14, 2024 ):
XML for virt-manager / virsh:
You can use shared memory with nested virtualization:
This way, KSM can be used (via the daemon).
Pass a RNG
I pass an RNG device, which will be used by Windows automatically.
/dev/urandom is cryptographically secure as far as I know ( Feb 18, 2024 )
the host system uses
haveged
VirtIO for disk and network - all guests, Windows download
Install VirtIO for Windows:
– provided by Fedora. Performance, drivers etc.
On Linux, Virtio is part of the default kernel config for most distributions.
Bugfix: Bluescreen on Windows 10 / 11 when Virtio is the Boot Disk
Windows 10 / 11 / Server often do not load the Virtio drivers during boot, which can result in Blue Screens:
Bug fix Windows 11:
initially keep SATA for boot disk, use unwrap if you use qcow2
Install Virtio drivers
bcdedit /set "{current}" safeboot minimal
on a Terminal as AdminAttach a Virtio dummy disk
Reboot into Safe Mode, logon from Spice QXL console
Optional: check with Disk Manager whether the dummy disk got detected
bcdedit /deletevalue "{current}" safeboot
on Terminal as AdminShutdown
Setup the Boot disk as Virtio
Result: You can use Virtio for the boot disk.
→ Obviously this sucks.
fstab for virtiofs on Linux
You can directly share the data to the host (this is a single host setup). Otherwise, set up an NFS server or something like that.
There is something called “dax mode”, which I was unable to enable. It’s supposed to increase the performance somehow, but no documentation that’s worth mentioning. That’s a general problem with Virtio.
Virtio 3d
Delete and copy-paste into virt-manager for the Display HW. Guest has to be offline.
Qemu-agent
- part of the Virtio Windows installer, separate package on Linux.
Snapshots - NOT for UEFI guests
I also believe that you cannot snapshot systems, where you use CPU host-passthrough to enable nested virtualization. Not confirmed.
You can use disk snapshots. Untested.
It is possible to create full snapshots for BIOS guests:
Summary
In comparison to a standalone ESXi:
The main advantages of KVM qemu
less overhead
much higher resource efficiency
much easier download (VMware customer portals are convoluted)
far easier updates
no web management interface with tons of software vulnerabilities
defined command-line tools like
virsh
Manageable firewall rules (
iptables
are easier thanesxcli
firewall rules)
The main problems of KVM qemu
bad documentation
example: the Virtio
dax
feature has no proper documentation
limitations to snapshots creation and management
no full snapshots for UEFI guests
heterogeneous distribution of Windows utilities:
Spice, Virtio, TPM emulator … across RedHat, Fedora etc.
Windows 11 / 10 are second class citizens here, same for Windows Server obviously
Windows performance (in some workload scenarios)
https://because-security.atlassian.net/wiki/spaces/Linix/pages/54820923
lack of proper unlocking of encrypted guest images or configs
bad secrets management in libvirt
one global setting (in the
kvm_intel
or amd kernel module) enables nested virtualization for all guestsWindows needs this for security, Linux doesn’t need this (also not for Docker)
no AppArmor (Mandatory Access Control) for qemu on Debian 12 Bookworm
generally KVM qemu is not hardened, audited or tested for security. Cloud vendors like Google Cloud use forks.
bad network performance with NAT mode (bridge), unless you use
vhost_net
So far, so good. Balanced downgrade. Don’t use this at work.