Linux KVM: setup considerations in relation to ESXi

 

https://www.because-security.com/blog/the-end-of-free-tier-esxi

 

My personal 2024 setup is simple and effective:

  • fast, Free Open-Source

It’s not:

  • flawless because it’s not funded well

  • or easy to learn because it’s not well documented

 

And it’s not as fast as ESXi, specifically for Windows guests (depends on the kind of workload):

 

Read the summary if you are evaluating this for Windows guests.

 

– I don’t believe VMware products will benefit from the 2023 Broadcom acquisition. Let’s learn from history.

Announcement: End of general availability of ESXi (free tier). Feb 13, 2024

Solution: move on.

 

The alternative:

Libvirt is a control software which also supports Linux KVM guests (based on qemu). – A simple hypervisor (KVM) with simple tooling (qemu etc.). Many opportunities.

To name a few:

  • Windows 11 remote clients (TPM 2.0)

  • Ubuntu 22 LTS Server for self-hosting

  • Terabytes of Cloud Storage, managed via Shared Folders (Virtio)

And no: I wouldn’t run a data-center with this. Maybe with oVirt, but not with plain Libvirt Linux KVM.
And yes: you can run any system you want if it’s supported. But I have not tested that.

 

The setup in the following:

  • It’s a single-host setup. This is for small-scale scenarios.

    • 1 Gpbs, no LACP, no nothing

    • RAID 1, no HBA, no nothing

    • no vLANs, no vSwitch, no nothing

    • you get the point: small-scale

    • no VM at-rest encryption (on Hypervisor-management level because libvirt has a useless secrets handling process)

      • you can use luks in qcow2 images, but you need to store the key in cleartext

      • of course you can use Full-Disk Encryption on the guests

 

ToC

Compartmentalize the HW host into many virtual guests

In the following, I will only write about the interesting stuff.

Simple is better. NAT mode,

 

You can map guests out directly with a Zero Trust SASE architecture:

 

Or you use iptables for NATing, if you need higher throughput / reliability. I use both. This is a small-scale setup without an LB / big CDN cache etc.

Dedicated Server

 

Hetzner Serverbörse 50 bucks, 8 TB cloud storage + computation is dirt cheap in 2024.

HW

Spec

Purpose

HW

Spec

Purpose

CPU

i7 8700

12 cores for small-scale VMs

RAM

128 GB

split between VMs

Disk

8 TB

RAID 1

2 TB - base OSes

6 TB - data

Net

IPv4

IPv6

routeable IPs

NAT internal net to external IP:Port

 

Headless setup (remote): libvirt daemon

Install: SSH connect to the HW server (Debian 12 Bookworm):

apt install --no-install-recommends qemu-system libvirt-clients libvirt-daemon-system apt install qemu-block-extra qemu-utils # qcow2 format support, block optional apt install ovmf # UEFI apt install swtpm-tools swtpm-utils # tpm 2.0 apt install dnsmasq # default network 192.168/24, NAT via IPtables systemctl disable dnsmasq # libvirt only needs the util, not the service # optionally apt install irqbalance apt install haveged # apt install numad # on different HW apt install apparmor-profiles

TPM emulation is for Windows 11.

TPM (2.0) is not a security feature. It’s a way to ensure that corporate systems with Full-Disk Encryption can be administered by IT. TPMs may have a reasonable secure at-rest encryption. But no in-transit encryption.

Given that the TPM transmits the Bitlocker Key in clear-text (via the service here, or via the Bus on HW systems), it undermines the security of Bitlocker. Use a different encryption, for example LVM on the host with dmcrypt LUKS (without TPM) or an encrypted qcow2 file backend.

 

Dedicated server: KSM (Kernel Same-page Merging)

Kernel Same-page Merge (KSM) is a process that allows guests to share identical memory pages. By sharing pages, the combined memory usage of the guest is reduced. The savings are especially increased when multiple guests are running similar base operating system images.

(Source)

Many Linux distributions do not enable KSM by default, even if you compile the kernel feature. The ksmtuned daemon can dynamically adjust the frequency of the deduplication algorithms.

sudo apt-get install ksmtuned --no-install-recommends

Or manually (older example, from around 2015):

root@mjo:/home/marius/scripts# grep -H '' /sys/kernel/mm/ksm/pages_* /sys/kernel/mm/ksm/pages_shared:0 /sys/kernel/mm/ksm/pages_sharing:0 /sys/kernel/mm/ksm/pages_to_scan:100 /sys/kernel/mm/ksm/pages_unshared:0 /sys/kernel/mm/ksm/pages_volatile:0 root@mjo:/home/marius/scripts# cat /sys/kernel/mm/ksm/run 0 root@mjo:/home/marius/scripts# echo 1 > /sys/kernel/mm/ksm/run root@mjo:/home/marius/scripts# grep -H '' /sys/kernel/mm/ksm/pages_* /sys/kernel/mm/ksm/pages_shared:0 /sys/kernel/mm/ksm/pages_sharing:0 /sys/kernel/mm/ksm/pages_to_scan:100 /sys/kernel/mm/ksm/pages_unshared:0 /sys/kernel/mm/ksm/pages_volatile:15900

What exactly the kernel will do with my 15900 volatile pages depends on the kernel version. If you set it this, the value will grow first, and then at some point residual pages can be shared. Here are more details about this.

pages_shared: The number of unswappable kernel pages that KSM is using
pages_sharing: An indication of memory savings
pages_unshared: The number of unique pages repeatedly checked for merging
pages_volatile: The number of pages that are changing too often

The bottom line is: KSM can share pages between guests, which do not change frequently. So if you check 15 minutes later, it may look like this:

Feb 13, 2024

 

KSM is efficient, especially if you base your VMs on templates.

VMs (shared memory)

RAM

Used (KSM) on the host

VMs (shared memory)

RAM

Used (KSM) on the host

3 x Win 11
(Template)

 

12 GB

12 GB

12 GB

Transparent Huge Pages Support

KSM daemon

Debian 12 host

Template Machines

6 x Ubuntu Server 22.04 LTS 22
(Template)

12 GB

4 GB

4 GB

8 GB

8 GB

4 GB

Sum

76 GB

46,6

Roughly saves 25 GB, 30% here.

 

 

tuned - the right settings for the kernel, at the right time

Tuned belongs to the standard Red Hat libvirt KVM stack. You should also add it to you guests, with the guest profile, to avoid swappiness of database hosts.

 

And in the crontab:

This way we can save power at night.

 

Debian 12 has Transparent Huge Pages support

We don’t need to adjust this:

The kernel takes care of this automatically.

 

Firewall - iptables for IPv4 and IPv6

In some cases, dnsmasq (controlled by libvirt) may open DHCP and DNS listening sockets:

Here we see a DHCP listening socket bound to external interfaces. In case, you also see a public DNS resolver, you may also want to adjust your IPtables rules for that.

You may want to send RST instead.

If this is a perimeter-facing device, conntrack may be useful to you.

Guest IP enumeration

virsh isn’t able to show a list like this:

Guest hostname - Guest IP

PowerCLI can. For libvirt, you read that right. I haven’t used that in years, but it’s an option.

 

Dnsmasq leases

Assuming that you use DHCP, you will probably have dnsmasq on the host machine. Now what are the IPs of the guest VMs? I want to SSH / RDP into my web server VM.

Result (example):

This assumes:

  • virbr0 is your bridge interface

  • you use dnsmasq on Debian 12 (paths can differ)

 

Bash and virsh

Same result, different script, just virsh.

 

How to NAT out a KVM guest via iptables

By default, my KVM guests get an internal IP (version 4). I use iptables NAT, and port forwarding to map the specific service to the external interface.

You can automize this:

  • I took a look at Netfilter nftables but the core problem is the same: the organization of NAT, prerouting, filter and other chains is a mess. Pf is a much better system.

 

This assumes:

  • clean NAT tables

  • virbr0 being used for this guest

  •  

  • IP forwarding is not active by default on your distribution

  • you want to use stateful iptables rules, but not everywhere

  • guests will initiate connections (you probably don’t need to do this)

Add whatever services you like.

 

Using vhost-net and bbr with KVM qemu guests for better speeds (1 Gbps)

I have to use Virtio and NAT, because of the Hetzner networks. You’d need to be careful with the vSwitch topology, to avoid that virtual MAC addresses are leaking

 

With VMware vSwitch you can use two Layer 2 bridges, but I wanted to get rid of this complexity. In theory, however, you can archive the same with OpenVSwitch. But why bother, if you only have one single NIC and a 1 Gbps uplink. No LACP, not vLANs etc.

For reference:

 

 

With the following setup, I was able to push the RX throughput from 40 MB/s to approx. 114 MB/s. That’s a significant improvement.

Screenshot 2024-02-16 at 16.11.21.png
916 Mibps RX (KVM guest activity), btop shows the “Top” value. vhost_net accelerates Virtio paravirtualized network drivers.
  • The brr TCP stack is the default in Debian 12.

  • vhost-net is available as well. It allows offloading some networking to the host.

image-20240211-144711.png
vhost-net reference documentation by Red Hat

Enable vhost-net by default (host system kernel, Debian 12 Bookwork, tested)

Apply these tweaks to the host system (also Debian 12 of course):

Adjust the Virtio network interface of the respective guest, which is going to have a high throughput:

Note: None of these values are secret because knowing this doesn’t add any risk. This is standard Linux admin stuff: tracking bottlenecks, looking for a fix. Performance engineering

  • 4 vhost-net queues - 4 vCPUs (first formula, 4 - 4)

Windows 11 guests, and Ubuntu Server

Finding documentation… man. Here and there. Hidden like easter eggs all over the internet.

TPM 2.0 - Windows 11

If you get permissions problems during the boot (I did):

You can start as many Windows 11 systems as you want. You can clone them, snapshot them etc..

Hyper-V (nested virtualization) - Windows 10 / 11 - EXPERIMENTAL

 

On the i7 8700 you must disable Model Specific Registers (MSRs) in KVM:

The other parameters are from my tuning. ignore_msrs matters here.

 

Otherwise, you will get into Boot loops and Blue Screen “Threat execution” errors, which are very difficult to trace. Obviously, because MSRs are involved.

You can disable MSRs at runtime (in the KVM kernel module on the host, Debian 12):

 

I tested the following Hyper-V Enlightenments (QEMU feature) ( Feb 14, 2024 ):

XML for virt-manager / virsh:

 

 

You can use shared memory with nested virtualization:

This way, KSM can be used (via the daemon).

 

 

Pass a RNG

I pass an RNG device, which will be used by Windows automatically.

  • /dev/urandom is cryptographically secure as far as I know ( Feb 18, 2024 )

  • the host system uses haveged

VirtIO for disk and network - all guests, Windows download

Install VirtIO for Windows:

– provided by Fedora. Performance, drivers etc.

  • On Linux, Virtio is part of the default kernel config for most distributions.

 

Bugfix: Bluescreen on Windows 10 / 11 when Virtio is the Boot Disk

Windows 10 / 11 / Server often do not load the Virtio drivers during boot, which can result in Blue Screens:

Bug fix Windows 11:

  1. initially keep SATA for boot disk, use unwrap if you use qcow2

  2. Install Virtio drivers

  3. bcdedit /set "{current}" safeboot minimal on a Terminal as Admin

  4. Attach a Virtio dummy disk

  5. Reboot into Safe Mode, logon from Spice QXL console

  6. Optional: check with Disk Manager whether the dummy disk got detected

  7. bcdedit /deletevalue "{current}" safeboot on Terminal as Admin

  8. Shutdown

  9. Setup the Boot disk as Virtio

Result: You can use Virtio for the boot disk.

→ Obviously this sucks.

 

fstab for virtiofs on Linux

You can directly share the data to the host (this is a single host setup). Otherwise, set up an NFS server or something like that.

There is something called “dax mode”, which I was unable to enable. It’s supposed to increase the performance somehow, but no documentation that’s worth mentioning. That’s a general problem with Virtio.

 

Virtio 3d

Delete and copy-paste into virt-manager for the Display HW. Guest has to be offline.

 

Qemu-agent

- part of the Virtio Windows installer, separate package on Linux.

Snapshots - NOT for UEFI guests

 

I also believe that you cannot snapshot systems, where you use CPU host-passthrough to enable nested virtualization. Not confirmed.

You can use disk snapshots. Untested.

 

It is possible to create full snapshots for BIOS guests:

 

 

Summary

In comparison to a standalone ESXi:

 

The main advantages of KVM qemu

  • less overhead

  • much higher resource efficiency

  • much easier download (VMware customer portals are convoluted)

  • far easier updates

  • no web management interface with tons of software vulnerabilities

  • defined command-line tools like virsh

  • Manageable firewall rules (iptables are easier than esxcli firewall rules)

 

The main problems of KVM qemu

 

So far, so good. Balanced downgrade. Don’t use this at work.