Migrating a Windows VM from VirtualBox to libvirt/KVM

For building Windows binaries on our CI infrastructure, we are using VirtualBox VMs on an oldish Linux host.

While this works, building takes forever...

Our build fleet consists of a couple of machines:

  • several Linux builders (based on Docker) on various hosts (the fastest using an AMD Ryzen 5965WX with 128GB of RAM, the slowest using Intel i7 870 and 8GB of RAM)
  • a single Windows builder (based on VirtualBox, running on the slowest Linux builder.
  • a single macOS builder (based on tart) running on a Mac Studio 13,1.

Building Pd on on the Apple runner takes about 1½ minutes (fat binaries for amd64 and arm64), on the slowest Linux runner it takes 4 minutes. On the Windows runner (which uses the same hardware as the Linux runner, although on a VM), it takes... 20 minutes.

A typical pipeline for Pd has 4 Linux jobs, 4 Apple jobs, and 7 Windows jobs (on Windows we build for both i386 and amd64, and the build runs in different environments and hence jobs). It's of course rather unfortunate that there are so many Windows jobs running on a single slow runner. Anyhow, a typical pipeline takes about 2 hours to complete. Which is not fun, especially before releases, when multiple branches are updated often.

My gut feeling tells me, that there are two issues:

  • the host that runs the Windows VMs is outdated and slow
  • VirtualBox is not the most efficient virtualization solution.

So the task was, to get the CI running on a new host (for now that would be my desktop machine, which features a Intel i7-13700K with 64GB or RAM) using KVM/libvirt (which I'm already using for our virtualized servers).

Requirements

We are using GitLab-CI as the software to drive the iem-ci. All builds need to happen in an isolated, reproducible environment (so the build does not depend on some grown and undocumented knowldedge embedded in the build machine).

The setup we were using for the last years is like this:

  • A master VM contains a recent snapshot of Windows 10 (the golden image), with some build tools pre-installed. These build tools include:
    • MSYS2 (MinGW64-i686, MinGW64-x86_64) with pacman
    • chocolatey
    • VisualStudio
    • gitlab-runner
  • Whenever a new job is scheduled, the master VM gets cloned to a new throwaway VM
  • The throwaway VM is started and runs the build job
  • Once the build job completes, the throwaway VM is destroyed

Now the master VM image is rather large (~60 GB), so doing a full clone (which involves copying the entire harddisk) is not very feasible (as it takes >10 minutes). Instead, we want to use some "thin clone" that uses some copy-on-write (COW) storage, so the VM only needs to store the additional data (as a difference image with respect to the golden image).

We used to clone our throwaway VMs from a live snapshot of the master VM (in order to reduce startup time), but this gave us a bit of trouble in the past, as the system clock in the VM would need to be synchronized (the golden image is only updated every year or so). So we switched to doing full boots of the VMs about 6 months ago.

Migration from VirtualBox to libvirt/KVM

Exporting from VirtualBox

In order to migrate away from VirtualBox, the first thing required is to extract the last golden image. With VirtualBox, we use snapshots a lot (allowing us to do upgrades of the machine that can be easily rolled back).

So the first step is to just create a full disk clone from the last (used) snapshot, which I did by creating a VM clone via the VirtualBox GUI (using the "Current State" option). This took a while, but eventually I got my 57GB W10 Clone-disk1.vmdk.

Once done, I copied the .vmdk file to the new host.

Importing to libvirt/KVM

In order to allow thin cloning of the master VM, the disk ought to be in some format that allows COW. For libvirt/KVM, the natural choice is the qcow2 format. In the Proxmov VE wiki I read that qcow2 is not a very good format for Windows VMs. Since we are not running a Microsoft SQL Database, I hope we can get away with the performance penalty. In any case, I think the penalty of copying full raw images is always greater than the penalty for using qcow2.

The disk can be converted using qemu-img:

1in_img="W10 Clone-disk1.vmdk"
2out_img="W10.qcow2"
3qemu-img convert -f vmdk "${in_img}" -O qcow2 "${out_img}"

I also found some blog that suggests converting the VirtualBox image first to raw (using VBoxManage clonehd) and then convert the result to qcow2 (using qemu-img). I do not really see the point to do that. In practice, the VirtualBox host was running low on disk space, and the uncompressed image would have been an additional 100GB, so I took the short route.

After that, we can create a new libvirt/KVM VM.

I'm pretty sure that this can be done with virt-install or somesuch, but today I was lazy and used the VM creation wizard of virt-manager: The options I chose in the wizard were:

option value
Connection QEMU/KVM
Installation method Import existing disk image
Architecture options x86_64
Storage path /path/to/my/W10.qcow
Operating System Microsoft Windows 10
Memory 8GB
CPU 4
Name win10
Network selection NAT

I also selected the Customize installation before install option, which opens the more detailed VM configuration window.

In there, I removed a couple of devices that I do not need on a build machine.

device comment
Tablet Proxmox VE performance tweaks
Sound (ich9) no need for sound when building
Console 1
USB Redirector 1
USB Redirector 2

We also want to be able to use some paravirtualisation features and monitoring, so I added the following new "hardware":

device config reason
Storage type: CDROM device installing drivers via CDrom
Controller type: VirtIO Serial using virtio for harddisks
Channel name: org.qemu.guest_agent.0 monitoring the live VM

We don't need to insert an ISO image into our virtual CD-ROM drive just yet.

Now let's start the VM. Since this is the first boot of the VM with "new" hardware, it might take a bit longer than usual.

Networking

The VM has a virtualized e1000e network card, but for whatever reasons, the VM is unable to connect to the network. The Device Manager claims that all systems are operational, but the network is still unusable.

I finally found a (German) form that hinted at a possible solution: downgrading the virtual chipset! There's no option to specify the chipset in virt-manager, so a low-level edit to the VM definition is required.

Using virsh edit win10 (after the VM has been powered down) we get a text-editor for modifying the XML-definition of the VM. There we locate basic definition in the <os> section and change the virtual machine:

1-<type arch="x86_64" machine="pc-q35-8.2">hvm</type>
2+<type arch="x86_64" machine="pc-q35-6.0">hvm</type>

After that, the e1000e NIC becomes usable in the VM.

(In the forum, some people say that they were able to change the chipset back to the original value after that. For me, this doesn't work, and the e1000e card only works with pc-q35-6.0).

Fine-tuning the VM

Once the VM is running, there we can fine tune it a bit so (hopefully) improve performance.

Usually the virtualization software will emulate "real" hardware for the VM, but the less emulation is required, the faster the system will be.

For the CPU we use the default: Copy host CPU configuration (host-passthrough). For the graphics card and storage controller, we can use the virtio drivers.

While the virtio drivers come with Linux, we have to manually install them for Windows, following the Proxmox VE docs. I picked the ISO for the latest stable release, which at the time of writing is 0.1.240. Inserting the ISO into the VM's CD-ROM drive, I ran the virtio-win-guest-tools.exe to install all the goodies.

We can now switch some hardware to VM-optimized variants

Video

That's the easiest one, we can simply switch the Video Virtio model (from QXL) to Virtio. I've disabled 3D acceleration.

Frankly, I do not worry to much about the virtual video card: the VM is going to run headless most of the time, and is used as a typical CPU-hungry buildbot... no real use for graphics.

Network

That should be a no-brainer, switching the NIC Device model from e1000e to virtio. For reasons unknown to me, it doesn't work at all.

So I'm currently stuck with e1000e.

Storage

Removing the default SATA drive and replacing it with a VirtIO Disk doesn't work out of the box, as Windows will then fail to find the boot disk (that contains the virtio drivers).

INACCESSIBLE BOOT DEVICE

Some StackExchange post shed some light on how to fix this by booting the VM into Safe mode.

  1. While the system still uses the SATA drive, configure it to boot into safe mode on the next reboot.
    • Within an elevated command prompt type:
    1bcdedit /set "{current}" safeboot minimal
    
  2. Power down the VM
  3. Change the bus of the disk to VirtIO
  4. Power up the VM (it should start into Safe mode)
  5. Disable the safe-mode, by running the following within an elevated command prompt:
    1bcdedit /deletevalue "{current}" safeboot
    
  6. Reboot

QEMU guest agent

The virtio goodies also included the guest agent, which allows the hypervisor to query some extra information from a running instance.

Since we already have setup the relevant channel, nothing more needs to be done here.

Just in case this was omitted above, here's what needs to be added

key value
Hardware Channel
Name org.qemu.guest_agent.0
Device Type UNIX socket (unix)
Auto socket yes

To check whether everything works, we can query the guestinfo from the hypervisor:

 1$ virsh guestinfo win11 --os
 2os.id               : mswindows
 3os.name             : Microsoft Windows
 4os.pretty-name      : Windows 10 Enterprise 2016 LTSB
 5os.version          : Microsoft Windows 10
 6os.version-id       : 10
 7os.machine          : x86_64
 8os.variant          : client
 9os.variant-id       : client
10os.kernel-release   : 14393
11os.kernel-version   : 10.0
12
13$

Conclusion

That's about it: A Win10 VM that used to run on VirtualBox is now successfully running on libvirt/KVM.

This is not yet a CI-runner, but we are getting closer. In the next days I'm going to add two more blog-posts:

  • Doing fast clones with libvirt/KVM
  • Using libvirt/KVM for running fully isolated jobs with gitlab-runner