Adding Joerg's functional email id. On 4/1/25 6:10 PM, Paolo Bonzini wrote: > I guess April 1st is not the best date to send out such a large series > after months of radio silence, but here we are. > > AMD VMPLs, Intel TDX partitions, Microsoft Hyper-V VTLs, and ARM CCA planes. > are all examples of virtual privilege level concepts that are exclusive to > guests. In all these specifications the hypervisor hosts multiple > copies of a vCPU's register state (or at least of most of it) and provides > hypercalls or instructions to switch between them. > > This is the first draft of the implementation according to the sketch that > was prepared last year between Linux Plumbers and KVM Forum. The initial > version of the API was posted last October, and the implementation only > needed small changes. > > Attempts made in the past, mostly in the context of Hyper-V VTLs and SEV-SNP > VMPLs, fell into two categories: > > - use a single vCPU file descriptor, and store multiple copies of the state > in a single struct kvm_vcpu. This approach requires a lot of changes to > provide multiple copies of affected fields, especially MMUs and APICs; > and complex uAPI extensions to direct existing ioctls to a specific > privilege level. While more or less workable for SEV-SNP VMPLs, that > was only because the copies of the register state were hidden > in the VMSA (KVM does not manage it); it showed all its problems when > applied to Hyper-V VTLs. > > The main advantage was that KVM kept the knowledge of the relationship > between vCPUs that have the same id but belong to different privilege > levels. This is important in order to accelerate switches in-kernel. > > - use multiple VM and vCPU file descriptors, and handle the switch entirely > in userspace. This got gnarly pretty fast for even more reasons than > the previous case, for example because VMs could not share anymore > memslots, including dirty bitmaps and private/shared attributes (a > substantial problem for SEV-SNP since VMPLs share their ASID). > > Opposite to the other case, the total lack of kernel-level sharing of > register state, and lack of control that vCPUs do not run in parallel, > is what makes this approach problematic for both kernel and userspace. > In-kernel implementation of privilege level switch becomes from > complicated to impossible, and userspace needs a lot of complexity > as well to ensure that higher-privileged VTLs properly interrupted a > lower-privileged one. > > This design sits squarely in the middle: it gives the initial set of > VM and vCPU file descriptors the full set of ioctls + struct kvm_run, > whereas other privilege levels ("planes") instead only support a small > part of the KVM API. In fact for the vm file descriptor it is only three > ioctls: KVM_CHECK_EXTENSION, KVM_SIGNAL_MSI, KVM_SET_MEMORY_ATTRIBUTES. > For vCPUs it is basically KVM_GET/SET_*. > > Most notably, memslots and KVM_RUN are *not* included (the choice of > which plane to run is done via vcpu->run), which solves a lot of > the problems in both of the previous approaches. Compared to the > multiple-file-descriptors solution, it gets for free the ability to > avoid parallel execution of the same vCPUs in different privilege levels. > Compared to having a single file descriptor churn is more limited, or > at least can be attacked in small bites. For example in this series > only per-plane interrupt controllers are switched to use the new struct > kvm_plane in place of struct kvm, and that's more or less enough in > the absence of complex interrupt delivery scenarios. > > Changes to the userspace API are also relatively small; they boil down > to the introduction of a single new kind of file descriptor and almost > entirely fit in common code. Reviewing these VM-wide and architecture- > independent changes should be the main purpose of this RFC, since > there are still some things to fix: > > - I named some fields "plane" instead of "plane_id" because I expected no > fields of type struct kvm_plane*, but in retrospect that wasn't a great > idea. > > - online_vcpus counts across all planes but x86 code is still using it to > deal with TSC synchronization. Probably I will try and make kvmclock > synchronization per-plane instead of per-VM. > Hi Paolo, Is there still a plan to make kvmclock synchronization per-plane instead of per-VM? Do you plan to handle it as part of this patchset or you think it should be handled separately on top of this patchset? I'm asking as coconut-svsm needs a monotonic clock source which adheres to wall-clock time. And we have been exploring several approaches to achieve this. One of the idea is to use kvmclock, provided it can support a per-plane instance that remains synchronized across planes. Thanks. > - we're going to need a struct kvm_vcpu_plane similar to what Roy had in > https://lore.kernel.org/kvm/cover.1726506534.git.roy.hopkins@xxxxxxxx/ > (probably smaller though). Requests are per-plane for example, and I'm > pretty sure any simplistic solution would have some corner cases where > it's wrong; but it's a high churn change and I wanted to avoid that > for this first posting. > > There's a handful of locking TODOs where things should be checked more > carefully, but clearly identifying vCPU data that is not per-plane will > also simplify locking, thanks to having a single vcpu->mutex for the > whole plane. So I'm not particularly worried about that; the TDX saga > hopefully has taught everyone to move in baby steps towards the intended > direction. > > The handling of interrupt priorities is way more complicated than I > anticipated, unfortunately; everything else seems to fall into place > decently well---even taking into account the above incompleteness, > which anyway should not be a blocker for any VTL or VMPL experiments. > But do shout if anything makes you feel like I was too lazy, and/or you > want to puke. > > Patches 1-2 are documentation and uAPI definitions. > > Patches 3-9 are the common code for VM planes, while patches 10-14 > are the common code for vCPU file descriptors on non-default planes. > > Patches 15-26 are the x86-specific code, which is organized as follows: > > - 15-20: convert APIC code to place its data in the new struct > kvm_arch_plane instead of struct kvm_arch. > > - 21-24: everything else except the new userspace exit, KVM_EXIT_PLANE_EVENT > > - 25: KVM_EXIT_PLANE_EVENT, which is used when one plane interrupts another. > > - 26: finally make the capability available to userspace > > Patches 27-29 finally are the testcases. More are possible and planned, > but these are enough to say that, despite the missing bits, what exits > is not _completely_ broken. I also didn't want to write dozens of tests > before committing to a selftests API. > > Available for now at https://git.kernel.org/pub/scm/virt/kvm/kvm.git > branch planes-20250401. I plan to place it in kvm-coco-queue, for lack > of a better place, as soon as TDX is merged into kvm/next and I test it > with the usual battery of kvm-unit-tests and real world guests. > > Thanks, > > Paolo > > Paolo Bonzini (29): > Documentation: kvm: introduce "VM plane" concept > KVM: API definitions for plane userspace exit > KVM: add plane info to structs > KVM: introduce struct kvm_arch_plane > KVM: add plane support to KVM_SIGNAL_MSI > KVM: move mem_attr_array to kvm_plane > KVM: do not use online_vcpus to test vCPU validity > KVM: move vcpu_array to struct kvm_plane > KVM: implement plane file descriptors ioctl and creation > KVM: share statistics for same vCPU id on different planes > KVM: anticipate allocation of dirty ring > KVM: share dirty ring for same vCPU id on different planes > KVM: implement vCPU creation for extra planes > KVM: pass plane to kvm_arch_vcpu_create > KVM: x86: pass vcpu to kvm_pv_send_ipi() > KVM: x86: split "if" in __kvm_set_or_clear_apicv_inhibit > KVM: x86: block creating irqchip if planes are active > KVM: x86: track APICv inhibits per plane > KVM: x86: move APIC map to kvm_arch_plane > KVM: x86: add planes support for interrupt delivery > KVM: x86: add infrastructure to share FPU across planes > KVM: x86: implement initial plane support > KVM: x86: extract kvm_post_set_cpuid > KVM: x86: initialize CPUID for non-default planes > KVM: x86: handle interrupt priorities for planes > KVM: x86: enable up to 16 planes > selftests: kvm: introduce basic test for VM planes > selftests: kvm: add plane infrastructure > selftests: kvm: add x86-specific plane test > > Documentation/virt/kvm/api.rst | 245 +++++++-- > Documentation/virt/kvm/locking.rst | 3 + > Documentation/virt/kvm/vcpu-requests.rst | 7 + > arch/arm64/include/asm/kvm_host.h | 5 + > arch/arm64/kvm/arm.c | 4 +- > arch/arm64/kvm/handle_exit.c | 6 +- > arch/arm64/kvm/hyp/nvhe/gen-hyprel.c | 4 +- > arch/arm64/kvm/mmio.c | 4 +- > arch/loongarch/include/asm/kvm_host.h | 5 + > arch/loongarch/kvm/exit.c | 8 +- > arch/loongarch/kvm/vcpu.c | 4 +- > arch/mips/include/asm/kvm_host.h | 5 + > arch/mips/kvm/emulate.c | 2 +- > arch/mips/kvm/mips.c | 32 +- > arch/mips/kvm/vz.c | 18 +- > arch/powerpc/include/asm/kvm_host.h | 5 + > arch/powerpc/kvm/book3s.c | 2 +- > arch/powerpc/kvm/book3s_hv.c | 46 +- > arch/powerpc/kvm/book3s_hv_rm_xics.c | 8 +- > arch/powerpc/kvm/book3s_pr.c | 22 +- > arch/powerpc/kvm/book3s_pr_papr.c | 2 +- > arch/powerpc/kvm/powerpc.c | 6 +- > arch/powerpc/kvm/timing.h | 28 +- > arch/riscv/include/asm/kvm_host.h | 5 + > arch/riscv/kvm/vcpu.c | 4 +- > arch/riscv/kvm/vcpu_exit.c | 10 +- > arch/riscv/kvm/vcpu_insn.c | 16 +- > arch/riscv/kvm/vcpu_sbi.c | 2 +- > arch/riscv/kvm/vcpu_sbi_hsm.c | 2 +- > arch/s390/include/asm/kvm_host.h | 5 + > arch/s390/kvm/diag.c | 18 +- > arch/s390/kvm/intercept.c | 20 +- > arch/s390/kvm/interrupt.c | 48 +- > arch/s390/kvm/kvm-s390.c | 10 +- > arch/s390/kvm/priv.c | 60 +-- > arch/s390/kvm/sigp.c | 50 +- > arch/s390/kvm/vsie.c | 2 +- > arch/x86/include/asm/kvm_host.h | 46 +- > arch/x86/kvm/cpuid.c | 57 +- > arch/x86/kvm/cpuid.h | 2 + > arch/x86/kvm/debugfs.c | 2 +- > arch/x86/kvm/hyperv.c | 7 +- > arch/x86/kvm/i8254.c | 7 +- > arch/x86/kvm/ioapic.c | 4 +- > arch/x86/kvm/irq_comm.c | 14 +- > arch/x86/kvm/kvm_cache_regs.h | 4 +- > arch/x86/kvm/lapic.c | 147 +++-- > arch/x86/kvm/mmu/mmu.c | 41 +- > arch/x86/kvm/mmu/tdp_mmu.c | 2 +- > arch/x86/kvm/svm/sev.c | 4 +- > arch/x86/kvm/svm/svm.c | 21 +- > arch/x86/kvm/vmx/tdx.c | 8 +- > arch/x86/kvm/vmx/vmx.c | 20 +- > arch/x86/kvm/x86.c | 319 ++++++++--- > arch/x86/kvm/xen.c | 1 + > include/linux/kvm_host.h | 130 +++-- > include/linux/kvm_types.h | 1 + > include/uapi/linux/kvm.h | 28 +- > tools/testing/selftests/kvm/Makefile.kvm | 2 + > .../testing/selftests/kvm/include/kvm_util.h | 48 ++ > .../selftests/kvm/include/x86/processor.h | 1 + > tools/testing/selftests/kvm/lib/kvm_util.c | 65 ++- > .../testing/selftests/kvm/lib/x86/processor.c | 15 + > tools/testing/selftests/kvm/plane_test.c | 103 ++++ > tools/testing/selftests/kvm/x86/plane_test.c | 270 ++++++++++ > virt/kvm/dirty_ring.c | 5 +- > virt/kvm/guest_memfd.c | 3 +- > virt/kvm/irqchip.c | 5 +- > virt/kvm/kvm_main.c | 500 ++++++++++++++---- > 69 files changed, 1991 insertions(+), 614 deletions(-) > create mode 100644 tools/testing/selftests/kvm/plane_test.c > create mode 100644 tools/testing/selftests/kvm/x86/plane_test.c >