On Tue, Apr 01, 2025, Paolo Bonzini wrote: > Statistics are protected by vcpu->mutex; because KVM_RUN takes the > plane-0 vCPU mutex, there is no race on applying statistics for all > planes to the plane-0 kvm_vcpu struct. > > This saves the burden on the kernel of implementing the binary stats > interface for vCPU plane file descriptors, and on userspace of gathering > info from multiple planes. The disadvantage is a slight loss of > information, and an extra pointer dereference when updating stats. > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > arch/arm64/kvm/arm.c | 2 +- > arch/arm64/kvm/handle_exit.c | 6 +-- > arch/arm64/kvm/hyp/nvhe/gen-hyprel.c | 4 +- > arch/arm64/kvm/mmio.c | 4 +- > arch/loongarch/kvm/exit.c | 8 ++-- > arch/loongarch/kvm/vcpu.c | 2 +- > arch/mips/kvm/emulate.c | 2 +- > arch/mips/kvm/mips.c | 30 +++++++------- > arch/mips/kvm/vz.c | 18 ++++----- > arch/powerpc/kvm/book3s.c | 2 +- > arch/powerpc/kvm/book3s_hv.c | 46 ++++++++++----------- > arch/powerpc/kvm/book3s_hv_rm_xics.c | 8 ++-- > arch/powerpc/kvm/book3s_pr.c | 22 +++++----- > arch/powerpc/kvm/book3s_pr_papr.c | 2 +- > arch/powerpc/kvm/powerpc.c | 4 +- > arch/powerpc/kvm/timing.h | 28 ++++++------- > arch/riscv/kvm/vcpu.c | 2 +- > arch/riscv/kvm/vcpu_exit.c | 10 ++--- > arch/riscv/kvm/vcpu_insn.c | 16 ++++---- > arch/riscv/kvm/vcpu_sbi.c | 2 +- > arch/riscv/kvm/vcpu_sbi_hsm.c | 2 +- > arch/s390/kvm/diag.c | 18 ++++----- > arch/s390/kvm/intercept.c | 20 +++++----- > arch/s390/kvm/interrupt.c | 48 +++++++++++----------- > arch/s390/kvm/kvm-s390.c | 8 ++-- > arch/s390/kvm/priv.c | 60 ++++++++++++++-------------- > arch/s390/kvm/sigp.c | 50 +++++++++++------------ > arch/s390/kvm/vsie.c | 2 +- > arch/x86/kvm/debugfs.c | 2 +- > arch/x86/kvm/hyperv.c | 4 +- > arch/x86/kvm/kvm_cache_regs.h | 4 +- > arch/x86/kvm/mmu/mmu.c | 18 ++++----- > arch/x86/kvm/mmu/tdp_mmu.c | 2 +- > arch/x86/kvm/svm/sev.c | 2 +- > arch/x86/kvm/svm/svm.c | 18 ++++----- > arch/x86/kvm/vmx/tdx.c | 8 ++-- > arch/x86/kvm/vmx/vmx.c | 20 +++++----- > arch/x86/kvm/x86.c | 40 +++++++++---------- > include/linux/kvm_host.h | 5 ++- > virt/kvm/kvm_main.c | 19 ++++----- > 40 files changed, 285 insertions(+), 283 deletions(-) ... > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index dbca418d64f5..d2e0c0e8ff17 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -393,7 +393,8 @@ struct kvm_vcpu { > bool ready; > bool scheduled_out; > struct kvm_vcpu_arch arch; > - struct kvm_vcpu_stat stat; > + struct kvm_vcpu_stat *stat; > + struct kvm_vcpu_stat __stat; Rather than special case invidiual fields, I think we should give kvm_vcpu the same treatment as "struct kvm", and have kvm_vcpu represent the overall vCPU, with an array of planes to hold the sub-vCPUs. Having "kvm_vcpu" represent a plane, while "kvm" represents the overall VM, is conceptually messy. And more importantly, I think the approach taken here will be nigh impossible to maintain, and will have quite a bit of baggage. E.g. planes1+ will be filled with dead memory, and we also risk goofs where KVM could access __stat in a plane1+ vCPU. Documenting which fields are plane0-only, i.e. per-vCPU, via comments isn't sustainable, whereas a hard split via structures will naturally what fields are scope to the overall vCPU, versus what is per-plane, and will force us to more explicitly audit the code. E.g. ____srcu_idx (and thus srcu_depth) is something that I think should be shared by all planes. Ditto for preempt_notifier, vcpu_id, vcpu_idx, pid, etc. Aha! And to prove my point, this series breaks legacy signal handling, because sigset_active and sigset are accessed using the plane1+ vCPU in kvm_vcpu_ioctl_run_plane(), but KVM_SET_SIGNAL_MASK is only allowed to operated on plane0. And I definitely don't think the answer is to let KVM_SET_SIGNAL_MASK operate on plane1+, because forcing userspace to duplicate the sigmal masks to all planes is pointless. Yeeeaaap. pid and pid_lock are also broken. As is vmx_hwapic_isr_update() and kvm_sched_out()'s usage of wants_to_run. And guest_debug. Long term, I just don't see this approach as being maintainable. We're pretty much guaranteed to end up with bugs where KVM operates on the wrong kvm_vcpu structure due to lack of explicit isolation in code. And those bugs are going to absolutely brutal to debug (or even notice). E.g. failure to set "preempted" on planes 1+ will mostly manifest as subtle performance issues. Oof. And that would force us to document that duplicating cpuid and cpu_caps to planes1+ is actually necessary, due to dynamic CPUID features (ugh). Though FWIW, we could dodge that by special casing dynamic features, which isn't a bad idea irrespective of planes. Somewhat of a side topic: unless we need/want to explicitly support concurrent GET/SET on planes of a vCPU, I think we should make vcpu->mutex per-vCPU, not per-plane, so that there's zero chance of having bugs due to thinking that holding vcpu->mutex provides protection against a race. Extracing fields to a separate kvm_vcpu_plane will obviously require a *lot* more churn, but I think in the long run it will be less work in total, because we won't spend as much time chasing down bugs. Very little per-plane state is in "struct kvm_vcpu", so I think we can do the big conversion on a per-arch basis via a small amount of #ifdefs, i.e. not be force to immediatedly convert all architectures to a kvm_vcpu vs. kvm_vcpu_plane world.