On Tue, 2025-04-22 at 16:41 -0700, Sean Christopherson wrote: > On Tue, Apr 15, 2025, Maxim Levitsky wrote: > > Pass through the host's DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM to the guest > > GUEST_IA32_DEBUGCTL without the guest seeing this value. > > > > Note that in the future we might allow the guest to set this bit as well, > > when we implement PMU freezing on VM own, virtual SMM entry. > > > > Since the value of the host DEBUGCTL can in theory change between VM runs, > > check if has changed, and if yes, then reload the GUEST_IA32_DEBUGCTL with > > the new value of the host portion of it (currently only the > > DEBUGCTLMSR_FREEZE_IN_SMM bit) > > No, it can't. DEBUGCTLMSR_FREEZE_IN_SMM can be toggled via IPI callback, but > IRQs are disabled for the entirety of the inner run loop. And if I'm somehow > wrong, this change movement absolutely belongs in a separate patch. > > > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> > > --- > > arch/x86/kvm/svm/svm.c | 2 ++ > > arch/x86/kvm/vmx/vmx.c | 28 +++++++++++++++++++++++++++- > > arch/x86/kvm/x86.c | 2 -- > > 3 files changed, 29 insertions(+), 3 deletions(-) > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index cc1c721ba067..fda0660236d8 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -4271,6 +4271,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu, > > svm->vmcb->save.rsp = vcpu->arch.regs[VCPU_REGS_RSP]; > > svm->vmcb->save.rip = vcpu->arch.regs[VCPU_REGS_RIP]; > > > > + vcpu->arch.host_debugctl = get_debugctlmsr(); > > + > > /* > > * Disable singlestep if we're injecting an interrupt/exception. > > * We don't want our modified rflags to be pushed on the stack where > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index c9208a4acda4..e0bc31598d60 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -2194,6 +2194,17 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated > > return debugctl; > > } > > > > +static u64 vmx_get_host_preserved_debugctl(struct kvm_vcpu *vcpu) > > No, just open code handling DEBUGCTLMSR_FREEZE_IN_SMM, or make it a #define. > I'm not remotely convinced that we'll ever want to emulate DEBUGCTLMSR_FREEZE_IN_SMM, > and trying to plan for that possibility and adds complexity for no immediate value. Hi, The problem here is a bit different: we indeed are very unlikely to emulate the DEBUGCTLMSR_FREEZE_IN_SMM but however, when I wrote this patch I was sure that this bit is mandatory with PMU version of 2 or more, but looks like it is optional after all: " Note that system software must check if the processor supports the IA32_DEBUGCTL.FREEZE_WHILE_SMM control bit. IA32_DEBUGCTL.FREEZE_WHILE_SMM is supported if IA32_PERF_CAPABIL- ITIES.FREEZE_WHILE_SMM[Bit 12] is reporting 1. See Section 20.8 for details of detecting the presence of IA32_PERF_CAPABILITIES MSR." KVM indeed doesn't set the bit 12 of IA32_PERF_CAPABILITIES. However, note that the Linux kernel silently sets this bit without checking the aforementioned capability bit and ends up with a #GP exception, which it silently ignores.... (I checked this with a trace...) This led me to believe that this bit should be unconditionally supported, meaning that KVM should at least fake setting it without triggering a #GP. Since that is not the case, I can revert to the simpler model of exclusively using GUEST_IA32_DEBUGCTL while hiding the bit from the guest, however I do vote to keep the guest/host separation. > > > +{ > > + /* > > + * Bits of host's DEBUGCTL that we should preserve while the guest is > > + * running. > > + * > > + * Some of those bits might still be emulated for the guest own use. > > + */ > > + return DEBUGCTLMSR_FREEZE_IN_SMM; > > > > u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu) > > { > > return to_vmx(vcpu)->msr_ia32_debugctl; > > @@ -2202,9 +2213,11 @@ u64 vmx_get_guest_debugctl(struct kvm_vcpu *vcpu) > > static void __vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data) > > { > > struct vcpu_vmx *vmx = to_vmx(vcpu); > > + u64 host_mask = vmx_get_host_preserved_debugctl(vcpu); > > > > vmx->msr_ia32_debugctl = data; > > - vmcs_write64(GUEST_IA32_DEBUGCTL, data); > > + vmcs_write64(GUEST_IA32_DEBUGCTL, > > + (vcpu->arch.host_debugctl & host_mask) | (data & ~host_mask)); > > } > > > > bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated) > > @@ -2232,6 +2245,7 @@ bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated > > return true; > > } > > > > + > > Spurious newline. > > > /* > > * Writes msr value into the appropriate "register". > > * Returns 0 on success, non-0 otherwise. > > @@ -7349,6 +7363,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) > > { > > struct vcpu_vmx *vmx = to_vmx(vcpu); > > unsigned long cr3, cr4; > > + u64 old_debugctl; > > > > /* Record the guest's net vcpu time for enforced NMI injections. */ > > if (unlikely(!enable_vnmi && > > @@ -7379,6 +7394,17 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit) > > vmcs_write32(PLE_WINDOW, vmx->ple_window); > > } > > > > + old_debugctl = vcpu->arch.host_debugctl; > > + vcpu->arch.host_debugctl = get_debugctlmsr(); > > + > > + /* > > + * In case the host DEBUGCTL had changed since the last time we > > + * read it, update the guest's GUEST_IA32_DEBUGCTL with > > + * the host's bits. > > + */ > > + if (old_debugctl != vcpu->arch.host_debugctl) > > This can and should be optimized to only do an update if a host-preserved bit > is toggled. True, I will do this in the next version. > > > + __vmx_set_guest_debugctl(vcpu, vmx->msr_ia32_debugctl); > > I would rather have a helper that explicitly writes the VMCS field, not one that > sets the guest value *and* writes the VMCS field. > > The usage in init_vmcs() doesn't need to write vmx->msr_ia32_debugctl because the > vCPU is zero allocated, and this usage doesn't change vmx->msr_ia32_debugctl. > So the only path that actually needs to modify vmx->msr_ia32_debugctl is > vmx_set_guest_debugctl(). But what about nested entry? nested entry pretty much sets the MSR to a value given by the guest. Also technically the intel_pmu_legacy_freezing_lbrs_on_pmi also changes the guest value by emulating what the real hardware does. Best regards, Maxim Levitsky > > > + > > /* > > * We did this in prepare_switch_to_guest, because it needs to > > * be within srcu_read_lock. > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 844e81ee1d96..05e866ed345d 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -11020,8 +11020,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > set_debugreg(0, 7); > > } > > > > - vcpu->arch.host_debugctl = get_debugctlmsr(); > > - > > guest_timing_enter_irqoff(); > > > > for (;;) { > > -- > > 2.26.3 > >