Re: [PATCH v5 09/44] perf/x86: Switch LVTPC to/from mediated PMI vector on guest load/put context

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Mon, 18 Aug 2025 16:32:04 +0200

On Fri, Aug 15, 2025 at 08:55:25AM -0700, Sean Christopherson wrote:
> On Fri, Aug 15, 2025, Sean Christopherson wrote:
> > On Fri, Aug 15, 2025, Peter Zijlstra wrote:
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index e1df3c3bfc0d..ad22b182762e 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data)
> > > >  		task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST);
> > > >  	}
> > > >  
> > > > +	arch_perf_load_guest_context(data);
> > > 
> > > So I still don't understand why this ever needs to reach the generic
> > > code. x86 pmu driver and x86 kvm can surely sort this out inside of x86,
> > > no?
> > 
> > It's definitely possible to handle this entirely within x86, I just don't love
> > switching the LVTPC without the protection of perf_ctx_lock and perf_ctx_disable().
> > It's not a sticking point for me if you strongly prefer something like this: 
> > 
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 0e5048ae86fa..86b81c217b97 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -1319,7 +1319,9 @@ void kvm_mediated_pmu_load(struct kvm_vcpu *vcpu)
> >  
> >         lockdep_assert_irqs_disabled();
> >  
> > -       perf_load_guest_context(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> > +       perf_load_guest_context();
> > +
> > +       perf_load_guest_lvtpc(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> 
> Hmm, an argument for providing a dedicated perf_load_guest_lvtpc() APIs is that
> it would allow KVM to handle LVTPC writes in KVM's VM-Exit fastpath, i.e. without
> having to do a full put+reload of the guest context.
> 
> So if we're confident that switching the host LVTPC outside of
> perf_{load,put}_guest_context() is functionally safe, I'm a-ok with it.

Let me see. So the hardware sets Masked when it raises the interrupt.

The interrupt handler clears it from software -- depending on uarch in 3
different places:
 1) right at the start of the PMI
 2) in the middle, right before enabling the PMU (writing global control)
 3) at the end of the PMI

the various changelogs adding that code mention spurious PMIs and
malformed PEBS records.

So the fun all happens when the guest is doing PMI and gets a VM-exit
while still Masked.

At that point, we can come in and completely rewrite the PMU state,
reroute the PMI and enable things again. Then later, we 'restore' the
PMU state, re-set LVTPC masked to the guest interrupt and 'resume'.

What could possibly go wrong :/ Kan, I'm assuming, but not knowing, that
writing all the PMU MSRs is somehow serializing state sufficient to not
cause the above mentioned fails? Specifically, clearing PEBS_ENABLE
should inhibit those malformed PEBS records or something? What if the
host also has PEBS and we don't actually clear the bit?

The current order ensures we rewrite LVTPC when global control is unset;
I think we want to keep that.

While staring at this, I note that perf_load_guest_context() will clear
global ctrl, clear all the counter programming, and re-enable an empty
pmu. Now, an empty PMU should result in global control being zero --
there is nothing run after all.

But then kvm_mediated_pmu_load() writes an explicit 0 again. Perhaps
replace this with asserting it is 0 instead?

Anyway, this means that moving the LVTPC writing into
kvm_mediated_pmu_load() as you suggest is identical.
perf_load_guest_context() results in global control being 0, we then
assert it is 0, and write LVTPC while it is still 0.
kvm_pmu_load_guest_pmcs() will then frob the MSRs.

OK, so *IF* doing the VM-exit during PMI is sound, this is something
that needs a comment somewhere. Then going back again, is the easy part,
since on the host side, we can never transition into KVM during a PMI.