On Wed, Aug 06, 2025 at 12:56:31PM -0700, Sean Christopherson wrote: > Add arch hooks to the mediated vPMU load/put APIs, and use the hooks to > switch PMIs to the dedicated mediated PMU IRQ vector on load, and back to > perf's standard NMI when the guest context is put. I.e. route PMIs to > PERF_GUEST_MEDIATED_PMI_VECTOR when the guest context is active, and to > NMIs while the host context is active. > > While running with guest context loaded, ignore all NMIs (in perf). Any > NMI that arrives while the LVTPC points at the mediated PMU IRQ vector > can't possibly be due to a host perf event. > > Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxxxxxxxx> > Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx> > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > [sean: use arch hook instead of per-PMU callback] > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > arch/x86/events/core.c | 27 +++++++++++++++++++++++++++ > include/linux/perf_event.h | 3 +++ > kernel/events/core.c | 4 ++++ > 3 files changed, 34 insertions(+) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 7610f26dfbd9..9b0525b252f1 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -55,6 +55,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { > .pmu = &pmu, > }; > > +static DEFINE_PER_CPU(bool, x86_guest_ctx_loaded); > + > DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key); > DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key); > DEFINE_STATIC_KEY_FALSE(perf_is_hybrid); > @@ -1756,6 +1758,16 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs) > u64 finish_clock; > int ret; > > + /* > + * Ignore all NMIs when a guest's mediated PMU context is loaded. Any > + * such NMI can't be due to a PMI as the CPU's LVTPC is switched to/from > + * the dedicated mediated PMI IRQ vector while host events are quiesced. > + * Attempting to handle a PMI while the guest's context is loaded will > + * generate false positives and clobber guest state. > + */ > + if (this_cpu_read(x86_guest_ctx_loaded)) > + return NMI_DONE; > + > /* > * All PMUs/events that share this PMI handler should make sure to > * increment active_events for their events. > @@ -2727,6 +2739,21 @@ static struct pmu pmu = { > .filter = x86_pmu_filter, > }; > > +void arch_perf_load_guest_context(unsigned long data) > +{ > + u32 masked = data & APIC_LVT_MASKED; > + > + apic_write(APIC_LVTPC, > + APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked); > + this_cpu_write(x86_guest_ctx_loaded, true); > +} > + > +void arch_perf_put_guest_context(void) > +{ > + this_cpu_write(x86_guest_ctx_loaded, false); > + apic_write(APIC_LVTPC, APIC_DM_NMI); > +} > + > void arch_perf_update_userpage(struct perf_event *event, > struct perf_event_mmap_page *userpg, u64 now) > { > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 0c529fbd97e6..3a9bd9c4c90e 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1846,6 +1846,9 @@ static inline unsigned long perf_arch_guest_misc_flags(struct pt_regs *regs) > # define perf_arch_guest_misc_flags(regs) perf_arch_guest_misc_flags(regs) > #endif > > +extern void arch_perf_load_guest_context(unsigned long data); > +extern void arch_perf_put_guest_context(void); > + > static inline bool needs_branch_stack(struct perf_event *event) > { > return event->attr.branch_sample_type != 0; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index e1df3c3bfc0d..ad22b182762e 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6408,6 +6408,8 @@ void perf_load_guest_context(unsigned long data) > task_ctx_sched_out(cpuctx->task_ctx, NULL, EVENT_GUEST); > } > > + arch_perf_load_guest_context(data); So I still don't understand why this ever needs to reach the generic code. x86 pmu driver and x86 kvm can surely sort this out inside of x86, no?