https://bugzilla.kernel.org/show_bug.cgi?id=218792 --- Comment #5 from chenyi.qiang@xxxxxxxxx --- On 5/1/2024 12:41 AM, Sean Christopherson wrote: > On Tue, Apr 30, 2024, bugzilla-daemon@xxxxxxxxxx wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=218792 >> >> Bug ID: 218792 >> Summary: Guest call trace with mwait enabled >> Product: Virtualization >> Version: unspecified >> Hardware: Intel >> OS: Linux >> Status: NEW >> Severity: normal >> Priority: P3 >> Component: kvm >> Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx >> Reporter: farrah.chen@xxxxxxxxx >> Regression: No >> >> Environment: >> host/guest kernel: >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master >> e67572cd220(v6.9-rc6) >> QEMU: https://gitlab.com/qemu-project/qemu.git master 5c6528dce86d >> Host/Guest OS: Centos stream9/Ubuntu24.04 >> >> Bug detail description: >> Boot Guest with mwait enabled(-overcommit cpu-pm=on), guest call trace >> "unchecked MSR access error" >> >> Reproduce steps: >> img=centos9.qcow2 >> qemu-system-x86_64 \ >> -name legacy,debug-threads=on \ >> -overcommit cpu-pm=on \ >> -accel kvm -smp 8 -m 8G -cpu host \ >> -drive file=${img},if=none,id=virtio-disk0 \ >> -device virtio-blk-pci,drive=virtio-disk0 \ >> -device virtio-net-pci,netdev=nic0 -netdev >> user,id=nic0,hostfwd=tcp::10023-:22 \ >> -vnc :1 -serial stdio >> >> Guest boot with call trace: >> [ 0.475344] unchecked MSR access error: RDMSR from 0xe2 at rIP: > > MSR 0xE2 is MSR_PKG_CST_CONFIG_CONTROL, which hpet_is_pc10_damaged() assumes > exists if PC10 substates are supported. KVM doesn't emulate/support > MSR_PKG_CST_CONFIG_CONTROL, i.e. injects a #GP on the guest RDMSR, hence the > splat. This isn't a KVM bug as KVM explicitly advertises all zeros for the > MWAIT CPUID leaf, i.e. QEMU is effectively telling the guest that PC10 > substates > are support without KVM's explicit blessing. > > That said, this is arguably a kernel bug (guest side), as I don't see > anything > in the SDM that _requires_ MSR_PKG_CST_CONFIG_CONTROL to exist if PC10 > substates > are supported. > > The issue is likely benign, other that than obvious WARN. The kernel > gracefully > handles the #GP and zeros the result, i.e. will always think PC10 is > _disabled_, > which may or may not be correct, but is functionally ok if the HPET is being > emulated by the host, which it probably is. > > rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg); > if ((pcfg & 0xF) < 8) > return false; > > The most straightforward fix, and probably the most correct all around, would > be > to use rdmsrl_safe() to suppress the WARN, i.e. have the kernel not yell if > MSR_PKG_CST_CONFIG_CONTROL doesn't exist. Unless HPET is also being passed > through, that'll do the right thing when Linux is a guest. And if a setup > also > passes through HPET, then the VMM can also trap-and-emulate > MSR_PKG_CST_CONFIG_CONTROL > as appropriate (doing so in QEMU without KVM support might be impossible, > though > again it's unnecessary if QEMU is emulating the HPET). > > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c > index c96ae8fee95e..2afafff18f92 100644 > --- a/arch/x86/kernel/hpet.c > +++ b/arch/x86/kernel/hpet.c > @@ -980,7 +980,9 @@ static bool __init hpet_is_pc10_damaged(void) > return false; > > /* Check whether PC10 is enabled in PKG C-state limit */ > - rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg); > + if (rdmsrl_safe(MSR_PKG_CST_CONFIG_CONTROL, pcfg)) > + return false; > + > if ((pcfg & 0xF) < 8) > return false; There are three places which could access MSR_PKG_CST_CONFIG_CONTROL. 1. hpet_is_pc10_damaged() in hpet.c 2. *_idle_state_table_update() in intel_idle.c (This BUG comes from this path in VMs) 3. auto_demotion_disable() in intel_idle.c This MSR seems not architectural but CPU model specific. Besides the case 1 as mentioned, the intel_idle driver also uses it to query the lowest processor-specific C-state for the package (case 2) and to disable auto demotion (case 3) based on the specific model. I assume both case 2 and 3 are aimed to improve energy-efficiency. For example, spr_idle_state_table_update() adjusts the exit_latency/target_residency to hardcoded ones based on the package C-state limit. It seems unreasonable in VMs as the hardcoded values are measured in host and the guest CPU model may not match the host one if we only pass-thru this MSR. Similarly, for case 3, there is no guarantee that disabling auto demotion can improve energy efficiency in a emulated CPU model. Since there is no such fine-grained power management virtualization support yet. Can we change all the rdmsr/wrmsr(MSR_PKG_CST_CONFIG_CONTROL) to the *_safe() variant to skip the related operation in VMs? > >> 0xffffffffb5a966b8 (native_read_msr+0x8/0x40) >> [ 0.476465] Call Trace: >> [ 0.476763] <TASK> >> [ 0.477027] ? ex_handler_msr+0x128/0x140 >> [ 0.477460] ? fixup_exception+0x166/0x3c0 >> [ 0.477934] ? exc_general_protection+0xdc/0x3c0 >> [ 0.478481] ? asm_exc_general_protection+0x26/0x30 >> [ 0.479052] ? __pfx_intel_idle_init+0x10/0x10 >> [ 0.479587] ? native_read_msr+0x8/0x40 >> [ 0.480057] intel_idle_init_cstates_icpu.constprop.0+0x5e/0x560 >> [ 0.480747] ? __pfx_intel_idle_init+0x10/0x10 >> [ 0.481275] intel_idle_init+0x161/0x360 >> [ 0.481742] do_one_initcall+0x45/0x220 >> [ 0.482209] do_initcalls+0xac/0x130 >> [ 0.482643] kernel_init_freeable+0x134/0x1e0 >> [ 0.483159] ? __pfx_kernel_init+0x10/0x10 >> [ 0.483648] kernel_init+0x1a/0x1c0 >> [ 0.484087] ret_from_fork+0x31/0x50 >> [ 0.484541] ? __pfx_kernel_init+0x10/0x10 >> [ 0.485030] ret_from_fork_asm+0x1a/0x30 >> [ 0.485462] </TASK> >> >> -- >> You may reply to this email to add a comment. >> >> You are receiving this mail because: >> You are watching the assignee of the bug. > -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.