[Bug 218792] Guest call trace with mwait enabled

bugzilla-daemon@xxxxxxxxxx · Thu, 31 Jul 2025 08:59:52 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=218792

--- Comment #5 from chenyi.qiang@xxxxxxxxx ---
On 5/1/2024 12:41 AM, Sean Christopherson wrote:
> On Tue, Apr 30, 2024, bugzilla-daemon@xxxxxxxxxx wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=218792
>>
>>             Bug ID: 218792
>>            Summary: Guest call trace with mwait enabled
>>            Product: Virtualization
>>            Version: unspecified
>>           Hardware: Intel
>>                 OS: Linux
>>             Status: NEW
>>           Severity: normal
>>           Priority: P3
>>          Component: kvm
>>           Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx
>>           Reporter: farrah.chen@xxxxxxxxx
>>         Regression: No
>>
>> Environment:
>> host/guest kernel:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> e67572cd220(v6.9-rc6)
>> QEMU: https://gitlab.com/qemu-project/qemu.git master 5c6528dce86d
>> Host/Guest OS: Centos stream9/Ubuntu24.04
>>
>> Bug detail description: 
>> Boot Guest with mwait enabled(-overcommit cpu-pm=on), guest call trace
>> "unchecked MSR access error"
>>
>> Reproduce steps:
>> img=centos9.qcow2
>> qemu-system-x86_64 \
>>     -name legacy,debug-threads=on \
>>     -overcommit cpu-pm=on \
>>     -accel kvm -smp 8 -m 8G -cpu host \
>>     -drive file=${img},if=none,id=virtio-disk0 \
>>     -device virtio-blk-pci,drive=virtio-disk0 \
>>     -device virtio-net-pci,netdev=nic0 -netdev
>> user,id=nic0,hostfwd=tcp::10023-:22 \
>>     -vnc :1 -serial stdio
>>
>> Guest boot with call trace:
>> [ 0.475344] unchecked MSR access error: RDMSR from 0xe2 at rIP:
> 
> MSR 0xE2 is MSR_PKG_CST_CONFIG_CONTROL, which hpet_is_pc10_damaged() assumes
> exists if PC10 substates are supported. KVM doesn't emulate/support
> MSR_PKG_CST_CONFIG_CONTROL, i.e. injects a #GP on the guest RDMSR, hence the
> splat.  This isn't a KVM bug as KVM explicitly advertises all zeros for the
> MWAIT CPUID leaf, i.e. QEMU is effectively telling the guest that PC10
> substates
> are support without KVM's explicit blessing.
> 
> That said, this is arguably a kernel bug (guest side), as I don't see
> anything
> in the SDM that _requires_ MSR_PKG_CST_CONFIG_CONTROL to exist if PC10
> substates
> are supported.
> 
> The issue is likely benign, other that than obvious WARN.  The kernel
> gracefully
> handles the #GP and zeros the result, i.e. will always think PC10 is
> _disabled_,
> which may or may not be correct, but is functionally ok if the HPET is being
> emulated by the host, which it probably is.
> 
>       rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg);
>       if ((pcfg & 0xF) < 8)
>               return false;
> 
> The most straightforward fix, and probably the most correct all around, would
> be
> to use rdmsrl_safe() to suppress the WARN, i.e. have the kernel not yell if
> MSR_PKG_CST_CONFIG_CONTROL doesn't exist.  Unless HPET is also being passed
> through, that'll do the right thing when Linux is a guest.  And if a setup
> also
> passes through HPET, then the VMM can also trap-and-emulate
> MSR_PKG_CST_CONFIG_CONTROL
> as appropriate (doing so in QEMU without KVM support might be impossible,
> though
> again it's unnecessary if QEMU is emulating the HPET).
> 
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index c96ae8fee95e..2afafff18f92 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -980,7 +980,9 @@ static bool __init hpet_is_pc10_damaged(void)
>                 return false;
>  
>         /* Check whether PC10 is enabled in PKG C-state limit */
> -       rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg);
> +       if (rdmsrl_safe(MSR_PKG_CST_CONFIG_CONTROL, pcfg))
> +               return false;
> +
>         if ((pcfg & 0xF) < 8)
>                 return false;

There are three places which could access MSR_PKG_CST_CONFIG_CONTROL.
1. hpet_is_pc10_damaged() in hpet.c
2. *_idle_state_table_update() in intel_idle.c (This BUG comes from this path
in VMs)
3. auto_demotion_disable() in intel_idle.c

This MSR seems not architectural but CPU model specific.

Besides the case 1 as mentioned, the intel_idle driver also uses it to query
the
lowest processor-specific C-state for the package (case 2) and to disable auto
demotion
(case 3) based on the specific model.

I assume both case 2 and 3 are aimed to improve energy-efficiency. For example,
spr_idle_state_table_update() adjusts the exit_latency/target_residency to
hardcoded ones based on
the package C-state limit. It seems unreasonable in VMs as the hardcoded values
are measured in host
and the guest CPU model may not match the host one if we only pass-thru this
MSR. Similarly,
for case 3, there is no guarantee that disabling auto demotion can improve
energy efficiency in a
emulated CPU model.

Since there is no such fine-grained power management virtualization support
yet. Can we change
all the rdmsr/wrmsr(MSR_PKG_CST_CONFIG_CONTROL) to the *_safe() variant to skip
the related operation
in VMs?

> 
>> 0xffffffffb5a966b8 (native_read_msr+0x8/0x40)
>> [ 0.476465] Call Trace:
>> [ 0.476763] <TASK>
>> [ 0.477027] ? ex_handler_msr+0x128/0x140
>> [ 0.477460] ? fixup_exception+0x166/0x3c0
>> [ 0.477934] ? exc_general_protection+0xdc/0x3c0
>> [ 0.478481] ? asm_exc_general_protection+0x26/0x30
>> [ 0.479052] ? __pfx_intel_idle_init+0x10/0x10
>> [ 0.479587] ? native_read_msr+0x8/0x40
>> [ 0.480057] intel_idle_init_cstates_icpu.constprop.0+0x5e/0x560
>> [ 0.480747] ? __pfx_intel_idle_init+0x10/0x10
>> [ 0.481275] intel_idle_init+0x161/0x360
>> [ 0.481742] do_one_initcall+0x45/0x220
>> [ 0.482209] do_initcalls+0xac/0x130
>> [ 0.482643] kernel_init_freeable+0x134/0x1e0
>> [ 0.483159] ? __pfx_kernel_init+0x10/0x10
>> [ 0.483648] kernel_init+0x1a/0x1c0
>> [ 0.484087] ret_from_fork+0x31/0x50
>> [ 0.484541] ? __pfx_kernel_init+0x10/0x10
>> [ 0.485030] ret_from_fork_asm+0x1a/0x30
>> [ 0.485462] </TASK>
>>
>> -- 
>> You may reply to this email to add a comment.
>>
>> You are receiving this mail because:
>> You are watching the assignee of the bug.
>

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.