Re: [PATCH RFC 1/2] KVM: TDX: Disable general support for MWAIT in guest

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/20/2025 7:35 AM, Sean Christopherson wrote:
On Tue, Aug 19, 2025, Rick P Edgecombe wrote:
On Tue, 2025-08-19 at 10:38 +0300, Adrian Hunter wrote:
On 18/08/2025 21:49, Edgecombe, Rick P wrote:
Attn: Binbin, Xiaoyao

On Mon, 2025-08-18 at 07:05 -0700, Sean Christopherson wrote:
NAK.

Fix the guest, or wherever else in the pile there are issues.  KVM is NOT carrying
hack-a-fixes to workaround buggy software/firmware.  Been there, done that.

Yes, I would have thought we should have at least had a TDX module change option
for this.

That would not help with existing TDX Modules, and would possibly require
a guest opt-in, which would not help with existing guests.  Hence, to start
with disabling the feature first, and look for another solution second.

I think you have the priorities wrong. There are only so many kludges we can ask
KVM to take. Across all the changes people want for TDX, do you think not having
to update the TDX module, backport a guest fix or even just adjust qemu args is
more important the other stuff?

I'm especially sensitive to fudging around _bugs_ in other pieces of the stack.
KVM has been burned badly, multiple times, by hacking around issues elsewhere.
There are inevitably cases where throwing something into KVM is the least awful
choice (usually because it's the only feasible choise), but this ain't one of
those cases.

TDX support is still very early. We need to think about long term sustainable
solutions. So a fix that doesn't support existing TDX modules or guests (the
intel_idle fix is also in this category anyway) should absolutely be on the
table.


In the MWAIT case, Sean has rejected supporting MSR_PKG_CST_CONFIG_CONTROL
even for VMX, because it is an optional MSR, so altering intel_idle is
being proposed.

No, I rejected support MSR_PKG_CST_CONFIG_CONTROL _in KVM_ because I don't see
any reason to shove information into KVM.  AFAICT, it's not an "architectural"
MSR, and all of KVM's existing handling of truly uarch/model-specific MSRs is
painful and ugly.

E.g., for MSR_IA32_POWER_CTL (I don't know why it has _IA32_ in the name, it seems to me not an architectural MSR), KVM chose to just emulate the read/write of it as NOP to workaround the issue that guest driver will access it after MWAIT is allowed natively for the guest.

TDX allowed the same workaround by returning true for MSR_IA32_POWER_CTL in tdx_has_emulated_msr(). So that when td guest requests emulation from KVM on #VE, the workaround can come into play. But with #VE reduction enabled by the TD guest, no #VE anymore but #GP.

It seems we cannot remove the workaround because that will break the existing guests who rely on the workaround.

And userspace (e.g. QEMU) could support emulate MSR_PKG_CST_CONFIG_CONTROL (and
any other MSRs of that nature) via MSR filters.  I doubt the MSR is accessed
outside of boot paths, so the cost of a userspace exit should be a non-issue.
Of course, QEMU probably can't provide useful/accurate information.

One option if there is a super strong need to do so would be to add a "disable
exits" capability to let the guest read package c-state MSRs, but that has
obvious downsides and would still just be fudging around a flawed driver.

I have thought about this option as well. Unfortunately there are WRITE case in the driver, and we cannot simply pass through WRITE.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8dbf19aa66ef..c254aa26ff22 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4120,6 +4120,15 @@ void vmx_recalc_msr_intercepts(struct kvm_vcpu *vcpu)
                 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C6_RESIDENCY, MSR_TYPE_R);
                 vmx_disable_intercept_for_msr(vcpu, MSR_CORE_C7_RESIDENCY, MSR_TYPE_R);
         }
+       if (kvm_package_cstate_in_guest(vcpu->kvm)) {
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_CST_CONFIG_CONTROL, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C2_RESIDENCY, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C3_RESIDENCY, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C6_RESIDENCY, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C8_RESIDENCY, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C9_RESIDENCY, MSR_TYPE_R);
+               vmx_disable_intercept_for_msr(vcpu, MSR_PKG_C10_RESIDENCY, MSR_TYPE_R);
+       }
         if (kvm_aperfmperf_in_guest(vcpu->kvm)) {
                 vmx_disable_intercept_for_msr(vcpu, MSR_IA32_APERF, MSR_TYPE_R);
                 vmx_disable_intercept_for_msr(vcpu, MSR_IA32_MPERF, MSR_TYPE_R);





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux