On Tue, Sep 09, 2025, Chao Gao wrote: > On Mon, Aug 25, 2025 at 10:55:20AM +0800, Chao Gao wrote: > >On Sun, Aug 24, 2025 at 06:52:55PM -0700, Xin Li wrote: > >>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > >>> index 6b01c6e9330e..799ac76679c9 100644 > >>> --- a/arch/x86/kvm/x86.c > >>> +++ b/arch/x86/kvm/x86.c > >>> @@ -4566,6 +4569,21 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > >>> } > >>> EXPORT_SYMBOL_GPL(kvm_get_msr_common); > >>> +/* > >>> + * Returns true if the MSR in question is managed via XSTATE, i.e. is context > >>> + * switched with the rest of guest FPU state. > >>> + */ > >>> +static bool is_xstate_managed_msr(u32 index) > >>> +{ > >>> + switch (index) { > >>> + case MSR_IA32_U_CET: > >> > >> > >>Why MSR_IA32_S_CET is not included here? Because the guest's S_CET must *never* be resident in harware while running in the host. Doing so would create an egregious security issue due to letting the guest disabled IBT and/or shadow stacks, or alternatively crash the host by enabling one or the other. Having guest MSR_IA32_PL[0-3]_SSP resident in hardware while the _kernel_ is running is safe, because those MSRs are only consumed on transitions to lower privilege levels, i.e. from KVM's perspective, they're effectively user-return MSRs that get restored on exit to userspace thanks to kvm_{load,put}_guest_fpu() context switching between VMM and guest state (if the vCPU task is preempted, the normal context switch code handles swapping state between tasks, it's only the VMM vs. guest state that needs dedicated handling since they are the same task). Context switching S_CET as part of XRSTORS very, VERY subtly works by virtue of S_CET already being loaded with the host's value on VM-Exit. I.e. the value saved into guest FPU state is always the host's value, and thus the value loaded from guest FPU state is always the host's value. And because all of that isn't enough, the final wrinkle is that KVM_{G,S}ET_XSAVE only operate on xcr0 / user state, i.e. don't allow userspace to load supervisor (S_CET) state into the kernel. > >Emm. I didn't think about this > > > >MSR_IA32_S_CET is read from or written to a dedicated VMCS/B field, so KVM > >doesn't need to load the guest FPU to access MSR_IA32_S_CET. This pairs with > >the kvm_{get,set}_xstate_msr() in kvm_{get,set}_msr_common(). > > > >That said, userspace writes can indeed cause an inconsistency between the guest > >FPU and VMCS fields regarding MSR_IA32_S_CET. If migration occurs right after a > >userspace write (without a VM-entry, which would bring them in sync) and > >userspace just restores MSR_IA32_S_CET from the guest FPU, the write before > >migration could be lost. > > > >If that migration issue is a practical problem, I think MSR_IA32_S_CET should > >be included here, and we need to perform a kvm_set_xstate_msr() after writing > >to the VMCS/B. > > I prefer to make guest FPU and VMCS always consistent regarding MSR_IA32_S_CET. > The cost is to load guest FPU before userspace accesses the MSR and save guest > FPU after that. It isn't a problem as MSR_IA32_S_CET accesses from userspace > shouldn't be on any hot-path. So, I will add: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 989008f5307e..92daf63c9487 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -2435,6 +2435,7 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > break; > case MSR_IA32_S_CET: > vmcs_writel(GUEST_S_CET, data); > + kvm_set_xstate_msr(vcpu, msr_info); As explained above, this is a fatal flaw, as it will allow userspace to write an arbitrary value to S_CET. > break; > case MSR_KVM_INTERNAL_GUEST_SSP: > vmcs_writel(GUEST_SSP, data); > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index a3770e3d5154..6f64a3355274 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -4647,6 +4647,7 @@ EXPORT_SYMBOL_GPL(kvm_get_msr_common); > static bool is_xstate_managed_msr(u32 index) > { > switch (index) { > + case MSR_IA32_S_CET: And this is wrong too. I'll add a big comment to explain all of this, especially since I just spent an entire morning re-learning how all of this works. *sigh* > case MSR_IA32_U_CET: > case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP: