Re: [PATCH 1/3] x86: KVM: VMX: Wrap GUEST_IA32_DEBUGCTL read/write with access functions

mlevitsk@xxxxxxxxxx · Mon, 12 May 2025 20:34:03 -0400

On Wed, 2025-05-07 at 10:18 -0700, Sean Christopherson wrote:
> On Thu, May 01, 2025, mlevitsk@xxxxxxxxxx wrote:
> > On Tue, 2025-04-22 at 16:33 -0700, Sean Christopherson wrote:
> > > > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> > > >  	if (vmx->nested.nested_run_pending &&
> > > >  	    (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) {
> > > >  		kvm_set_dr(vcpu, 7, vmcs12->guest_dr7);
> > > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl);
> > > > +		new_debugctl = vmcs12->guest_ia32_debugctl;
> > > >  	} else {
> > > >  		kvm_set_dr(vcpu, 7, vcpu->arch.dr7);
> > > > -		vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl);
> > > > +		new_debugctl = vmx->nested.pre_vmenter_debugctl;
> > > >  	}
> > > > +
> > > > +	if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) {
> > > 
> > > The consistency check belongs in nested_vmx_check_guest_state(), only needs to
> > > check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate
> > > patch.
> > 
> > I can move it there. Can you explain why though you want this? Is it because of the
> > order of checks specified in the PRM?
> 
> To be consistent with how KVM checks guest state.  The two checks in prepare_vmcs02()
> are special cases.  vmx_guest_state_valid() consumes a huge variety of state, and
> so replicating all of its logic for vmcs12 isn't worth doing.  The check on the
> kvm_set_msr() for guest_ia32_perf_global_ctrl exists purely so that KVM doesn't
> simply ignore the return value.
> 
> And to a lesser degree, because KVM assumes that guest state has been sanitized
> after nested_vmx_check_guest_state() is called.  Violating that risks introducing
> bugs, e.g. consuming vmcs12->guest_ia32_debugctl before it's been vetted could
> theoretically be problematic.
> 
> > Currently GUEST_IA32_DEBUGCTL of the host is *written* in prepare_vmcs02. 
> > Should I also move this write to nested_vmx_check_guest_state?
> 
> No.  nested_vmx_check_guest_state() verifies the incoming vmcs12 state,
> prepare_vmcs02() merges the vmcs12 state with KVM's desires and fills vmcs02.
> 
> > Or should I write the value blindly in prepare_vmcs02 and then check the value
> > of 'vmx->msr_ia32_debugctl' in nested_vmx_check_guest_state and fail if the value
> > contains reserved bits? 
> 
> I don't follow.  nested_vmx_check_guest_state() is called before prepare_vmcs02().

My mistake, I for some reason thought that nested_vmx_check_guest_state is called from
prepare_vmcs02(). Your explanation now makes sense.

> 
> > > > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated)
> > > > +{
> > > > +	u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated);
> > > > +
> > > > +	if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) {
> > > > +		kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data);
> > > > +		data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > > +		invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR);
> > > > +	}
> > > > +
> > > > +	if (invalid)
> > > > +		return false;
> > > > +
> > > > +	if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls &
> > > > +					VM_EXIT_SAVE_DEBUG_CONTROLS))
> > > > +		get_vmcs12(vcpu)->guest_ia32_debugctl = data;
> > > > +
> > > > +	if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event &&
> > > > +	    (data & DEBUGCTLMSR_LBR))
> > > > +		intel_pmu_create_guest_lbr_event(vcpu);
> > > > +
> > > > +	__vmx_set_guest_debugctl(vcpu, data);
> > > > +	return true;
> > > 
> > > Return 0/-errno, not true/false.
> > 
> > There are plenty of functions in this file and KVM that return boolean.
> 
> That doesn't make them "right".  For helpers that are obvious predicates, then
> absolutely use a boolean return value.  The names for nested_vmx_check_eptp()
> and vmx_control_verify() aren't very good, e.g. they should be
> nested_vmx_is_valid_eptp() and vmx_is_valid_control(), but the intent is good.
> 
> But for flows like modifying guest state, KVM should return 0/-errno.
> 
> > e.g: 
> > 
> > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> > static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
> > static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr)
> > static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > 						 struct vmcs12 *vmcs12)
> 
> These two should return 0/-errno.
> 
>  
> > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
> > static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> 
> Probably should return 0/-errno, but nested_get_vmcs12_pages() is a bit of a mess.

I am not going to argue with you about this, let it be.

> 
> > ...
> > 
> > 
> > I personally think that functions that emulate hardware should return boolean
> > values or some hardware specific status code (e.g VMX failure code) because
> > the real hardware never returns -EINVAL and such.
> 
> Real hardware absolutely "returns" granular error codes.  KVM even has informal
> mappings between some of them, e.g. -EINVAL == #GP, -EFAULT == #PF, -EOPNOTSUPP == #UD,
> BUG() == 3-strike #MC.
> 
> And hardware has many more ways to report errors to software. E.g. VMLAUNCH can
> #UD, #GP(0), VM-Exit, VMfailInvalid, or VMFailValid with 30+ unique reasons.  #MC
> has a crazy number of possible error encodings.  And so on and so forth.
> 
> Software visible error codes aside, comparing individual KVM functions to an
> overall CPU is wildly misguided.  A more appropriate comparison would be between
> a KVM function and the ucode for a single instruction/operation.  I highly, highly
> doubt ucode flows are limited to binary yes/no outputs.

I don't think you understood my point - I just pointed out that real hardware will never return
things like -EINVAL.

I never have claimed that real hardware never does return error codes - it of course does, 
like indeed VMX can return something like 77 different error codes.

So I said that functions that emulate hardware should return either boolean in case hardware
only accepts/rejects the action, or hardware specific error codes, because
I think that its a bit confusing to map hardware error codes and kernel error codes.

In case of MSR write, hardware response is more or less boolean - hardware either accepts
the write or raises #GP.

Yes I understand that hardware can in theory also #UD, or silently ignore write, etc,
so I am not going to argue about this, let it be.

AFAIK the KVM convention for msr writes is that 1 is GP, 0 success, and negative value
exits as a KVM internal error to userspace. Not very developer friendly IMHO, there is
a room for improvement here.

And I see that we now also have KVM_MSR_RET_UNSUPPORTED and KVM_MSR_RET_FILTERED.

Thanks for the review,
Best regards,
	Maxim Levitsky

>