On Wed, 2025-05-07 at 10:18 -0700, Sean Christopherson wrote: > On Thu, May 01, 2025, mlevitsk@xxxxxxxxxx wrote: > > On Tue, 2025-04-22 at 16:33 -0700, Sean Christopherson wrote: > > > > @@ -2653,11 +2654,17 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, > > > > if (vmx->nested.nested_run_pending && > > > > (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { > > > > kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); > > > > - vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl); > > > > + new_debugctl = vmcs12->guest_ia32_debugctl; > > > > } else { > > > > kvm_set_dr(vcpu, 7, vcpu->arch.dr7); > > > > - vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl); > > > > + new_debugctl = vmx->nested.pre_vmenter_debugctl; > > > > } > > > > + > > > > + if (CC(!vmx_set_guest_debugctl(vcpu, new_debugctl, false))) { > > > > > > The consistency check belongs in nested_vmx_check_guest_state(), only needs to > > > check the VM_ENTRY_LOAD_DEBUG_CONTROLS case, and should be posted as a separate > > > patch. > > > > I can move it there. Can you explain why though you want this? Is it because of the > > order of checks specified in the PRM? > > To be consistent with how KVM checks guest state. The two checks in prepare_vmcs02() > are special cases. vmx_guest_state_valid() consumes a huge variety of state, and > so replicating all of its logic for vmcs12 isn't worth doing. The check on the > kvm_set_msr() for guest_ia32_perf_global_ctrl exists purely so that KVM doesn't > simply ignore the return value. > > And to a lesser degree, because KVM assumes that guest state has been sanitized > after nested_vmx_check_guest_state() is called. Violating that risks introducing > bugs, e.g. consuming vmcs12->guest_ia32_debugctl before it's been vetted could > theoretically be problematic. > > > Currently GUEST_IA32_DEBUGCTL of the host is *written* in prepare_vmcs02. > > Should I also move this write to nested_vmx_check_guest_state? > > No. nested_vmx_check_guest_state() verifies the incoming vmcs12 state, > prepare_vmcs02() merges the vmcs12 state with KVM's desires and fills vmcs02. > > > Or should I write the value blindly in prepare_vmcs02 and then check the value > > of 'vmx->msr_ia32_debugctl' in nested_vmx_check_guest_state and fail if the value > > contains reserved bits? > > I don't follow. nested_vmx_check_guest_state() is called before prepare_vmcs02(). My mistake, I for some reason thought that nested_vmx_check_guest_state is called from prepare_vmcs02(). Your explanation now makes sense. > > > > > +bool vmx_set_guest_debugctl(struct kvm_vcpu *vcpu, u64 data, bool host_initiated) > > > > +{ > > > > + u64 invalid = data & ~vmx_get_supported_debugctl(vcpu, host_initiated); > > > > + > > > > + if (invalid & (DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR)) { > > > > + kvm_pr_unimpl_wrmsr(vcpu, MSR_IA32_DEBUGCTLMSR, data); > > > > + data &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); > > > > + invalid &= ~(DEBUGCTLMSR_BTF|DEBUGCTLMSR_LBR); > > > > + } > > > > + > > > > + if (invalid) > > > > + return false; > > > > + > > > > + if (is_guest_mode(vcpu) && (get_vmcs12(vcpu)->vm_exit_controls & > > > > + VM_EXIT_SAVE_DEBUG_CONTROLS)) > > > > + get_vmcs12(vcpu)->guest_ia32_debugctl = data; > > > > + > > > > + if (intel_pmu_lbr_is_enabled(vcpu) && !to_vmx(vcpu)->lbr_desc.event && > > > > + (data & DEBUGCTLMSR_LBR)) > > > > + intel_pmu_create_guest_lbr_event(vcpu); > > > > + > > > > + __vmx_set_guest_debugctl(vcpu, data); > > > > + return true; > > > > > > Return 0/-errno, not true/false. > > > > There are plenty of functions in this file and KVM that return boolean. > > That doesn't make them "right". For helpers that are obvious predicates, then > absolutely use a boolean return value. The names for nested_vmx_check_eptp() > and vmx_control_verify() aren't very good, e.g. they should be > nested_vmx_is_valid_eptp() and vmx_is_valid_control(), but the intent is good. > > But for flows like modifying guest state, KVM should return 0/-errno. > > > e.g: > > > > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp) > > static inline bool vmx_control_verify(u32 control, u32 low, u32 high) > > static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr) > > static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > > struct vmcs12 *vmcs12) > > These two should return 0/-errno. > > > > static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp) > > static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) > > Probably should return 0/-errno, but nested_get_vmcs12_pages() is a bit of a mess. I am not going to argue with you about this, let it be. > > > ... > > > > > > I personally think that functions that emulate hardware should return boolean > > values or some hardware specific status code (e.g VMX failure code) because > > the real hardware never returns -EINVAL and such. > > Real hardware absolutely "returns" granular error codes. KVM even has informal > mappings between some of them, e.g. -EINVAL == #GP, -EFAULT == #PF, -EOPNOTSUPP == #UD, > BUG() == 3-strike #MC. > > And hardware has many more ways to report errors to software. E.g. VMLAUNCH can > #UD, #GP(0), VM-Exit, VMfailInvalid, or VMFailValid with 30+ unique reasons. #MC > has a crazy number of possible error encodings. And so on and so forth. > > Software visible error codes aside, comparing individual KVM functions to an > overall CPU is wildly misguided. A more appropriate comparison would be between > a KVM function and the ucode for a single instruction/operation. I highly, highly > doubt ucode flows are limited to binary yes/no outputs. I don't think you understood my point - I just pointed out that real hardware will never return things like -EINVAL. I never have claimed that real hardware never does return error codes - it of course does, like indeed VMX can return something like 77 different error codes. So I said that functions that emulate hardware should return either boolean in case hardware only accepts/rejects the action, or hardware specific error codes, because I think that its a bit confusing to map hardware error codes and kernel error codes. In case of MSR write, hardware response is more or less boolean - hardware either accepts the write or raises #GP. Yes I understand that hardware can in theory also #UD, or silently ignore write, etc, so I am not going to argue about this, let it be. AFAIK the KVM convention for msr writes is that 1 is GP, 0 success, and negative value exits as a KVM internal error to userspace. Not very developer friendly IMHO, there is a room for improvement here. And I see that we now also have KVM_MSR_RET_UNSUPPORTED and KVM_MSR_RET_FILTERED. Thanks for the review, Best regards, Maxim Levitsky >