On Wed, Aug 13, 2025, Maxim Levitsky wrote: > Fix a semi theoretical race condition in reading of page_ready_pending > in kvm_arch_async_page_present_queued. This needs to explain what can actually go wrong if the race is "hit". After staring at all of this for far, far too long, I'm 99.9% confident the race is benign. If the worker "incorrectly" sees pageready_pending as %false, then the result is simply a spurious kick, and that spurious kick is all but guaranteed to be a nop since if kvm_arch_async_page_present() is setting the flag, then (a) the vCPU isn't blocking and (b) isn't IN_GUEST_MODE and thus won't be IPI'd. If the worker incorrectly sees pageready_pending as %true, then the vCPU has *just* written MSR_KVM_ASYNC_PF_ACK, and is guaranteed to observe and process KVM_REQ_APF_READY before re-entering the guest, and the sole purpose of the kick is to ensure the request is processed. > Only trust the value of page_ready_pending if the guest is about to > enter guest mode (vcpu->mode). This is misleading, e.g. IN_GUEST_MODE can be true if the vCPU just *exited*. All IN_GUEST_MODE says is that the vCPU task is somewhere in KVM's inner run loop. > To achieve this, read the vcpu->mode using smp_load_acquire which is > paired with smp_store_release in vcpu_enter_guest. > > Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the > page_ready_pending because it was written before and therefore its correct > value is visible. > > Also if the above mentioned check is true, avoid raising the request > on the target vCPU. Why? At worst, a dangling KVM_REQ_APF_READY will cause KVM to bail from the fastpath when it's not strictly necessary to do so. On the other hand, a missing request could hang the guest. So I don't see any reason to try and be super precise when setting KVM_REQ_APF_READY. > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> > --- > arch/x86/kvm/x86.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 9018d56b4b0a..3d45a4cd08a4 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, > > void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) > { > - kvm_make_request(KVM_REQ_APF_READY, vcpu); > - if (!vcpu->arch.apf.pageready_pending) > + /* Pairs with smp_store_release in vcpu_enter_guest. */ > + bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE); In terms of arch.apf.pageready_pending being modified, it's not IN_GUEST_MODE that's special, it's OUTSIDE_GUEST_MODE that's special, because that's the only time the task that hold vcpu->mutex can clear pageready_pending. > + bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending); This should be paired with WRITE_ONCE() on the vCPU. > + > + if (!in_guest_mode || !page_ready_pending) { > + kvm_make_request(KVM_REQ_APF_READY, vcpu); > kvm_vcpu_kick(vcpu); > + } Given that the race is guaranteed to be bening (assuming my analysis is correct), I definitely think there should be a comment here explaining that pageready_pending is "technically unstable". Otherwise, it takes a lot of staring to understand what this code is actually doing. I also think it makes sense to do the bare minimum for OUTSIDE_GUEST_MODE, which is to wake the vCPU. Because calling kvm_vcpu_kick() when the vCPU is known to not be IN_GUEST_MODE is weird. For the code+comment, how about this? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6bdf7ef0b535..d721fab3418d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4000,7 +4000,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT)) return 1; if (data & 0x1) { - vcpu->arch.apf.pageready_pending = false; + WRITE_ONCE(vcpu->arch.apf.pageready_pending, false); kvm_check_async_pf_completion(vcpu); } break; @@ -13457,7 +13457,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, if ((work->wakeup_all || work->notpresent_injected) && kvm_pv_async_pf_enabled(vcpu) && !apf_put_user_ready(vcpu, work->arch.token)) { - vcpu->arch.apf.pageready_pending = true; + WRITE_ONCE(vcpu->arch.apf.pageready_pending, true); kvm_apic_set_irq(vcpu, &irq, NULL); } @@ -13468,7 +13468,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu, void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu) { kvm_make_request(KVM_REQ_APF_READY, vcpu); - if (!vcpu->arch.apf.pageready_pending) + + /* + * Don't kick the vCPU if it has an outstanding "page ready" event as + * KVM won't be able to deliver the next "page ready" token until the + * outstanding one is handled. Ignore pageready_pending if the vCPU is + * outside "guest mode", i.e. if KVM might be sending "page ready" or + * servicing a MSR_KVM_ASYNC_PF_ACK write, as the flag is technically + * unstable. However, in that case, there's obviously no need to kick + * the vCPU out of the guest, so just ensure the vCPU is awakened if + * it's blocking. + */ + if (smp_load_acquire(vcpu->mode) == OUTSIDE_GUEST_MODE) + kvm_vcpu_wake_up(vcpu); + else if (!READ_ONCE(vcpu->arch.apf.pageready_pending)) kvm_vcpu_kick(vcpu); }