Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Thu, 27 Mar 2025 20:06:51 +0800

On Fri, Mar 21, 2025 at 12:49:42PM +0100, Paolo Bonzini wrote:
> On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > object) to making a functional and confusing code change to fudge around a lockdep
> > false positive.
> 
> In that thread I had made another suggestion, which Yan also tried,
> which was to use subclasses:
> 
> - in the sched_out path, which cannot race with the others:
>   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
>
> - in the irq and sched_in paths, which can race with each other:
>   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
Hi Paolo, Sean, Maxim,

The sched_out path still may race with sched_in path. e.g.
    CPU 0                 CPU 1
-----------------     ---------------
vCPU 0 sched_out
vCPU 1 sched_in
vCPU 1 sched_out      vCPU 0 sched_in

vCPU 0 sched_in may race with vCPU 1 sched_out on CPU 0's wakeup list.


So, the situation is
sched_in, sched_out: race
sched_in, irq:       race
sched_out, irq: mutual exclusive, do not race


Hence, do you think below subclasses assignments reasonable?
irq: subclass 0
sched_out: subclass 1
sched_in: subclasses 0 and 1

As inspired by Sean's solution, I made below patch to inform lockdep that the
sched_in path involves both subclasses 0 and 1 by adding a line
"spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_)".

I like it because it accurately conveys the situation to lockdep :)
What are your thoughts?

Thanks
Yan

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index ec08fa3caf43..c5684225255a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -89,9 +89,12 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
         * current pCPU if the task was migrated.
         */
        if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
-               raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+               raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
+               raw_spin_lock(spinlock);
+               spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_);
                list_del(&vmx->pi_wakeup_list);
-               raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+               spin_release(&spinlock->dep_map, _RET_IP_);
+               raw_spin_unlock(spinlock);
        }

        dest = cpu_physical_id(cpu);
@@ -152,7 +155,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)

        local_irq_save(flags);

-       raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+       raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
        list_add_tail(&vmx->pi_wakeup_list,
                      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
        raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));