Re: [PATCH] x86/irq: introduce repair_irq try to repair CPU vector

Thomas Gleixner <tglx@xxxxxxxxxxxxx> · Wed, 23 Jul 2025 17:08:59 +0200

On Wed, Jul 23 2025 at 09:50, Hogan Wang wrote:

I have no idea what that subject line means.

> When the VM irqbalance service adjusts interrupt affinity
> frequently, the VM repeatedly masks MSI-x interrupts.

What has this to do with frequently? The point is that the interrupt is
masked at the PCI level for changing the affinity, which causes a VMEXIT
and activates the VFIO horrorshow vai QEMU.

> During the guest kernel masking MSI-x interrupts, VM exits
> to the Hypervisor.
>
> The Qemu emulator implements the switching between
> kvm_interrupt and qemu_interrupt to achieve MSI-x PBA
> capability.

What's achieved here?

> When the Qemu emulator calls the vfio_msi_set_vector_signal
> interface to switch the kvm_interrupt and qemu_interrupt
> eventfd, it releases and requests IRQs, and correspondingly
> clears and initializes the CPU Vector.
>
> When initializing the CPU Vector, if an unhandled interrupt
> in the APIC is delivered to the kernel, the __common_interrupt
> function is called to handle the interrupt.

I really don't know what that means. Documentation clearly asks you to
provide a proper description of multi-CPU race conditions.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-submission-notes

I've reverse engineered this word salad and I have to tell you that this
is not a completely VFIO specific problem. VFIO just makes it more
likely to trigger and adds some VFIO specific twist on top.

> Since the call_irq_handler function assigns vector_irq[vector]

Please use proper function annotation, i.e.:

  Since call_irq_handler() assigns...

> to VECTOR_UNUSED without lock protection, the assignment of
> vector_irq[vector] and the initialization of the CPU Vector
> are concurrent, leading to vector_irq[vector] being mistakenly
> set to VECTOR_UNUSED.

It's not mistakenly. It's the obvious consequence.

As you pointed out correctly there is no lock protection, so why not
fixing that in the first place?

> This ultimately results in the inability of VFIO passthrough
> device interrupts to be delivered, causing packet loss in
> network devices or IO hangs in disk devices.
>
> This patch detects and repairs vector_irq[vector] after the

# git grep 'This patch' Documentation/process/

> interrupt initialization is completed, ensuring that
> vector_irq[vector] can be corrected if it is mistakenly set
> to VECTOR_UNUSED.

That's a patently bad idea and does not even work under all
circumstances. See below.

> +static void x86_vector_repair(struct irq_domain *dom, struct irq_data *irqd)
> +{
> +	struct apic_chip_data *apicd = apic_chip_data(irqd);
> +	struct irq_desc *desc = irq_data_to_desc(irqd);
> +	unsigned int vector = apicd->vector;
> +	unsigned int cpu = apicd->cpu;
> +	unsigned long flags;
> +
> +	raw_spin_lock_irqsave(&vector_lock, flags);
> +	if (per_cpu(vector_irq, cpu)[vector] != desc) {
> +		per_cpu(vector_irq, cpu)[vector] = desc;
> +		pr_warn("irq %u: repair vector %u.%u\n",
> +			irqd->irq, cpu, vector);
> +	}
> +	raw_spin_unlock_irqrestore(&vector_lock, flags);
> +}

> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -517,6 +517,8 @@ static int vfio_msi_set_vector_signal(struct vfio_pci_core_device *vdev,
>  	}
>  	ctx->trigger = trigger;
>  
> +	repair_irq(irq);
> +

How is that supposed to cure the problem completely?

Let me reverse engineer the actual problem you are trying to solve from
the breadcrumbs you provided:

	CPU0
	vmenter(vCPU0)
   	....
         msi_set_affinity()
           mask(MSI-X)
             vmexit()
        ...

        free_irq()
1         mask();        

2         __synchronize_irq()

          msi_domain_deactivate()
3           write_msg(0);
          x86_vector_deactivate()
4           per_cpu(vector_irq, cpu)[vector] = VECTOR_SHUTDOWN;
       
        request_irq()
          x86_vector_activate()
5           per_cpu(vector_irq, cpu)[vector] = desc;
	  msi_domain_deactivate()
6           write_msg(msg);
7         unmask();

After #1 the device _cannot_ raise the original vector anymore.

After #7 the device _can_ raise an interrupt on the new vector/target
         CPU pair, which might be the same as the previous one.

So the only case where this causes the problem you describe is when

  A) the device raises the old vector _before_ #1. If it raises the
     interrupt _after_ #1, the device is broken.

  B) the old target CPU is delayed handling the interrupt (interrupts
     disabled, NMI, ....)

  C) As a consequence #2 - __synchronize_irq() - cannot observe the
     interrupt handler being executed on the target CPU and x86 has no
     way to query the APIC IRR of a remote CPU to detect the delayed
     case.

  D) x86_vector_deactivate() sets vector_irq to VECTOR_SHUTDOWN

  E) the old target CPU handles the interrupt and observes
     VECTOR_SHUTDOWN and is delayed again

  F) request_irq() gets the same vector/target CPU combo back and writes
     the descriptor into vector_irq

  G) the old target CPU writes VECTOR_UNUSED

In a proper side by side flow the broken case looks like this:

	CPU0				CPU1
	vmenter(vCPU0)
   	....
         msi_set_affinity()
           mask(MSI-X)
             vmexit()
        ...                             interrupt is raised in APIC
                                        but not handled
        free_irq()
          mask();        

          __synchronize_irq()

          msi_domain_deactivate()
            write_msg(0);
          x86_vector_deactivate()
            per_cpu(vector_irq, cpu)[vector] = VECTOR_SHUTDOWN;
       
        request_irq()                   interrupt is handled and
                                        observes VECTOR_SHUTDOWN
          x86_vector_activate()
            per_cpu(vector_irq, cpu)[vector] = desc;

                                        writes VECTOR_UNUSED

	  msi_domain_deactivate()
            write_msg(msg);
          unmask();

That's the kind of analysis, which needs to be provided and is required
to understand the root cause. And if you look carefully at that
analysis, then this is even a problem for regular host side device
drivers:

	CPU0				CPU1
                                        interrupt is raised in APIC
                                        but not handled
        disable_irq_in_device();
        free_irq()
          mask();

          __synchronize_irq()

          msi_domain_deactivate()
            write_msg(0);
          x86_vector_deactivate()
            per_cpu(vector_irq, cpu)[vector] = VECTOR_SHUTDOWN;
       
        request_irq()                   interrupt is handled and
                                        observes VECTOR_SHUTDOWN
          x86_vector_activate()
            per_cpu(vector_irq, cpu)[vector] = desc;

                                        writes VECTOR_UNUSED

	  msi_domain_deactivate()
            write_msg(msg);
          unmask();
        enable_irq_in_device();

See?

Now what to do about that?

Definitely not hacking some ill defined repair function into the code,
which is neither guaranteed to work nor fixes the general problem.

Worse it exposes a functionality which should not be there in the first
place to drivers, which then go and invoke it randomly and for the very
wrong reasons.

As you described correctly, there is a lack of locking in the x86
interrupt entry code. That's the obvious thing to fix. See uncompiled
and untested patch below.

That solves the general overwrite problem _and_ does not rely on an
interrupt sent by the device right afterwards.

But it does not and _cannot_ solve the other VFIO specific problem,
which comes with free_irq()/request_irq() on an active device:

	CPU0				CPU1
	vmenter(vCPU0)
   	....
         msi_set_affinity()
           mask(MSI-X)
             vmexit()
#1      ...                             interrupt is raised in APIC
                                        but not handled
        free_irq()
          mask();        

          __synchronize_irq()

          msi_domain_deactivate()
            write_msg(0);
          x86_vector_deactivate()
            per_cpu(vector_irq, cpu)[vector] = VECTOR_SHUTDOWN;
       
#2                                      interrupt is handled and
                                        observes VECTOR_SHUTDOWN
                                        writes VECTOR_UNUSED
       request_irq()
         x86_vector_activate()
            per_cpu(vector_irq, cpu)[vector] = desc;

	  msi_domain_deactivate()
            write_msg(msg);
          unmask();

#2 discards the interrupt as spurious _after_ shutdown and acknowledges
the APIC with EOI. That means the interrupt is lost.

So when the device logic is:

        raise_interrupt()
          if (!wait_for_driver_ack)
             wait_for_driver_ack = true;
             send_msi_message();

then the device waits forever or in the best case until timeout / driver
interaction that something handles the interrupt and reads the device
status register, which clears 'wait_for_driver_ack'.

That's not a theoretical case, that's what real world hardware devices
implement. If it does not apply to your device, that does not mean that
the problem does not exist.

For a regular device driver this is a non-problem once the locking fix
is applied. But for VFIO this _is_ an unfixable problem and the magic
repair hack can't fix it either.

I've pointed out a gazillion times before, that freeing an interrupt
without quiescing the device interrupt before that, is a patently bad
idea.

Unless VFIO/QEMU has some secret magic to handle that particular case,
the brute force workaround for this is to unconditionaly inject the
interrupt in QEMU after returning from the VFIO syscall. Maybe that's
the case today already, but I can't be bothered to stare at that
code. That's an exercise left for the virt folks. If that exists, then
this want's to be explained in the change log for completeness sake and
ideally in a comment in that VFIO function.

As the fix here is x86 specific, I looked at other architectures as
well. AFAICT on a quick skim, it seems (no guarantee though) none of
them is affected by the generic issue, but _all_ of them are affected by
the VFIO specific one.

Thanks,

        tglx

---

--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -256,26 +256,46 @@ static __always_inline void handle_irq(s
 		__handle_irq(desc, regs);
 }
 
-static __always_inline int call_irq_handler(int vector, struct pt_regs *regs)
+static struct irq_desc *reevaluate_vector(int vector)
 {
-	struct irq_desc *desc;
-	int ret = 0;
+	struct irq_desc *desc = __this_cpu_read(vector_irq[vector]);
+
+	if (!IS_ERR_OR_NULL(desc))
+		return desc;
+
+	if (desc == VECTOR_UNUSED) {
+		pr_emerg_ratelimited("%d.%u No irq handler for vector\n",
+				     smp_processor_id(), vector);
+	} else {
+		__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
+	}
+	return NULL;
+}
+
+static __always_inline bool call_irq_handler(int vector, struct pt_regs *regs)
+{
+	struct irq_desc *desc = __this_cpu_read(vector_irq[vector]);
 
-	desc = __this_cpu_read(vector_irq[vector]);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		handle_irq(desc, regs);
-	} else {
-		ret = -EINVAL;
-		if (desc == VECTOR_UNUSED) {
-			pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
-					     __func__, smp_processor_id(),
-					     vector);
-		} else {
-			__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
-		}
+		return true;
 	}
 
-	return ret;
+	/*
+	 * Reevaluate with vector_lock held.
+	 *
+	 * FIXME: Add a big fat comment explaining the problem
+	 */
+	lock_vector_lock();
+	desc = reevaluate_vector(vector);
+	unlock_vector_lock();
+
+	if (!desc)
+		return false;
+
+	/* Using @desc is safe here as it is RCU protected */
+	handle_irq(desc, regs);
+	return true;
 }
 
 /*
@@ -289,7 +309,7 @@ DEFINE_IDTENTRY_IRQ(common_interrupt)
 	/* entry code tells RCU that we're not quiescent.  Check it. */
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
-	if (unlikely(call_irq_handler(vector, regs)))
+	if (unlikely(!call_irq_handler(vector, regs)))
 		apic_eoi();
 
 	set_irq_regs(old_regs);