Re: [BUG] [KVM/VMX] Level triggered interrupts mishandled on Windows w/ nested virt(Credential Guard) when using split irqchip

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 10 Sep 2025 10:39:03 +0100

On Wed, 2025-09-10 at 10:34 +0200, Vitaly Kuznetsov wrote:
> Khushit Shah <khushit.shah@xxxxxxxxxxx> writes:
> 
> > > On 8 Sep 2025, at 5:12 PM, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
> > > 
> 
> ...
> 
> > > Also, I've just recalled I fixed (well, 'workarounded') an issue
> > > similar
> > > to yours a while ago in QEMU:
> > > 
> > > commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb
> > > Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> > > Date:   Tue Apr 2 10:02:15 2019 +0200
> > > 
> > >    ioapic: allow buggy guests mishandling level-triggered
> > > interrupts to make progress
> > > 
> > > maybe something has changed and it doesn't work anymore?
> > 
> > This is really interesting, we are facing a very similar issue, but
> > the interrupt storm only occurs when using split-irqchip. 
> > Using kernel-irqchip, we do not even see consecutive level
> > triggered interrupts of the same vector. From the logs it is 
> > clear that somehow with kernel-irqchip, L1 passes the interrupt to
> > L2 to service, but with split-irqchip, L1 EOI’s without 
> > servicing the interrupt. As it is working properly on kernel-
> > irqchip, we can’t really point it as an Hyper-V issue. AFAIK, 
> > kernel-irqchip setting should be transparent to the guest, can you
> > think of anything that can change this?
> 
> The problem I've fixed back then was also only visible with split
> irqchip. The reason was:
> 
> """
> in-kernel IOAPIC implementation has commit 184564efae4d ("kvm:
> ioapic: conditionally delay
> irq delivery duringeoi broadcast")
> """
> 
> so even though the guest cannot really distinguish between in-kernel
> and
> split irqchips, the small differences in implementation can make a
> big
> difference in the observed behavior. In case we re-assert improperly
> handled level-triggered interrupt too fast, the guest is not able to
> make much progress but if we let it execute for even the tiniest
> fraction of time, then the forward progress happens. 
> 
> I don't exactly know what happens in this particular case but I'd
> suggest you try to atrificially delay re-asserting level triggered
> interrupts and see what happens.

We know that QEMU reasserts INTx interrupts too soon anyway.

The in-kernel irqchip will trigger the VFIO resamplefd when the
interrupt is EOI'd in the I/O APIC. as $DEITY intended.

QEMU, on the other hand, will unmap the device BARs when the interrupt
happens and intercept subsequent access, triggering the VFIO resamplefd
as soon as the next access happens — even before it's EOI'd.

Could that be making a difference here?

I guess, in theory, "too soon" probably shouldn't matter if it's all
handled correctly elsewhere — it should get masked again in the
hardware and the pending status tracked correctly until it's
redelivered to the guest(s). But it's probably worth testing, given
that's one of the big behavioural differences between kernel and
userspace I/O APIC?

It's somewhat non-trivial to fix it 'properly' across all of QEMU's
interrupt controllers and IRQ abstractions, but hacking something up
which does the right thing just for this x86 platform and I/O APIC and
avoids the current MMIO-unmapping abomination might be worth a test?
Attachment:
smime.p7s

Description: S/MIME cryptographic signature