On Wed, 2025-09-10 at 10:34 +0200, Vitaly Kuznetsov wrote: > Khushit Shah <khushit.shah@xxxxxxxxxxx> writes: > > > > On 8 Sep 2025, at 5:12 PM, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote: > > > > > ... > > > > Also, I've just recalled I fixed (well, 'workarounded') an issue > > > similar > > > to yours a while ago in QEMU: > > > > > > commit 958a01dab8e02fc49f4fd619fad8c82a1108afdb > > > Author: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > > > Date: Tue Apr 2 10:02:15 2019 +0200 > > > > > > ioapic: allow buggy guests mishandling level-triggered > > > interrupts to make progress > > > > > > maybe something has changed and it doesn't work anymore? > > > > This is really interesting, we are facing a very similar issue, but > > the interrupt storm only occurs when using split-irqchip. > > Using kernel-irqchip, we do not even see consecutive level > > triggered interrupts of the same vector. From the logs it is > > clear that somehow with kernel-irqchip, L1 passes the interrupt to > > L2 to service, but with split-irqchip, L1 EOI’s without > > servicing the interrupt. As it is working properly on kernel- > > irqchip, we can’t really point it as an Hyper-V issue. AFAIK, > > kernel-irqchip setting should be transparent to the guest, can you > > think of anything that can change this? > > The problem I've fixed back then was also only visible with split > irqchip. The reason was: > > """ > in-kernel IOAPIC implementation has commit 184564efae4d ("kvm: > ioapic: conditionally delay > irq delivery duringeoi broadcast") > """ > > so even though the guest cannot really distinguish between in-kernel > and > split irqchips, the small differences in implementation can make a > big > difference in the observed behavior. In case we re-assert improperly > handled level-triggered interrupt too fast, the guest is not able to > make much progress but if we let it execute for even the tiniest > fraction of time, then the forward progress happens. > > I don't exactly know what happens in this particular case but I'd > suggest you try to atrificially delay re-asserting level triggered > interrupts and see what happens. We know that QEMU reasserts INTx interrupts too soon anyway. The in-kernel irqchip will trigger the VFIO resamplefd when the interrupt is EOI'd in the I/O APIC. as $DEITY intended. QEMU, on the other hand, will unmap the device BARs when the interrupt happens and intercept subsequent access, triggering the VFIO resamplefd as soon as the next access happens — even before it's EOI'd. Could that be making a difference here? I guess, in theory, "too soon" probably shouldn't matter if it's all handled correctly elsewhere — it should get masked again in the hardware and the pending status tracked correctly until it's redelivered to the guest(s). But it's probably worth testing, given that's one of the big behavioural differences between kernel and userspace I/O APIC? It's somewhat non-trivial to fix it 'properly' across all of QEMU's interrupt controllers and IRQ abstractions, but hacking something up which does the right thing just for this x86 platform and I/O APIC and avoids the current MMIO-unmapping abomination might be worth a test?
Attachment:
smime.p7s
Description: S/MIME cryptographic signature