On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
Hi Marc,
If I understand correctly, you want to reproduce the issue by yourself.
Then finally I manage to reproduce this issue by playing with the setup
shared by my collogue. Here are the five prerequisites to reproduce the
bug,
Hi Marc,
It turns out host kernel and host machine are not absolute prerequisites to
reproduce the problem. But they matter because they can make it much
more difficult to reproduce this problem. I also did a bisection against
QEMU to find out which commit make the issue gone. For details, please
check following inline comments.
1. Guest kernel Newer than commit b5712bf89b4b
("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
2. Host kernel
Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
v6.17.0 don't have this issue.
It turns out with other conditions met, the latest host kernel
(6.17.0-0.rc3) can still reproduce the issue but it's much more
difficult to reproduce it. For example, with RHEL8 kernel
4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3
times at maximum to reproduce it. But for Fedora rawhide kernel
6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue
after triggering kernel crash for 60 consecutive times. For a
comparison, I've listed the times of triggering kernel crash to reproduce
the issue in 10 trials,
RHEL8: 2 1 1 1 1 1 2 1 3 2
Fedora rawhide: 43 60 47 60 12 56 60 45 49 18
3. QEMU <= v6.2
I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.
4. Specific host machines I'm not familiar with the hardware so
currently I haven't figured out
what hardware factor makes the issue reproducible. I've attached
dmidecode outputs of four machines (files inside indmidecode_host folder).
Two systems (dmidecode_not_work*) can reproduce this issue and the
other two systems (dmidecode_work*) can't despite all have the same
product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
01234567890123456789AB. One difference that doesn't seem to found in
the dmidecode output is the two machines that can't reproduce the issue
have the model name "PnP device PNP0c02" where the problematic
machines have "R152-P31-00 (01234567890123456789AB)" according to our
internal web pages that show the hardware info.
It turns out all four machines can reproduce the issue. I tried to
reproduce this issue for 10 times and counted the times to trigger
kernel crash and here's a comparison
R152-P31-00: 2 1 1 1 1 1 2 1 3 2
PnP device PNP0c02: 8 3 5 15 11 18 2 5 12 4
5. The Guest needs to be bridged to a physical host interface.
Bridging the guest to tun interface can't reproduce the issue (for
example, the default bridge (virbr0) created by libvirtd uses tun
interface)
I tried triggering kernel crash for 100 consecutive times for virbr0 in
one trial but can't reproduce it. So I think bridging the guest to a
physical network interface is still a must.
[...]
--
Best regards,
Coiby