Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

Coiby Xu <coxu@xxxxxxxxxx> · Wed, 27 Aug 2025 16:17:28 +0800

On Sat, Aug 23, 2025 at 11:00:11AM +0800, Coiby Xu wrote:
Hi Marc,

If I understand correctly, you want to reproduce the issue by yourself.
Then finally I manage to reproduce this issue by playing with the setup
shared by my collogue. Here are the five prerequisites to reproduce the
bug,

Hi Marc,

It turns out host kernel and host machine are not absolute prerequisites to
reproduce the problem. But they matter because they can make it much
more difficult to reproduce this problem. I also did a bisection against
QEMU to find out which commit make the issue gone. For details, please
check following inline comments.

1. Guest kernel    Newer than commit b5712bf89b4b 
("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")

2. Host kernel
  Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
  v6.17.0 don't have this issue.

It turns out with other conditions met, the latest host kernel
(6.17.0-0.rc3) can still reproduce the issue but it's much more
difficult to reproduce it. For example, with RHEL8 kernel
4.18.0-372.9.1.el8.aarch64, I need to trigger kernel crash for 3
times at maximum to reproduce it. But for Fedora rawhide kernel
6.17.0-0.rc3.31.fc43.aarch64, 3/10 times I can't reproduce this issue
after triggering kernel crash for 60 consecutive times. For a
comparison, I've listed the times of triggering kernel crash to reproduce
the issue in 10 trials,

RHEL8:           2  1  1  1  1  1  2  1  3  2
Fedora rawhide: 43 60 47 60 12 56 60 45 49 18

3. QEMU <= v6.2

I did a bisection and it shows the issue is gone with QEMU commit
f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 ("kvm: Atomic memslot updates")
which is last/3rd patch of patch set "KVM: allow listener to stop all
vcpus before"
https://lists.nongnu.org/archive/html/qemu-devel/2022-11/msg02172.html
Note this commit shows in QEMU > 7.2 so QEMU <= v7.2.0 can also
reproduce this issue.

4. Specific host machines    I'm not familiar with the hardware so 
currently I haven't figured out
  what hardware factor makes the issue reproducible. I've attached
  dmidecode outputs of four machines (files inside indmidecode_host folder).
  Two systems (dmidecode_not_work*) can reproduce this issue and the
  other two systems (dmidecode_work*) can't despite all have the same
  product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
  01234567890123456789AB. One difference that doesn't seem to found in
  the dmidecode output is the two machines that can't reproduce the issue
  have the model name "PnP device PNP0c02" where the problematic
  machines have "R152-P31-00 (01234567890123456789AB)" according to our
  internal web pages that show the hardware info.

It turns out all four machines can reproduce the issue. I tried to
reproduce this issue for 10 times and counted the times to trigger
kernel crash and here's a comparison

R152-P31-00:        2  1 1  1  1  1 2 1  3 2 
PnP device PNP0c02: 8  3 5 15 11 18 2 5 12 4 

5. The Guest needs to be bridged to a physical host interface.    
Bridging the guest to tun interface can't reproduce the issue (for
  example, the default bridge (virbr0) created by libvirtd uses tun
  interface)

I tried triggering kernel crash for 100 consecutive times for virbr0 in
one trial but can't reproduce it. So I think bridging the guest to a
physical network interface is still a must.

[...]

--
Best regards,
Coiby