Re: [Regression] kdump fails to get DHCP address unless booting with pci=nomsi or without nr_cpus=1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marc,

If I understand correctly, you want to reproduce the issue by yourself.
Then finally I manage to reproduce this issue by playing with the setup
shared by my collogue. Here are the five prerequisites to reproduce the
bug,

1. Guest kernel Newer than commit b5712bf89b4b ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")

2. Host kernel
   Relatively older ones like v6.10.0. Newer ones like v6.12.0 and
   v6.17.0 don't have this issue.

3. QEMU <= v6.2

4. Specific host machines I'm not familiar with the hardware so currently I haven't figured out
   what hardware factor makes the issue reproducible. I've attached
   dmidecode outputs of four machines (files inside indmidecode_host folder).
   Two systems (dmidecode_not_work*) can reproduce this issue and the
   other two systems (dmidecode_work*) can't despite all have the same
   product name R152-P31-00, CPU model ARMv8 (M128-30) and SKU
   01234567890123456789AB. One difference that doesn't seem to found in
   the dmidecode output is the two machines that can't reproduce the issue
   have the model name "PnP device PNP0c02" where the problematic
   machines have "R152-P31-00 (01234567890123456789AB)" according to our
   internal web pages that show the hardware info.

5. The Guest needs to be bridged to a physical host interface. Bridging the guest to tun interface can't reproduce the issue (for
   example, the default bridge (virbr0) created by libvirtd uses tun
   interface)

With the above conditions met, I can reproduce the issue simply with Fedora Cloud Base 42 image,

1. Start the VM
   qemu-system-aarch64 -cpu host   -machine virt \
   -device virtio-net-pci,netdev=hn0,id=nic1,mac=00:16:3e:3d:5f:b8 \
   -netdev bridge,id=hn0,br=br0,helper=/usr/local/libexec/qemu-bridge-helper \
   -hda /var/lib/libvirt/images/f42_1.qcow2 \
   -accel kvm -boot d \
   -drive if=pflash,format=raw,readonly,file=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw \
   -m 35840 -serial stdio -smp 16

2. Set up kdump to dump vmcore to a remote NFS server
   dnf install kdump-utils nfs-utils -y
   echo nfs NFS_SERVER:EXPORT_PATH >> /etc/kdump.conf
   systemctl enable kdump
kdumpctl reset-crashkernel systemctl reboot

3. After rebooting, trigger 1st kernel crash
   If kdump works i.e. DHCP works, you will need to trigger kernel crash
   again until it doesn't work. In my experience, repeating this step for 6
   consecutive times will surely lead to one time that DHCP doesn't
   work.

Note f42_1.qcow2 was created from Fedora Cloud Base 42 image
https://download.fedoraproject.org/pub/fedora/linux/releases/42/Cloud/aarch64/images/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2

Considering QEMU 6.12 was released about 4 years ago, do you think there
is an need to further dig into this problem to find out how the five
prerequisite conditions interplay with each other to create the bug? If
you think it's worth the efforts, I'll do a bisection against QEMU to
find out the 1st bad commit and also provide other debugging info you
need.

On Wed, Aug 20, 2025 at 09:56:50AM +0100, Marc Zyngier wrote:
On Wed, 20 Aug 2025 00:30:12 +0100,
Coiby Xu <coxu@xxxxxxxxxx> wrote:

On Wed, Aug 13, 2025 at 08:08:28PM +0800, Coiby Xu wrote:
> On Tue, Aug 12, 2025 at 02:14:25PM +0100, Marc Zyngier wrote:
[...]
>>
>> Can you at the very least share:

Thanks for your patience! I've attached a zip file with the info you
need. Additionally I've included the dmidecode of guest
(dmidecode_guest), host machine (dmidecode_host) and the domain info
of guest (libvirt.xml) in case they may be helpful. If you need further
info or any experiment I need to do, feel free to let me know! Now I
have access to the host machine so I can respond much faster.

>>
>> - the boot log of the guest on its first kernel

Please check file boot_log_1st_kernel

Old kernel. It would have been better to use a vanilla v6.16, so that
we know exactly what you are running. I have zero interest in finding
out what 6.15.9-201.fc42.aarch64 corresponds to in real life.

Thanks for the suggestion! I've built v6.16 and attached the logs.
Please check 04_not_work/boot_log_{1st,2nd}_kernel.

Btw, I'm curious to know why you want a vanilla v6.16. Is it because you
are worried a Fedora kernel can be so different from a vanilla v6.16
that it can obscure the problem?


>> - the boot log of the guest running kdump

boot_log_2nd_kernel

Same thing.


>>
>> - the content of /sys/kernel/debug/kvm/$PID-xx/vgic*state* when
>> running both kernels

vgic-state_{1st,2nd}_kernel

What is the host running? It also looks like a pre-6.16 kernel, which
lacks important information.

The host is running RHEL8.6. But I can confirm Fedora kernel
6.10.0-64.fc41.aarch64 can also reproduce the issue but
not latest ones like 6.17.0-0.rc2.24.fc43.aarch64.



>>
>> - the QEMU command-line to get to run the whole thing

qemu_cmdline

I'm sorry, but that doesn't look like a command line as I know it. I
certainly cannot feed this to QEMU and reproduce your findings.

Sorry I didn't realize you want to reproduce the issue. Previously I
hadn't reproduced the issue and thought it's not easy to reproduce it. Thus I
merely shared the cmdline generated by libvirt/virt-install so you may
find something suspicious.


Now, there is *one* thing that is interesting:

The second vgic_state dump indicates that LPI 8225 is routed to
vcpu-3. Given that your guest boots into the second kernel on vcpu-0,
and that this is the only online vcpu at this stage, the LPI will
never be presented to the CPU (and the vgic has it as pending, which
is what I'd expect).

I'd suggest you instrument the second kernel to try and see why this
affinity is not changed.

Currently, I'm not familiar with interrupts. But I notice for the 2nd
kernel, /proc/irq/*/smp_affinity of the 2nd kernel all have the same
value 1 and /proc/interrupts only list one CPU. If you want me to try
other things, please let me know.


Thanks,

	M.

--
Jazz isn't dead. It just smells funny.


--
Best regards,
Coiby

<<attachment: debug_info_VGICv3_not_work_for_kdump.zip>>


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux