Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 8/29/25 12:44 AM, Paolo Bonzini wrote:
On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@xxxxxxxxxxxxx> wrote:
Actually this is a bug triggered by one monitor tool in our production
environment. This monitor executes 'info registers -a' hmp at a fixed
frequency, even during VM startup process, which makes some AP stay in
KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
extremely low probability, about 1~2 VM hangs per week.

Considering other emulators, like cloud-hypervisor and firecracker maybe
also have similar potential race issues, I think KVM had better do some
handling. But anyway, I will check Qemu code to avoid such race. Thanks
for both of your comments. 🙂
If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
similar cases, that of course would help understanding the situation
better.

In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
vCPUs have halted.

Paolo

Hi Paolo and Sean,


Sorry for the late response, I have been a little busy with other things recently. The complete calling processes for the bad case are as follows:

`info registers -a` hmp per 2ms[1]      AP(vcpu1) thread[2]                  BSP(vcpu0) send INIT/SIPI[3]

                                 [2]
                                 KVM: KVM_RUN and then
                          schedule() in kvm_vcpu_block() loop

[1]
for each cpu: cpu_synchronize_state
if !qemu_thread_is_self()
1. insert to cpu->work_list, and handle asynchronously
2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal

                      [2]
                      KVM: checks signal_pending, breaks loop and returns -EINTR
Qemu: break kvm_cpu_exec loop, run
  1. qemu_wait_io_event()
  => process_queued_cpu_work => cpu->work_list.func()
       e.i. do_kvm_cpu_synchronize_state() callback
       => kvm_arch_get_registers
            => kvm_get_mp_state /* KVM: get_mpstate also calls
           kvm_apic_accept_events() to handle INIT and SIPI */
       => cpu->vcpu_dirty = true;
  // end of qemu_wait_io_event

                                  [3]
                                  SeaBIOS: BSP enters non-root mode and runs reset_vector() in SeaBIOS.                                            send INIT and then SIPI by writing APIC_ICR during smp_scan                                   KVM: BSP(vcpu0) exits, then => handle_apic_write                                        => kvm_lapic_reg_write => kvm_apic_send_ipi to all APs                                        => for each AP: __apic_accept_irq, e.g. for AP(vcpu1)                                             => case APIC_DM_INIT: apic->pending_events = (1UL << KVM_APIC_INIT)
                                                 (not kick the AP yet)
                                            => case APIC_DM_STARTUP: set_bit(KVM_APIC_SIPI, &apic->pending_events)
                                                 (not kick the AP yet)

  [2]
  2. kvm_cpu_exec()
  => if (cpu->vcpu_dirty):
     => kvm_arch_put_registers
        => kvm_put_vcpu_events
                      KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
 => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
      e.i. pending_events changes from 11b to 10b
 // end of kvm_vcpu_ioctl_x86_set_vcpu_events
Qemu: => after put_registers, cpu->vcpu_dirty = false;
        => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
                      KVM: KVM_RUN
 => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal
 /* But AP(vcpu1)'s mp_state will never change from KVM_MP_STATE_UNINITIALIZED
   to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
without handling INIT inside kvm_apic_accept_events(), considering BSP will never
   send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
   non-root mode */

                                  [3]
                                  SeaBIOS: waits CountCPUs == expected_cpus_count and loops forever                                   e.i. the AP(vcpu1) stays: EIP=0000fff0 && CS =f000 ffff0000                                         and BSP(vcpu0) appears 100% utilized as it is in a while loop.

As for other emulators (like cloud-hypervisor and firecracker), there is no interactive command like 'info registers -a'. But sorry again that I haven't had time to check code to confirm whether they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)


Have a nice day, thanks
Fei





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux