Re: [PATCH v2 3/5] KVM: Conditionally reschedule when resetting the dirty ring

James Houghton <jthoughton@xxxxxxxxxx> · Mon, 12 May 2025 15:02:00 -0700

On Thu, May 8, 2025 at 7:11 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> When resetting a dirty ring, conditionally reschedule on each iteration
> after the first.  The recently introduced hard limit mitigates the issue
> of an endless reset, but isn't sufficient to completely prevent RCU
> stalls, soft lockups, etc., nor is the hard limit intended to guard
> against such badness.
>
> Note!  Take care to check for reschedule even in the "continue" paths,
> as a pathological scenario (or malicious userspace) could dirty the same
> gfn over and over, i.e. always hit the continue path.
>
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu:  4-....: (5249 ticks this GP) idle=51e4/1/0x4000000000000000 softirq=309/309 fqs=2563
>   rcu:  (t=5250 jiffies g=-319 q=608 ncpus=24)
>   CPU: 4 UID: 1000 PID: 1067 Comm: dirty_log_test Tainted: G             L     6.13.0-rc3-17fa7a24ea1e-HEAD-vm #814
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x26/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>   Tainted: [L]=SOFTLOCKUP
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_arch_mmu_enable_log_dirty_pt_masked+0x17/0x200 [kvm]
>   Call Trace:
>    <TASK>
>    kvm_reset_dirty_gfn.part.0+0xb4/0xe0 [kvm]
>    kvm_dirty_ring_reset+0x58/0x220 [kvm]
>    kvm_vm_ioctl+0x10eb/0x15d0 [kvm]
>    __x64_sys_ioctl+0x8b/0xb0
>    do_syscall_64+0x5b/0x160
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>    </TASK>
>
> Fixes: fb04a1eddb1a ("KVM: X86: Implement ring-based dirty memory tracking")
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
>  virt/kvm/dirty_ring.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index e844e869e8c7..97cca0c02fd1 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -134,6 +134,16 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring,
>
>                 ring->reset_index++;
>                 (*nr_entries_reset)++;
> +
> +               /*
> +                * While the size of each ring is fixed, it's possible for the
> +                * ring to be constantly re-dirtied/harvested while the reset
> +                * is in-progress (the hard limit exists only to guard against
> +                * wrapping the count into negative space).
> +                */
> +               if (!first_round)
> +                       cond_resched();

Should we be dropping slots_lock here?

It seems like we need to be holding slots_lock to call
kvm_reset_dirty_gfn(), but that's it. Userspace can already change the
memslots after enabling the dirty ring, so `entry->slot` can already
be stale, so dropping slots_lock for the cond_resched() seems harmless
(and better than not dropping it).