> On Sep 11, 2025, at 1:49 AM, Jason Wang <jasowang@xxxxxxxxxx> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >> >> On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote: >>> >>> >>>> On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@xxxxxxxxxx wrote: >>>> >>>> Hello: >>>> >>>> This patch was applied to netdev/net-next.git (main) >>>> by Jakub Kicinski <kuba@xxxxxxxxxx>: >>> >>> Hey all, >>> Writing to fire up a flare and point out a problem that we’re seeing >>> with this patch internally, specifically when we enable iommu on the >>> virtio-net device. >>> >>> With this patch applied on 6.12.y-based bare metal instance and then >>> starting a 6.12.y based guest with iommu enabled, we see lockups >>> within the guest in short order, as well as vmm (qemu) stuck in a tight >>> loop responding to iommu misses from vhost net loop. >>> >>> We've bisected this in our internal tree, and for sure it is this >>> patch that is alledgedly causing the problem, so I wanted to point out >>> there is some sort of issue here. >>> >>> Working on trying to figure this out, but if jumps off the page to >>> anyone, happy to take advice! >>> >>> Flamegraph: >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_JonKohler_0e83c014230ab59ddc950f10441335f1-23file-2Diotlb-2Dlockup-2Dsvg&d=DwIFaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=XD-WdOsgncdt6PTfV1XEaecYF8hq634UgPyqfA2RFyd3VRYL27kl6G8lH5jGTK_x&s=jJp6ZIn5iS1x_a8yXm7BNtxUVL3HgLExEQhcS-L0m4U&e= >>> >>> Guest dmesg errors like so: >>> [ 66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms >>> [ 68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago >>> [ 112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms >>> [ 124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago >>> [ 124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms >>> [ 124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago >>> >>> Host level top output >>> 3992758 qemu 20 0 16.6g 52168 26704 R 99.9 0.0 21:23.72 qemu-kvm <<< this is the qemu main thread >>> 3992769 qemu 20 0 16.6g 52168 26704 R 58.8 0.0 13:33.44 vhost-3992758 <<< this is the vhost-net kthread >>> >>> For qemu-kvm main thread: >>> Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0 >>> Children Self Shared Object Symbol >>> - 87.41% 0.30% [kernel] [k] entry_SYSCALL_64_after_hwframe >>> - 87.11% entry_SYSCALL_64_after_hwframe >>> - do_syscall_64 >>> - 44.79% ksys_write >>> - 43.74% vfs_write >>> - 40.96% vhost_chr_write_iter >>> - 38.22% vhost_process_iotlb_msg >>> - 13.72% vhost_iotlb_add_range_ctx >>> - 7.43% vhost_iotlb_map_free >>> - 4.37% vhost_iotlb_itree_remove >>> rb_next >>> 1.78% __rb_erase_color >>> 0.73% kfree >>> 1.15% __rb_insert_augmented >>> 0.68% __kmalloc_cache_noprof >>> - 10.73% vhost_vq_work_queue >>> - 7.65% try_to_wake_up >>> - 2.55% ttwu_queue_wakelist >>> - 1.72% __smp_call_single_queue >>> 1.36% call_function_single_prep_ipi >>> - 1.32% __task_rq_lock >>> - _raw_spin_lock >>> native_queued_spin_lock_slowpath >>> - 1.30% select_task_rq >>> - select_task_rq_fair >>> - 0.88% wake_affine >>> available_idle_cpu >>> 2.06% llist_add_batch >>> - 4.05% __mutex_lock.constprop.0 >>> 2.14% mutex_spin_on_owner >>> 0.72% osq_lock >>> 3.00% mutex_lock >>> - 1.72% kfree >>> - 1.16% __slab_free >>> slab_update_freelist.constprop.0.isra.0 >>> 1.37% _raw_spin_lock >>> 1.08% mutex_unlock >>> 1.98% _copy_from_iter >>> - 1.86% rw_verify_area >>> - security_file_permission >>> - 1.13% file_has_perm >>> 0.69% avc_has_perm >>> 0.63% fdget_pos >>> - 27.86% syscall_exit_to_user_mode >>> - syscall_exit_to_user_mode_prepare >>> - 25.96% __audit_syscall_exit >>> - 25.03% __audit_filter_op >>> 6.66% audit_filter_rules.constprop.0 >>> 1.27% audit_reset_context.part.0.constprop.0 >>> - 10.86% ksys_read >>> - 9.37% vfs_read >>> - 6.67% vhost_chr_read_iter >>> 1.48% _copy_to_iter >>> 1.36% _raw_spin_lock >>> - 1.30% __wake_up >>> 0.81% _raw_spin_lock_irqsave >>> - 1.25% vhost_enqueue_msg >>> _raw_spin_lock >>> - 1.83% rw_verify_area >>> - security_file_permission >>> - 1.03% file_has_perm >>> 0.64% avc_has_perm >>> 0.65% fdget_pos >>> 0.57% fput >>> - 2.56% syscall_trace_enter >>> - 1.25% __seccomp_filter >>> seccomp_run_filters >>> 0.54% __audit_syscall_entry >>> >>> vhost-net thread >>> Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0 >>> Children Self Shared Object Symbol >>> - 100.00% 3.38% [kernel] [k] vhost_task_fn >>> 38.26% 0xffffffff930bb8c0 >>> - 3.36% 0 >>> ret_from_fork_asm >>> ret_from_fork >>> - 1.16% vhost_task_fn >>> - 2.35% vhost_run_work_list >>> - 1.67% handle_tx >>> - 7.09% __mutex_lock.constprop.0 >>> 6.64% mutex_spin_on_owner >>> - 0.84% vq_meta_prefetch >>> - 3.22% iotlb_access_ok >>> 2.50% vhost_iotlb_itree_first >>> 0.80% mutex_lock >>> - 0.75% handle_tx_copy >>> 0.86% llist_reverse_order >>> >>>> >>>> On Wed, 30 Apr 2025 19:04:28 -0700 you wrote: >>>>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and >>>>> batches up to 64 messages before calling sock->sendmsg. >>>>> >>>>> Currently, when there are no more messages on the ring to dequeue, >>>>> handle_tx_copy re-enables kicks on the ring *before* firing off the >>>>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay, >>>>> especially if it needs to wake up a thread (e.g., another vhost worker). >>>>> >>>>> [...] >>>> >>>> Here is the summary with links: >>>> - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e= >>>> >>>> You are awesome, thank you! >>>> -- >>>> Deet-doot-dot, I am a bot. >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e= >>>> >>>> >>> >> >> >> Well it seems that if get_tx_bufs failed with -EAGAIN then we >> previously bailed out, but now we will redo poll and so on, forever. > > Something like this, the vhost_vq_avail_empty() will cause the > vhost_poll_queue() to be queued in this case. > > Let me post a patch to fix that. Ok great, thanks Jason and Michael. Happy to test this when available as we’ve got a solid repro. > > Thanks > >> >> >> No? >> >> >> -- >> MST >> >