Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg

Jon Kohler <jon@xxxxxxxxxxx> · Thu, 11 Sep 2025 13:24:59 +0000

> On Sep 11, 2025, at 1:49 AM, Jason Wang <jasowang@xxxxxxxxxx> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>> 
>> On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
>>> 
>>> 
>>>> On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@xxxxxxxxxx wrote:
>>>> 
>>>> Hello:
>>>> 
>>>> This patch was applied to netdev/net-next.git (main)
>>>> by Jakub Kicinski <kuba@xxxxxxxxxx>:
>>> 
>>> Hey all,
>>> Writing to fire up a flare and point out a problem that we’re seeing
>>> with this patch internally, specifically when we enable iommu on the
>>> virtio-net device.
>>> 
>>> With this patch applied on 6.12.y-based bare metal instance and then
>>> starting a 6.12.y based guest with iommu enabled, we see lockups
>>> within the guest in short order, as well as vmm (qemu) stuck in a tight
>>> loop responding to iommu misses from vhost net loop.
>>> 
>>> We've bisected this in our internal tree, and for sure it is this
>>> patch that is alledgedly causing the problem, so I wanted to point out
>>> there is some sort of issue here.
>>> 
>>> Working on trying to figure this out, but if jumps off the page to
>>> anyone, happy to take advice!
>>> 
>>> Flamegraph:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_JonKohler_0e83c014230ab59ddc950f10441335f1-23file-2Diotlb-2Dlockup-2Dsvg&d=DwIFaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=XD-WdOsgncdt6PTfV1XEaecYF8hq634UgPyqfA2RFyd3VRYL27kl6G8lH5jGTK_x&s=jJp6ZIn5iS1x_a8yXm7BNtxUVL3HgLExEQhcS-L0m4U&e= 
>>> 
>>> Guest dmesg errors like so:
>>> [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
>>> [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
>>> [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
>>> [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
>>> [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
>>> [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
>>> 
>>> Host level top output
>>> 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
>>> 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
>>> 
>>> For qemu-kvm main thread:
>>> Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
>>>  Children      Self  Shared Object     Symbol
>>> -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
>>>   - 87.11% entry_SYSCALL_64_after_hwframe
>>>      - do_syscall_64
>>>         - 44.79% ksys_write
>>>            - 43.74% vfs_write
>>>               - 40.96% vhost_chr_write_iter
>>>                  - 38.22% vhost_process_iotlb_msg
>>>                     - 13.72% vhost_iotlb_add_range_ctx
>>>                        - 7.43% vhost_iotlb_map_free
>>>                           - 4.37% vhost_iotlb_itree_remove
>>>                                rb_next
>>>                             1.78% __rb_erase_color
>>>                             0.73% kfree
>>>                          1.15% __rb_insert_augmented
>>>                          0.68% __kmalloc_cache_noprof
>>>                     - 10.73% vhost_vq_work_queue
>>>                        - 7.65% try_to_wake_up
>>>                           - 2.55% ttwu_queue_wakelist
>>>                              - 1.72% __smp_call_single_queue
>>>                                   1.36% call_function_single_prep_ipi
>>>                           - 1.32% __task_rq_lock
>>>                              - _raw_spin_lock
>>>                                   native_queued_spin_lock_slowpath
>>>                           - 1.30% select_task_rq
>>>                              - select_task_rq_fair
>>>                                 - 0.88% wake_affine
>>>                                      available_idle_cpu
>>>                          2.06% llist_add_batch
>>>                     - 4.05% __mutex_lock.constprop.0
>>>                          2.14% mutex_spin_on_owner
>>>                          0.72% osq_lock
>>>                       3.00% mutex_lock
>>>                     - 1.72% kfree
>>>                        - 1.16% __slab_free
>>>                             slab_update_freelist.constprop.0.isra.0
>>>                       1.37% _raw_spin_lock
>>>                       1.08% mutex_unlock
>>>                    1.98% _copy_from_iter
>>>               - 1.86% rw_verify_area
>>>                  - security_file_permission
>>>                     - 1.13% file_has_perm
>>>                          0.69% avc_has_perm
>>>              0.63% fdget_pos
>>>         - 27.86% syscall_exit_to_user_mode
>>>            - syscall_exit_to_user_mode_prepare
>>>               - 25.96% __audit_syscall_exit
>>>                  - 25.03% __audit_filter_op
>>>                       6.66% audit_filter_rules.constprop.0
>>>                 1.27% audit_reset_context.part.0.constprop.0
>>>         - 10.86% ksys_read
>>>            - 9.37% vfs_read
>>>               - 6.67% vhost_chr_read_iter
>>>                    1.48% _copy_to_iter
>>>                    1.36% _raw_spin_lock
>>>                  - 1.30% __wake_up
>>>                       0.81% _raw_spin_lock_irqsave
>>>                  - 1.25% vhost_enqueue_msg
>>>                       _raw_spin_lock
>>>               - 1.83% rw_verify_area
>>>                  - security_file_permission
>>>                     - 1.03% file_has_perm
>>>                          0.64% avc_has_perm
>>>              0.65% fdget_pos
>>>              0.57% fput
>>>         - 2.56% syscall_trace_enter
>>>            - 1.25% __seccomp_filter
>>>                 seccomp_run_filters
>>>              0.54% __audit_syscall_entry
>>> 
>>> vhost-net thread
>>> Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
>>>  Children      Self  Shared Object     Symbol
>>> -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
>>>     38.26% 0xffffffff930bb8c0
>>>   - 3.36% 0
>>>        ret_from_fork_asm
>>>        ret_from_fork
>>>   - 1.16% vhost_task_fn
>>>      - 2.35% vhost_run_work_list
>>>         - 1.67% handle_tx
>>>            - 7.09% __mutex_lock.constprop.0
>>>                 6.64% mutex_spin_on_owner
>>>            - 0.84% vq_meta_prefetch
>>>               - 3.22% iotlb_access_ok
>>>                    2.50% vhost_iotlb_itree_first
>>>              0.80% mutex_lock
>>>            - 0.75% handle_tx_copy
>>>           0.86% llist_reverse_order
>>> 
>>>> 
>>>> On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
>>>>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>>>>> batches up to 64 messages before calling sock->sendmsg.
>>>>> 
>>>>> Currently, when there are no more messages on the ring to dequeue,
>>>>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>>>>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>>>>> especially if it needs to wake up a thread (e.g., another vhost worker).
>>>>> 
>>>>> [...]
>>>> 
>>>> Here is the summary with links:
>>>> - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
>>>>   https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e=
>>>> 
>>>> You are awesome, thank you!
>>>> --
>>>> Deet-doot-dot, I am a bot.
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e=
>>>> 
>>>> 
>>> 
>> 
>> 
>> Well it seems that if  get_tx_bufs failed with -EAGAIN then we
>> previously bailed out, but now we will redo poll and so on, forever.
> 
> Something like this, the vhost_vq_avail_empty() will cause the
> vhost_poll_queue() to be queued in this case.
> 
> Let me post a patch to fix that.

Ok great, thanks Jason and Michael. Happy to test this when available
as we’ve got a solid repro.

> 
> Thanks
> 
>> 
>> 
>> No?
>> 
>> 
>> --
>> MST
>> 
>