Re: [PATCH ath-current 0/7] wifi: ath12k: Fix Issues in REO RX Queue Updates

Nithyanantham Paramasivam <nithyanantham.paramasivam@xxxxxxxxxxxxxxxx> · Thu, 7 Aug 2025 18:01:16 +0530

On Thu, Aug 7, 2025 at 1:32 AM Nicolas Escande <nico.escande@xxxxxxxxx> wrote:
>
> On Wed Aug 6, 2025 at 1:17 PM CEST, Nithyanantham Paramasivam wrote:
> > During stress test scenarios, when the REO command ring becomes full,
> > the RX queue update command issued during peer deletion fails due to
> > insufficient space. In response, the host performs a dma_unmap and
> > frees the associated memory. However, the hardware still retains a
> > reference to the same memory address. If the kernel later reallocates
> > this address, unaware that the hardware is still using it, it can
> > lead to memory corruption-since the host might access or modify
> > memory that is still actively referenced by the hardware.
> >
> > Implement a retry mechanism for the HAL_REO_CMD_UPDATE_RX_QUEUE
> > command during TID deletion to prevent memory corruption. Introduce
> > a new list, reo_cmd_update_rx_queue_list, in the dp structure to
> > track pending RX queue updates. Protect this list with
> > reo_rxq_flush_lock, which also ensures synchronized access to
> > reo_cmd_cache_flush_list. Defer memory release until hardware
> > confirms the virtual address is no longer in use, avoiding immediate
> > deallocation on command failure. Release memory for pending RX queue
> > updates via ath12k_dp_rx_reo_cmd_list_cleanup() on system reset
> > if hardware confirmation is not received.
> >
> > Additionally, increase DP_REO_CMD_RING_SIZE to 256 to reduce the
> > likelihood of ring exhaustion. Use a 1KB cache flush command for
> > QoS TID descriptors to improve efficiency.
> >
>
> Hello,
>
> I'm not sure I fully understand so please correct where I'm wrong but from what
> I got looking at the code:
>   - you increase the ring size for reo cmd
>   - you enable releasing multiple tid buffer at once
>   - as soon as you allocate the tid you create an entry in the list flagging it
>     as active
>   - when you need to clean up a tid
>     - you mark it as inactive in the list, then
>     - try to process the whole list by
>       - sending the reo command to release it
>       - if it succeed you release it and remove the entry from the list
>   - on core exit, you re process the list
>
> What is bothering me with this is that when a vdev which has multiple sta goes
> down, you will have a lot of those entries to push to the firmware at once:
>
>   - So increasing the ring size / being able to release multiple entries at once
>     in one reo cmd may or may not be enough to handle the burst. It seems
>     that increasing those is just band aids on the underlying problem but I
>     understand it's just to be on the safe side.
>   - If entries do not get send/accepted by the firmware, we will need to wait
>     for another station removal before retrying, which could be a wile.
>   - Or in the worst case (one vdev going down and needing to release tid of
>     all its stations) the more entries we have in the list the less likely we
>     will be to be able to push the delete of all stations + the ones still in
>     queue
>
> So, on station removal, why not make just things completely async. Push the tid
> deletes in a list and wake a workqueue that tries to push those to the firmware.
> If the ring is full, retry periodically.
>
> Thanks

Hi Nicolas,

Thanks for the detailed observations and suggestions.

You're right in your understanding of the flow. To clarify further:

When the host fails to obtain a buffer from the hardware to send a
command—due to the REO command ring being full
(ath12k_hal_reo_cmd_send returning -ENOBUFS)—we treat it as a command
send failure and avoid deleting the corresponding entry immediately.

This situation typically arises in the flow:

ath12k_dp_rx_process_reo_cmd_update_rx_queue_list →
ath12k_dp_rx_tid_delete_handler → returns -ENOBUFS

Given the command ring size is 256, and each peer generally has 16
TIDs, each TID sends 2 commands (RXQ update and cache flush). So,
deleting one peer involves up to 32 commands. This means we can delete
up to 8 peers (8 × 32 = 256 commands) before hitting the ring limit.