Re: [PATCH ath-current 0/7] wifi: ath12k: Fix Issues in REO RX Queue Updates

"Nicolas Escande" <nico.escande@xxxxxxxxx> · Wed, 06 Aug 2025 22:02:28 +0200

On Wed Aug 6, 2025 at 1:17 PM CEST, Nithyanantham Paramasivam wrote:
> During stress test scenarios, when the REO command ring becomes full,
> the RX queue update command issued during peer deletion fails due to
> insufficient space. In response, the host performs a dma_unmap and
> frees the associated memory. However, the hardware still retains a
> reference to the same memory address. If the kernel later reallocates
> this address, unaware that the hardware is still using it, it can
> lead to memory corruption-since the host might access or modify
> memory that is still actively referenced by the hardware.
>
> Implement a retry mechanism for the HAL_REO_CMD_UPDATE_RX_QUEUE
> command during TID deletion to prevent memory corruption. Introduce
> a new list, reo_cmd_update_rx_queue_list, in the dp structure to
> track pending RX queue updates. Protect this list with
> reo_rxq_flush_lock, which also ensures synchronized access to
> reo_cmd_cache_flush_list. Defer memory release until hardware
> confirms the virtual address is no longer in use, avoiding immediate
> deallocation on command failure. Release memory for pending RX queue
> updates via ath12k_dp_rx_reo_cmd_list_cleanup() on system reset
> if hardware confirmation is not received.
>
> Additionally, increase DP_REO_CMD_RING_SIZE to 256 to reduce the
> likelihood of ring exhaustion. Use a 1KB cache flush command for
> QoS TID descriptors to improve efficiency.
>

Hello,

I'm not sure I fully understand so please correct where I'm wrong but from what
I got looking at the code:
  - you increase the ring size for reo cmd
  - you enable releasing multiple tid buffer at once
  - as soon as you allocate the tid you create an entry in the list flagging it
    as active
  - when you need to clean up a tid
    - you mark it as inactive in the list, then
    - try to process the whole list by
      - sending the reo command to release it
      - if it succeed you release it and remove the entry from the list
  - on core exit, you re process the list

What is bothering me with this is that when a vdev which has multiple sta goes
down, you will have a lot of those entries to push to the firmware at once:

  - So increasing the ring size / being able to release multiple entries at once
    in one reo cmd may or may not be enough to handle the burst. It seems
    that increasing those is just band aids on the underlying problem but I
    understand it's just to be on the safe side.
  - If entries do not get send/accepted by the firmware, we will need to wait
    for another station removal before retrying, which could be a wile.
  - Or in the worst case (one vdev going down and needing to release tid of
    all its stations) the more entries we have in the list the less likely we
    will be to be able to push the delete of all stations + the ones still in
    queue

So, on station removal, why not make just things completely async. Push the tid
deletes in a list and wake a workqueue that tries to push those to the firmware.
If the ring is full, retry periodically.

Thanks