On Thu, Aug 7, 2025 at 1:32 AM Nicolas Escande <nico.escande@xxxxxxxxx> wrote: > > On Wed Aug 6, 2025 at 1:17 PM CEST, Nithyanantham Paramasivam wrote: > > During stress test scenarios, when the REO command ring becomes full, > > the RX queue update command issued during peer deletion fails due to > > insufficient space. In response, the host performs a dma_unmap and > > frees the associated memory. However, the hardware still retains a > > reference to the same memory address. If the kernel later reallocates > > this address, unaware that the hardware is still using it, it can > > lead to memory corruption-since the host might access or modify > > memory that is still actively referenced by the hardware. > > > > Implement a retry mechanism for the HAL_REO_CMD_UPDATE_RX_QUEUE > > command during TID deletion to prevent memory corruption. Introduce > > a new list, reo_cmd_update_rx_queue_list, in the dp structure to > > track pending RX queue updates. Protect this list with > > reo_rxq_flush_lock, which also ensures synchronized access to > > reo_cmd_cache_flush_list. Defer memory release until hardware > > confirms the virtual address is no longer in use, avoiding immediate > > deallocation on command failure. Release memory for pending RX queue > > updates via ath12k_dp_rx_reo_cmd_list_cleanup() on system reset > > if hardware confirmation is not received. > > > > Additionally, increase DP_REO_CMD_RING_SIZE to 256 to reduce the > > likelihood of ring exhaustion. Use a 1KB cache flush command for > > QoS TID descriptors to improve efficiency. > > > > Hello, > > I'm not sure I fully understand so please correct where I'm wrong but from what > I got looking at the code: > - you increase the ring size for reo cmd > - you enable releasing multiple tid buffer at once > - as soon as you allocate the tid you create an entry in the list flagging it > as active > - when you need to clean up a tid > - you mark it as inactive in the list, then > - try to process the whole list by > - sending the reo command to release it > - if it succeed you release it and remove the entry from the list > - on core exit, you re process the list > > What is bothering me with this is that when a vdev which has multiple sta goes > down, you will have a lot of those entries to push to the firmware at once: > > - So increasing the ring size / being able to release multiple entries at once > in one reo cmd may or may not be enough to handle the burst. It seems > that increasing those is just band aids on the underlying problem but I > understand it's just to be on the safe side. > - If entries do not get send/accepted by the firmware, we will need to wait > for another station removal before retrying, which could be a wile. > - Or in the worst case (one vdev going down and needing to release tid of > all its stations) the more entries we have in the list the less likely we > will be to be able to push the delete of all stations + the ones still in > queue > > So, on station removal, why not make just things completely async. Push the tid > deletes in a list and wake a workqueue that tries to push those to the firmware. > If the ring is full, retry periodically. > > Thanks Hi Nicolas, Thanks for the detailed observations and suggestions. You're right in your understanding of the flow. To clarify further: When the host fails to obtain a buffer from the hardware to send a command—due to the REO command ring being full (ath12k_hal_reo_cmd_send returning -ENOBUFS)—we treat it as a command send failure and avoid deleting the corresponding entry immediately. This situation typically arises in the flow: ath12k_dp_rx_process_reo_cmd_update_rx_queue_list → ath12k_dp_rx_tid_delete_handler → returns -ENOBUFS Given the command ring size is 256, and each peer generally has 16 TIDs, each TID sends 2 commands (RXQ update and cache flush). So, deleting one peer involves up to 32 commands. This means we can delete up to 8 peers (8 × 32 = 256 commands) before hitting the ring limit.