Re: [BUG] mlx5_core memory management issue

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 14 Aug 2025 17:58:21 +0200

On 14/08/2025 16.42, Dragos Tatulea wrote:
On Thu, Aug 14, 2025 at 01:26:37PM +0200, Jesper Dangaard Brouer wrote:


On 13/08/2025 22.24, Dragos Tatulea wrote:
On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
On 2025-08-12 16:25:58, Chris Arges wrote:
On 2025-08-12 20:19:30, Dragos Tatulea wrote:
On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 482d284a1553..484216c7454d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
           /* If not all frames have been transmitted, it is our
            * responsibility to free them
            */
+       xdp_set_return_frame_no_direct();
           for (i = sent; unlikely(i < to_send); i++)
                   xdp_return_frame_rx_napi(bq->q[i]);
+       xdp_clear_return_frame_no_direct();

Why can't this instead just be xdp_return_frame(bq->q[i]); with no
"no_direct" fussing?

Wouldn't this be the safest way for this function to call frame completion?
It seems like presuming the calling context is napi is wrong?

It would be better indeed. Thanks for removing my horse glasses!

Once Chris verifies that this works for him I can prepare a fix patch.

Working on that now, I'm testing a kernel with the following change:

---

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3aa002a47..ef86d9e06 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
           * responsibility to free them
           */
          for (i = sent; unlikely(i < to_send); i++)
-               xdp_return_frame_rx_napi(bq->q[i]);
+               xdp_return_frame(bq->q[i]);
   out:
          bq->count = 0;

This patch resolves the issue I was seeing and I am no longer able to
reproduce the issue. I tested for about 2 hours, when the reproducer usually
takes about 1-2 minutes.

Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.


Looking at code ... there are more cases we need to deal with.
If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.

The normal way to fix this is to use the helpers:
  - xdp_set_return_frame_no_direct();
  - xdp_clear_return_frame_no_direct()

Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
disable those napi_direct requests.

  [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439

Something doesn't add-up, because the remote CPUMAP bpf-prog that redirects
to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that function
already uses the xdp_set_return_frame_no_direct() helper.

  [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189

I see the bug now... attached a patch with the fix.
The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.

Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement
generic cpumap") v5.15.

Nice! Thanks for looking at this! Will you send the patch separately?


Yes, I will send the patch as an official patch.

I want to give both of you credit, so I'm considering adding these tags
to the patch description (WDYT):

Found-by: Dragos Tatulea <dtatulea@xxxxxxxxxx>
Reported-by: Chris Arges <carges@xxxxxxxxxxxxxx>


As follow up work it would be good to have a way to catch this family of
issues. Something in the lines of the patch below.


Yes, please, we want something that can catch these kind of hard to find
bugs.

Will send a patch when I find some time.


Great! :-)

Thanks,
Dragos

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f1373756cd0f..0c498fbd8df6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
   {
          lockdep_assert_no_hardirq();
+#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
+       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
I meant to negate the condition here.


The XDP code have evolved since the xdp_set_return_frame_no_direct()
calls were added.  Now page_pool keeps track of pp->napi and
pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
(and maybe it allows us to remove the no_direct helpers).

So you mean to drop the napi_direct flag in __xdp_return and let
page_pool_put_unrefed_netmem() decide if direct should be used by
page_pool_napi_local()?

Yes, something like that, but I would like Kuba/Jakub's input, as IIRC
he introduced the page_pool->cpuid and page_pool->napi.

There are some corner-cases we need to consider if they are valid.  If
cpumap get redirected to the *same* CPU as "previous" NAPI instance,
which then makes page_pool->cpuid match, is it then still valid to do
"direct" return(?).

--Jesper