Re: [BUG] mlx5_core memory management issue

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Thu, 14 Aug 2025 13:26:37 +0200

On 13/08/2025 22.24, Dragos Tatulea wrote:
On Wed, Aug 13, 2025 at 07:26:49PM +0000, Dragos Tatulea wrote:
On Wed, Aug 13, 2025 at 01:53:48PM -0500, Chris Arges wrote:
On 2025-08-12 16:25:58, Chris Arges wrote:
On 2025-08-12 20:19:30, Dragos Tatulea wrote:
On Tue, Aug 12, 2025 at 11:55:39AM -0700, Jesse Brandeburg wrote:
On 8/12/25 8:44 AM, 'Dragos Tatulea' via kernel-team wrote:

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 482d284a1553..484216c7454d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
          /* If not all frames have been transmitted, it is our
           * responsibility to free them
           */
+       xdp_set_return_frame_no_direct();
          for (i = sent; unlikely(i < to_send); i++)
                  xdp_return_frame_rx_napi(bq->q[i]);
+       xdp_clear_return_frame_no_direct();

Why can't this instead just be xdp_return_frame(bq->q[i]); with no
"no_direct" fussing?

Wouldn't this be the safest way for this function to call frame completion?
It seems like presuming the calling context is napi is wrong?

It would be better indeed. Thanks for removing my horse glasses!

Once Chris verifies that this works for him I can prepare a fix patch.

Working on that now, I'm testing a kernel with the following change:

---

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 3aa002a47..ef86d9e06 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -409,7 +409,7 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
          * responsibility to free them
          */
         for (i = sent; unlikely(i < to_send); i++)
-               xdp_return_frame_rx_napi(bq->q[i]);
+               xdp_return_frame(bq->q[i]);
  
  out:
         bq->count = 0;

This patch resolves the issue I was seeing and I am no longer able to
reproduce the issue. I tested for about 2 hours, when the reproducer usually
takes about 1-2 minutes.

Thanks! Will send a patch tomorrow and also add you in the Tested-by tag.


Looking at code ... there are more cases we need to deal with.
If simply replacing xdp_return_frame_rx_napi() with xdp_return_frame.

The normal way to fix this is to use the helpers:
 - xdp_set_return_frame_no_direct();
 - xdp_clear_return_frame_no_direct()

Because __xdp_return() code[1] via xdp_return_frame_no_direct() will
disable those napi_direct requests.

 [1] https://elixir.bootlin.com/linux/v6.16/source/net/core/xdp.c#L439

Something doesn't add-up, because the remote CPUMAP bpf-prog that 
redirects to veth is running in cpu_map_bpf_prog_run_xdp()[2] and that 
function already uses the xdp_set_return_frame_no_direct() helper.

 [2] https://elixir.bootlin.com/linux/v6.16/source/kernel/bpf/cpumap.c#L189

I see the bug now... attached a patch with the fix.
The scope for the "no_direct" forgot to wrap the xdp_do_flush() call.

Looks like bug was introduced in 11941f8a8536 ("bpf: cpumap: Implement 
generic cpumap") v5.15.

As follow up work it would be good to have a way to catch this family of
issues. Something in the lines of the patch below.


Yes, please, we want something that can catch these kind of hard to find 
bugs.

Thanks,
Dragos

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index f1373756cd0f..0c498fbd8df6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -794,6 +794,10 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
  {
         lockdep_assert_no_hardirq();
  
+#ifdef CONFIG_PAGE_POOL_CACHEDEBUG
+       WARN(page_pool_napi_local(pool), "Page pool cache access from non-direct napi context");
I meant to negate the condition here.


The XDP code have evolved since the xdp_set_return_frame_no_direct()
calls were added.  Now page_pool keeps track of pp->napi and
pool-> cpuid.  Maybe the __xdp_return [1] checks should be updated?
(and maybe it allows us to remove the no_direct helpers).

--Jesper
cpumap: disable page_pool direct xdp_return need larger scope

From: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

When running an XDP bpf_prog on the remote CPU in cpumap code
then we must disable the direct return optimization that
xdp_return can perform for mem_type page_pool.  This optimization
assumes code is still executing under RX-NAPI of the original
receiving CPU, which isn't true on this remote CPU.

The cpumap code already disabled this via helpers
xdp_set_return_frame_no_direct() and xdp_clear_return_frame_no_direct(),
but the scope didn't include xdp_do_flush().

When doing XDP_REDIRECT towards e.g devmap this causes the
function bq_xmit_all() to run with direct return optimization
enabled. This can lead to hard to find bugs.

Fix by expanding scope to include xdp_do_flush().

Fixes: 11941f8a8536 ("bpf: cpumap: Implement generic cpumap")
Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
---
 kernel/bpf/cpumap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index b2b7b8ec2c2a..c46360b27871 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -186,7 +186,6 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 	struct xdp_buff xdp;
 	int i, nframes = 0;
 
-	xdp_set_return_frame_no_direct();
 	xdp.rxq = &rxq;
 
 	for (i = 0; i < n; i++) {
@@ -231,7 +230,6 @@ static int cpu_map_bpf_prog_run_xdp(struct bpf_cpu_map_entry *rcpu,
 		}
 	}
 
-	xdp_clear_return_frame_no_direct();
 	stats->pass += nframes;
 
 	return nframes;
@@ -255,6 +253,7 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 
 	rcu_read_lock();
 	bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
+	xdp_set_return_frame_no_direct();
 
 	ret->xdp_n = cpu_map_bpf_prog_run_xdp(rcpu, frames, ret->xdp_n, stats);
 	if (unlikely(ret->skb_n))
@@ -264,6 +263,7 @@ static void cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	if (stats->redirect)
 		xdp_do_flush();
 
+	xdp_clear_return_frame_no_direct();
 	bpf_net_ctx_clear(bpf_net_ctx);
 	rcu_read_unlock();