Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Tue, 19 Aug 2025 18:44:56 +0200

On 19/08/2025 02.38, Jacob Keller wrote:


On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
On 15/08/2025 22.41, Tony Nguyen wrote:
This has the advantage that we also no longer need to track or cache the
number of fragments in the rx_ring, which saves a few bytes in the ring.


Have anyone tested the performance impact for XDP_DROP ?
(with standard non-multi-buffer frames)

Below code change will always touch cache-lines in shared_info area.
Before it was guarded with a xdp_buff_has_frags() check.


I did some basic testing with XDP_DROP previously using the xdp-bench
tool, and do not recall notice an issue. I don't recall the actual
numbers now though, so I did some quick tests again.

without patch...

Client:
$ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
...
[SUM]  10.00-10.33  sec   626 MBytes  16.0 Gbits/sec  546909

$ iperf3 -s -B 192.168.93.1%ens260f0
[SUM]   0.00-10.00  sec  17.7 GBytes  15.2 Gbits/sec  0.011 ms
9712/15888183 (0.061%)  receiver

$ xdp-bench drop ens260f0
Summary                 1,778,935 rx/s                  0 err/s
Summary                 2,041,087 rx/s                  0 err/s
Summary                 2,005,052 rx/s                  0 err/s
Summary                 1,918,967 rx/s                  0 err/s

with patch...

Client:
$ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
...
[SUM]  78.00-78.90  sec  2.01 GBytes  19.1 Gbits/sec  1801284

Server:
$ iperf3 -s -B 192.168.93.1%ens260f0
[SUM]  77.00-78.00  sec  2.14 GBytes  18.4 Gbits/sec  0.012 ms
9373/1921186 (0.49%)

xdp-bench:
$ xdp-bench drop ens260f0
Dropping packets on ens260f0 (ifindex 8; driver ice)
Summary                 1,910,918 rx/s                  0 err/s
Summary                 1,866,562 rx/s                  0 err/s
Summary                 1,901,233 rx/s                  0 err/s
Summary                 1,859,854 rx/s                  0 err/s
Summary                 1,593,493 rx/s                  0 err/s
Summary                 1,891,426 rx/s                  0 err/s
Summary                 1,880,673 rx/s                  0 err/s
Summary                 1,866,043 rx/s                  0 err/s
Summary                 1,872,845 rx/s                  0 err/s


I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
20Gbit/sec, with throughput varying regardless of which patch applied. I
actually tended to see slightly higher numbers with this fix applied,
but it was not consistent and hard to measure.


Above testing is not a valid XDP_DROP test.

The packet generator need to be much much faster, as XDP_DROP is for
DDoS protection use-cases (one of Cloudflare's main products).

I recommend using the script for pktgen in kernel tree:
 samples/pktgen/pktgen_sample03_burst_single_flow.sh

Example:
 ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m 
b4:96:91:ad:0b:09 -t $(nproc)


without the patch:

On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
 - sudo ./xdp-bench drop ice4  # (defaults to no-touch)

XDP_DROP (with no-touch)
 Without patch :  54,052,300 rx/s = 18.50 nanosec/packet
 With the patch:  33,420,619 rx/s = 29.92 nanosec/packet
 Diff: 11.42 nanosec

Using perf stat I can see an increase in cache-misses.

The difference is less, if we read-packet data, running:
 - sudo ./xdp-bench drop ice4 --packet-operation read-data

XDP_DROP (with read-data)
 Without patch :  27,200,683 rx/s = 36.76 nanosec/packet
 With the patch:  24,348,751 rx/s = 41.07 nanosec/packet
 Diff: 4.31 nanosec

On this CPU we don't have DDIO/DCA, so we take a big hit reading the
packet data in XDP.  This will be needed by our DDoS bpf_prog.
The nanosec diff isn't the same, so it seem this change can hide a
little behind the cache-miss in the XDP bpf_prog.


Without xdp-bench running the XDP program, top showed a CPU usage of
740% and an ~86 idle score.


We don't want a scaling test for this. For this XDP_DROP/DDoS test we
want to target a single CPU. This is easiest done by generating a single
flow (hint pktgen script is called _single_flow). We want to see a
single CPU running ksoftirqd 100% of the time.

With xdp-bench running, the iperf cpu drops off the top listing and the
CPU idle score goes up to 99.9


With the single flow generator DoS "attacking" a single CPU, we want to
see a single CPU running ksoftirqd 100% of the time, for XDP_DROP case.


with the patch:

The iperf3 CPU use seems to go up, but so does the throughput. It is
hard to get an isolated measure. I don't have an immediate setup for
fine tuned performance testing available to do anything more rigorous.

Personally, I think its overall in the noise, as I saw the same peak
performance and CPU usages with and without the patch.

I also tried testing TCP and also didn't see a significant difference
with or without the patch. Though, testing xdp-bench with TCP is not
that useful since the client stops transmitting once the packets are
dropped instead of handled.

$ iperf3 -c 192.168.93.1 -t86400 -l8000 -P5

Without patch:
[SUM]  24.00-25.00  sec  7.80 GBytes  67.0 Gbits/sec

With patch:
[SUM]  28.00-29.00  sec  7.85 GBytes  67.4 Gbits/sec

Again, it ranges from 60 to 68 Gbit/sec in both cases, though I think
the peak is slightly higher with the fix applied, sometimes I saw it
spike up to 70Gbit/sec but it mostly hovers around 67 Gbit/sec.

I'm sure theres a lot of factors impacting the performance here, but I
think there's not much evidence that its significantly different.
Cc: Christoph Petrausch <christoph.petrausch@xxxxxxxxx>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx>
Closes: https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxzGMYoE44caRbgw@xxxxxxxxxxxxxx/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Signed-off-by: Jacob Keller <jacob.e.keller@xxxxxxxxx>
Tested-by: Rinitha S <sx.rinitha@xxxxxxxxx> (A Contingent worker at Intel)
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx>
Tested-by: Priya Singh <priyax.singh@xxxxxxxxx>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@xxxxxxxxx>
---
   drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +++++++++--------------
   drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
   2 files changed, 33 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 29e0088ab6b2..93907ab2eac7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -894,10 +894,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
   	__skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page,
   				   rx_buf->page_offset, size);
   	sinfo->xdp_frags_size += size;
-	/* remember frag count before XDP prog execution; bpf_xdp_adjust_tail()
-	 * can pop off frags but driver has to handle it on its own
-	 */
-	rx_ring->nr_frags = sinfo->nr_frags;
   
   	if (page_is_pfmemalloc(rx_buf->page))
   		xdp_buff_set_frag_pfmemalloc(xdp);
@@ -968,20 +964,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const unsigned int size,
   /**
    * ice_get_pgcnts - grab page_count() for gathered fragments
    * @rx_ring: Rx descriptor ring to store the page counts on
+ * @ntc: the next to clean element (not included in this frame!)
    *
    * This function is intended to be called right before running XDP
    * program so that the page recycling mechanism will be able to take
    * a correct decision regarding underlying pages; this is done in such
    * way as XDP program can change the refcount of page
    */
-static void ice_get_pgcnts(struct ice_rx_ring *rx_ring)
+static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc)
   {
-	u32 nr_frags = rx_ring->nr_frags + 1;
   	u32 idx = rx_ring->first_desc;
   	struct ice_rx_buf *rx_buf;
   	u32 cnt = rx_ring->count;
   
-	for (int i = 0; i < nr_frags; i++) {
+	while (idx != ntc) {
   		rx_buf = &rx_ring->rx_buf[idx];
   		rx_buf->pgcnt = page_count(rx_buf->page);
   
@@ -1154,62 +1150,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf)
   }
   
   /**
- * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags
+ * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame
    * @rx_ring: Rx ring with all the auxiliary data
    * @xdp: XDP buffer carrying linear + frags part
- * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage
- * @ntc: a current next_to_clean value to be stored at rx_ring
+ * @ntc: the next to clean element (not included in this frame!)
    * @verdict: return code from XDP program execution
    *
- * Walk through gathered fragments and satisfy internal page
- * recycle mechanism; we take here an action related to verdict
- * returned by XDP program;
+ * Called after XDP program is completed, or on error with verdict set to
+ * ICE_XDP_CONSUMED.
+ *
+ * Walk through buffers from first_desc to the end of the frame, releasing
+ * buffers and satisfying internal page recycle mechanism. The action depends
+ * on verdict from XDP program.
    */
   static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
-			    u32 *xdp_xmit, u32 ntc, u32 verdict)
+			    u32 ntc, u32 verdict)
   {
-	u32 nr_frags = rx_ring->nr_frags + 1;
+	u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;

Here we unconditionally access the skb_shared_info area.

   	u32 idx = rx_ring->first_desc;
   	u32 cnt = rx_ring->count;
-	u32 post_xdp_frags = 1;
   	struct ice_rx_buf *buf;
-	int i;
-
-	if (unlikely(xdp_buff_has_frags(xdp)))

Previously we only touch shared_info area if this is a multi-buff frame.


I'm not certain, but reading the helpers it might be correct to do
something like this:

if (unlikely(xdp_buff_has_frags(xdp)))
   nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
else
   nr_frags = 1

Yes, that looks like a correct pattern.

either in the driver code or by adding a new xdp helper function.

I'm not sure its worth it though. We have pending work from our
development team to refactor ice to use page pool and switch to libeth
XDP helpers which eliminates all of this driver-specific logic.

Please do proper testing of XDP_DROP case when doing this change.

I don't personally think its worth holding up this series and this
important memory leak fix for a minor potential code change that I can't
measure an obvious improvement on.

IMHO you included an optimization (that wasn't a gain) in a fix patch.
I think you can fix the memory leak without the "optimization" part.

pw-bot: cr

--Jesper