Re: [PATCH net 3/6] ice: fix Rx page leak on multi-buffer frames

Jacob Keller <jacob.e.keller@xxxxxxxxx> · Thu, 21 Aug 2025 09:27:22 -0700

On 8/19/2025 12:53 PM, Jacob Keller wrote:
> 
> 
> On 8/19/2025 9:44 AM, Jesper Dangaard Brouer wrote:
>>
>>
>> On 19/08/2025 02.38, Jacob Keller wrote:
>>>
>>>
>>> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote:
>>>> On 15/08/2025 22.41, Tony Nguyen wrote:
>>>>> This has the advantage that we also no longer need to track or cache the
>>>>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>>>>
>>>>
>>>> Have anyone tested the performance impact for XDP_DROP ?
>>>> (with standard non-multi-buffer frames)
>>>>
>>>> Below code change will always touch cache-lines in shared_info area.
>>>> Before it was guarded with a xdp_buff_has_frags() check.
>>>>
>>>
>>> I did some basic testing with XDP_DROP previously using the xdp-bench
>>> tool, and do not recall notice an issue. I don't recall the actual
>>> numbers now though, so I did some quick tests again.
>>>
>>> without patch...
>>>
>>> Client:
>>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>>> ...
>>> [SUM]  10.00-10.33  sec   626 MBytes  16.0 Gbits/sec  546909
>>>
>>> $ iperf3 -s -B 192.168.93.1%ens260f0
>>> [SUM]   0.00-10.00  sec  17.7 GBytes  15.2 Gbits/sec  0.011 ms
>>> 9712/15888183 (0.061%)  receiver
>>>
>>> $ xdp-bench drop ens260f0
>>> Summary                 1,778,935 rx/s                  0 err/s
>>> Summary                 2,041,087 rx/s                  0 err/s
>>> Summary                 2,005,052 rx/s                  0 err/s
>>> Summary                 1,918,967 rx/s                  0 err/s
>>>
>>> with patch...
>>>
>>> Client:
>>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G
>>> ...
>>> [SUM]  78.00-78.90  sec  2.01 GBytes  19.1 Gbits/sec  1801284
>>>
>>> Server:
>>> $ iperf3 -s -B 192.168.93.1%ens260f0
>>> [SUM]  77.00-78.00  sec  2.14 GBytes  18.4 Gbits/sec  0.012 ms
>>> 9373/1921186 (0.49%)
>>>
>>> xdp-bench:
>>> $ xdp-bench drop ens260f0
>>> Dropping packets on ens260f0 (ifindex 8; driver ice)
>>> Summary                 1,910,918 rx/s                  0 err/s
>>> Summary                 1,866,562 rx/s                  0 err/s
>>> Summary                 1,901,233 rx/s                  0 err/s
>>> Summary                 1,859,854 rx/s                  0 err/s
>>> Summary                 1,593,493 rx/s                  0 err/s
>>> Summary                 1,891,426 rx/s                  0 err/s
>>> Summary                 1,880,673 rx/s                  0 err/s
>>> Summary                 1,866,043 rx/s                  0 err/s
>>> Summary                 1,872,845 rx/s                  0 err/s
>>>
>>>
>>> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to
>>> 20Gbit/sec, with throughput varying regardless of which patch applied. I
>>> actually tended to see slightly higher numbers with this fix applied,
>>> but it was not consistent and hard to measure.
>>>
>>
>> Above testing is not a valid XDP_DROP test.
>>
> 
> Fair. I'm no XDP expert, so I have a lot to learn here :)
> 
>> The packet generator need to be much much faster, as XDP_DROP is for
>> DDoS protection use-cases (one of Cloudflare's main products).
>>
>> I recommend using the script for pktgen in kernel tree:
>>   samples/pktgen/pktgen_sample03_burst_single_flow.sh
>>
>> Example:
>>   ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m 
>> b4:96:91:ad:0b:09 -t $(nproc)
>>
>>
>>> without the patch:
>>
>> On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running:
>>   - sudo ./xdp-bench drop ice4  # (defaults to no-touch)
>>
>> XDP_DROP (with no-touch)
>>   Without patch :  54,052,300 rx/s = 18.50 nanosec/packet
>>   With the patch:  33,420,619 rx/s = 29.92 nanosec/packet
>>   Diff: 11.42 nanosec
>>
> 
> Oof. Yea, thats not good.
> 
>> Using perf stat I can see an increase in cache-misses.
>>
>> The difference is less, if we read-packet data, running:
>>   - sudo ./xdp-bench drop ice4 --packet-operation read-data
>>
>> XDP_DROP (with read-data)
>>   Without patch :  27,200,683 rx/s = 36.76 nanosec/packet
>>   With the patch:  24,348,751 rx/s = 41.07 nanosec/packet
>>   Diff: 4.31 nanosec
>>
>> On this CPU we don't have DDIO/DCA, so we take a big hit reading the
>> packet data in XDP.  This will be needed by our DDoS bpf_prog.
>> The nanosec diff isn't the same, so it seem this change can hide a
>> little behind the cache-miss in the XDP bpf_prog.
>>
>>
>>> Without xdp-bench running the XDP program, top showed a CPU usage of
>>> 740% and an ~86 idle score.
>>>
>>
>> We don't want a scaling test for this. For this XDP_DROP/DDoS test we
>> want to target a single CPU. This is easiest done by generating a single
>> flow (hint pktgen script is called _single_flow). We want to see a
>> single CPU running ksoftirqd 100% of the time.
>>
> 
> Ok.
> 
>>>
>>> I'm not certain, but reading the helpers it might be correct to do
>>> something like this:
>>>
>>> if (unlikely(xdp_buff_has_frags(xdp)))
>>>    nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
>>> else
>>>    nr_frags = 1
>>
>> Yes, that looks like a correct pattern.
>>
It looks like i40e has the same mistake, but perhaps its less impacted
because of lower network speeds.

This mistake crept in because the i40e_process_rx_buffs (which I
borrowed the same logic from) unconditionally checks the shared info for
the nr_frags.

In actuality, this counts the number of fragments not counting the
initial descriptor, but the check in the loop body is aware of and
accounts for that.

Thus, I think what we really want here is to set nr_frags to 0 if
xdp_buff_has_frags() is false, not 1. A helper function seems like the
best solution, and I can submit a change to i40e to fix that code,
assuming I can measure the difference there as well.

Thanks,
Jake
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature