On 8/19/2025 12:53 PM, Jacob Keller wrote: > > > On 8/19/2025 9:44 AM, Jesper Dangaard Brouer wrote: >> >> >> On 19/08/2025 02.38, Jacob Keller wrote: >>> >>> >>> On 8/18/2025 4:05 AM, Jesper Dangaard Brouer wrote: >>>> On 15/08/2025 22.41, Tony Nguyen wrote: >>>>> This has the advantage that we also no longer need to track or cache the >>>>> number of fragments in the rx_ring, which saves a few bytes in the ring. >>>>> >>>> >>>> Have anyone tested the performance impact for XDP_DROP ? >>>> (with standard non-multi-buffer frames) >>>> >>>> Below code change will always touch cache-lines in shared_info area. >>>> Before it was guarded with a xdp_buff_has_frags() check. >>>> >>> >>> I did some basic testing with XDP_DROP previously using the xdp-bench >>> tool, and do not recall notice an issue. I don't recall the actual >>> numbers now though, so I did some quick tests again. >>> >>> without patch... >>> >>> Client: >>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G >>> ... >>> [SUM] 10.00-10.33 sec 626 MBytes 16.0 Gbits/sec 546909 >>> >>> $ iperf3 -s -B 192.168.93.1%ens260f0 >>> [SUM] 0.00-10.00 sec 17.7 GBytes 15.2 Gbits/sec 0.011 ms >>> 9712/15888183 (0.061%) receiver >>> >>> $ xdp-bench drop ens260f0 >>> Summary 1,778,935 rx/s 0 err/s >>> Summary 2,041,087 rx/s 0 err/s >>> Summary 2,005,052 rx/s 0 err/s >>> Summary 1,918,967 rx/s 0 err/s >>> >>> with patch... >>> >>> Client: >>> $ iperf3 -u -c 192.168.93.1 -t86400 -l1200 -P20 -b5G >>> ... >>> [SUM] 78.00-78.90 sec 2.01 GBytes 19.1 Gbits/sec 1801284 >>> >>> Server: >>> $ iperf3 -s -B 192.168.93.1%ens260f0 >>> [SUM] 77.00-78.00 sec 2.14 GBytes 18.4 Gbits/sec 0.012 ms >>> 9373/1921186 (0.49%) >>> >>> xdp-bench: >>> $ xdp-bench drop ens260f0 >>> Dropping packets on ens260f0 (ifindex 8; driver ice) >>> Summary 1,910,918 rx/s 0 err/s >>> Summary 1,866,562 rx/s 0 err/s >>> Summary 1,901,233 rx/s 0 err/s >>> Summary 1,859,854 rx/s 0 err/s >>> Summary 1,593,493 rx/s 0 err/s >>> Summary 1,891,426 rx/s 0 err/s >>> Summary 1,880,673 rx/s 0 err/s >>> Summary 1,866,043 rx/s 0 err/s >>> Summary 1,872,845 rx/s 0 err/s >>> >>> >>> I ran a few times and it seemed to waffle a bit around 15Gbit/sec to >>> 20Gbit/sec, with throughput varying regardless of which patch applied. I >>> actually tended to see slightly higher numbers with this fix applied, >>> but it was not consistent and hard to measure. >>> >> >> Above testing is not a valid XDP_DROP test. >> > > Fair. I'm no XDP expert, so I have a lot to learn here :) > >> The packet generator need to be much much faster, as XDP_DROP is for >> DDoS protection use-cases (one of Cloudflare's main products). >> >> I recommend using the script for pktgen in kernel tree: >> samples/pktgen/pktgen_sample03_burst_single_flow.sh >> >> Example: >> ./pktgen_sample03_burst_single_flow.sh -vi mlx5p2 -d 198.18.100.1 -m >> b4:96:91:ad:0b:09 -t $(nproc) >> >> >>> without the patch: >> >> On my testlab with CPU: AMD EPYC 9684X (SRSO=IBPB) running: >> - sudo ./xdp-bench drop ice4 # (defaults to no-touch) >> >> XDP_DROP (with no-touch) >> Without patch : 54,052,300 rx/s = 18.50 nanosec/packet >> With the patch: 33,420,619 rx/s = 29.92 nanosec/packet >> Diff: 11.42 nanosec >> > > Oof. Yea, thats not good. > >> Using perf stat I can see an increase in cache-misses. >> >> The difference is less, if we read-packet data, running: >> - sudo ./xdp-bench drop ice4 --packet-operation read-data >> >> XDP_DROP (with read-data) >> Without patch : 27,200,683 rx/s = 36.76 nanosec/packet >> With the patch: 24,348,751 rx/s = 41.07 nanosec/packet >> Diff: 4.31 nanosec >> >> On this CPU we don't have DDIO/DCA, so we take a big hit reading the >> packet data in XDP. This will be needed by our DDoS bpf_prog. >> The nanosec diff isn't the same, so it seem this change can hide a >> little behind the cache-miss in the XDP bpf_prog. >> >> >>> Without xdp-bench running the XDP program, top showed a CPU usage of >>> 740% and an ~86 idle score. >>> >> >> We don't want a scaling test for this. For this XDP_DROP/DDoS test we >> want to target a single CPU. This is easiest done by generating a single >> flow (hint pktgen script is called _single_flow). We want to see a >> single CPU running ksoftirqd 100% of the time. >> > > Ok. > >>> >>> I'm not certain, but reading the helpers it might be correct to do >>> something like this: >>> >>> if (unlikely(xdp_buff_has_frags(xdp))) >>> nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags; >>> else >>> nr_frags = 1 >> >> Yes, that looks like a correct pattern. >> It looks like i40e has the same mistake, but perhaps its less impacted because of lower network speeds. This mistake crept in because the i40e_process_rx_buffs (which I borrowed the same logic from) unconditionally checks the shared info for the nr_frags. In actuality, this counts the number of fragments not counting the initial descriptor, but the check in the loop body is aware of and accounts for that. Thus, I think what we really want here is to set nr_frags to 0 if xdp_buff_has_frags() is false, not 1. A helper function seems like the best solution, and I can submit a change to i40e to fix that code, assuming I can measure the difference there as well. Thanks, Jake
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature