Jesper Dangaard Brouer <hawk@xxxxxxxxxx> writes: > On 17/06/2025 17.10, Toke Høiland-Jørgensen wrote: >>> Later we will look at using the vlan tag. Today we have disabled HW >>> vlan-offloading, because XDP originally didn't support accessing HW vlan >>> tags. >> >> Side note (with changed subject to disambiguate): Do you have any data >> on the performance impact of disabling VLAN offload that you can share? >> I've been sort of wondering whether saving those couple of bytes has any >> measurable impact on real workloads (where you end up looking at the >> headers anyway, so saving the cache miss doesn't matter so much)? >> > > Our production setup have two different VLAN IDs, one for INTERNAL-ID > and one for EXTERNAL-ID (Internet) traffic. On (many) servers this is > on the same physical net_device. > > Our Unimog XDP load-balancer *only* handles EXTERNAL-ID. Thus, the very > first thing Unimog does is checking the VLAN ID. If this doesn't match > EXTERNAL-ID it returns XDP_PASS. This is the first time packet data > area is read which (due to our AMD-CPUs) will be a cache-miss. > > If this were INTERNAL-ID then we have caused a cache-miss earlier than > needed. The NIC driver have already started a net_prefetch. Thus, if > we can return XDP_PASS without touching packet data, then we can > (latency) hide part of the cache-miss (behind SKB-zero-ing). (We could > also CPUMAP redirect the INTERNAL-ID to a remote CPU for further gains). > Using the kfunc (bpf_xdp_metadata_rx_vlan_tag[1]) for reading VLAN ID > doesn't touch/read packet data. > > I hope this makes it clear why reading the HW offloaded VLAN tag from > the RX-descriptor is a performance benefit? Right, I can certainly see the argument, but I was hoping you'd have some data to quantify exactly how much of a difference this makes? :) Also, I guess this XDP-based early demux is a bit special as far as this use case is concerned? For regular net-stack usage of the VLAN field, we'll already have touched the packet data while building the skb; so the difference will be less, as it shouldn't be a cache miss. Which doesn't invalidate your use case, of course, it just makes it different... -Toke