Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation

Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> · Wed, 18 Jun 2025 18:03:01 -0400

Alexei Starovoitov wrote:
> On Wed, Jun 18, 2025 at 6:50 AM Anton Protopopov
> <a.s.protopopov@xxxxxxxxx> wrote:
> >
> > On 25/06/16 10:38AM, Willem de Bruijn wrote:
> > > From: Willem de Bruijn <willemb@xxxxxxxxxx>
> > >
> > > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > > map is full, due to percpu reservations and force shrink before
> > > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > > it will once steal one elem from a neighbor and after that each time
> > > flush this one element to the global list and immediately recycle it.
> > >
> > > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > > with 79 CPUs. CPU 79 will observe this behavior even while its
> > > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> > >
> > > CPUs need not be active concurrently. The issue can appear with
> > > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > > hold onto its 128 elements indefinitely.
> > >
> > > Avoid global list exhaustion by limiting aggregate percpu caches to
> > > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > > This change has no effect on sufficiently large tables.
> > >
> > > Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
> > > lru->free_target. The extra field fits in a hole in struct bpf_lru.
> > > The cacheline is already warm where read in the hot path. The field is
> > > only accessed with the lru lock held.
> >
> > Hi Willem! The patch looks very reasonable. I've bumbed into this
> > issue before (see https://lore.kernel.org/bpf/ZJwy478jHkxYNVMc@zh-lab-node-5/)
> > but didn't follow up, as we typically have large enough LRU maps.
> >
> > I've tested your patch (with a patched map_tests/map_percpu_stats.c
> > selftest), works as expected for small maps. E.g., before your patch
> > map of size 4096 after being updated 2176 times from 32 threads on 32
> > CPUS contains around 150 elements, after your patch around (expected)
> > 2100 elements.
> >
> > Tested-by: Anton Protopopov <a.s.protopopov@xxxxxxxxx>
> 
> Looks like we have consensus.

Great. Thanks for the reviews and testing. Good to have more data that
the issue is well understood and the approach helps.

> Willem,
> please target bpf tree when you respin.

Done: https://lore.kernel.org/bpf/20250618215803.3587312-1-willemdebruijn.kernel@xxxxxxxxx/T/#u