On Tue, Jul 8, 2025 at 2:52 PM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > On Wed, 18 Jun 2025, Kent Overstreet wrote: > > > On Tue, Jun 10, 2025 at 05:30:53PM -0600, Casey Chen wrote: > > > Add support for tracking per-NUMA node statistics in /proc/allocinfo. > > > Previously, each alloc_tag had a single set of counters (bytes and > > > calls), aggregated across all CPUs. With this change, each CPU can > > > maintain separate counters for each NUMA node, allowing finer-grained > > > memory allocation profiling. > > > > > > This feature is controlled by the new > > > CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS option: > > > > > > * When enabled (=y), the output includes per-node statistics following > > > the total bytes/calls: > > > > > > <size> <calls> <tag info> > > > ... > > > 315456 9858 mm/dmapool.c:338 func:pool_alloc_page > > > nid0 94912 2966 > > > nid1 220544 6892 > > > 7680 60 mm/dmapool.c:254 func:dma_pool_create > > > nid0 4224 33 > > > nid1 3456 27 > > > > I just received a report of memory reclaim issues where it seems DMA32 > > is stuffed full. > > > > So naturally, instrumenting to see what's consuming DMA32 is going to be > > the first thing to do, which made me think of your patchset. > > > > I wonder if we should think about something a bit more general, so it's > > easy to break out accounting different ways depending on what we want to > > debug. > > > > Right, per-node memory attribution, or per zone, is very useful. > > Casey, what's the latest status of your patch? Using alloc_tag for > attributing memory overheads has been exceedingly useful for Google Cloud > and adding better insight it for per-node breakdown would be even better. > > Our use case is quite simple: we sell guest memory to the customer as > persistent hugetlb and keep some memory on the host for ourselves (VMM, > host userspace, host kernel). We track every page of that overhead memory > because memory pressure here can cause all sorts of issues like userspace > unresponsiveness. We also want to sell as much guest memory as possible > to avoid stranding cpus. > > To do that, per-node breakdown of memory allocations would be a tremendous > help. We have memory that is asymmetric for NUMA, even for memory that > has affinity to the NIC. Being able to inspect the origins of memory for > a specific NUMA node that is under memory pressure where other NUMA nodes > are not under memory pressure would be excellent. > > Adding Sourav Panda as well as he may have additional thoughts on this. I agree with David, especially the point regarding NIC affinity. I was dealing with a similar bug today, but pertaining to SSD where this patchset would have helped in the investigation. That being said, I think pgalloc_tag_swap() has to be modified as well, which gets called by __migrate_folio().