On Fri, 2025-06-06 at 11:19 -0700, Andrii Nakryiko wrote: [...] > Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable > and should have worked (we do reset pc->local_watermark). But note if > (usage > peer_ctx->value) logic and /* initial write, register watcher > */ comment. I'm totally guessing and speculating, but maybe you didn't > close and re-open the file in between and so you had stale "watcher" > with already recorded high watermark?.. > > I'd try again but be very careful what cgroup and at what point this > is being reset... The way I read memcontrol.c:memory_peak_write(), it always transfers current memcg->memory (aka memory.current) to the ofp->value of the currently open file (aka memory.peak). So this should work as documentation suggests: one needs to keep a single fd for memory.peak and periodically write something to it to reset the value. --- I tried several versions with selftests and scx BPF binaries: - version as in this patch-set, aka "many cg"; - version with a single control group that writes to memory.reclaim and then to memory.peak between program verifications (while holding same FDs for these files), aka "reset+reclaim", implementation is in [1]; - version with a single control group same as "reset+reclaim" but without "reclaim" part, aka "reset only", implementation can be trivially derived from [1]. Here are stats for each of the versions, where I try to figure out the stability of results. Each version was run twice and generated results compared. | | | one cg | one cg | | | | many cg | reclaim+reset | reset only | master | |------------------------------------+---------+---------------+------------+--------| | SCX | | | | | |------------------------------------+---------+---------------+------------+--------| | running time (sec) | 48 | 50 | 46 | 43 | | jitter mem_peak_diff!=0 (of 172) | 3 | 93 | 80 | | | jitter mem_peak_diff>256 (of 172) | 0 | 5 | 7 | | |------------------------------------+---------+---------------+------------+--------| | selftests | | | | | |------------------------------------+---------+---------------+------------+--------| | running time (sec) | 108 | 140 | 90 | 86 | | jitter mem_peak_diff!=0 (of 3601) | 195 | 1751 | 1181 | | | jitter mem_peak_diff>256 (of 3601) | 1 | 22 | 14 | | - "jitter mem_peak_diff!=0" means that veristat was run two times and results were compared to produce a number of differences: `veristat -C -f "mem_peak_diff!=0" first-run.csv second-run.csv| wc -l` - "jitter mem_peak_diff>256" is the same, but the filter expression was "mem_peak_diff>256", meaning difference is greater than 256KiB. The big jitter comes from `0->256KiB` and `256KiB->0` transitions occurring to very small programs. There are a lot of such programs in selftests. Comparison of results quality between many cg and other types (same metrics as above, but different veristat versions were used to produce CSVs for comparison): | | many cg | many cg | | | vs reset+reclaim | vs reset-only | |------------------------------------+------------------+---------------| | SCX | | | |------------------------------------+------------------+---------------| | jitter mem_peak_diff!=0 (of 172) | 108 | 70 | | jitter mem_peak_diff>256 (of 172) | 6 | 2 | |------------------------------------+------------------+---------------| | sleftests | | | |------------------------------------+------------------+---------------| | jitter mem_peak_diff!=0 (of 3601) | 1885 | 942 | | jitter mem_peak_diff>256 (of 3601) | 27 | 11 | As can be seen, most of the difference in collected stats is not bigger than 256KiB. --- Given above I'm inclined to stick with "many cg" approach, as it has less jitter and is reasonably performant. I need to wrap-up parallel veristat version anyway (and many cg should be easier to manage for parallel run). --- [1] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting.one-cg P.S. The only difference between [1] and my initial experiments is that I used dprintf instead of pwrite to access memory.{peak,reclaim}, ¯\_(ツ)_/¯.