Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs

Eduard Zingerman <eddyz87@xxxxxxxxx> · Sat, 07 Jun 2025 01:13:26 -0700

On Fri, 2025-06-06 at 11:19 -0700, Andrii Nakryiko wrote:

[...]

> Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable
> and should have worked (we do reset pc->local_watermark). But note if
> (usage > peer_ctx->value) logic and /* initial write, register watcher
> */ comment. I'm totally guessing and speculating, but maybe you didn't
> close and re-open the file in between and so you had stale "watcher"
> with already recorded high watermark?..
> 
> I'd try again but be very careful what cgroup and at what point this
> is being reset...

The way I read memcontrol.c:memory_peak_write(), it always transfers
current memcg->memory (aka memory.current) to the ofp->value of the
currently open file (aka memory.peak). So this should work as
documentation suggests: one needs to keep a single fd for memory.peak
and periodically write something to it to reset the value.

---

I tried several versions with selftests and scx BPF binaries:
- version as in this patch-set, aka "many cg";
- version with a single control group that writes to memory.reclaim
  and then to memory.peak between program verifications (while holding
  same FDs for these files), aka "reset+reclaim", implementation is in [1];
- version with a single control group same as "reset+reclaim" but
  without "reclaim" part, aka "reset only", implementation can be
  trivially derived from [1].

Here are stats for each of the versions, where I try to figure out the
stability of results. Each version was run twice and generated results
compared.

|                                    |         |        one cg |     one cg |        |
|                                    | many cg | reclaim+reset | reset only | master |
|------------------------------------+---------+---------------+------------+--------|
| SCX                                |         |               |            |        |
|------------------------------------+---------+---------------+------------+--------|
| running time (sec)                 |      48 |            50 |         46 |     43 |
| jitter mem_peak_diff!=0  (of 172)  |       3 |            93 |         80 |        |
| jitter mem_peak_diff>256 (of 172)  |       0 |             5 |          7 |        |
|------------------------------------+---------+---------------+------------+--------|
| selftests                          |         |               |            |        |
|------------------------------------+---------+---------------+------------+--------|
| running time (sec)                 |     108 |           140 |         90 |     86 |
| jitter mem_peak_diff!=0  (of 3601) |     195 |          1751 |       1181 |        |
| jitter mem_peak_diff>256 (of 3601) |       1 |            22 |         14 |        |

- "jitter mem_peak_diff!=0" means that veristat was run two times and
  results were compared to produce a number of differences:
  `veristat -C -f "mem_peak_diff!=0" first-run.csv second-run.csv| wc -l`
- "jitter mem_peak_diff>256" is the same, but the filter expression
  was "mem_peak_diff>256", meaning difference is greater than 256KiB.

The big jitter comes from `0->256KiB` and `256KiB->0` transitions
occurring to very small programs. There are a lot of such programs in
selftests.

Comparison of results quality between many cg and other types (same
metrics as above, but different veristat versions were used to produce
CSVs for comparison):

|                                    |          many cg |       many cg |
|                                    | vs reset+reclaim | vs reset-only |
|------------------------------------+------------------+---------------|
| SCX                                |                  |               |
|------------------------------------+------------------+---------------|
| jitter mem_peak_diff!=0  (of 172)  |              108 |            70 |
| jitter mem_peak_diff>256 (of 172)  |                6 |             2 |
|------------------------------------+------------------+---------------|
| sleftests                          |                  |               |
|------------------------------------+------------------+---------------|
| jitter mem_peak_diff!=0  (of 3601) |             1885 |           942 |
| jitter mem_peak_diff>256 (of 3601) |               27 |            11 |

As can be seen, most of the difference in collected stats is not
bigger than 256KiB.

---

Given above I'm inclined to stick with "many cg" approach, as it has
less jitter and is reasonably performant. I need to wrap-up parallel
veristat version anyway (and many cg should be easier to manage for
parallel run).

---

[1] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting.one-cg

P.S.

The only difference between [1] and my initial experiments is that I
used dprintf instead of pwrite to access memory.{peak,reclaim},
¯\_(ツ)_/¯.