Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Fri, 6 Jun 2025 11:19:21 -0700

On Fri, Jun 6, 2025 at 10:03 AM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote:
>
> On Fri, 2025-06-06 at 17:53 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
> > > +/*
> > > + * Creates a cgroup at /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>,
> > > + * moves current process to this cgroup.
> > > + */
> > > +static int create_stat_cgroup(void)
> > > +{
> > > +   char buf[PATH_MAX + 1];
> > > +   int err;
> > > +
> > > +   if (!env.cgroup_fs_mount[0])
> > > +           return -1;
> > > +
> > > +   env.memory_peak_fd = -1;
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/accounting-%d", env.cgroup_fs_mount, getpid());
> > > +   err = mkdir(buf, 0777);
> > > +   if (err < 0) {
> > > +           err = log_errno("mkdir(%s)", buf);
> > > +           goto err_out;
> > > +   }
> > > +   strcpy(env.stat_cgroup, buf);
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.stat_cgroup);
> > > +   err = write_one_line(buf, "%d\n", getpid());
> > > +   if (err < 0) {
> > > +           err = log_errno("echo %d > %s", getpid(), buf);
> > > +           goto err_out;
> > > +   }
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/memory.peak", env.stat_cgroup);
> > > +   env.memory_peak_fd = open(buf, O_RDWR | O_APPEND);
> >
> > Why is it necessary to open in RW|APPEND mode? Won't O_RDONLY cut it?
>
> With current implementation -- not necessary, O_RDONLY should be sufficient.
>
> > > +   if (env.memory_peak_fd < 0) {
> > > +           err = log_errno("open(%s)", buf);
> > > +           goto err_out;
> > > +   }
> > > +
> > > +   return 0;
> > > +
> > > +err_out:
> > > +   destroy_stat_cgroup();
> > > +   return err;
> > > +}
>
> [...]
>
> > > +/* Current value of /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>/memory.peak */
> > > +static long cgroup_memory_peak(void)
> > > +{
> > > +   long err, memory_peak;
> > > +   char buf[32];
> > > +
> > > +   if (env.memory_peak_fd < 0)
> > > +           return -1;
> > > +
> > > +   err = pread(env.memory_peak_fd, buf, sizeof(buf) - 1, 0);
> > > +   if (err <= 0) {
> > > +           log_errno("read(%s/memory.peak)", env.stat_cgroup);
> > > +           return -1;
> > > +   }
> > > +
> > > +   buf[err] = 0;
> >
> > nit: maybe rename err to len here?
>
> Sure, will rename.
>
> > > +   errno = 0;
> > > +   memory_peak = strtoll(buf, NULL, 10);
> > > +   if (errno) {
> > > +           log_errno("unrecognized %s/memory.peak format: %s", env.stat_cgroup, buf);
> > > +           return -1;
> > > +   }
> > > +
> > > +   return memory_peak;
> > > +}
> > > +
>
> [...]
>
> > > @@ -1332,7 +1551,16 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
> > >     if (env.force_reg_invariants)
> > >             bpf_program__set_flags(prog, bpf_program__flags(prog) | BPF_F_TEST_REG_INVARIANTS);
> > >
> > > -   err = bpf_object__load(obj);
> > > +   err = bpf_object__prepare(obj);
> > > +   if (!err) {
> > > +           cgroup_err = create_stat_cgroup();
> > > +           mem_peak_a = cgroup_memory_peak();
> > > +           err = bpf_object__load(obj);
> > > +           mem_peak_b = cgroup_memory_peak();
> > > +           destroy_stat_cgroup();
> >
> > What if we do create_stat_cgroup/destory_stat_cgroup in
> > handle_verif_mode along with mount/umount_cgroupfs.
> > It may speed things up a little bit here and moving all cgroup
> > preparations into the single place seems reasonable.
> > Will memory counter behave differently? We are checking the difference
> > around bpf_object__load, from layman's
> > perspective it should be the same.
>
> The memory.peak file accounts for peak memory consumption, so one
> would need to reset this counter between program verifications.
> Doc [1] describes such mechanism:
>
>     memory.peak
>
>       A read-write single value file which exists on non-root cgroups.
>
>       The max memory usage recorded for the cgroup and its descendants
>       since either the creation of the cgroup or the most recent reset
>       for that FD.
>
>       A write of any non-empty string to this file resets it to the
>       current memory usage for subsequent reads through the same file
>       descriptor.
>
>     memory.reclaim
>
>       A write-only nested-keyed file which exists for all cgroups.
>
>       This is a simple interface to trigger memory reclaim in the target
>       cgroup.
>
>       Example:
>
>         echo "1G" > memory.reclaim
>
>       Please note that the kernel can over or under reclaim from the
>       target cgroup. If less bytes are reclaimed than the specified
>       amount, -EAGAIN is returned.
>
> As mentioned in cover letter, I tried using a combination of the above,
> while creating a cgroup only once. For reasons I don't understand this
> did not produce stable measurements. E.g. depending on a program being

Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable
and should have worked (we do reset pc->local_watermark). But note if
(usage > peer_ctx->value) logic and /* initial write, register watcher
*/ comment. I'm totally guessing and speculating, but maybe you didn't
close and re-open the file in between and so you had stale "watcher"
with already recorded high watermark?..

I'd try again but be very careful what cgroup and at what point this
is being reset...

> verified in isolation or as part of a batch results vary from 5mb to 2mb.
>
> [1] https://docs.kernel.org/admin-guide/cgroup-v2.html
>
> > > +           if (!cgroup_err && mem_peak_a >= 0 && mem_peak_b >= 0)
> > > +                   mem_peak = mem_peak_b - mem_peak_a;
> > > +   }
> > >     env.progs_processed++;
> > >
> > >     stats->file_name = strdup(base_filename);
> > > @@ -1341,6 +1569,7 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
>
> [...]
>
> > Acked-by: Mykyta Yatsenko <yatsenko@xxxxxxxx>
>
> Thank you. I will have to send a v2 avoiding mount operations and
> controllers file modification.
>