Re: [RFC PATCH v2 bpf-next 0/3] bpf: cgroup: support writing and freezing cgroups from BPF

Djalal Harouni <tixxdz@xxxxxxxxx> · Wed, 27 Aug 2025 00:27:08 +0100

Hi Michal,

On 8/26/25 15:18, Michal Koutný wrote:
Hi Djalal.

On Mon, Aug 18, 2025 at 10:04:21AM +0100, Djalal Harouni <tixxdz@xxxxxxxxx> wrote:
This patch series add support to write cgroup interfaces from BPF.

It is useful to freeze a cgroup hierarchy on suspicious activity for
a more thorough analysis before killing it. Planned users of this
feature are: systemd and BPF tools where the cgroup hierarchy could
be a system service, user session, k8s pod or a container.

Could you please give more specific example of the "suspicious
activity"? The last time (v1) it was referring to LSM hooks where such
asynchronous approach wasn't ideal.

It solves the case perfectly, you detect something you fail the
security hook return -EPERM and optionally freeze the cgroup,
snapshot the runtime state.

Oh I thought the attached example is an obvious one, customers want to
restrict bpf() usage per cgroup specific container/pod, so when
we detect bpf() that's not per allowed cgroup we fail it and freeze
it.

Take this and build on top, detect bash/shell exec or any other new
dropped binaries, fail and freeze the exec early at linux_bprm object
checks.

Also why couldn't all these tools execute the cgroup actions themselves
through traditional userspace API?

- Freezing at BPF is obviously better, less race since you don't need
  access to the corresponding cgroup fs and namespace. Not all tools run
  as supervisor/container manager.
- The bpf_send_signal in some cases is not enough, what if you race with
  a task clone as an example? however freezing the cgroup hierarchy or
  the one above is a catch all...

One more point (for possible interference with lifecycles) -- what is
the relation between cgroup in which the BPF code "runs" and cgroup
that's target of the operation? (I hope this isn't supposed to run from
BPF without process context.)

The feature is supposed to be used by sleepable BPF programs, I don't
think we need extra checks here?

It could be that this BPF code runs in a process that is under
pod-x/container-y/cgroup-z/  and maybe you want to freeze "cgroup-z"
or "container-y" and so on... or in case of delegated hierarchies,
freezing the parent is a catch all.

Todo:
* Limit size of data to be written.
* Further tests.
* Add cgroup kill support.

I'm missing the retrieval of freeze result in this plan :) cgroup kill

Indeed you are right a small kfunc to read back, yes ;) !

would be simpler for PoC (and maybe even sufficient for your use case?).

I think both are useful cases.

Thank you!

Regards,
Michal