Re: [PATCH bpf-next v4 2/5] bpf: Implement mprog API on top of existing cgroup progs

Yonghong Song <yonghong.song@xxxxxxxxx> · Fri, 6 Jun 2025 09:03:44 -0700

On 6/5/25 1:30 PM, Andrii Nakryiko wrote:
On Fri, May 30, 2025 at 10:38 AM Yonghong Song <yonghong.song@xxxxxxxxx> wrote:
Current cgroup prog ordering is appending at attachment time. This is not
ideal. In some cases, users want specific ordering at a particular cgroup
level. To address this, the existing mprog API seems an ideal solution with
supporting BPF_F_BEFORE and BPF_F_AFTER flags.

But there are a few obstacles to directly use kernel mprog interface.
Currently cgroup bpf progs already support prog attach/detach/replace
and link-based attach/detach/replace. For example, in struct
bpf_prog_array_item, the cgroup_storage field needs to be together
with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog
as the member, which makes it difficult to use kernel mprog interface.

In another case, the current cgroup prog detach tries to use the
same flag as in attach. This is different from mprog kernel interface
which uses flags passed from user space.

So to avoid modifying existing behavior, I made the following changes to
support mprog API for cgroup progs:
  - The support is for prog list at cgroup level. Cross-level prog list
    (a.k.a. effective prog list) is not supported.
  - Previously, BPF_F_PREORDER is supported only for prog attach, now
    BPF_F_PREORDER is also supported by link-based attach.
  - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported
    similar to kernel mprog but with different implementation.
  - For detach and replace, use the existing implementation.
  - For attach, detach and replace, the revision for a particular prog
    list, associated with a particular attach type, will be updated
    by increasing count by 1.

Signed-off-by: Yonghong Song <yonghong.song@xxxxxxxxx>
---
  include/uapi/linux/bpf.h       |   7 ++
  kernel/bpf/cgroup.c            | 197 +++++++++++++++++++++++++++++----
  kernel/bpf/syscall.c           |  43 ++++---
  tools/include/uapi/linux/bpf.h |   7 ++
  4 files changed, 216 insertions(+), 38 deletions(-)

[...]

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 9122c39870bf..bab580df5908 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -658,6 +658,131 @@ static struct bpf_prog_list *find_attach_entry(struct hlist_head *progs,
         return NULL;
  }

+static struct bpf_cgroup_link *bpf_get_anchor_link(u32 flags, u32 id_or_fd,
+                                                  enum bpf_prog_type type)
+{
+       struct bpf_link *link = ERR_PTR(-EINVAL);
+
+       if (flags & BPF_F_ID)
+               link = bpf_link_by_id(id_or_fd);
+       else if (id_or_fd)
+               link = bpf_link_get_from_fd(id_or_fd);
+       if (IS_ERR(link))
+               return ERR_PTR(PTR_ERR(link));
+       if (link->type != BPF_LINK_TYPE_CGROUP || link->prog->type != type) {
This check is a bit redundant and incomplete at the same time, so I'm
wondering if it's better to just drop this check?

It's redundant because if link or program is of wrong type, we won't
find it in cgroup's list of links/progs and return -ENOENT.

It's incomplete, because attach_type should be checked together with
prog_type. But again, this doesn't matter because correct prog_type
and incorrect attach_type would result in not finding the anchor link.

So I'm just thinking if it would be just better to not check type here
and let anchor prog/link logic return -ENOENT? WDYT?

Make sense. Will remove this 'if' condition.


+               bpf_link_put(link);
+               return ERR_PTR(-EINVAL);
+       }
+
+       return container_of(link, struct bpf_cgroup_link, link);
+}
+
+static struct bpf_prog *bpf_get_anchor_prog(u32 flags, u32 id_or_fd, enum bpf_prog_type type)
+{
+       struct bpf_prog *prog = ERR_PTR(-EINVAL);
+
+       if (flags & BPF_F_ID)
+               prog = bpf_prog_by_id(id_or_fd);
+       else if (id_or_fd)
+               prog = bpf_prog_get(id_or_fd);
+       if (IS_ERR(prog))
+               return prog;
+       if (prog->type != type) {
same as for links above, it would make sense to check attach_type as
well, but it ultimately doesn't matter because user will get
-ENOENT... It's just the inconsistency (-EINVAL if prog_type
mismatches, -ENOENT if attach_type mismatches), that makes we want to
not check this type info here at all...

The same as above, will remove this 'if' statement.


+               bpf_prog_put(prog);
+               return ERR_PTR(-EINVAL);
+       }
+
+       return prog;
+}
+
[...]

  #define BPF_PROG_ATTACH_LAST_FIELD expected_revision

  #define BPF_F_ATTACH_MASK_BASE \
@@ -4215,7 +4234,7 @@ static int bpf_prog_attach(const union bpf_attr *attr)
         if (bpf_mprog_supported(ptype)) {
                 if (attr->attach_flags & ~BPF_F_ATTACH_MASK_MPROG)
                         return -EINVAL;
-       } else {
+       } else if (!is_cgroup_prog_type(ptype, 0, false)) {
wouldn't we skip checking flags altogether for cgroup program types?  maybe:

if (is_cgroup_prog_type(...)) {
    /* check flags for cgroups */
} else {
    ...
}

Indeed, I missed flag checking for is_cgroup_prog_type(...) case. Will fix!


would be a safer pattern?

pw-bot: cr

                 if (attr->attach_flags & ~BPF_F_ATTACH_MASK_BASE)
                         return -EINVAL;
                 if (attr->relative_fd ||
@@ -4243,20 +4262,6 @@ static int bpf_prog_attach(const union bpf_attr *attr)
         case BPF_PROG_TYPE_FLOW_DISSECTOR:
                 ret = netns_bpf_prog_attach(attr, prog);
                 break;
-       case BPF_PROG_TYPE_CGROUP_DEVICE:
-       case BPF_PROG_TYPE_CGROUP_SKB:
-       case BPF_PROG_TYPE_CGROUP_SOCK:
-       case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
-       case BPF_PROG_TYPE_CGROUP_SOCKOPT:
-       case BPF_PROG_TYPE_CGROUP_SYSCTL:
-       case BPF_PROG_TYPE_SOCK_OPS:
-       case BPF_PROG_TYPE_LSM:
-               if (ptype == BPF_PROG_TYPE_LSM &&
-                   prog->expected_attach_type != BPF_LSM_CGROUP)
-                       ret = -EINVAL;
-               else
-                       ret = cgroup_bpf_prog_attach(attr, ptype, prog);
-               break;
         case BPF_PROG_TYPE_SCHED_CLS:
                 if (attr->attach_type == BPF_TCX_INGRESS ||
                     attr->attach_type == BPF_TCX_EGRESS)
@@ -4265,7 +4270,10 @@ static int bpf_prog_attach(const union bpf_attr *attr)
                         ret = netkit_prog_attach(attr, prog);
                 break;
         default:
-               ret = -EINVAL;
+               if (!is_cgroup_prog_type(ptype, prog->expected_attach_type, true))
+                       ret = -EINVAL;
+               else
+                       ret = cgroup_bpf_prog_attach(attr, ptype, prog);
         }

         if (ret)
@@ -4295,6 +4303,9 @@ static int bpf_prog_detach(const union bpf_attr *attr)
                         if (IS_ERR(prog))
                                 return PTR_ERR(prog);
                 }
+       } else if (is_cgroup_prog_type(ptype, 0, false)) {
+               if (attr->attach_flags || attr->relative_fd)
+                       return -EINVAL;
         } else if (attr->attach_flags ||
                    attr->relative_fd ||
                    attr->expected_revision) {
[...]