Re: [PATCH v1 bpf-next/net 2/8] bpf: Add a bpf hook in __inet_accept().

Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> · Tue, 26 Aug 2025 14:08:59 -0700

On Tue, Aug 26, 2025 at 1:07 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
>
> On 8/25/25 5:23 PM, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <martin.lau@xxxxxxxxx>
> > Date: Mon, 25 Aug 2025 16:14:35 -0700
> >> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote:
> >>> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:
> >>>>
> >>>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote:
> >>>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> >>>>> index ae83ecda3983..ab613abdfaa4 100644
> >>>>> --- a/net/ipv4/af_inet.c
> >>>>> +++ b/net/ipv4/af_inet.c
> >>>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >>>>>                 kmem_cache_charge(newsk, gfp);
> >>>>>         }
> >>>>>
> >>>>> +     BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk);
> >>>>> +
> >>>>>         if (mem_cgroup_sk_enabled(newsk)) {
> >>>>>                 int amt;
> >>>>>
> >>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >>>>> index 233de8677382..80df246d4741 100644
> >>>>> --- a/tools/include/uapi/linux/bpf.h
> >>>>> +++ b/tools/include/uapi/linux/bpf.h
> >>>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type {
> >>>>>         BPF_NETKIT_PEER,
> >>>>>         BPF_TRACE_KPROBE_SESSION,
> >>>>>         BPF_TRACE_UPROBE_SESSION,
> >>>>> +     BPF_CGROUP_INET_SOCK_ACCEPT,
> >>>>
> >>>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be
> >>>> inherited from the listener?
> >>>
> >>> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to
> >>> accept() because the child socket could be created during irq context with
> >>> unrelated cgroup.  This had another reason; if the listener was created in the
> >>> root cgroup and passed to a process under cgroup, child sockets would never
> >>> have sk_memcg if sk_memcg was inherited.
> >>>
> >>> So, the child's memcg is not always the same one with the listener's, and
> >>> we cannot rely on the listener's sk_memcg.
> >>
> >> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit
> >> the SK_BPF_MEMCG_SOCK_ISOLATED bit.
> >
> > I didn't want to let the flag remain alone without accept() (error-prone
> > but works because we always check mem_cgroup_from_sk() before the bit)
> > and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we
> > don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB
> > or bpf_iter etc.
>
> I think this could be a surprise to the user. imo, this is the implementation
> details that a bit of a pointer is used for the setsockopt purpose and a right
> one for perf reason. It does not necessary need to affect what the user can
> expect from setsockopt in listener. From the user pov, what the user can usually
> expect from setsockopt in the listener and gets copied to child?

Ah, somehow I was confused with allowing flagging before sk_memcg
is set and I didn't think of inheriting a flag from the listener.

Inheriting a flag from the listener and only allowing setsockopt()
from socket() would be the simplest way.

> Beside, the
> listener and the accept-or on different processes is one of the use case but not
> the only use case.
>
> >
> > ---8<---
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index a78356682442..9ef458fe706e 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt)
> >
> >   static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt)
> >   {
> > -     if (!mem_cgroup_sk_enabled(sk))
> > +     if (!sk_has_account(sk))
> >               return -EOPNOTSUPP;
> >
> >       if (getopt) {
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index e92dfca0a0ff..efae15d04306 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new
> >       if (mem_cgroup_sockets_enabled &&
> >           (!IS_ENABLED(CONFIG_IP_SCTP) ||
> >            sk_is_tcp(newsk) || sk_is_mptcp(newsk))) {
> > +             unsigned short flags = mem_cgroup_sk_get_flags(newsk);
> > +
> >               mem_cgroup_sk_alloc(newsk);
> > +             mem_cgroup_sk_set_flags(newsk, flags);
> >               kmem_cache_charge(newsk, gfp);
> >       }
> >
> > ---8<---
> >
> >
> >>
> >> If it can only be done at accept, there is already an existing
> >> SEC("lsm_cgroup/socket_accept") hook. Take a look at
> >> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't
> >> have access to the "newsock->sk" but it should have access to the "sock->sk", do
> >> bpf_setsockopt and then inherit by the newsock->sk (?)
> >>
> >> There are already quite enough cgroup-sk style hooks. I would prefer not to add
> >> another cgroup attach_type and instead see if some of the existing ones can be
> >> reused. There is also SEC("lsm/sock_graft").
> >
> > We need to do fixup below, so lsm_cgroup/socket_accept won't work
> > if we keep the code in __inet_accept().  We can move this after
> > lsm/sock_graft within __inet_accept().
> >
> > if (mem_cgroup_sk_isolated(newsk))
> >       sk_memory_allocated_sub(newsk, amt);
>
> If I read it correctly, lsm_cgroup/sock_graft should work but need to do the
> above sk_memory_allocated_sub() after the sock_graft and ...
>   >
> > But then, we cannot distinguish it with other hooks (lock_sock() &&
> > sk->sk_socket != NULL), and finally the fixup must be done dynamically
> > in setsockopt().
>
> ... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed
> once the socket fd is visible to the user. The current approach is to use the
> observation in the owned_by_user and sk->sk_socket in the create and accept
> hook. [ unrelated, I am not sure about the owned_by_user check considering
> sol_socket_sockopt can be called from bh ].

[ my expectation was bh checks sock_owned_by_user() before
  processing packets and entering where bpf_setsockopt() can
  be called ]

>
> If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED
> being changed again once the fd is visible to user. e.g. there are still bits
> left in the sk_memcg pointer to freeze it at runtime. If doing it statically
> (i.e. at prog load time), it can probably return a different setsockopt_proto
> that can understand the SK_BPF_MEMCG_FLAGS.

I was thinking a kind of the latter, passing caller info to general
__bpf_setsockopt(), and gave up as it was ugly, but wrapping it
as different setsockopt_proto sounds good.

Then, we don't need to care about how to limit the caller context.