On Tue, Aug 26, 2025 at 1:07 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > > On 8/25/25 5:23 PM, Kuniyuki Iwashima wrote: > > From: Martin KaFai Lau <martin.lau@xxxxxxxxx> > > Date: Mon, 25 Aug 2025 16:14:35 -0700 > >> On 8/25/25 11:14 AM, Kuniyuki Iwashima wrote: > >>> On Mon, Aug 25, 2025 at 10:57 AM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote: > >>>> > >>>> On 8/22/25 3:17 PM, Kuniyuki Iwashima wrote: > >>>>> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > >>>>> index ae83ecda3983..ab613abdfaa4 100644 > >>>>> --- a/net/ipv4/af_inet.c > >>>>> +++ b/net/ipv4/af_inet.c > >>>>> @@ -763,6 +763,8 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new > >>>>> kmem_cache_charge(newsk, gfp); > >>>>> } > >>>>> > >>>>> + BPF_CGROUP_RUN_PROG_INET_SOCK_ACCEPT(newsk); > >>>>> + > >>>>> if (mem_cgroup_sk_enabled(newsk)) { > >>>>> int amt; > >>>>> > >>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > >>>>> index 233de8677382..80df246d4741 100644 > >>>>> --- a/tools/include/uapi/linux/bpf.h > >>>>> +++ b/tools/include/uapi/linux/bpf.h > >>>>> @@ -1133,6 +1133,7 @@ enum bpf_attach_type { > >>>>> BPF_NETKIT_PEER, > >>>>> BPF_TRACE_KPROBE_SESSION, > >>>>> BPF_TRACE_UPROBE_SESSION, > >>>>> + BPF_CGROUP_INET_SOCK_ACCEPT, > >>>> > >>>> Instead of adding another hook, can the SK_BPF_MEMCG_SOCK_ISOLATED bit be > >>>> inherited from the listener? > >>> > >>> Since e876ecc67db80 and d752a4986532c , we defer memcg allocation to > >>> accept() because the child socket could be created during irq context with > >>> unrelated cgroup. This had another reason; if the listener was created in the > >>> root cgroup and passed to a process under cgroup, child sockets would never > >>> have sk_memcg if sk_memcg was inherited. > >>> > >>> So, the child's memcg is not always the same one with the listener's, and > >>> we cannot rely on the listener's sk_memcg. > >> > >> I didn't mean to inherit the entire sk_memcg pointer. I meant to only inherit > >> the SK_BPF_MEMCG_SOCK_ISOLATED bit. > > > > I didn't want to let the flag remain alone without accept() (error-prone > > but works because we always check mem_cgroup_from_sk() before the bit) > > and wanted to check mem_cgroup_sk_enabled() in setsockopt(), but if we > > don't care, it will be doable with other hooks, PASSIVE_ESTABLISHED_CB > > or bpf_iter etc. > > I think this could be a surprise to the user. imo, this is the implementation > details that a bit of a pointer is used for the setsockopt purpose and a right > one for perf reason. It does not necessary need to affect what the user can > expect from setsockopt in listener. From the user pov, what the user can usually > expect from setsockopt in the listener and gets copied to child? Ah, somehow I was confused with allowing flagging before sk_memcg is set and I didn't think of inheriting a flag from the listener. Inheriting a flag from the listener and only allowing setsockopt() from socket() would be the simplest way. > Beside, the > listener and the accept-or on different processes is one of the use case but not > the only use case. > > > > > ---8<--- > > diff --git a/net/core/filter.c b/net/core/filter.c > > index a78356682442..9ef458fe706e 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -5269,7 +5269,7 @@ static int sk_bpf_set_get_cb_flags(struct sock *sk, char *optval, bool getopt) > > > > static int sk_bpf_set_get_memcg_flags(struct sock *sk, int *optval, bool getopt) > > { > > - if (!mem_cgroup_sk_enabled(sk)) > > + if (!sk_has_account(sk)) > > return -EOPNOTSUPP; > > > > if (getopt) { > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > > index e92dfca0a0ff..efae15d04306 100644 > > --- a/net/ipv4/af_inet.c > > +++ b/net/ipv4/af_inet.c > > @@ -760,7 +760,10 @@ void __inet_accept(struct socket *sock, struct socket *newsock, struct sock *new > > if (mem_cgroup_sockets_enabled && > > (!IS_ENABLED(CONFIG_IP_SCTP) || > > sk_is_tcp(newsk) || sk_is_mptcp(newsk))) { > > + unsigned short flags = mem_cgroup_sk_get_flags(newsk); > > + > > mem_cgroup_sk_alloc(newsk); > > + mem_cgroup_sk_set_flags(newsk, flags); > > kmem_cache_charge(newsk, gfp); > > } > > > > ---8<--- > > > > > >> > >> If it can only be done at accept, there is already an existing > >> SEC("lsm_cgroup/socket_accept") hook. Take a look at > >> tools/testing/selftests/bpf/progs/lsm_cgroup.c. The lsm socket_accept doesn't > >> have access to the "newsock->sk" but it should have access to the "sock->sk", do > >> bpf_setsockopt and then inherit by the newsock->sk (?) > >> > >> There are already quite enough cgroup-sk style hooks. I would prefer not to add > >> another cgroup attach_type and instead see if some of the existing ones can be > >> reused. There is also SEC("lsm/sock_graft"). > > > > We need to do fixup below, so lsm_cgroup/socket_accept won't work > > if we keep the code in __inet_accept(). We can move this after > > lsm/sock_graft within __inet_accept(). > > > > if (mem_cgroup_sk_isolated(newsk)) > > sk_memory_allocated_sub(newsk, amt); > > If I read it correctly, lsm_cgroup/sock_graft should work but need to do the > above sk_memory_allocated_sub() after the sock_graft and ... > > > > But then, we cannot distinguish it with other hooks (lock_sock() && > > sk->sk_socket != NULL), and finally the fixup must be done dynamically > > in setsockopt(). > > ... need a way to disallow this SK_BPF_MEMCG_SOCK_ISOLATED bit being changed > once the socket fd is visible to the user. The current approach is to use the > observation in the owned_by_user and sk->sk_socket in the create and accept > hook. [ unrelated, I am not sure about the owned_by_user check considering > sol_socket_sockopt can be called from bh ]. [ my expectation was bh checks sock_owned_by_user() before processing packets and entering where bpf_setsockopt() can be called ] > > If it is needed, there are other ways to stop the SK_BPF_MEMCG_SOCK_ISOLATED > being changed again once the fd is visible to user. e.g. there are still bits > left in the sk_memcg pointer to freeze it at runtime. If doing it statically > (i.e. at prog load time), it can probably return a different setsockopt_proto > that can understand the SK_BPF_MEMCG_FLAGS. I was thinking a kind of the latter, passing caller info to general __bpf_setsockopt(), and gave up as it was ugly, but wrapping it as different setsockopt_proto sounds good. Then, we don't need to care about how to limit the caller context.