> > > Now I wonder if I should use the u8 sk_bpf_cb_flags in V13 or just > > > keep it as-is? Either way is fine with me :) bpf_sock_ops_cb_flags > > > uses u8 as an example, thus I think we prefer the former? > > > > If it fits in a u8 and that in practice also results in less memory > > and cache pressure (i.e., does not just add a 24b hole), then it is a > > net improvement. > > Probably I didn't state it clearly. I agree with you on saving memory:) > > In the previous response, I was trying to keep the sk_bpf_cb_flags > flag and use a u8 instead. I admit u32 is too large after you noticed > this. > > Would the following diff on top of this series be acceptable for you? > And would it be a proper place to put the u8 sk_bpf_cb_flags in struct > sock? > diff --git a/include/net/sock.h b/include/net/sock.h > index 6f4d54faba92..e85d6fb3a2ba 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -447,7 +447,7 @@ struct sock { > int sk_forward_alloc; > u32 sk_tsflags; > #define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG)) > - u32 sk_bpf_cb_flags; > + u8 sk_bpf_cb_flags; > __cacheline_group_end(sock_write_rxtx); > > __cacheline_group_begin(sock_write_tx); > > The following output is the result of running 'pahole --hex -C sock vmlinux'. > Before this series: > u32 sk_tsflags; /* 0x168 0x4 */ > __u8 > __cacheline_group_end__sock_write_rxtx[0]; /* 0x16c 0 */ > __u8 > __cacheline_group_begin__sock_write_tx[0]; /* 0x16c 0 */ > int sk_write_pending; /* 0x16c 0x4 */ > atomic_t sk_omem_alloc; /* 0x170 0x4 */ > int sk_sndbuf; /* 0x174 0x4 */ > int sk_wmem_queued; /* 0x178 0x4 */ > refcount_t sk_wmem_alloc; /* 0x17c 0x4 */ > /* --- cacheline 6 boundary (384 bytes) --- */ > long unsigned int sk_tsq_flags; /* 0x180 0x8 */ > ... > /* sum members: 773, holes: 1, sum holes: 1 */ > > After this diff patch: > u32 sk_tsflags; /* 0x168 0x4 */ > u8 sk_bpf_cb_flags; /* 0x16c 0x1 */ > __u8 > __cacheline_group_end__sock_write_rxtx[0]; /* 0x16d 0 */ > __u8 > __cacheline_group_begin__sock_write_tx[0]; /* 0x16d 0 */ > > /* XXX 3 bytes hole, try to pack */ > > int sk_write_pending; /* 0x170 0x4 */ > atomic_t sk_omem_alloc; /* 0x174 0x4 */ > int sk_sndbuf; /* 0x178 0x4 */ > int sk_wmem_queued; /* 0x17c 0x4 */ > /* --- cacheline 6 boundary (384 bytes) --- */ > refcount_t sk_wmem_alloc; /* 0x180 0x4 */ > > /* XXX 4 bytes hole, try to pack */ > > long unsigned int sk_tsq_flags; /* 0x188 0x8 */ > ... > /* sum members: 774, holes: 3, sum holes: 8 */ > > It will introduce 7 extra sum holes if this series with this u8 change > gets applied. I think it's a proper position because this new > sk_bpf_cb_flags will be used in the tx and rx path just like > sk_tsflags, aligned with rules introduced by the commit[1]. Reducing a u64 to u8 can leave 7b of holes, but that is not great, of course. Since this bitmap is only touched if a BPF program is loaded, arguably it need not be in the hot path cacheline groups. Can you find a hole further down to place this in, or at least a spot that does not result in 7b of wasted space (in the hotpath cacheline groups of all places).