Hi, On 4/29/25 5:26 AM, Mina Almasry wrote: > Augment dmabuf binding to be able to handle TX. Additional to all the RX > binding, we also create tx_vec needed for the TX path. > > Provide API for sendmsg to be able to send dmabufs bound to this device: > > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. > > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY > implementation, while disabling instances where MSG_ZEROCOPY falls back > to copying. > > We additionally pipe the binding down to the new > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems > instead of the traditional page netmems. > > We also special case skb_frag_dma_map to return the dma-address of these > dmabuf net_iovs instead of attempting to map pages. > > The TX path may release the dmabuf in a context where we cannot wait. > This happens when the user unbinds a TX dmabuf while there are still > references to its netmems in the TX path. In that case, the netmems will > be put_netmem'd from a context where we can't unmap the dmabuf, Resolve > this by making __net_devmem_dmabuf_binding_free schedule_work'd. > > Based on work by Stanislav Fomichev <sdf@xxxxxxxxxxx>. A lot of the meat > of the implementation came from devmem TCP RFC v1[1], which included the > TX path, but Stan did all the rebasing on top of netmem/net_iov. > > Cc: Stanislav Fomichev <sdf@xxxxxxxxxxx> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@xxxxxxxxxx> > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > Acked-by: Stanislav Fomichev <sdf@xxxxxxxxxxx> I'm sorry for the late feedback. A bunch of things I did not notice before... > @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, > > if (msg && msg->msg_ubuf && msg->sg_from_iter) > ret = msg->sg_from_iter(skb, from, length); > + else if (unlikely(binding)) I'm unsure if the unlikely() here (and in similar tests below) it's worthy: depending on the actual workload this condition could be very likely. [...] > @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > int flags, err, copied = 0; > int mss_now = 0, size_goal, copied_syn = 0; > int process_backlog = 0; > + bool sockc_valid = true; > int zc = 0; > long timeo; > > flags = msg->msg_flags; > > + sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags), > + .dmabuf_id = 0 }; the '.dmabuf_id = 0' part is not needed, and possibly the code is clearer without it. > + if (msg->msg_controllen) { > + err = sock_cmsg_send(sk, msg, &sockc); > + if (unlikely(err)) > + /* Don't return error until MSG_FASTOPEN has been > + * processed; that may succeed even if the cmsg is > + * invalid. > + */ > + sockc_valid = false; > + } > + > if ((flags & MSG_ZEROCOPY) && size) { > if (msg->msg_ubuf) { > uarg = msg->msg_ubuf; > @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > zc = MSG_ZEROCOPY; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > skb = tcp_write_queue_tail(sk); > - uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); > + uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), > + sockc_valid && !!sockc.dmabuf_id); If sock_cmsg_send() failed and the user did not provide a dmabuf_id, memory accounting will be incorrect. > if (!uarg) { > err = -ENOBUFS; > goto out_err; > @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > zc = MSG_ZEROCOPY; > else > uarg_to_msgzc(uarg)->zerocopy = 0; > + > + if (sockc_valid && sockc.dmabuf_id) { > + binding = net_devmem_get_binding(sk, sockc.dmabuf_id); > + if (IS_ERR(binding)) { > + err = PTR_ERR(binding); > + binding = NULL; > + goto out_err; > + } > + } > } > } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { > if (sk->sk_route_caps & NETIF_F_SG) > zc = MSG_SPLICE_PAGES; > } > > + if (sockc_valid && sockc.dmabuf_id && > + (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) { > + err = -EINVAL; > + goto out_err; > + } > + > if (unlikely(flags & MSG_FASTOPEN || > inet_test_bit(DEFER_CONNECT, sk)) && > !tp->repair) { > @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > /* 'common' sending to sendq */ > } > > - sockc = (struct sockcm_cookie) { .tsflags = READ_ONCE(sk->sk_tsflags)}; > - if (msg->msg_controllen) { > - err = sock_cmsg_send(sk, msg, &sockc); > - if (unlikely(err)) { > - err = -EINVAL; > - goto out_err; > - } > - } > + if (!sockc_valid) > + goto out_err; Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err could emit a wrong return value. Possibly it's better to keep the 'dmabuf_id' initialization out of sock_cmsg_send() in a separate helper could simplify the handling here? /P