From: Christian Brauner <brauner@xxxxxxxxxx> Date: Sat, 3 May 2025 07:17:10 +0200 > On Fri, May 02, 2025 at 10:23:44PM +0200, Jann Horn wrote: > > On Fri, May 2, 2025 at 10:11 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > On Fri, May 02, 2025 at 04:04:32PM +0200, Jann Horn wrote: > > > > On Fri, May 2, 2025 at 2:42 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > > diff --git a/fs/coredump.c b/fs/coredump.c > > > > [...] > > > > > @@ -801,6 +841,73 @@ void do_coredump(const kernel_siginfo_t *siginfo) > > > > > } > > > > > break; > > > > > } > > > > > + case COREDUMP_SOCK: { > > > > > + struct file *file __free(fput) = NULL; > > > > > +#ifdef CONFIG_UNIX > > > > > + ssize_t addr_size; > > > > > + struct sockaddr_un unix_addr = { > > > > > + .sun_family = AF_UNIX, > > > > > + }; > > > > > + struct sockaddr_storage *addr; > > > > > + > > > > > + /* > > > > > + * TODO: We need to really support core_pipe_limit to > > > > > + * prevent the task from being reaped before userspace > > > > > + * had a chance to look at /proc/<pid>. > > > > > + * > > > > > + * I need help from the networking people (or maybe Oleg > > > > > + * also knows?) how to do this. > > > > > + * > > > > > + * IOW, we need to wait for the other side to shutdown > > > > > + * the socket/terminate the connection. > > > > > + * > > > > > + * We could just read but then userspace could sent us > > > > > + * SCM_RIGHTS and we just shouldn't need to deal with > > > > > + * any of that. > > > > > + */ > > > > > > > > I don't think userspace can send you SCM_RIGHTS if you don't do a > > > > recvmsg() with a control data buffer? > > > > > > Oh hm, then maybe just a regular read at the end would work. As soon as > > > userspace send us anything or we get a close event we just disconnect. > > > > > > But btw, I think we really need a recvmsg() flag that allows a receiver > > > to refuse SCM_RIGHTS/file descriptors from being sent to it. IIRC, right > > > now this is a real issue that systemd works around by always calling its > > > cmsg_close_all() helper after each recvmsg() to ensure that no one sent > > > it file descriptors it didn't want. The problem there is that someone > > > could have sent it an fd to a hanging NFS server or something and then > > > it would hang in close() even though it never even wanted any file > > > descriptors in the first place. > > > > Would a recvmsg() flag really solve that aspect of NFS hangs? By the > > time you read from the socket, the file is already attached to an SKB > > queued up on the socket, and cleaning up the file is your task's > > responsibility either way (which will either be done by the kernel for > > you if you don't read it into a control message, or by userspace if it > > was handed off through a control message). Right. recvmsg() is too late. Once sendmsg() is done, the last fput() responsibility could fall on the receiver. Btw, I was able to implement the cmsg_close_all() equivalent at sendmsg() with BPF LSM to completely remove the issue. I will send a series shortly and hope you like it :) > > The process that sent the > > file to you might already be gone, it can't be on the hook for > > cleaning up the file anymore. > > Hm, I guess the unix_gc() runs in task context? I had thought that it > might take care of that. Note that unix_gc() is a garbage collector only for AF_UNIX fds that have circular dependency: 1) AF_UNIX sk1 sends its fd to itself 2) AF_UNIX sk1 sends its fd to AF_UNIX sk2 and AF_UNIX sk2 sends its fd to AF_UNIX sk1 In these examples, file refcnts remain even after close() by all users of fds. So, the GC is not a mechanism to deligate fput() for fds sent by SCM_RIGHTS.