On Mon, May 5, 2025 at 4:41 PM Mickaël Salaün <mic@xxxxxxxxxxx> wrote: > On Mon, May 05, 2025 at 01:13:38PM +0200, Christian Brauner wrote: > > Coredumping currently supports two modes: > > > > (1) Dumping directly into a file somewhere on the filesystem. > > (2) Dumping into a pipe connected to a usermode helper process > > spawned as a child of the system_unbound_wq or kthreadd. > > > > For simplicity I'm mostly ignoring (1). There's probably still some > > users of (1) out there but processing coredumps in this way can be > > considered adventurous especially in the face of set*id binaries. > > > > The most common option should be (2) by now. It works by allowing > > userspace to put a string into /proc/sys/kernel/core_pattern like: > > > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > > > The "|" at the beginning indicates to the kernel that a pipe must be > > used. The path following the pipe indicator is a path to a binary that > > will be spawned as a usermode helper process. Any additional parameters > > pass information about the task that is generating the coredump to the > > binary that processes the coredump. > > > > In the example core_pattern shown above systemd-coredump is spawned as a > > usermode helper. There's various conceptual consequences of this > > (non-exhaustive list): > > > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > > connected to the read-end of the pipe. All other file descriptors are > > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > > already caused bugs because userspace assumed that this cannot happen > > (Whether or not this is a sane assumption is irrelevant.). > > > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > > it is not a child of any userspace process and specifically not a > > child of PID 1. It cannot be waited upon and is in a weird hybrid > > upcall which are difficult for userspace to control correctly. > > > > - systemd-coredump is spawned with full kernel privileges. This > > necessitates all kinds of weird privilege dropping excercises in > > userspace to make this safe. > > > > - A new usermode helper has to be spawned for each crashing process. > > > > This series adds a new mode: > > > > (3) Dumping into an abstract AF_UNIX socket. > > > > Userspace can set /proc/sys/kernel/core_pattern to: > > > > @linuxafsk/coredump_socket > > > > The "@" at the beginning indicates to the kernel that the abstract > > AF_UNIX coredump socket will be used to process coredumps. > > > > The coredump socket uses the fixed address "linuxafsk/coredump.socket" > > for now. > > > > The coredump socket is located in the initial network namespace. To bind > > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial > > user namespace. Listening and reading can happen from whatever > > unprivileged context is necessary to safely process coredumps. > > > > When a task coredumps it opens a client socket in the initial network > > namespace and connects to the coredump socket. For now only tasks that > > are acctually coredumping are allowed to connect to the initial coredump > > socket. > > I think we should avoid using abstract UNIX sockets, especially for new > interfaces, because it is hard to properly control such access. Can we > create new dedicated AF_UNIX protocols instead? One could be used by a > privileged process in the initial namespace to create a socket to > collect coredumps, and the other could be dedicatde to coredumped > proccesses. Such (coredump collector) file descriptor or new (proxy) > socketpair ones could be passed to containers. I would agree with you if we were talking about designing a pure userspace thing; but I think the limits that Christian added on bind() and connect() to these special abstract names in this series effectively make it behave as if they were dedicated AF_UNIX protocols, and prevent things like random unprivileged userspace processes bind()ing to them.