On Mon, May 5, 2025 at 1:14 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > Coredumping currently supports two modes: > > (1) Dumping directly into a file somewhere on the filesystem. > (2) Dumping into a pipe connected to a usermode helper process > spawned as a child of the system_unbound_wq or kthreadd. > > For simplicity I'm mostly ignoring (1). There's probably still some > users of (1) out there but processing coredumps in this way can be > considered adventurous especially in the face of set*id binaries. > > The most common option should be (2) by now. It works by allowing > userspace to put a string into /proc/sys/kernel/core_pattern like: > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > The "|" at the beginning indicates to the kernel that a pipe must be > used. The path following the pipe indicator is a path to a binary that > will be spawned as a usermode helper process. Any additional parameters > pass information about the task that is generating the coredump to the > binary that processes the coredump. > > In the example core_pattern shown above systemd-coredump is spawned as a > usermode helper. There's various conceptual consequences of this > (non-exhaustive list): > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > connected to the read-end of the pipe. All other file descriptors are > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > already caused bugs because userspace assumed that this cannot happen > (Whether or not this is a sane assumption is irrelevant.). > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > it is not a child of any userspace process and specifically not a > child of PID 1. It cannot be waited upon and is in a weird hybrid > upcall which are difficult for userspace to control correctly. > > - systemd-coredump is spawned with full kernel privileges. This > necessitates all kinds of weird privilege dropping excercises in > userspace to make this safe. > > - A new usermode helper has to be spawned for each crashing process. > > This series adds a new mode: > > (3) Dumping into an abstract AF_UNIX socket. > > Userspace can set /proc/sys/kernel/core_pattern to: > > @linuxafsk/coredump_socket > > The "@" at the beginning indicates to the kernel that the abstract > AF_UNIX coredump socket will be used to process coredumps. > > The coredump socket uses the fixed address "linuxafsk/coredump.socket" > for now. > > The coredump socket is located in the initial network namespace. To bind > the coredump socket userspace must hold CAP_SYS_ADMIN in the initial > user namespace. Listening and reading can happen from whatever > unprivileged context is necessary to safely process coredumps. > > When a task coredumps it opens a client socket in the initial network > namespace and connects to the coredump socket. For now only tasks that > are acctually coredumping are allowed to connect to the initial coredump > socket. > > - The coredump server should use SO_PEERPIDFD to get a stable handle on > the connected crashing task. The retrieved pidfd will provide a stable > reference even if the crashing task gets SIGKILLed while generating > the coredump. > > - By setting core_pipe_limit non-zero userspace can guarantee that the > crashing task cannot be reaped behind it's back and thus process all > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to > detect whether /proc/<pid> still refers to the same process. > > The core_pipe_limit isn't used to rate-limit connections to the > socket. This can simply be done via AF_UNIX socket directly. > > - The pidfd for the crashing task will contain information how the task > coredumps. The PIDFD_GET_INFO ioctl gained a new flag > PIDFD_INFO_COREDUMP which can be used to retreive the coredump > information. > > If the coredump gets a new coredump client connection the kernel > guarantees that PIDFD_INFO_COREDUMP information is available. > Currently the following information is provided in the new > @coredump_mask extension to struct pidfd_info: > > * PIDFD_COREDUMPED is raised if the task did actually coredump. > * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., > undumpable). > * PIDFD_COREDUMP_USER is raised if this is a regular coredump and > doesn't need special care by the coredump server. > * IDFD_COREDUMP_ROOT is raised if the generated coredump should be > treated as sensitive and the coredump server should restrict to the > generated coredump to sufficiently privileged users. > > - Since unix_stream_connect() runs bpf programs during connect it's > possible to even redirect or multiplex coredumps to other sockets. Or change the userspace protocol used for containers such that the init-namespace coredumping helper forwards the FD it accept()ed into a container via SCM_RIGHTS... > - The coredump server should mark itself as non-dumpable. > To capture coredumps for the coredump server itself a bpf program > should be run at connect to redirect it to another socket in > userspace. This can be useful for debugging crashing coredump servers. > > - A container coredump server in a separate network namespace can simply > bind to linuxafsk/coredump.socket and systemd-coredump fowards > coredumps to the container. > > - Fwiw, one idea is to handle coredumps via per-user/session coredump > servers that run with that users privileges. > > The coredump server listens on the coredump socket and accepts a > new coredump connection. It then retrieves SO_PEERPIDFD for the > client, inspects uid/gid and hands the accepted client to the users > own coredump handler which runs with the users privileges only. (Though that would only be okay if it's not done for suid dumping cases.) > The new coredump socket will allow userspace to not have to rely on > usermode helpers for processing coredumps and provides a safer way to > handle them instead of relying on super privileged coredumping helpers. > > This will also be significantly more lightweight since no fork()+exec() > for the usermodehelper is required for each crashing process. The > coredump server in userspace can just keep a worker pool. I mean, if coredumping is a performance bottleneck, something is probably seriously wrong with the system... I don't think we need to optimize for execution speed in this area. > This is easy to test: > > (a) coredump processing (we're using socat): > > > cat coredump_socket.sh > #!/bin/bash > > set -x > > sudo bash -c "echo '@linuxafsk/coredump.socket' > /proc/sys/kernel/core_pattern" > sudo socat --statistics abstract-listen:linuxafsk/coredump.socket,fork FILE:core_file,create,append,trunc > > (b) trigger a coredump: > > user1@localhost:~/data/scripts$ cat crash.c > #include <stdio.h> > #include <unistd.h> > > int main(int argc, char *argv[]) > { > fprintf(stderr, "%u\n", (1 / 0)); > _exit(0); > } This looks pretty neat overall! > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > fs/coredump.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 107 insertions(+), 5 deletions(-) > > diff --git a/fs/coredump.c b/fs/coredump.c > index 1779299b8c61..c60f86c473ad 100644 > --- a/fs/coredump.c > +++ b/fs/coredump.c > @@ -44,7 +44,11 @@ > #include <linux/sysctl.h> > #include <linux/elf.h> > #include <linux/pidfs.h> > +#include <linux/net.h> > +#include <linux/socket.h> > +#include <net/net_namespace.h> > #include <uapi/linux/pidfd.h> > +#include <uapi/linux/un.h> > > #include <linux/uaccess.h> > #include <asm/mmu_context.h> > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT; > enum coredump_type_t { > COREDUMP_FILE = 1, > COREDUMP_PIPE = 2, > + COREDUMP_SOCK = 3, > }; > > struct core_name { > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, > cn->corename = NULL; > if (*pat_ptr == '|') > cn->core_type = COREDUMP_PIPE; > + else if (*pat_ptr == '@') > + cn->core_type = COREDUMP_SOCK; > else > cn->core_type = COREDUMP_FILE; > if (expand_corename(cn, core_name_size)) > return -ENOMEM; > cn->corename[0] = '\0'; > > - if (cn->core_type == COREDUMP_PIPE) { > + switch (cn->core_type) { > + case COREDUMP_PIPE: { > int argvs = sizeof(core_pattern) / 2; > (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL); > if (!(*argv)) > @@ -247,6 +255,32 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, > ++pat_ptr; > if (!(*pat_ptr)) > return -ENOMEM; > + break; > + } > + case COREDUMP_SOCK: { > + err = cn_printf(cn, "%s", pat_ptr); > + if (err) > + return err; > + > + /* > + * We can potentially allow this to be changed later but > + * I currently see no reason to. > + */ > + if (strcmp(cn->corename, "@linuxafsk/coredump.socket")) > + return -EINVAL; > + > + /* > + * Currently no need to parse any other options. > + * Relevant information can be retrieved from the peer > + * pidfd retrievable via SO_PEERPIDFD by the receiver or > + * via /proc/<pid>, using the SO_PEERPIDFD to guard > + * against pid recycling when opening /proc/<pid>. > + */ > + return 0; > + } > + default: > + WARN_ON_ONCE(cn->core_type != COREDUMP_FILE); > + break; > } > > /* Repeat as long as we have more pattern to process and more output I think the core_uses_pid logic at the end of this function needs to be adjusted to also exclude COREDUMP_SOCK? > @@ -583,6 +617,17 @@ static int umh_coredump_setup(struct subprocess_info *info, struct cred *new) > return 0; > } > > +#ifdef CONFIG_UNIX > +struct sockaddr_un coredump_unix_socket = { > + .sun_family = AF_UNIX, > + .sun_path = "\0linuxafsk/coredump.socket", > +}; Nit: Please make that static and const. > +/* Without trailing NUL byte. */ > +#define COREDUMP_UNIX_SOCKET_ADDR_SIZE \ > + (offsetof(struct sockaddr_un, sun_path) + \ > + sizeof("\0linuxafsk/coredump.socket") - 1) > +#endif > + > void do_coredump(const kernel_siginfo_t *siginfo) > { > struct core_state core_state; > @@ -801,6 +846,40 @@ void do_coredump(const kernel_siginfo_t *siginfo) > } > break; > } > + case COREDUMP_SOCK: { > + struct file *file __free(fput) = NULL; > +#ifdef CONFIG_UNIX > + struct socket *socket; > + > + /* > + * It is possible that the userspace process which is > + * supposed to handle the coredump and is listening on > + * the AF_UNIX socket coredumps. Userspace should just > + * mark itself non dumpable. > + */ > + > + retval = sock_create_kern(&init_net, AF_UNIX, SOCK_STREAM, 0, &socket); > + if (retval < 0) > + goto close_fail; > + > + file = sock_alloc_file(socket, 0, NULL); > + if (IS_ERR(file)) { > + sock_release(socket); > + retval = PTR_ERR(file); > + goto close_fail; > + } > + > + retval = kernel_connect(socket, > + (struct sockaddr *)(&coredump_unix_socket), > + COREDUMP_UNIX_SOCKET_ADDR_SIZE, 0); > + if (retval) > + goto close_fail; > + > + cprm.limit = RLIM_INFINITY; > +#endif The non-CONFIG_UNIX case here should probably bail out? > + cprm.file = no_free_ptr(file); > + break; > + } > default: > WARN_ON_ONCE(true); > retval = -EINVAL; > @@ -818,7 +897,10 @@ void do_coredump(const kernel_siginfo_t *siginfo) > * have this set to NULL. > */ > if (!cprm.file) { > - coredump_report_failure("Core dump to |%s disabled", cn.corename); > + if (cn.core_type == COREDUMP_PIPE) > + coredump_report_failure("Core dump to |%s disabled", cn.corename); > + else > + coredump_report_failure("Core dump to @%s disabled", cn.corename); Are you actually truncating the initial "@" off of cn.corename, or is this going to print two "@" characters? > goto close_fail; > } > if (!dump_vma_snapshot(&cprm)) > @@ -839,8 +921,28 @@ void do_coredump(const kernel_siginfo_t *siginfo) > file_end_write(cprm.file); > free_vma_snapshot(&cprm); > } > - if ((cn.core_type == COREDUMP_PIPE) && core_pipe_limit) > - wait_for_dump_helpers(cprm.file); > + > + if (core_pipe_limit) { > + switch (cn.core_type) { > + case COREDUMP_PIPE: > + wait_for_dump_helpers(cprm.file); > + break; > + case COREDUMP_SOCK: { > + char buf[1]; > + /* > + * We use a simple read to wait for the coredump > + * processing to finish. Either the socket is > + * closed or we get sent unexpected data. In > + * both cases, we're done. > + */ > + __kernel_read(cprm.file, buf, 1, NULL); > + break; > + } > + default: > + break; > + } > + } > + > close_fail: > if (cprm.file) > filp_close(cprm.file, NULL);