On Thu, May 15, 2025 at 03:47:33PM +0200, Alexander Mikhalitsyn wrote: > Am Do., 15. Mai 2025 um 00:04 Uhr schrieb Christian Brauner > <brauner@xxxxxxxxxx>: > > > > Coredumping currently supports two modes: > > > > (1) Dumping directly into a file somewhere on the filesystem. > > (2) Dumping into a pipe connected to a usermode helper process > > spawned as a child of the system_unbound_wq or kthreadd. > > > > For simplicity I'm mostly ignoring (1). There's probably still some > > users of (1) out there but processing coredumps in this way can be > > considered adventurous especially in the face of set*id binaries. > > > > The most common option should be (2) by now. It works by allowing > > userspace to put a string into /proc/sys/kernel/core_pattern like: > > > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > > > The "|" at the beginning indicates to the kernel that a pipe must be > > used. The path following the pipe indicator is a path to a binary that > > will be spawned as a usermode helper process. Any additional parameters > > pass information about the task that is generating the coredump to the > > binary that processes the coredump. > > > > In the example core_pattern shown above systemd-coredump is spawned as a > > usermode helper. There's various conceptual consequences of this > > (non-exhaustive list): > > > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > > connected to the read-end of the pipe. All other file descriptors are > > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > > already caused bugs because userspace assumed that this cannot happen > > (Whether or not this is a sane assumption is irrelevant.). > > > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > > it is not a child of any userspace process and specifically not a > > child of PID 1. It cannot be waited upon and is in a weird hybrid > > upcall which are difficult for userspace to control correctly. > > > > - systemd-coredump is spawned with full kernel privileges. This > > necessitates all kinds of weird privilege dropping excercises in > > userspace to make this safe. > > > > - A new usermode helper has to be spawned for each crashing process. > > > > This series adds a new mode: > > > > (3) Dumping into an AF_UNIX socket. > > > > Userspace can set /proc/sys/kernel/core_pattern to: > > > > @/path/to/coredump.socket > > > > The "@" at the beginning indicates to the kernel that an AF_UNIX > > coredump socket will be used to process coredumps. > > > > The coredump socket must be located in the initial mount namespace. > > When a task coredumps it opens a client socket in the initial network > > namespace and connects to the coredump socket. > > > > - The coredump server uses SO_PEERPIDFD to get a stable handle on the > > connected crashing task. The retrieved pidfd will provide a stable > > reference even if the crashing task gets SIGKILLed while generating > > the coredump. > > > > - By setting core_pipe_limit non-zero userspace can guarantee that the > > crashing task cannot be reaped behind it's back and thus process all > > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to > > detect whether /proc/<pid> still refers to the same process. > > > > The core_pipe_limit isn't used to rate-limit connections to the > > socket. This can simply be done via AF_UNIX sockets directly. > > > > - The pidfd for the crashing task will grow new information how the task > > coredumps. > > > > - The coredump server should mark itself as non-dumpable. > > > > - A container coredump server in a separate network namespace can simply > > bind to another well-know address and systemd-coredump fowards > > coredumps to the container. > > > > - Coredumps could in the future also be handled via per-user/session > > coredump servers that run only with that users privileges. > > > > The coredump server listens on the coredump socket and accepts a > > new coredump connection. It then retrieves SO_PEERPIDFD for the > > client, inspects uid/gid and hands the accepted client to the users > > own coredump handler which runs with the users privileges only > > (It must of coure pay close attention to not forward crashing suid > > binaries.). > > > > The new coredump socket will allow userspace to not have to rely on > > usermode helpers for processing coredumps and provides a safer way to > > handle them instead of relying on super privileged coredumping helpers > > that have and continue to cause significant CVEs. > > > > This will also be significantly more lightweight since no fork()+exec() > > for the usermodehelper is required for each crashing process. The > > coredump server in userspace can e.g., just keep a worker pool. > > > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > > Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@xxxxxxxxxxxxx> > > > --- > > fs/coredump.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++---- > > include/linux/net.h | 1 + > > net/unix/af_unix.c | 53 ++++++++++++++++----- > > 3 files changed, 166 insertions(+), 21 deletions(-) > > > > diff --git a/fs/coredump.c b/fs/coredump.c > > index a70929c3585b..e1256ebb89c1 100644 > > --- a/fs/coredump.c > > +++ b/fs/coredump.c > > @@ -44,7 +44,11 @@ > > #include <linux/sysctl.h> > > #include <linux/elf.h> > > #include <linux/pidfs.h> > > +#include <linux/net.h> > > +#include <linux/socket.h> > > +#include <net/net_namespace.h> > > #include <uapi/linux/pidfd.h> > > +#include <uapi/linux/un.h> > > > > #include <linux/uaccess.h> > > #include <asm/mmu_context.h> > > @@ -79,6 +83,7 @@ unsigned int core_file_note_size_limit = CORE_FILE_NOTE_SIZE_DEFAULT; > > enum coredump_type_t { > > COREDUMP_FILE = 1, > > COREDUMP_PIPE = 2, > > + COREDUMP_SOCK = 3, > > }; > > > > struct core_name { > > @@ -232,13 +237,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, > > cn->corename = NULL; > > if (*pat_ptr == '|') > > cn->core_type = COREDUMP_PIPE; > > + else if (*pat_ptr == '@') > > + cn->core_type = COREDUMP_SOCK; > > else > > cn->core_type = COREDUMP_FILE; > > if (expand_corename(cn, core_name_size)) > > return -ENOMEM; > > cn->corename[0] = '\0'; > > > > - if (cn->core_type == COREDUMP_PIPE) { > > + switch (cn->core_type) { > > + case COREDUMP_PIPE: { > > int argvs = sizeof(core_pattern) / 2; > > (*argv) = kmalloc_array(argvs, sizeof(**argv), GFP_KERNEL); > > if (!(*argv)) > > @@ -247,6 +255,33 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, > > ++pat_ptr; > > if (!(*pat_ptr)) > > return -ENOMEM; > > + break; > > + } > > + case COREDUMP_SOCK: { > > + /* skip the @ */ > > + pat_ptr++; > > nit: I would do > if (!(*pat_ptr)) > return -ENOMEM; > as we do for the COREDUMP_PIPE case above. > just in case if something will change in cn_printf() to eliminate any > chance of crashes in there. Ok.