On Wed, 14 May 2025 at 23:04, Christian Brauner <brauner@xxxxxxxxxx> wrote: > > Coredumping currently supports two modes: > > (1) Dumping directly into a file somewhere on the filesystem. > (2) Dumping into a pipe connected to a usermode helper process > spawned as a child of the system_unbound_wq or kthreadd. > > For simplicity I'm mostly ignoring (1). There's probably still some > users of (1) out there but processing coredumps in this way can be > considered adventurous especially in the face of set*id binaries. > > The most common option should be (2) by now. It works by allowing > userspace to put a string into /proc/sys/kernel/core_pattern like: > > |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h > > The "|" at the beginning indicates to the kernel that a pipe must be > used. The path following the pipe indicator is a path to a binary that > will be spawned as a usermode helper process. Any additional parameters > pass information about the task that is generating the coredump to the > binary that processes the coredump. > > In the example core_pattern shown above systemd-coredump is spawned as a > usermode helper. There's various conceptual consequences of this > (non-exhaustive list): > > - systemd-coredump is spawned with file descriptor number 0 (stdin) > connected to the read-end of the pipe. All other file descriptors are > closed. That specifically includes 1 (stdout) and 2 (stderr). This has > already caused bugs because userspace assumed that this cannot happen > (Whether or not this is a sane assumption is irrelevant.). > > - systemd-coredump will be spawned as a child of system_unbound_wq. So > it is not a child of any userspace process and specifically not a > child of PID 1. It cannot be waited upon and is in a weird hybrid > upcall which are difficult for userspace to control correctly. > > - systemd-coredump is spawned with full kernel privileges. This > necessitates all kinds of weird privilege dropping excercises in > userspace to make this safe. > > - A new usermode helper has to be spawned for each crashing process. > > This series adds a new mode: > > (3) Dumping into an abstract AF_UNIX socket. > > Userspace can set /proc/sys/kernel/core_pattern to: > > @/path/to/coredump.socket > > The "@" at the beginning indicates to the kernel that an AF_UNIX > coredump socket will be used to process coredumps. > > The coredump socket must be located in the initial mount namespace. > When a task coredumps it opens a client socket in the initial network > namespace and connects to the coredump socket. > > - The coredump server should use SO_PEERPIDFD to get a stable handle on > the connected crashing task. The retrieved pidfd will provide a stable > reference even if the crashing task gets SIGKILLed while generating > the coredump. > > - When a coredump connection is initiated use the socket cookie as the > coredump cookie and store it in the pidfd. The receiver can now easily > authenticate that the connection is coming from the kernel. > > Unless the coredump server expects to handle connection from > non-crashing task it can validate that the connection has been made from > a crashing task: > > fd_coredump = accept4(fd_socket, NULL, NULL, SOCK_CLOEXEC); > getsockopt(fd_coredump, SOL_SOCKET, SO_PEERPIDFD, &fd_peer_pidfd, &fd_peer_pidfd_len); > > struct pidfd_info info = { > info.mask = PIDFD_INFO_EXIT | PIDFD_INFO_COREDUMP, > }; > > ioctl(pidfd, PIDFD_GET_INFO, &info); > /* Refuse connections that aren't from a crashing task. */ > if (!(info.mask & PIDFD_INFO_COREDUMP) || !(info.coredump_mask & PIDFD_COREDUMPED) ) > close(fd_coredump); > > /* > * Make sure that the coredump cookie matches the connection cookie. > * If they don't it's not the coredump connection from the kernel. > * We'll get another connection request in a bit. > */ > getsocketop(fd_coredump, SOL_SOCKET, SO_COOKIE, &peer_cookie, &peer_cookie_len); > if (!info.coredump_cookie || (info.coredump_cookie != peer_cookie)) > close(fd_coredump); > > The kernel guarantees that by the time the connection is made the > coredump info is available. > > - By setting core_pipe_limit non-zero userspace can guarantee that the > crashing task cannot be reaped behind it's back and thus process all > necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to > detect whether /proc/<pid> still refers to the same process. > > The core_pipe_limit isn't used to rate-limit connections to the > socket. This can simply be done via AF_UNIX socket directly. > > - The pidfd for the crashing task will contain information how the task > coredumps. The PIDFD_GET_INFO ioctl gained a new flag > PIDFD_INFO_COREDUMP which can be used to retreive the coredump > information. > > If the coredump gets a new coredump client connection the kernel > guarantees that PIDFD_INFO_COREDUMP information is available. > Currently the following information is provided in the new > @coredump_mask extension to struct pidfd_info: > > * PIDFD_COREDUMPED is raised if the task did actually coredump. > * PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g., > undumpable). > * PIDFD_COREDUMP_USER is raised if this is a regular coredump and > doesn't need special care by the coredump server. > * PIDFD_COREDUMP_ROOT is raised if the generated coredump should be > treated as sensitive and the coredump server should restrict access > to the generated coredump to sufficiently privileged users. > > - The coredump server should mark itself as non-dumpable. > > - A container coredump server in a separate network namespace can simply > bind to another well-know address and systemd-coredump fowards > coredumps to the container. > > - Coredumps could in the future also be handled via per-user/session > coredump servers that run only with that users privileges. > > The coredump server listens on the coredump socket and accepts a > new coredump connection. It then retrieves SO_PEERPIDFD for the > client, inspects uid/gid and hands the accepted client to the users > own coredump handler which runs with the users privileges only > (It must of coure pay close attention to not forward crashing suid > binaries.). > > The new coredump socket will allow userspace to not have to rely on > usermode helpers for processing coredumps and provides a safer way to > handle them instead of relying on super privileged coredumping helpers. > > This will also be significantly more lightweight since no fork()+exec() > for the usermodehelper is required for each crashing process. The > coredump server in userspace can just keep a worker pool. > > Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx> > --- > Changes in v7: > - Use regular AF_UNIX sockets instead of abstract AF_UNIX sockets. This > fixes the permission problems as userspace can ensure that the socket > path cannot be rebound by arbitrary unprivileged userspace via regular > path permissions. > > This means: > - We don't require privilege checks on a reserved abstract AF_UNIX > namespace > - We don't require a fixed address for the coredump socket. > - We don't need to use abstract unix sockets at all. > - We don't need special socket cookie magic in the > /proc/sys/kernel/core_pattern handler. > - We are able to set /proc/sys/kernel/core_pattern statically without > having any socket bound. > > That's all complaints addressed. > > Simply massage unix_find_bsd() to be able to handle this and always > lookup the coredump socket in the initial mount namespace with > appropriate credentials. The same thing we do for looking up other > parts in the kernel like this. Only the lookup happens this way. > Actual connection credentials are obviously from the coredumping task. > - Link to v6: https://lore.kernel.org/20250512-work-coredump-socket-v6-0-c51bc3450727@xxxxxxxxxx > > Changes in v6: > - Use the socket cookie to verify the coredump server. > - Link to v5: https://lore.kernel.org/20250509-work-coredump-socket-v5-0-23c5b14df1bc@xxxxxxxxxx > > Changes in v5: > - Don't use a prefix just the specific address. > - Link to v4: https://lore.kernel.org/20250507-work-coredump-socket-v4-0-af0ef317b2d0@xxxxxxxxxx > > Changes in v4: > - Expose the coredump socket cookie through the pidfd. This allows the > coredump server to easily recognize coredump socket connections. > - Link to v3: https://lore.kernel.org/20250505-work-coredump-socket-v3-0-e1832f0e1eae@xxxxxxxxxx > > Changes in v3: > - Use an abstract unix socket. > - Add documentation. > - Add selftests. > - Link to v2: https://lore.kernel.org/20250502-work-coredump-socket-v2-0-43259042ffc7@xxxxxxxxxx > > Changes in v2: > - Expose dumpability via PIDFD_GET_INFO. > - Place COREDUMP_SOCK handling under CONFIG_UNIX. > - Link to v1: https://lore.kernel.org/20250430-work-coredump-socket-v1-0-2faf027dbb47@xxxxxxxxxx > > --- > Christian Brauner (9): > coredump: massage format_corname() > coredump: massage do_coredump() > coredump: reflow dump helpers a little > coredump: add coredump socket > pidfs, coredump: add PIDFD_INFO_COREDUMP > coredump: show supported coredump modes > coredump: validate socket name as it is written > selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure > selftests/coredump: add tests for AF_UNIX coredumps > > fs/coredump.c | 392 +++++++++++++---- > fs/pidfs.c | 79 ++++ > include/linux/net.h | 1 + > include/linux/pidfs.h | 10 + > include/uapi/linux/pidfd.h | 22 + > net/unix/af_unix.c | 60 ++- > tools/testing/selftests/coredump/stackdump_test.c | 514 +++++++++++++++++++++- > tools/testing/selftests/pidfd/pidfd.h | 23 + > 8 files changed, 996 insertions(+), 105 deletions(-) > --- > base-commit: 4dd6566b5a8ca1e8c9ff2652c2249715d6c64217 > change-id: 20250429-work-coredump-socket-87cc0f17729c Looks great to me and we can for sure use this in systemd-coredump, thanks, for the series: Acked-by: Luca Boccassi <luca.boccassi@xxxxxxxxx>