On Wed, Sep 03, 2025 at 08:38:03PM +0000, Tom Hromatka wrote: > Add an operation, SECCOMP_CLONE_FILTER, that can copy the seccomp filters > from another process to the current process. > > I roughly reproduced the Docker seccomp filter [1] and timed how long it > takes to build it (via libseccomp) and attach it to a process. After > 1000 runs, on average it took 3,740,000 TSC ticks (or ~1440 microseconds) > on an AMD EPYC 9J14 running at 2596 MHz. The median build/load time was > 3,715,000 TSC ticks. > > On the same system, I preloaded the above Docker seccomp filter onto a > process. (Note that I opened a pidfd to the reference process and left > the pidfd open for the entire run.) I then cloned the filter using the > feature in this patch to 1000 new processes. On average, it took 9,300 > TSC ticks (or ~3.6 microseconds) to copy the filter to the new processes. > The median clone time was 9,048 TSC ticks. > > This is approximately a 400x performance improvement for those container > managers that are using the exact same seccomp filter across all of their > containers. This is a nice speedup, but with devil's advocate hat on, are launchers spawning at rates high enough that this makes a difference? > > [1] https://raw.githubusercontent.com/moby/moby/refs/heads/master/profiles/seccomp/default.json > > Signed-off-by: Tom Hromatka <tom.hromatka@xxxxxxxxxx> > --- > .../userspace-api/seccomp_filter.rst | 10 ++ > include/uapi/linux/seccomp.h | 1 + > kernel/seccomp.c | 48 ++++++ > samples/seccomp/Makefile | 2 +- > samples/seccomp/clone-filter.c | 143 ++++++++++++++++++ > tools/include/uapi/linux/seccomp.h | 1 + > tools/testing/selftests/seccomp/seccomp_bpf.c | 71 +++++++++ > 7 files changed, 275 insertions(+), 1 deletion(-) > create mode 100644 samples/seccomp/clone-filter.c > > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst > index cff0fa7f3175..ef1797d093f6 100644 > --- a/Documentation/userspace-api/seccomp_filter.rst > +++ b/Documentation/userspace-api/seccomp_filter.rst > @@ -289,6 +289,16 @@ above in this document: all arguments being read from the tracee's memory > should be read into the tracer's memory before any policy decisions are made. > This allows for an atomic decision on syscall arguments. > > +Cloning an Existing Seccomp Filter > +================================== > + > +Constructing and loading a complex seccomp filter can often take a non-trivial > +amount of time. If a user wants to use the same seccomp filter across more > +than one process, it can be cloned to new processes via the > +``SECCOMP_CLONE_FILTER`` operation. Note that the clone will only succeed if > +the destination process does not have any seccomp filters already applied to > +it. See ``samples/seccomp/clone-filter.c`` for an example. Is "does not have any seccomp filters" a reasonable expectation? I mean, I'm fine with it, but will this feature be generally useful with that restriction? (i.e. becomes unusable under Docker, in several systemd contexts, etc) > +static long seccomp_clone_filter(void __user *upidfd) > +{ > + struct task_struct *task; > + unsigned int flags; > + pid_t pidfd; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; Again, I'm fine with this being CAP_SYS_ADMIN, but the normal seccomp filter restriction is NNP || CAP_SYS_ADMIN. Would it be safe to do clones with non-ADMIN? Hmmm. > + if (atomic_read(¤t->seccomp.filter_count) > 0) > + return -EINVAL; I'd wait to test this until you're under ¤t->sighand->siglock, and just test current->seccomp.filter. > + > + if (copy_from_user(&pidfd, upidfd, sizeof(pid_t))) > + return -EFAULT; > + > + task = pidfd_get_task(pidfd, &flags); > + if (IS_ERR(task)) > + return -ESRCH; Is this the right way to pass in a pidfd? Shouldn't this be an int, not a pointer? What is idiomatic for pidfd interfaces? > + > + spin_lock_irq(¤t->sighand->siglock); > + spin_lock_irq(&task->sighand->siglock); As others mentioned, you can't do this. ;) I'm pretty sure you can just take references progressively: - task = pidfd_get_task - mutex_lock_killable(&task->signal->cred_guard_mutex) - spin_lock_irq(&task->sighand->siglock); - check seccomp mode for filter mode, e.g. see seccomp_may_assign_mode() - get_seccomp_filter(task) - struct seccomp copy = task->seccomp (copies filter pointer, count, and mode) - release siglock - release cred_guard_mutex - put task Now you have the seccomp copy! :) Any errors from here mean you need to use __seccomp_filter_release to "put" the filter, if I'm reading things correctly. (We might have issues with USER_NOTIF, but I haven't looked closely yet.) And applying it should be: - take current->signal->cred_guard_mutex - take current->sighand->siglock - make sure task->seccomp.filter == NULL - current->seccomp = copy - release siglock - release cred_guard_mutex I *think* the above is correct. I may have forgotten some details, but I was mostly trying to combine TSYNC and regular application of a filter to current. > + > + if (atomic_read(&task->seccomp.filter_count) == 0) { > + spin_unlock_irq(&task->sighand->siglock); > + spin_unlock_irq(¤t->sighand->siglock); > + put_task_struct(task); > + return -EINVAL; > + } > + > + get_seccomp_filter(task); > + current->seccomp = task->seccomp; > + > + spin_unlock_irq(&task->sighand->siglock); > + > + set_task_syscall_work(current, SECCOMP); > + > + spin_unlock_irq(¤t->sighand->siglock); > + > + put_task_struct(task); > + > + return 0; > +} > + > /* Common entry point for both prctl and syscall. */ > static long do_seccomp(unsigned int op, unsigned int flags, > void __user *uargs) > @@ -2102,6 +2145,11 @@ static long do_seccomp(unsigned int op, unsigned int flags, > return -EINVAL; > > return seccomp_get_notif_sizes(uargs); > + case SECCOMP_CLONE_FILTER: > + if (flags != 0) > + return -EINVAL; > + > + return seccomp_clone_filter(uargs); > default: > return -EINVAL; > } > diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile > index c85ae0ed8342..d38977f41b86 100644 > --- a/samples/seccomp/Makefile > +++ b/samples/seccomp/Makefile > @@ -1,5 +1,5 @@ > # SPDX-License-Identifier: GPL-2.0 > -userprogs-always-y += bpf-fancy dropper bpf-direct user-trap > +userprogs-always-y += bpf-fancy dropper bpf-direct user-trap clone-filter > > bpf-fancy-objs := bpf-fancy.o bpf-helper.o > > diff --git a/samples/seccomp/clone-filter.c b/samples/seccomp/clone-filter.c > new file mode 100644 > index 000000000000..d26e1375b9dc > --- /dev/null > +++ b/samples/seccomp/clone-filter.c > @@ -0,0 +1,143 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Seccomp filter example for cloning a filter > + * > + * Copyright (c) 2025 Oracle and/or its affiliates. > + * Author: Tom Hromatka <tom.hromatka@xxxxxxxxxx> > + * > + * The code may be used by anyone for any purpose, > + * and can serve as a starting point for developing > + * applications that reuse the same seccomp filter > + * across many processes. > + */ > +#include <linux/seccomp.h> > +#include <linux/filter.h> > +#include <sys/syscall.h> > +#include <sys/wait.h> > +#include <stdbool.h> > +#include <signal.h> > +#include <stddef.h> > +#include <stdlib.h> > +#include <unistd.h> > +#include <stdio.h> > +#include <errno.h> > + > +#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) > + > +static int seccomp(unsigned int op, unsigned int flags, void *args) > +{ > + errno = 0; > + return syscall(__NR_seccomp, op, flags, args); > +} > + > +static int install_filter(void) > +{ > + struct sock_filter deny_filter[] = { > + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, > + offsetof(struct seccomp_data, nr)), > + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1), > + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH), > + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), > + }; > + struct sock_fprog deny_prog = { > + .len = (unsigned short)ARRAY_SIZE(deny_filter), > + .filter = deny_filter, > + }; > + > + return seccomp(SECCOMP_SET_MODE_FILTER, 0, &deny_prog); > +} > + > +static int clone_filter(pid_t ref_pid) > +{ > + int ref_pidfd, ret; > + > + ref_pidfd = syscall(SYS_pidfd_open, ref_pid, 0); > + if (ref_pidfd < 0) > + return -errno; > + > + ret = seccomp(SECCOMP_CLONE_FILTER, 0, &ref_pidfd); > + > + close(ref_pidfd); > + > + return ret; > +} > + > +static void do_ref_filter(void) > +{ > + int ret; > + > + ret = install_filter(); > + if (ret) { > + perror("Failed to install ref filter\n"); > + exit(1); > + } > + > + while (true) > + sleep(1); > +} > + > +static void do_child_process(pid_t ref_pid) > +{ > + pid_t res; > + int ret; > + > + ret = clone_filter(ref_pid); > + if (ret != 0) { > + perror("Failed to clone filter. Installing filter from scratch\n"); > + > + ret = install_filter(); > + if (ret != 0) { > + perror("Filter install failed\n"); > + exit(ret); > + } > + } > + > + res = syscall(__NR_getpid); > + if (res < 0) { > + perror("getpid() unexpectedly failed\n"); > + exit(errno); > + } > + > + res = syscall(__NR_getppid); > + if (res > 0) { > + perror("getppid() unexpectedly succeeded\n"); > + exit(1); > + } > + > + exit(0); > +} > + > +int main(void) > +{ > + pid_t ref_pid = -1, child_pid = -1; > + int ret, status; > + > + ref_pid = fork(); > + if (ref_pid < 0) > + exit(errno); > + else if (ref_pid == 0) > + do_ref_filter(); > + > + child_pid = fork(); > + if (child_pid < 0) > + goto out; > + else if (child_pid == 0) > + do_child_process(ref_pid); > + > + waitpid(child_pid, &status, 0); > + if (WEXITSTATUS(status) != 0) { > + perror("child process failed"); > + ret = WEXITSTATUS(status); > + goto out; > + } > + > + ret = 0; > + > +out: > + if (ref_pid != -1) > + kill(ref_pid, SIGKILL); > + if (child_pid != -1) > + kill(child_pid, SIGKILL); > + > + exit(ret); > +} > diff --git a/tools/include/uapi/linux/seccomp.h b/tools/include/uapi/linux/seccomp.h > index dbfc9b37fcae..b0917e333b4b 100644 > --- a/tools/include/uapi/linux/seccomp.h > +++ b/tools/include/uapi/linux/seccomp.h > @@ -16,6 +16,7 @@ > #define SECCOMP_SET_MODE_FILTER 1 > #define SECCOMP_GET_ACTION_AVAIL 2 > #define SECCOMP_GET_NOTIF_SIZES 3 > +#define SECCOMP_CLONE_FILTER 4 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c > index 61acbd45ffaa..df5e0f615da0 100644 > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > @@ -177,6 +177,10 @@ struct seccomp_data { > #define SECCOMP_GET_NOTIF_SIZES 3 > #endif > > +#ifndef SECCOMP_CLONE_FILTER > +#define SECCOMP_CLONE_FILTER 4 > +#endif > + > #ifndef SECCOMP_FILTER_FLAG_TSYNC > #define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0) > #endif > @@ -5090,6 +5094,73 @@ TEST_F(URETPROBE, uretprobe_default_block_with_uretprobe_syscall) > ASSERT_EQ(0, run_probed_with_filter(&prog)); > } > > +TEST(clone_filter) Yay tests! > +{ > + struct sock_filter deny_filter[] = { > + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, > + offsetof(struct seccomp_data, nr)), > + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getppid, 0, 1), > + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ERRNO | ESRCH), > + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), > + }; > + struct sock_fprog deny_prog = { > + .len = (unsigned short)ARRAY_SIZE(deny_filter), > + .filter = deny_filter, > + }; > + struct timespec ts = { > + .tv_sec = 0, > + .tv_nsec = 100000000, > + }; > + > + pid_t child_pid, self_pid, res; > + int child_pidfd, ret; > + > + /* Only real root can copy a filter. */ > + if (geteuid()) { That's not true: uid==0 is not CAP_SYS_ADMIN. :) Look in this test for: cap_get_flag(cap, CAP_SYS_ADMIN, CAP_EFFECTIVE, &is_cap_sys_admin); which is how to test this correctly. > + SKIP(return, "clone_filter requires real root"); > + return; > + } > + > + self_pid = getpid(); > + > + child_pid = fork(); > + ASSERT_LE(0, child_pid); Uh, isn't that supposed to be GE? This should have failed your test immediately every time. Does this test pass for you? :P > + > + if (child_pid == 0) { > + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)); > + ASSERT_EQ(0, prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &deny_prog)); I'd assert that get_ppid gives ESRCH here just for completeness. > + > + while (true) > + EXPECT_EQ(0, syscall(__NR_nanosleep, &ts, NULL)); > + } > + > + /* wait for the child pid to create its seccomp filter */ > + ASSERT_EQ(0, syscall(__NR_nanosleep, &ts, NULL)); Please look at the other tests to see how to synchronize without a non-deterministic sleep "guess". :) It's bad form to induce needless delays in tests. > + > + child_pidfd = syscall(SYS_pidfd_open, child_pid, 0); > + EXPECT_LE(0, child_pidfd); > + > + /* Invalid flag provided */ > + ret = seccomp(SECCOMP_CLONE_FILTER, 1, &child_pidfd); I'd set all the bits, not just 1. This is also missing a EFAULT test (i.e. a NULL child_pidfd). (Though I still want to know if this is idiomatic for pidfd interfaces -- I'd not expect this to be passed as a pointer.) And it's missing a ESRCH test. And a "this is not a pidfd" test. > + EXPECT_EQ(-1, ret); > + EXPECT_EQ(errno, EINVAL); > + > + errno = 0; > + ret = seccomp(SECCOMP_CLONE_FILTER, 0, &child_pidfd); > + EXPECT_EQ(0, ret); > + EXPECT_EQ(errno, 0); > + > + res = syscall(__NR_getppid); > + EXPECT_EQ(res, -1); > + EXPECT_EQ(errno, ESRCH); I would validate that getppid succeeds before the CLONE_FILTER, just for robustness. > + > + res = syscall(__NR_getpid); > + EXPECT_EQ(res, self_pid); > + > + close(child_pidfd); > + kill(child_pid, SIGKILL); I'm not a fan of using "kill" for child sync in tests. I'd rather see a blocking read or similar (so the child doesn't have to spin, even if it's in sleep). But at the very least I'd want to see a waitpid for this kill. > +} Please also check for the "I already have a filter attached" failure path. > + > /* > * TODO: > * - expand NNP testing > -- > 2.47.3 > Thanks for working on this! It'd be a nice speed-up for sure. -Kees -- Kees Cook