Re: [PATCH] x86/bpf: use bpf_capable() instead of capable(CAP_SYS_ADMIN)

Ondrej Mosnacek <omosnace@xxxxxxxxxx> · Wed, 13 Aug 2025 11:49:10 +0200

On Sat, Aug 9, 2025 at 2:46 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Aug 8, 2025 at 4:59 PM Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> >
> > On Wed, Aug 06, 2025 at 08:18:55PM -0500, Serge E. Hallyn wrote:
> > > On Wed, Aug 06, 2025 at 04:31:05PM +0200, Ondrej Mosnacek wrote:
> > > > Don't check against the overloaded CAP_SYS_ADMINin do_jit(), but instead
> > > > use bpf_capable(), which checks against the more granular CAP_BPF first.
> > > > Going straight to CAP_SYS_ADMIN may cause unnecessary audit log spam
> > > > under SELinux, as privileged domains using BPF would usually only be
> > > > allowed CAP_BPF and not CAP_SYS_ADMIN.
> > > >
> > > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2369326
> > > > Fixes: d4e89d212d40 ("x86/bpf: Call branch history clearing sequence on exit")
> > > > Signed-off-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx>
> > >
> > > So this seems correct, *provided* that we consider it within the purview of
> > > CAP_BPF to be able to avoid clearing the branch history buffer.
>
> true, but...
>
> > >
> > > I suspect that's the case, but it might warrant discussion.
> > >
> > > Reviewed-by: Serge Hallyn <serge@xxxxxxxxxx>
> >
> > (BTW, I'm assuming this will get pulled into a BPF tree or something, and
> > doesn't need to go into the capabilities tree.  Let me know if that's wrong)
>
> Right.
> scripts/get_maintainer.pl arch/x86/net/bpf_jit_comp.c
> is your friend.
>
> Pls cc author-s of the commit in question in the future.
> Adding them now.
>
> > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > index 15672cb926fc1..2a825e5745ca1 100644
> > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > @@ -2591,8 +2591,7 @@ emit_jmp:
> > > >                     seen_exit = true;
> > > >                     /* Update cleanup_addr */
> > > >                     ctx->cleanup_addr = proglen;
> > > > -                   if (bpf_prog_was_classic(bpf_prog) &&
> > > > -                       !capable(CAP_SYS_ADMIN)) {
> > > > +                   if (bpf_prog_was_classic(bpf_prog) && !bpf_capable()) {
>
> This looks wrong for several reasons.
>
> 1.
> bpf_capable() and CAP_BPF in general applies to eBPF only.
> There is no precedent so far to do anything differently
> for cBPF when CAP_BPF is present.

That's not entirely true, see below.

> 2.
> commit log states that
> "privileged domains using BPF would usually only be allowed CAP_BPF
> and not CAP_SYS_ADMIN"
> which is true for eBPF only, since cBPF is always allowed for
> all unpriv users.
> Start chrome browser and you get cBPF loaded.

Processes using cBPF (via SO_ATTACH_FILTER) already can trigger a
CAP_BPF check - when the net.core.bpf_jit_harden sysctl is set to 1,
then the sequence sk_attach_filter() -> __get_filter() ->
bpf_prog_alloc() -> bpf_prog_alloc_no_stats() ->
bpf_jit_blinding_enabled() -> bpf_token_capable() happens for the same
iio-sensor-proxy syscall as the one that hits the CAP_SYS_ADMIN check.
Because of this we have already granted the BPF capability in
Fedora/RHEL SELinux policy to many domains that would usually run as
root and that use SO_ATTACH_FILTER. The logic being that they are
legitimately using BPF + without SELinux they would be fully
privileged (root) and they would pass that check + it seemed they
could otherwise lose some performance due to the hardening (though I'm
not sure now if it applies to cBPF, so this point could be moot) +
CAP_BPF doesn't grant any excess privileges beyond this (as opposed to
e.g. CAP_SYS_ADMIN). This is what I meant behind that commit log
statement, though I didn't remember the details, so I didn't state it
as clearly as I could have (my apologies).

Now this same usage started triggering the new plain CAP_SYS_ADMIN
check so I naturally assumed that changing it to bpf_capable() would
be the most logical solution (as it would let us keep the services
excluded from the hardening via CAP_BPF without granting the broad
CAP_SYS_ADMIN).

Is the fact that CAP_BPF check is reachable via cBPF use unexpected
behavior? If both cBPF and eBPF can be JIT'd and CAP_BPF is already
being used for the "exempt from JIT hardening" semantics in one place,
why should cBPF and eBPF be treated differently? In fact, shouldn't
the decision to apply the Spectre mitigation also take into account
the net.core.bpf_jit_harden sysctl even when the program is not cBPF?

> 3.
> glancing over bugzilla it seems that the issue is
> excessive audit spam and not related to CAP_BPF and privileges.
> If so then the fix is to use
> ns_capable_noaudit(&init_user_ns, CAP_SYS_ADMIN)
>
> 4.
> I don't understand how the patch is supposed to fix the issue.
> iio-sensor-proxy is probably unpriv. Why would it use CAP_BPF?
> It's using cBPF, so there is no reason for it to have CAP_BPF.
> So capable(CAP_BPF) will fail just like capable(CAP_SYS_ADMIN),
> but since CAP_BPF check was done first, the audit won't
> be printed, because it's some undocumented internal selinux behavior ?
> None of it is in the commit log :(

It is not unprivileged. It runs as root and without SELinux it would
have all capabilities allowed. If it were running without any
capabilities, then indeed there would be no SELinux checks.

> 5.
> And finally all that looks like a selinux bug.
> Just because something in the kernel is asking capable(CAP_SYS_ADMIN)
> there is no need to spam users with the wrong message:
> "SELinux is preventing iio-sensor-prox from using the 'sys_admin' capabilities."
> iio-sensor-prox is not trying to use 'sys_admin' capabilities.
> cBPF prog will be loaded anyway, with or without BHB clearing.

Well, it depends... In this case the AVC denial informs us that the
kernel is making some decision depending on the capability and that a
decision should be made in the policy to allow or silence the access
vector. Even when the consequence is not a failure of the syscall, it
still may be useful to have the denial reported, since there is a
potential performance impact. OTOH, with CAP_SYS_ADMIN if the decision
is to not allow it, then silencing it via a dontaudit rule would
potentially hide other more critical CAP_SYS_ADMIN denials, so it's
hard to decide what is better - to silence this specific case in the
kernel vs. to let the user allow/silence the specific AV in the
policy...

--
Ondrej Mosnacek
Senior Software Engineer, Linux Security - SELinux kernel
Red Hat, Inc.