Re: [PATCH] x86/bpf: use bpf_capable() instead of capable(CAP_SYS_ADMIN)

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 2 Sep 2025 10:36:51 -0700

On Wed, Aug 13, 2025 at 2:49 AM Ondrej Mosnacek <omosnace@xxxxxxxxxx> wrote:
>
> On Sat, Aug 9, 2025 at 2:46 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Fri, Aug 8, 2025 at 4:59 PM Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Aug 06, 2025 at 08:18:55PM -0500, Serge E. Hallyn wrote:
> > > > On Wed, Aug 06, 2025 at 04:31:05PM +0200, Ondrej Mosnacek wrote:
> > > > > Don't check against the overloaded CAP_SYS_ADMINin do_jit(), but instead
> > > > > use bpf_capable(), which checks against the more granular CAP_BPF first.
> > > > > Going straight to CAP_SYS_ADMIN may cause unnecessary audit log spam
> > > > > under SELinux, as privileged domains using BPF would usually only be
> > > > > allowed CAP_BPF and not CAP_SYS_ADMIN.
> > > > >
> > > > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2369326
> > > > > Fixes: d4e89d212d40 ("x86/bpf: Call branch history clearing sequence on exit")
> > > > > Signed-off-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx>
> > > >
> > > > So this seems correct, *provided* that we consider it within the purview of
> > > > CAP_BPF to be able to avoid clearing the branch history buffer.
> >
> > true, but...
> >
> > > >
> > > > I suspect that's the case, but it might warrant discussion.
> > > >
> > > > Reviewed-by: Serge Hallyn <serge@xxxxxxxxxx>
> > >
> > > (BTW, I'm assuming this will get pulled into a BPF tree or something, and
> > > doesn't need to go into the capabilities tree.  Let me know if that's wrong)
> >
> > Right.
> > scripts/get_maintainer.pl arch/x86/net/bpf_jit_comp.c
> > is your friend.
> >
> > Pls cc author-s of the commit in question in the future.
> > Adding them now.
> >
> > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > > > index 15672cb926fc1..2a825e5745ca1 100644
> > > > > --- a/arch/x86/net/bpf_jit_comp.c
> > > > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > > > @@ -2591,8 +2591,7 @@ emit_jmp:
> > > > >                     seen_exit = true;
> > > > >                     /* Update cleanup_addr */
> > > > >                     ctx->cleanup_addr = proglen;
> > > > > -                   if (bpf_prog_was_classic(bpf_prog) &&
> > > > > -                       !capable(CAP_SYS_ADMIN)) {
> > > > > +                   if (bpf_prog_was_classic(bpf_prog) && !bpf_capable()) {
> >
> > This looks wrong for several reasons.
> >
> > 1.
> > bpf_capable() and CAP_BPF in general applies to eBPF only.
> > There is no precedent so far to do anything differently
> > for cBPF when CAP_BPF is present.
>
> That's not entirely true, see below.
>
> > 2.
> > commit log states that
> > "privileged domains using BPF would usually only be allowed CAP_BPF
> > and not CAP_SYS_ADMIN"
> > which is true for eBPF only, since cBPF is always allowed for
> > all unpriv users.
> > Start chrome browser and you get cBPF loaded.
>
> Processes using cBPF (via SO_ATTACH_FILTER) already can trigger a
> CAP_BPF check - when the net.core.bpf_jit_harden sysctl is set to 1,
> then the sequence sk_attach_filter() -> __get_filter() ->
> bpf_prog_alloc() -> bpf_prog_alloc_no_stats() ->
> bpf_jit_blinding_enabled() -> bpf_token_capable() happens for the same
> iio-sensor-proxy syscall as the one that hits the CAP_SYS_ADMIN check.

Agree that it's a similar case and we should standardize on
the way to check whether mitigation is necessary.
But I think in both cases it should be "_noaudit" version.
In other words, everywhere where jit, verifier or anything else
is checking caps to apply a mitigation or not we should use "_noaudit".

> Because of this we have already granted the BPF capability in
> Fedora/RHEL SELinux policy to many domains that would usually run as
> root and that use SO_ATTACH_FILTER. The logic being that they are
> legitimately using BPF + without SELinux they would be fully
> privileged (root) and they would pass that check + it seemed they
> could otherwise lose some performance due to the hardening (though I'm
> not sure now if it applies to cBPF, so this point could be moot) +
> CAP_BPF doesn't grant any excess privileges beyond this (as opposed to
> e.g. CAP_SYS_ADMIN). This is what I meant behind that commit log
> statement, though I didn't remember the details, so I didn't state it
> as clearly as I could have (my apologies).
>
> Now this same usage started triggering the new plain CAP_SYS_ADMIN
> check so I naturally assumed that changing it to bpf_capable() would
> be the most logical solution (as it would let us keep the services
> excluded from the hardening via CAP_BPF without granting the broad
> CAP_SYS_ADMIN).

I see. Sounds like you identified bpf_jit_blinding_enabled() case
and special cased it selinux. So doing bpf_capable() in JIT
would fit the existing selinux handling.

>
> Is the fact that CAP_BPF check is reachable via cBPF use unexpected
> behavior? If both cBPF and eBPF can be JIT'd and CAP_BPF is already
> being used for the "exempt from JIT hardening" semantics in one place,
> why should cBPF and eBPF be treated differently? In fact, shouldn't
> the decision to apply the Spectre mitigation also take into account
> the net.core.bpf_jit_harden sysctl even when the program is not cBPF?
>
> > 3.
> > glancing over bugzilla it seems that the issue is
> > excessive audit spam and not related to CAP_BPF and privileges.
> > If so then the fix is to use
> > ns_capable_noaudit(&init_user_ns, CAP_SYS_ADMIN)
> >
> > 4.
> > I don't understand how the patch is supposed to fix the issue.
> > iio-sensor-proxy is probably unpriv. Why would it use CAP_BPF?
> > It's using cBPF, so there is no reason for it to have CAP_BPF.
> > So capable(CAP_BPF) will fail just like capable(CAP_SYS_ADMIN),
> > but since CAP_BPF check was done first, the audit won't
> > be printed, because it's some undocumented internal selinux behavior ?
> > None of it is in the commit log :(
>
> It is not unprivileged. It runs as root and without SELinux it would
> have all capabilities allowed. If it were running without any
> capabilities, then indeed there would be no SELinux checks.
>
> > 5.
> > And finally all that looks like a selinux bug.
> > Just because something in the kernel is asking capable(CAP_SYS_ADMIN)
> > there is no need to spam users with the wrong message:
> > "SELinux is preventing iio-sensor-prox from using the 'sys_admin' capabilities."
> > iio-sensor-prox is not trying to use 'sys_admin' capabilities.
> > cBPF prog will be loaded anyway, with or without BHB clearing.
>
> Well, it depends... In this case the AVC denial informs us that the
> kernel is making some decision depending on the capability and that a
> decision should be made in the policy to allow or silence the access
> vector. Even when the consequence is not a failure of the syscall, it
> still may be useful to have the denial reported, since there is a
> potential performance impact. OTOH, with CAP_SYS_ADMIN if the decision
> is to not allow it, then silencing it via a dontaudit rule would
> potentially hide other more critical CAP_SYS_ADMIN denials, so it's
> hard to decide what is better - to silence this specific case in the
> kernel vs. to let the user allow/silence the specific AV in the
> policy...

imo we should apply a general rule for using "_noaudit"
for all checks that don't end up as a denial.
There are various bpf_allow_ptr_leaks(), bpf_bypass_spec_v[14]()
checks in the verifier that should be converted to "_noaudit".
It's true that the check affects performance and users could be
interested in the end result, but if they enable mitigations they
expect the performance hit across the board, so skipping a mitigation
for a privileged process is a bonus. A prog will be tiny bit faster.
So not worth flagging anywhere.