On Sat, Aug 9, 2025 at 2:46 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Aug 8, 2025 at 4:59 PM Serge E. Hallyn <serge@xxxxxxxxxx> wrote: > > > > On Wed, Aug 06, 2025 at 08:18:55PM -0500, Serge E. Hallyn wrote: > > > On Wed, Aug 06, 2025 at 04:31:05PM +0200, Ondrej Mosnacek wrote: > > > > Don't check against the overloaded CAP_SYS_ADMINin do_jit(), but instead > > > > use bpf_capable(), which checks against the more granular CAP_BPF first. > > > > Going straight to CAP_SYS_ADMIN may cause unnecessary audit log spam > > > > under SELinux, as privileged domains using BPF would usually only be > > > > allowed CAP_BPF and not CAP_SYS_ADMIN. > > > > > > > > Link: https://bugzilla.redhat.com/show_bug.cgi?id=2369326 > > > > Fixes: d4e89d212d40 ("x86/bpf: Call branch history clearing sequence on exit") > > > > Signed-off-by: Ondrej Mosnacek <omosnace@xxxxxxxxxx> > > > > > > So this seems correct, *provided* that we consider it within the purview of > > > CAP_BPF to be able to avoid clearing the branch history buffer. > > true, but... > > > > > > > I suspect that's the case, but it might warrant discussion. > > > > > > Reviewed-by: Serge Hallyn <serge@xxxxxxxxxx> > > > > (BTW, I'm assuming this will get pulled into a BPF tree or something, and > > doesn't need to go into the capabilities tree. Let me know if that's wrong) > > Right. > scripts/get_maintainer.pl arch/x86/net/bpf_jit_comp.c > is your friend. > > Pls cc author-s of the commit in question in the future. > Adding them now. > > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > > > index 15672cb926fc1..2a825e5745ca1 100644 > > > > --- a/arch/x86/net/bpf_jit_comp.c > > > > +++ b/arch/x86/net/bpf_jit_comp.c > > > > @@ -2591,8 +2591,7 @@ emit_jmp: > > > > seen_exit = true; > > > > /* Update cleanup_addr */ > > > > ctx->cleanup_addr = proglen; > > > > - if (bpf_prog_was_classic(bpf_prog) && > > > > - !capable(CAP_SYS_ADMIN)) { > > > > + if (bpf_prog_was_classic(bpf_prog) && !bpf_capable()) { > > This looks wrong for several reasons. > > 1. > bpf_capable() and CAP_BPF in general applies to eBPF only. > There is no precedent so far to do anything differently > for cBPF when CAP_BPF is present. That's not entirely true, see below. > 2. > commit log states that > "privileged domains using BPF would usually only be allowed CAP_BPF > and not CAP_SYS_ADMIN" > which is true for eBPF only, since cBPF is always allowed for > all unpriv users. > Start chrome browser and you get cBPF loaded. Processes using cBPF (via SO_ATTACH_FILTER) already can trigger a CAP_BPF check - when the net.core.bpf_jit_harden sysctl is set to 1, then the sequence sk_attach_filter() -> __get_filter() -> bpf_prog_alloc() -> bpf_prog_alloc_no_stats() -> bpf_jit_blinding_enabled() -> bpf_token_capable() happens for the same iio-sensor-proxy syscall as the one that hits the CAP_SYS_ADMIN check. Because of this we have already granted the BPF capability in Fedora/RHEL SELinux policy to many domains that would usually run as root and that use SO_ATTACH_FILTER. The logic being that they are legitimately using BPF + without SELinux they would be fully privileged (root) and they would pass that check + it seemed they could otherwise lose some performance due to the hardening (though I'm not sure now if it applies to cBPF, so this point could be moot) + CAP_BPF doesn't grant any excess privileges beyond this (as opposed to e.g. CAP_SYS_ADMIN). This is what I meant behind that commit log statement, though I didn't remember the details, so I didn't state it as clearly as I could have (my apologies). Now this same usage started triggering the new plain CAP_SYS_ADMIN check so I naturally assumed that changing it to bpf_capable() would be the most logical solution (as it would let us keep the services excluded from the hardening via CAP_BPF without granting the broad CAP_SYS_ADMIN). Is the fact that CAP_BPF check is reachable via cBPF use unexpected behavior? If both cBPF and eBPF can be JIT'd and CAP_BPF is already being used for the "exempt from JIT hardening" semantics in one place, why should cBPF and eBPF be treated differently? In fact, shouldn't the decision to apply the Spectre mitigation also take into account the net.core.bpf_jit_harden sysctl even when the program is not cBPF? > 3. > glancing over bugzilla it seems that the issue is > excessive audit spam and not related to CAP_BPF and privileges. > If so then the fix is to use > ns_capable_noaudit(&init_user_ns, CAP_SYS_ADMIN) > > 4. > I don't understand how the patch is supposed to fix the issue. > iio-sensor-proxy is probably unpriv. Why would it use CAP_BPF? > It's using cBPF, so there is no reason for it to have CAP_BPF. > So capable(CAP_BPF) will fail just like capable(CAP_SYS_ADMIN), > but since CAP_BPF check was done first, the audit won't > be printed, because it's some undocumented internal selinux behavior ? > None of it is in the commit log :( It is not unprivileged. It runs as root and without SELinux it would have all capabilities allowed. If it were running without any capabilities, then indeed there would be no SELinux checks. > 5. > And finally all that looks like a selinux bug. > Just because something in the kernel is asking capable(CAP_SYS_ADMIN) > there is no need to spam users with the wrong message: > "SELinux is preventing iio-sensor-prox from using the 'sys_admin' capabilities." > iio-sensor-prox is not trying to use 'sys_admin' capabilities. > cBPF prog will be loaded anyway, with or without BHB clearing. Well, it depends... In this case the AVC denial informs us that the kernel is making some decision depending on the capability and that a decision should be made in the policy to allow or silence the access vector. Even when the consequence is not a failure of the syscall, it still may be useful to have the denial reported, since there is a potential performance impact. OTOH, with CAP_SYS_ADMIN if the decision is to not allow it, then silencing it via a dontaudit rule would potentially hide other more critical CAP_SYS_ADMIN denials, so it's hard to decide what is better - to silence this specific case in the kernel vs. to let the user allow/silence the specific AV in the policy... -- Ondrej Mosnacek Senior Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.