On Fri, Jun 20, 2025 at 5:12 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Jun 20, 2025 at 12:06 PM Kumar Kartikeya Dwivedi > <memxor@xxxxxxxxx> wrote: > > > > On Fri, 20 Jun 2025 at 20:57, Alexei Starovoitov > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > On Fri, Jun 20, 2025 at 11:52 AM Kumar Kartikeya Dwivedi > > > <memxor@xxxxxxxxx> wrote: > > > > > > > > On Fri, 20 Jun 2025 at 20:44, Alexei Starovoitov > > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > > > On Thu, Jun 19, 2025 at 8:11 PM Emil Tsalapatis <emil@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > Add a new BPF arena kfunc from protecting a range of pages. These pages > > > > > > cannot be allocated, either explicitly through bpf_arena_alloc_pages() > > > > > > or implicitly through userspace page faults. > > > > > > > > > > > > Signed-off-by: Emil Tsalapatis <emil@xxxxxxxxxxxxxxx> > > > > > > --- > > > > > > kernel/bpf/arena.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-- > > > > > > 1 file changed, 92 insertions(+), 3 deletions(-) > > > > > > > > > > > > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > > > > > > index 0d56cea71602..2f9293eb7151 100644 > > > > > > --- a/kernel/bpf/arena.c > > > > > > +++ b/kernel/bpf/arena.c > > > > > > @@ -48,6 +48,7 @@ struct bpf_arena { > > > > > > u64 user_vm_end; > > > > > > struct vm_struct *kern_vm; > > > > > > struct range_tree rt; > > > > > > + struct range_tree rt_guard; > > > > > > > > > > ... > > > > > > > > > > > } > > > > > > @@ -282,6 +298,11 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > > > > > /* User space requested to segfault when page is not allocated by bpf prog */ > > > > > > return VM_FAULT_SIGSEGV; > > > > > > > > > > > > + /* Make sure the page is not guarded. */ > > > > > > + ret = is_range_tree_set(&arena->rt_guard, vmf->pgoff, 1); > > > > > > + if (ret) > > > > > > + return VM_FAULT_SIGSEGV; > > > > > > + > > > > > > ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); > > > > > > > > > > Why complicate things with another tree ? > > > > > The logic has to range_tree_clear(&arena->rt, ... anyway > > > > > and here check: > > > > > is_range_tree_set(&arena->rt, ... > > > > > > > > > > bpf_arena_guard_pages() won't have EALREADY errors, so be it. > > > > > Keeping another range_tree and spending kernel memory > > > > > just to produce an error to buggy bpf prog is imo wrong trade off. > > > > > > > > IIUC the main requirement is reserving a region that cannot be faulted > > > > in user space, and cannot be allocated from the BPF side. > > > > I would instead add a flag that when set overrides the SIGSEGV/page-in > > > > behavior (which can be set globally by a flag on the map). > > > > That sounds more generic and potentially useful to pick the behavior > > > > on a per-allocation basis instead of making it global. > > > > So for specific allocations, we get SEGSEGV instead of paging in > > > > memory, while for the rest it's the default based on map's flags. > > > > And to prevent anybody else from allocating this range, reserve it > > > > ahead of time in the scheduler's init() callback. > > > > For normal programs it can be an extra prog run before the program is > > > > attached and starts firing. > > > > We won't need a new kfunc either. > > > > > > I'm not following the idea. > > > There is already a flag: > > > if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > > > /* User space requested to segfault when page is not > > > allocated by bpf prog */ > > > return VM_FAULT_SIGSEGV; > > > > Yeah, but it's global. We may want fault-in for normal memory on > > access, but we may want some range in the address space to have > > SEGV_ON_FAULT behavior. > > It allows some allocations to pick their fault-in/segv behavior > > independent of the arena user's choice, which I think is important for > > library like code / malloc etc. > > You mean to augment range_tree with extra flags per range ? > I don't like it. So far no one has used BPF_F_SEGV_ON_FAULT. > Making it more granular won't make it more useful. > > > The thing described above still wastes some memory because we map pages. > > I feel the proper way to do this is how one would do something like > > this in user space. > > Have pages mapped in a range and remove PROT_READ | PROT_WRITE from > > them, making them inaccessible. > > Ideally it'd be zero pages but if we cannot do that, wasting some > > memory for guard regions may not be too bad, for now. > > We need mprotect() like capabilities to change r/w permissions to > > create such guard pages, e.g. for a malloc with some debug features / > > efence-like protection. > > I don't see the point of all that. > bpf_arena_guard_pages() that only reserve the range without > allocating the pages makes sense to me. > I think bpf_arena_reserve_range() would be a better name for such api. > Since actual pages are not allocated. > "guard pages" typically means actual pages are allocated and populated > with some pattern. Here it's not the case. I'd agree with bpf_arena_reserve_range() as the name. Wrt having more granular permissions, it can be useful for some consumers of the arena API, e.g., for sched-ext arena data that gets set up at initialization but is never modified afterward. But we can consider the two features separately imo even if they touch the same code.