On Fri, Jun 20, 2025 at 12:06 PM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote: > > On Fri, 20 Jun 2025 at 20:57, Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Fri, Jun 20, 2025 at 11:52 AM Kumar Kartikeya Dwivedi > > <memxor@xxxxxxxxx> wrote: > > > > > > On Fri, 20 Jun 2025 at 20:44, Alexei Starovoitov > > > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > > > > > On Thu, Jun 19, 2025 at 8:11 PM Emil Tsalapatis <emil@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > Add a new BPF arena kfunc from protecting a range of pages. These pages > > > > > cannot be allocated, either explicitly through bpf_arena_alloc_pages() > > > > > or implicitly through userspace page faults. > > > > > > > > > > Signed-off-by: Emil Tsalapatis <emil@xxxxxxxxxxxxxxx> > > > > > --- > > > > > kernel/bpf/arena.c | 95 ++++++++++++++++++++++++++++++++++++++++++++-- > > > > > 1 file changed, 92 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > > > > > index 0d56cea71602..2f9293eb7151 100644 > > > > > --- a/kernel/bpf/arena.c > > > > > +++ b/kernel/bpf/arena.c > > > > > @@ -48,6 +48,7 @@ struct bpf_arena { > > > > > u64 user_vm_end; > > > > > struct vm_struct *kern_vm; > > > > > struct range_tree rt; > > > > > + struct range_tree rt_guard; > > > > > > > > ... > > > > > > > > > } > > > > > @@ -282,6 +298,11 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > > > > /* User space requested to segfault when page is not allocated by bpf prog */ > > > > > return VM_FAULT_SIGSEGV; > > > > > > > > > > + /* Make sure the page is not guarded. */ > > > > > + ret = is_range_tree_set(&arena->rt_guard, vmf->pgoff, 1); > > > > > + if (ret) > > > > > + return VM_FAULT_SIGSEGV; > > > > > + > > > > > ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); > > > > > > > > Why complicate things with another tree ? > > > > The logic has to range_tree_clear(&arena->rt, ... anyway > > > > and here check: > > > > is_range_tree_set(&arena->rt, ... > > > > > > > > bpf_arena_guard_pages() won't have EALREADY errors, so be it. > > > > Keeping another range_tree and spending kernel memory > > > > just to produce an error to buggy bpf prog is imo wrong trade off. > > > > > > IIUC the main requirement is reserving a region that cannot be faulted > > > in user space, and cannot be allocated from the BPF side. > > > I would instead add a flag that when set overrides the SIGSEGV/page-in > > > behavior (which can be set globally by a flag on the map). > > > That sounds more generic and potentially useful to pick the behavior > > > on a per-allocation basis instead of making it global. > > > So for specific allocations, we get SEGSEGV instead of paging in > > > memory, while for the rest it's the default based on map's flags. > > > And to prevent anybody else from allocating this range, reserve it > > > ahead of time in the scheduler's init() callback. > > > For normal programs it can be an extra prog run before the program is > > > attached and starts firing. > > > We won't need a new kfunc either. > > > > I'm not following the idea. > > There is already a flag: > > if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > > /* User space requested to segfault when page is not > > allocated by bpf prog */ > > return VM_FAULT_SIGSEGV; > > Yeah, but it's global. We may want fault-in for normal memory on > access, but we may want some range in the address space to have > SEGV_ON_FAULT behavior. > It allows some allocations to pick their fault-in/segv behavior > independent of the arena user's choice, which I think is important for > library like code / malloc etc. You mean to augment range_tree with extra flags per range ? I don't like it. So far no one has used BPF_F_SEGV_ON_FAULT. Making it more granular won't make it more useful. > The thing described above still wastes some memory because we map pages. > I feel the proper way to do this is how one would do something like > this in user space. > Have pages mapped in a range and remove PROT_READ | PROT_WRITE from > them, making them inaccessible. > Ideally it'd be zero pages but if we cannot do that, wasting some > memory for guard regions may not be too bad, for now. > We need mprotect() like capabilities to change r/w permissions to > create such guard pages, e.g. for a malloc with some debug features / > efence-like protection. I don't see the point of all that. bpf_arena_guard_pages() that only reserve the range without allocating the pages makes sense to me. I think bpf_arena_reserve_range() would be a better name for such api. Since actual pages are not allocated. "guard pages" typically means actual pages are allocated and populated with some pattern. Here it's not the case.