On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote: > On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote: > > There are peculiarities within the kernel where what is very clearly mm > > code is performed elsewhere arbitrarily. > > > > This violates separation of concerns and makes it harder to refactor code > > to make changes to how fundamental initialisation and operation of mm logic > > is performed. > > > > One such case is the creation of the VMA containing the initial stack upon > > execve()'ing a new process. This is currently performed in __bprm_mm_init() > > in fs/exec.c. > > > > Abstract this operation to create_init_stack_vma(). This allows us to limit > > use of vma allocation and free code to fork and mm only. > > > > We previously did the same for the step at which we relocate the initial > > stack VMA downwards via relocate_vma_down(), now we move the initial VMA > > establishment too. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Acked-by: David Hildenbrand <david@xxxxxxxxxx> > > Reviewed-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > --- > > fs/exec.c | 51 +--------------------------------- > > I'm kind of on the fence about this. On the one hand, yes, it's all vma > goo, and should live with the rest of vma code, as you suggest. On the > other had, exec is the only consumer of this behavior, and moving it > out of fs/exec.c means that changes to the code that specifically only > impacts exec are now in a separate file, and will no longer get exec > maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is > notoriously fragile, so I'm kind of generally paranoid about changes to > its behaviors going unnoticed. > > In defense of moving it, yes, this routine has gotten updates over the > many years, but it's relatively stable. But at least one thing has gone in > without exec maintainer review recently (I would have Acked it, but the > point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl") > Everything else was before I took on the role officially (Nov 2022). > > So I guess I'm asking, how do we make sure stuff pulled out of exec > still gets exec maintainer review? I think we have two options here: 1. Separate out this code into mm/vma_exec.c and treat it like mm/vma_init.c, then add you as a reviewer, so you have visibility on everything that happens there. 2. Add you as a reviewer to memory mapping in general. I think 1 is preferable actually, as it'll reduce noise for you and then you'll _always_ get notified about change in this code. Note that we have done this previously for similar reasons with relocate_vma_down() we could move this function into that file. For the sake of saving time given time zone differences, let me explore option 1in a v3, and obviously if that doesn't work for you let me know! :) We'll have to then have fs/exec.c include ../mm/vma_exec.h, so it is _only_ exec that gets access to this. > > > [...] > > static int __bprm_mm_init(struct linux_binprm *bprm) > > { > > - int err; > > [...] > > - return err; > > + return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p); > > } > > I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now. > However, it doesn't really look like it makes too much sense for the NOMMU > logic get moved as well, since it explicitly depends on exec-specific > values (MAX_ARG_PAGES), so perhaps something like this: > > diff --git a/fs/exec.c b/fs/exec.c > index 8e4ea5f1e64c..313dc70e0012 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm) > bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK]; > task_unlock(current->group_leader); > > - err = __bprm_mm_init(bprm); > +#ifndef CONFIG_MMU > + bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *); > +#else > + err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p); > if (err) > goto err; > +#endif > > return 0; Sure I can respin with this. > > > > On a related note, I'd like to point out that my claim that exec is > the only consumer here, is slightly a lie. Technically this is correct, > but only because this is specifically setting up the _stack_. > > The rest of the VMA setup actually surrounds this code (another > reason I remain unhappy about moving it). Specifically the mm_alloc() > before __bprm_mm_init (which is reached through alloc_brpm()). And > then, following alloc_bprm() in do_execveat_common(), is the call to > setup_new_exec(), which does the rest of the VMA setup, specifically > arch_pick_mmap_layout() and related fiddling. No other callers try to allocate/free vmas, which is the issue here. > > The "create userspace VMA" logic, mostly through mm_alloc(), is > used in a few places (e.g. text poking), but the "bring up a _usable_ > userspace VMA" logic (i.e. one also with functional mmap) is repeated in > lib/kunit/alloc_user.c for allowing testing of code that touches userspace > (see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests > don't actually run userspace code, so no stack is set up.) Hm but this seems something different? This is using the mm_alloc() interface? > > I guess what I'm trying to say is that I think we need a more clearly > defined "create usable userspace VMA" API, as we've got at least 3 > scattered approaches right now: exec ("everything"), non-mmap-non-stack > users (text poking, et al), and mmap-but-not-stack users (kunit tests). But only exec is actually allocating and essentially 'forcing in' a stack vma like this? I mean I'm all for having some nicely abstracted interface rather than scattering open-coded stuff around, but the one and only objective _here_ is to disallow the use of very clearly mm-specific internal APIs elsewhere. exec is of course 'special' in this stack behaviour, but I feel we can express this 'specialness' better by having this kind of well-defined, well-described functions exposed by mm, rather than handing over absolutely fundamental API calls to any part of the kernel that wants them. > > And the One True User of a full userspace VMA, exec, has the full setup > scattered into several phases, mostly due to needing to separate those > phases because it needs to progressively gather the information needed > to correctly configure each piece: > - set up userspace VMA at all (mm_alloc) > - set up a stack because exec args need to go somewhere (__bprm_mm_init) > - move stack to the right place (depends on executable binary and task bits) > - set up mmap (arch_pick_mmap_layout) to actually load executable binary > (depends on arch, binary, and task bits) > > Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init > get relocated. It'll _probably_ be fine, but I get antsy about changes > to code that only exec uses... Right, I understand and appreciate that, for sure. This is fiddly core code, you worry about things breaking - I mean I feel you (don't ask me about brk(), please please don't ask me about brk() :P) So hopefully the proposed solution with vma_exec.c should cover this off nicely? > > -- > Kees Cook Cheers, Lorenzo