On Wed, May 14, 2025 at 11:01:49AM +0100, Pedro Falcato wrote: > On Wed, May 14, 2025 at 10:12:49AM +0100, Lorenzo Stoakes wrote: > > On Wed, May 14, 2025 at 10:04:06AM +0100, Pedro Falcato wrote: > > > On Fri, May 09, 2025 at 01:13:34PM +0100, Lorenzo Stoakes wrote: > > > > Provide a means by which drivers can specify which fields of those > > > > permitted to be changed should be altered to prior to mmap()'ing a > > > > range (which may either result from a merge or from mapping an entirely new > > > > VMA). > > > > > > > > Doing so is substantially safer than the existing .mmap() calback which > > > > provides unrestricted access to the part-constructed VMA and permits > > > > drivers and file systems to do 'creative' things which makes it hard to > > > > reason about the state of the VMA after the function returns. > > > > > > > > The existing .mmap() callback's freedom has caused a great deal of issues, > > > > especially in error handling, as unwinding the mmap() state has proven to > > > > be non-trivial and caused significant issues in the past, for instance > > > > those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > > > > error path behaviour"). > > > > > > > > It also necessitates a second attempt at merge once the .mmap() callback > > > > has completed, which has caused issues in the past, is awkward, adds > > > > overhead and is difficult to reason about. > > > > > > > > The .mmap_prepare() callback eliminates this requirement, as we can update > > > > fields prior to even attempting the first merge. It is safer, as we heavily > > > > restrict what can actually be modified, and being invoked very early in the > > > > mmap() process, error handling can be performed safely with very little > > > > unwinding of state required. > > > > > > > > The .mmap_prepare() and deprecated .mmap() callbacks are mutually > > > > exclusive, so we permit only one to be invoked at a time. > > > > > > > > Update vma userland test stubs to account for changes. > > > > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > > > > > Reviewed-by: Pedro Falcato <pfalcato@xxxxxxx> > > > > > > Neat idea, thanks. This should also help out with the insane proliferation of > > > vm_flags_set in ->mmap() callbacks all over. Hopefully. > [snip] > > > > > > 2) Possibly add a ->mmap_finish()? With a fully constructed vma at that point. > > > So things like remap_pfn_range can still be used by drivers' mmap() > > > implementation. > > > > Thanks for raising the remap_pfn_range() case! Yes this is definitely a > > thing. > > > > However this proposed callback would totally undermine the purpose of the > > change - the idea is to never give a vma because if we do so we lose all of > > the advantages here and may as well just leave the mmap in place for > > this... > > > > Yes, good point. > > > However I do think we'll need a new callback at some point (previously > > discussed in thread). > > > > We could perhaps provide the option to _explicitly_ remap for instance. I > > would want it to be heavily locked down as to what can happen and to happen > > as early as possible. > > > > I think we can simply combine various ideas here. Like: > > struct vm_area_desc_private { > struct vm_area_desc desc; > struct vm_area_struct *vma; > }; > > Then, for this "mmap_finish" callback and associated infra: > > int (*mmap_finish)(struct vm_area_desc *desc); > > int mmap_remap_pfn_range(struct vm_area_desc *desc, /*...*/) > { > struct vm_area_desc_private *private = container_of(desc, struct vm_area_desc_private, desc); > return remap_pfn_range(private->vma, /*...*/); > } > > int random_driver_mmap_finish(struct vm_area_desc *desc) > { > return mmap_remap_pfn_range(desc, desc->start, some_pfn, some_size, > desc->page_prot); > } > > Yeah that's really nice actually! I like that. Will note down for when we come to this. Obviously we need to know whent the finish implies a remap, but we can have a field for this potentially set in the vm_area_desc to indicate this. > I think something of the sort would be quite less prone to abuse, and we could > take the time to then even polish up the interface (e.g maybe it would be nicer > if mmap_remap_pfn_range took a vma offset, and not a start address). Right yeah, David definitely wanted some improvements in this area so this aligns with other intentions/work. > > Anyway, just brainstorming. This idea came to mind, I think it's quite interesting. Nice ideas :) yeah, good chance to undo some sins here. > > > This is something we can iterate on, as trying to find the ideal scheme > > immediately will just lead to inaction, the big advantage with the approach > > here is we can be iterative. > > > > We provide this, use it in a scenario which allows us to eliminate merge > > retry, and can take it from there :) > > Yep, totally. > > > > > So indeed, watch this space basically... I will be highly proactive on this > > stuff moving forward. > > > > > > > > 1) is particularly important so our VFS and driver friends know this is supposed > > > to be The Way Forward. > > > > I think probably the answer is for me to fully update the document to be > > bang up to date, right? But that would obviously work best as a follow up > > patch :) > > > > You love your big projects :p I had the impression the docs were more or less up to date? > The VFS people do update it somewhat diligently. And for mm we only have ->mmap, ->get_unmapped_area, > and now ->mmap_prepare. And the descriptions are ATM quite useless, just > "called by the mmap(2) system call". Much like leg day, sometimes painful things are worthwhile things :) As for docs I saw there were some missing uring entries, and it explicitly says 'as of kernel 4.18'. So I feel that it really needs to be properly updated and it ends up being a little out of scope to the series. Since this series only uses this callback in one, mm-specific, place, I think it makes sense to attack this on a follow up, alongside updating file system callbacks etc. But it's definitely on the todo! :) Cheers, Lorenzo > > -- > Pedro