Apologies, made a mistake that I wrongly assumed tooling would catch... let me try sending this again... On Wed, Apr 30, 2025 at 08:54:10PM +0100, Lorenzo Stoakes wrote: > During the mmap() of a file-backed mapping, we invoke the underlying > driver's f_op->mmap() callback in order to perform driver/file system > initialisation of the underlying VMA. > > This has been a source of issues in the past, including a significant > security concern relating to unwinding of error state discovered by Jann > Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() > error path behaviour") which performed the recent, significant, rework of > mmap() as a whole. > > However, we have had a fly in the ointment remain - drivers have a great > deal of freedom in the .mmap() hook to manipulate VMA state (as well as > page table state). > > This can be problematic, as we can no longer reason sensibly about VMA > state once the call is complete (the ability to do - anything - here does > rather interfere with that). > > In addition, callers may choose to do odd or unusual things which might > interfere with subsequent steps in the mmap() process, and it may do so and > then raise an error, requiring very careful unwinding of state about which > we can make no assumptions. > > Rather than providing such an open-ended interface, this series provides an > alternative, far more restrictive one - we expose a whitelist of fields > which can be adjusted by the driver, along with immutable state upon which > the driver can make such decisions: > > struct vma_proto { > /* Immutable state. */ > struct mm_struct *mm; > unsigned long start; > unsigned long end; > > /* Mutable fields. Populated with initial state. */ > pgoff_t pgoff; > struct file *file; > vm_flags_t flags; > pgprot_t page_prot; > > /* Write-only fields. */ > const struct vm_operations_struct *vm_ops; > void *private_data; > }; > > The mmap logic then updates the state used to either merge with a VMA or > establish a new VMA based upon this logic. > > This is achieved via new f_op hook .vma_proto(), which is, importantly, > invoked very early on in the mmap() process. > > If an error arises, we can very simply abort the operation with very little > unwinding of state required. > > The existing logic contains another, related, peccadillo - since the > .mmap() callback might do anything, it may also cause a previously > unmergeable VMA to become mergeable with adjacent VMAs. > > Right now the logic will retry a merge like this only if the driver changes > VMA flags, and changes them in such a way that a merge might succeed (that > is, the flags are not 'special', that is do not contain any of the flags > specified in VM_SPECIAL). > > This has been the source of a great deal of pain for a while - it is hard > to reason about an .mmap() callback that might do - anything - but it's > also hard to reason about setting up a VMA and writing to the maple tree, > only to do it again utilising a great deal of shared state. > > Since .mmap_proto() sets fields before the first merge is even attempted, > the use of this callback obviates the need for this retry merge logic. > > A driver can specify either .mmap_proto(), .mmap() or both. This provides > maximum flexibility. > > In researching this change, I examined every .mmap() callback, and > discovered only a very few that set VMA state in such a way that a. the VMA > flags changed and b. this would be mergeable. > > In the majority of cases, it turns out that drivers are mapping kernel > memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable > VM_SPECIAL flags. > > Of those that remain I identified a number of cases which are only > applicable in DAX, setting the VM_HUGEPAGE flag: > > * dax_mmap() > * erofs_file_mmap() > * ext4_file_mmap() > * xfs_file_mmap() > > For this remerge to not occur and to impact users, each of these cases > would require a user to mmap() files using DAX, in parts, immediately > adjacent to one another. > > This is a very unlikely usecase and so it does not appear to be worthwhile > to adjust this functionality accordingly. > > We can, however, very quickly do so if needed by simply adding an > .mmap_proto() hook to these as required. > > There are two further non-DAX cases I idenitfied: > > * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with > VM_SEQ_READ. > * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. > > Both of these cases again seem very unlikely to be mmap()'d immediately > adjacent to one another in a fashion that would result in a merge. > > Finally, we are left with a viable case: > > * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. > > This is viable enough that the mm selftests trigger the logic as a matter > of course. Therefore, this series replace the .secretmem_mmap() hook with > .secret_mmap_proto(). > > Lorenzo Stoakes (3): > mm: introduce new .mmap_proto() f_op callback > mm: secretmem: convert to .mmap_proto() hook > mm/vma: remove mmap() retry merge > > include/linux/fs.h | 7 +++ > include/linux/mm_types.h | 24 ++++++++ > mm/memory.c | 3 +- > mm/mmap.c | 2 +- > mm/secretmem.c | 14 ++--- > mm/vma.c | 99 +++++++++++++++++++++++++++----- > tools/testing/vma/vma_internal.h | 38 ++++++++++++ > 7 files changed, 164 insertions(+), 23 deletions(-) > > -- > 2.49.0 >