Re: [RFC PATCH 0/3] eliminate mmap() retry merge, add .mmap_proto hook

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Wed, 30 Apr 2025 20:58:59 +0100

Apologies, made a mistake that I wrongly assumed tooling would catch... let me
try sending this again...

On Wed, Apr 30, 2025 at 08:54:10PM +0100, Lorenzo Stoakes wrote:
> During the mmap() of a file-backed mapping, we invoke the underlying
> driver's f_op->mmap() callback in order to perform driver/file system
> initialisation of the underlying VMA.
> 
> This has been a source of issues in the past, including a significant
> security concern relating to unwinding of error state discovered by Jann
> Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region()
> error path behaviour") which performed the recent, significant, rework of
> mmap() as a whole.
> 
> However, we have had a fly in the ointment remain - drivers have a great
> deal of freedom in the .mmap() hook to manipulate VMA state (as well as
> page table state).
> 
> This can be problematic, as we can no longer reason sensibly about VMA
> state once the call is complete (the ability to do - anything - here does
> rather interfere with that).
> 
> In addition, callers may choose to do odd or unusual things which might
> interfere with subsequent steps in the mmap() process, and it may do so and
> then raise an error, requiring very careful unwinding of state about which
> we can make no assumptions.
> 
> Rather than providing such an open-ended interface, this series provides an
> alternative, far more restrictive one - we expose a whitelist of fields
> which can be adjusted by the driver, along with immutable state upon which
> the driver can make such decisions:
> 
> struct vma_proto {
> 	/* Immutable state. */
> 	struct mm_struct *mm;
> 	unsigned long start;
> 	unsigned long end;
> 
> 	/* Mutable fields. Populated with initial state. */
> 	pgoff_t pgoff;
> 	struct file *file;
> 	vm_flags_t flags;
> 	pgprot_t page_prot;
> 
> 	/* Write-only fields. */
> 	const struct vm_operations_struct *vm_ops;
> 	void *private_data;
> };
> 
> The mmap logic then updates the state used to either merge with a VMA or
> establish a new VMA based upon this logic.
> 
> This is achieved via new f_op hook .vma_proto(), which is, importantly,
> invoked very early on in the mmap() process.
> 
> If an error arises, we can very simply abort the operation with very little
> unwinding of state required.
> 
> The existing logic contains another, related, peccadillo - since the
> .mmap() callback might do anything, it may also cause a previously
> unmergeable VMA to become mergeable with adjacent VMAs.
> 
> Right now the logic will retry a merge like this only if the driver changes
> VMA flags, and changes them in such a way that a merge might succeed (that
> is, the flags are not 'special', that is do not contain any of the flags
> specified in VM_SPECIAL).
> 
> This has been the source of a great deal of pain for a while - it is hard
> to reason about an .mmap() callback that might do - anything - but it's
> also hard to reason about setting up a VMA and writing to the maple tree,
> only to do it again utilising a great deal of shared state.
> 
> Since .mmap_proto() sets fields before the first merge is even attempted,
> the use of this callback obviates the need for this retry merge logic.
> 
> A driver can specify either .mmap_proto(), .mmap() or both. This provides
> maximum flexibility.
> 
> In researching this change, I examined every .mmap() callback, and
> discovered only a very few that set VMA state in such a way that a. the VMA
> flags changed and b. this would be mergeable.
> 
> In the majority of cases, it turns out that drivers are mapping kernel
> memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other unmergeable
> VM_SPECIAL flags.
> 
> Of those that remain I identified a number of cases which are only
> applicable in DAX, setting the VM_HUGEPAGE flag:
> 
> * dax_mmap()
> * erofs_file_mmap()
> * ext4_file_mmap()
> * xfs_file_mmap()
> 
> For this remerge to not occur and to impact users, each of these cases
> would require a user to mmap() files using DAX, in parts, immediately
> adjacent to one another.
> 
> This is a very unlikely usecase and so it does not appear to be worthwhile
> to adjust this functionality accordingly.
> 
> We can, however, very quickly do so if needed by simply adding an
> .mmap_proto() hook to these as required.
> 
> There are two further non-DAX cases I idenitfied:
> 
> * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with
>   VM_SEQ_READ.
> * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP.
> 
> Both of these cases again seem very unlikely to be mmap()'d immediately
> adjacent to one another in a fashion that would result in a merge.
> 
> Finally, we are left with a viable case:
> 
> * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP.
> 
> This is viable enough that the mm selftests trigger the logic as a matter
> of course. Therefore, this series replace the .secretmem_mmap() hook with
> .secret_mmap_proto().
> 
> Lorenzo Stoakes (3):
>   mm: introduce new .mmap_proto() f_op callback
>   mm: secretmem: convert to .mmap_proto() hook
>   mm/vma: remove mmap() retry merge
> 
>  include/linux/fs.h               |  7 +++
>  include/linux/mm_types.h         | 24 ++++++++
>  mm/memory.c                      |  3 +-
>  mm/mmap.c                        |  2 +-
>  mm/secretmem.c                   | 14 ++---
>  mm/vma.c                         | 99 +++++++++++++++++++++++++++-----
>  tools/testing/vma/vma_internal.h | 38 ++++++++++++
>  7 files changed, 164 insertions(+), 23 deletions(-)
> 
> -- 
> 2.49.0
>