(cc'ing linux-api) On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote: > REVIEWERS NOTES: > ================ > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm > 'putting it out there' for feedback. Any serious version of this will add a > bunch of self-tests to assert correct behaviour and I will more carefully > confirm everything's working. > > This is based on discussion arising from Usama's series [0], SJ's input on > the thread around process_madvise() behaviour [1] (and a subsequent > response by me [2]) and prior discussion about a new madvise() interface > [3]. > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@xxxxxxxxx/ > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@xxxxxxxxxx/ > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/ > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ > > ================ > > Currently, we are rather restricted in how madvise() operations > proceed. While effort has been put in to expanding what process_madvise() > can do (that is - unrestricted application of advice to the local process > alongside recent improvements on the efficiency of TLB operations over > these batvches), we are still constrained by existing madvise() limitations > and default behaviours. > > This series makes use of the currently unused flags field in > process_madvise() to provide more flexiblity. > > It introduces four flags: > > 1. PMADV_SKIP_ERRORS > > Currently, when an error arises applying advice in any individual VMA > (keeping in mind that a range specified to madvise() or as part of the > iovec passed to process_madvise()), the operation stops where it is and > returns an error. > > This might not be the desired behaviour of the user, who may wish instead > for the operation to be 'best effort'. By setting this flag, that behaviour > is obtained. > > Since process_madvise() would trivially, if skipping errors, simply return > the input vector size, we instead return the number of entries in the > vector which completed successfully without error. > > The PMADV_SKIP_ERRORS flag implicitly implies PMADV_NO_ERROR_ON_UNMAPPED. > > 2. PMADV_NO_ERROR_ON_UNMAPPED > > Currently madvise() has the peculiar behaviour of, if the range specified > to it contains unmapped range(s), completing the full operation, but > ultimately returning -ENOMEM. > > In the case of process_madvise(), this is fatal, as the operation will stop > immediately upon this occurring. > > By setting PMADV_NO_ERROR_ON_UNMAPPED, the user can indicate that it wishes > unmapped areas to simply be entirely ignored. > > 3. PMADV_SET_FORK_EXEC_DEFAULT > > It may be desirable for a user to specify that all VMAs mapped in a process > address space default to having an madvise() behaviour established by > default, in such a fashion as that this persists across fork/exec. > > Since this is a very powerful option that would make no sense for many > advice modes, we explicitly only permit known-safe flags here (currently > MADV_HUGEPAGE and MADV_NOHUGEPAGE only). > > 4. PMADV_ENTIRE_ADDRESS_SPACE > > It can be annoying, should a user wish to apply madvise() to all VMAs in an > address space, to have to add a singular large entry to the input iovec. > > So provide sugar to permit this - PMADV_ENTIRE_ADDRESS_SPACE. If specified, > we expect the user to pass NULL and -1 to the vec and vlen parameters > respectively so they explicitly acknowledge that these will be ignored, > e.g.: > > process_madvise(PIDFD_SELF, NULL, -1, MADV_HUGEPAGE, > PMADV_ENTIRE_ADDRESS_SPACE | PMADV_SKIP_ERRORS); > > Usually a user ought to prefer setting PMADV_SKIP_ERRORS here as it may > well be the case that incompatible VMAs will be encountered that ought to > be skipped. > > If this is not set, the PMADV_NO_ERROR_ON_UNMAPPED (which was otherwise > implicitly implied by PMADV_SKIP_ERRORS) ought to be set as of course, the > entire address space spans at least some gaps. > > Lorenzo Stoakes (5): > mm: madvise: refactor madvise_populate() > mm/madvise: add PMADV_SKIP_ERRORS process_madvise() flag > mm/madvise: add PMADV_NO_ERROR_ON_UNMAPPED process_madvise() flag > mm/madvise: add PMADV_SET_FORK_EXEC_DEFAULT process_madvise() flag > mm/madvise: add PMADV_ENTIRE_ADDRESS_SPACE process_madvise() flag > > include/uapi/asm-generic/mman-common.h | 6 + > mm/madvise.c | 206 +++++++++++++++++++------ > 2 files changed, 168 insertions(+), 44 deletions(-) > > -- > 2.49.0 > -- Sincerely yours, Mike.