## INTRODUCTION After discussions in various threads (Usama's series adding a new prctl() in [0], and a proposal to adapt process_madvise() to do the same - conception in [1] and RFC in [2]), it seems fairly clear that it would make sense to explore a dedicated API to explicitly allow for actions which affect the virtual address space as a whole. Also, Barry is implementing a feature (currently under RFC) which could additionally make use of this API (see [3]). [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@xxxxxxxxx/ [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/ [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx/ [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@xxxxxxxxx/ While madvise() and process_madvise() are useful for altering the attributes of VMAs within a virtual address space, it isn't the right fit for something that affects the whole address space. Additionally, a requirement of Usama's proposal (see [0]) is that we have the ability to propagate the change in behaviour across fork/exec. This further suggests the need for a dedicated interface, as this really sits outside the ordinary behaviour of [process_]madvise(). prctl() is too broad and encourages mm code to migrate to kernel/sys.c where it is at risk of bit-rotting. It can make it harder/impossible to isolate mm logic for testing and logic there might be missed in changes moving forward. It also, like so many kernel interfaces, has 'grown roots out of its pot' so to speak - while it started off as an ostensible 'process' manipulation interface, prctl() operations manipulate a myriad of task, virtual address space and even specific VMA attributes. At this stage it really is a 'catch-all' for things we simply couldn't fit elsewhere. Therefore, as suggested by the rather excellent Liam Howlett, I propose an mm-specific interface that _explicitly_ manipulates attributes of the virtual address space as a whole. I think something that mimics the simplicity of [process_]madvise() makes sense - have a limited set of actions that can be taken, and treat them as a simple action - a user requests you do XXX to the virtual address space (that is, across the mm_struct), and you do it. ## INTERFACE The proposed interface is simply: int mctl(int pidfd, int action, unsigned int flags); Since PIDFD_SELF is now available, it is easy to invoke this for the current process, while also adding the flexibility of being able to apply actions to other processes also. The function will return 0 on success, -1 on failure, with errno set to the error that arose, standard stuff. The behaviour will be tailored to each action taken. To begin with, I propose a single flag: - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec. This again will be tailored - only certain actions will be allowed to set this flag, and we will of course assert appropriate capabilities, etc. upon its use. All actions would, impact every VMA (if adjustments to VMAs are required). ## SECURITY Of course, security will be of utmost concern (Jann's input is important here :) We can vary security requirements depending on the action taken. For an initial version I suggest we simply limit operations which: - Operate on a remote process - Use the MCTL_SET_DEFAULT_EXEC flag To those tasks which possess the CAP_SYS_ADMIN capability. This may be too restrictive - be good to get some feedback on this. I know Jann raised concerns around privileged execution and perhaps it'd be useful to see whether this would make more sense for the SET_DEFAULT_EXEC case or not. Usama - would requiring CAP_SYS_ADMIN be egregious to your use case? ## IMPLEMENTATION I think that sensibly we'd need to add some new files here, mm/mctl.c, include/linux/mctl.h (the latter of providing the MCTL_xxx actions and flags). We could find ways to share code between mm files where appropriate to avoid too much duplication. I suggest that the best way forward, if we were minded to examine how this would look in practice, would be for me to implement an RFC that adds the interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE implementation as a proof of concept. If we wanted to then go ahead with a non-RFC version, this could then form a foundation upon which Usama and Barry could implement their features, with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry MCTL_[SET/CLEAR]_FADE_ON_DEATH. Obviously I don't mean to presume to suggest how we might proceed here - only suggesting this might be a good way of moving forward and getting things done as quickly as possible while allowing you guys to move forward with your features. Let me know if this makes sense, alternatively I could try to find a relatively benign action to implement as part of the base work, or we could simply collaborate to do it all in one series with multiple authors? ## RFC The above is all only in effect 'putting ideas out there' so this is entirely an RFC in spirit and intent - let me know if this makes sense in whole or part :) Thanks! Lorenzo