[DISCUSSION] proposed mctl() API

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Thu, 29 May 2025 15:43:26 +0100

## INTRODUCTION

After discussions in various threads (Usama's series adding a new prctl()
in [0], and a proposal to adapt process_madvise() to do the same -
conception in [1] and RFC in [2]), it seems fairly clear that it would make
sense to explore a dedicated API to explicitly allow for actions which
affect the virtual address space as a whole.

Also, Barry is implementing a feature (currently under RFC) which could
additionally make use of this API (see [3]).

[0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@xxxxxxxxx/
[1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
[2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@xxxxxxxxxx/
[3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@xxxxxxxxx/

While madvise() and process_madvise() are useful for altering the
attributes of VMAs within a virtual address space, it isn't the right fit
for something that affects the whole address space.

Additionally, a requirement of Usama's proposal (see [0]) is that we have
the ability to propagate the change in behaviour across fork/exec. This
further suggests the need for a dedicated interface, as this really sits
outside the ordinary behaviour of [process_]madvise().

prctl() is too broad and encourages mm code to migrate to kernel/sys.c
where it is at risk of bit-rotting. It can make it harder/impossible to
isolate mm logic for testing and logic there might be missed in changes
moving forward.

It also, like so many kernel interfaces, has 'grown roots out of its pot'
so to speak - while it started off as an ostensible 'process' manipulation
interface, prctl() operations manipulate a myriad of task, virtual
address space and even specific VMA attributes.

At this stage it really is a 'catch-all' for things we simply couldn't fit
elsewhere.

Therefore, as suggested by the rather excellent Liam Howlett, I propose an
mm-specific interface that _explicitly_ manipulates attributes of the
virtual address space as a whole.

I think something that mimics the simplicity of [process_]madvise() makes
sense - have a limited set of actions that can be taken, and treat them as
a simple action - a user requests you do XXX to the virtual address space
(that is, across the mm_struct), and you do it.

## INTERFACE

The proposed interface is simply:

int mctl(int pidfd, int action, unsigned int flags);

Since PIDFD_SELF is now available, it is easy to invoke this for the
current process, while also adding the flexibility of being able to apply
actions to other processes also.

The function will return 0 on success, -1 on failure, with errno set to the
error that arose, standard stuff.

The behaviour will be tailored to each action taken.

To begin with, I propose a single flag:

- MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.

This again will be tailored - only certain actions will be allowed to set
this flag, and we will of course assert appropriate capabilities, etc. upon
its use.

All actions would, impact every VMA (if adjustments to VMAs are required).

## SECURITY

Of course, security will be of utmost concern (Jann's input is important
here :)

We can vary security requirements depending on the action taken.

For an initial version I suggest we simply limit operations which:

- Operate on a remote process
- Use the MCTL_SET_DEFAULT_EXEC flag

To those tasks which possess the CAP_SYS_ADMIN capability.

This may be too restrictive - be good to get some feedback on this.

I know Jann raised concerns around privileged execution and perhaps it'd be
useful to see whether this would make more sense for the SET_DEFAULT_EXEC
case or not.

Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?

## IMPLEMENTATION

I think that sensibly we'd need to add some new files here, mm/mctl.c,
include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
flags).

We could find ways to share code between mm files where appropriate to
avoid too much duplication.

I suggest that the best way forward, if we were minded to examine how this
would look in practice, would be for me to implement an RFC that adds the
interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
implementation as a proof of concept.

If we wanted to then go ahead with a non-RFC version, this could then form
a foundation upon which Usama and Barry could implement their features,
with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
MCTL_[SET/CLEAR]_FADE_ON_DEATH.

Obviously I don't mean to presume to suggest how we might proceed here -
only suggesting this might be a good way of moving forward and getting
things done as quickly as possible while allowing you guys to move forward
with your features.

Let me know if this makes sense, alternatively I could try to find a
relatively benign action to implement as part of the base work, or we could
simply collaborate to do it all in one series with multiple authors?

## RFC

The above is all only in effect 'putting ideas out there' so this is
entirely an RFC in spirit and intent - let me know if this makes sense in
whole or part :)

Thanks!

Lorenzo