+cc SJ. On Fri, May 16, 2025 at 01:57:18PM +0100, Lorenzo Stoakes wrote: > On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote: > > Looking forward to hearing what your magic thinking cap can do! :) > > OK so just to say at the outset, this is purely playing around with a > theoretical idea here, so if it's crazy just let me know :)) > > Right now madvise() has limited utility because: > > - You have little control over how the operation is done > - You get little feedback about what's actually succeeded or not > - While you can perform multiple operations at once via process_madvise(), > even to the current process (after my changes to extend it), it's limited > to a single advice over 8 ranges. SJ raised the point that I am just wrong about this (see [0]). So this makes the interface even more compelling in my view, we can either: 1. Simply remove the 'madvise_ranges' stuff below, and replace with an iovec, and simply forward that to process_madvise() (simpler, probably preferable). 2. Keep it as-is so we get to perform _multiple_ advice operations in a batch. 3. Use an iovec but also have some other array specifying operations to get the same thing, but maybe extend process_madvise()? I really like the idea however of just - using the existing process_madvise() code - picking up all the recent improvements - avoiding duplication, etc. The key idea of this interface is more-so being able to control certain behaviours (such as stopping on gaps etc.) Yet Another (TM) alternative would be to use the -currently unused- process_madvise() flags field to specify options as mentioned here, including the 'set default' thing. Now _that_ is really interesting actually. It will give us less flexibility, but require a much much less major change. OK damn, that's quite compelling... maybe I will do an RFC patch for that... :) Happy to hear thoughts on it... [0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@xxxxxxxxxx/ > - You can't say 'ignore errors just try' > - You get the weird gap behaviour. > > So the concept is - make everything explicit and add a new syscall that > wraps the existing madvise() stuff and addresses all the above issues. > > Specifically pertinent to the case at hand - also add a 'set_default' > boolean (you'll see shortly exactly where) to also tell madvise() to make > all future VMAs default to the specified advice. We'll whitelist what we're > allowed to use here and should be able to use mm->def_flags. > > So the idea is we'll use a helper struct-configured function (hey, it's me, > I <3 helper structs so of course) like: > > int madvise_ranges(struct madvise_range_control *ctl); > > With the data structures as follows (untested, etc. etc.): > > enum madvise_range_type { > MADVISE_RANGE_SINGLE, > MADVISE_RANGE_MULTI, > MADVISE_RANGE_ALL, > }; > > struct madvise_range { > const void *addr; > size_t size; > int advice; > }; > > struct madvise_ranges { > const struct madvise_range *arr; > size_t count; > }; > > struct madvise_range_stats { > struct madvise_range range; > bool success; > bool partial; > }; > > struct madvise_ranges_stats { > unsigned long nr_mappings_advised; > unsigned long nr_mappings_skipped; > unsigned long nr_pages_advised; > unsigned long nr_pages_skipped; > unsigned long nr_gaps; > > /* > * Useful for madvise_range_control->ignore_errors: > * > * If non-NULL, points to an array of size equal to the number of ranges > * specified. Indiciates the specified range, whether it succeeded, and > * whether that success was partial (that is, the range specified > * multiple mappings, only some of which had advice applied > * successfully). > * > * Not valid for MADVISE_RANGE_ALL. > */ > struct madvise_range_stats *per_range_stats; > > /* Error details. */ > int err; > unsigned long failed_address; > size_t offset; /* If multi, at which offset did this occur? */ > }; > > struct madvise_ranges_control { > int version; /* Allow future updates to API. */ > > enum madvise_range_type type; > > union { > struct madvise_range range; /* MADVISE_RANGE_SINGLE */ > struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */ > struct all { /* MADVISE_RANGE_ALL */ > int advice; > /* > * If set, also have all future mappings have this applied by default. > * > * Only whitelisted advice may set this, otherwise -EINVAL will be returned. > */ > bool set_default; > }; > }; > struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */ > > int pidfd; /* If is_remote set, the remote process. */ > > /* Options. */ > bool is_remote :1; /* Target remote process as specified by pidfd. */ > bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */ > bool single_mapping_only :1; /* Error out if any range is not a single VMA. */ > bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */ > }; > > So the user can specify whether to apply advice to a single range, > multiple, or the whole address space, with real control over how the operation proceeds. > > This basically solves the problem this series tries to address while also > providing an improved madvise() API at the same time. > > Thoughts? Have I finally completely lost my mind?