On 28/03/2025 09:32, Zi Yan wrote: > On 28 Mar 2025, at 9:09, Ryan Roberts wrote: > >> On 27/03/2025 20:07, Zi Yan wrote: >>> On 27 Mar 2025, at 12:44, Matthew Wilcox wrote: >>> >>>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote: >>>>> So let's special-case the read(ahead) logic for executable mappings. The >>>>> trade-off is performance improvement (due to more efficient storage of >>>>> the translations in iTLB) vs potential read amplification (due to >>>>> reading too much data around the fault which won't be used), and the >>>>> latter is independent of base page size. I've chosen 64K folio size for >>>>> arm64 which benefits both the 4K and 16K base page size configs and >>>>> shouldn't lead to any read amplification in practice since the old >>>>> read-around path was (usually) reading blocks of 128K. I don't >>>>> anticipate any write amplification because text is always RO. >>>> >>>> Is there not also the potential for wasted memory due to ELF alignment? >>>> Kalesh talked about it in the MM BOF at the same time that Ted and I >>>> were discussing it in the FS BOF. Some coordination required (like >>>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be >>>> there?) >>>> >>>>> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT) >>>> >>>> I don't think the "arch" really adds much value here. >>>> >>>> #define exec_folio_order() get_order(SZ_64K) >>> >>> How about AMD’s PTE coalescing, which does PTE compression at >>> 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will >>> not hurt AMD PTE coalescing. Starting with 64KB across all arch >>> might be simpler to see the performance impact. Just a comment, >>> no objection. :) >> >> exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred >> size. At the moment x86 is not opted in, but they could choose to opt in with >> 32K (or whatever else makese sense) if the HW supports coalescing. > > Oh, I missed that part. I thought, since arch_ is not there, it was the same > for all arch. > >> >> I'm not sure if you thought this was global and are arguing against that, or if >> you are arguing for it to be global because it will more easily show us >> performance regressions earlier if x86 is doing this too? > > I thought it was global. It might be OK to set it global and let different arch > to optimize it as it rolls out. Opt-in might be "never" until someone looks > into it, but if it is global and it changes performance, people will notice > and look into it. Ahh now that we are both clear, I'd prefer to stick with the policy as implemented; exec_folio_order() defaults to "use the existing readahead method" but can be overridden by arches (arm64) that want specific behaviour (64K folios). > > -- > Best Regards, > Yan, Zi