On Thu, May 29, 2025 at 04:50:16PM +0200, Vlastimil Babka wrote: > On 5/21/25 20:20, Lorenzo Stoakes wrote: > > If a user wishes to enable KSM mergeability for an entire process and all > > fork/exec'd processes that come after it, they use the prctl() > > PR_SET_MEMORY_MERGE operation. > > > > This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set > > (in order to indicate they are KSM mergeable), as well as setting this flag > > for all existing VMAs. > > > > However it also entirely and completely breaks VMA merging for the process > > and all forked (and fork/exec'd) processes. > > I think merging due to e.g. mprotect() should still work, but for new VMAs, > yeah. Yes for new VMAs. I'll update the cover letter subject line + commit messages accordingly... I may have over-egged the pudding, but it's still really serious. But you're right, this is misleading, it's not _all_ merging, it's just _all merging of new mappings, ever_. Which is still you know. Not great :) > > > This is because when a new mapping is proposed, the flags specified will > > never have VM_MERGEABLE set. However all adjacent VMAs will already have > > VM_MERGEABLE set, rendering VMAs unmergeable by default. > > > > To work around this, we try to set the VM_MERGEABLE flag prior to > > attempting a merge. In the case of brk() this can always be done. > > > > However on mmap() things are more complicated - while KSM is not supported > > for file-backed mappings, it is supported for MAP_PRIVATE file-backed > > ^ insert "shared" to make it obvious? Good spot, this is confusing as-is, will fixup on respin. > > > mappings. > > > > And these mappings may have deprecated .mmap() callbacks specified which > > could, in theory, adjust flags and thus KSM eligiblity. > > Right, however your can_set_ksm_flags_early() isn't testing exactly that? > More on that there. It's testing to see whether we are in a known case where you can go ahead and set VM_MERGEABLE because either you know .mmap can't change _KSM_ mergabe eligibility, or it won't be invoked so can't hurt us. Realistically almost certainly the only cases where this applies are ones where VM_PFNMAP + friends are set, which a bunch of drivers do. David actually proposes to stop disallowing this for KSM, at which point we can drop this function anyway. But that's best done as a follow-up. > > > This is unlikely to cause an issue on merge, as any adjacent file-backed > > mappings would already have the same post-.mmap() callback attributes, and > > thus would naturally not be merged. > > I'm getting a bit lost as two kinds of merging have to be discussed. If the > vma's around have the same afftributes, they would be VMA-merged, no? The overloading of this term is very annoying. But yeah I need to drop this bit, the VMA mergeability isn't really applicable - I'll explain why... My concern was that you'd set VM_MERGEABLE then attempt mergeability then get merged with an adjacent VMA. But _later_ the .mmap() hook, had the merge not occurred, would have set some flags that would have made the prior merge invalid (oopsy!) However this isn't correct. The vma->vm_file would need to be the same for both, and therefore any adjacent VMA would already have had .mmap() called and had their VMA flags changed. And therefore TL;DR I should drop this bit from the commit message... > > > But for the purposes of establishing a VMA as KSM-eligible (as well as > > initially scanning the VMA), this is potentially very problematic. > > This part I understand as we have to check if we can add VM_MERGEABLE after > mmap() has adjusted the flags, as it might have an effect on the result of > ksm_compatible()? Yes. > > > So we check to determine whether this at all possible. If not, we set > > VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the > > previous behaviour. > > > > When .mmap_prepare() is more widely used, we can remove this precaution. > > > > While this doesn't quite cover all cases, it covers a great many (all > > anonymous memory, for instance), meaning we should already see a > > significant improvement in VMA mergeability. > > > > Since, when it comes to file-backed mappings (other than shmem) we are > > really only interested in MAP_PRIVATE mappings which have an available anon > > page by default. Therefore, the VM_SPECIAL restriction makes less sense for > > KSM. > > > > In a future series we therefore intend to remove this limitation, which > > ought to simplify this implementation. However it makes sense to defer > > doing so until a later stage so we can first address this mergeability > > issue. > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> > > Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport! > > Reviewed-by: Chengming Zhou <chengming.zhou@xxxxxxxxx> > > <snip> > > > +/* > > + * Are we guaranteed no driver can change state such as to preclude KSM merging? > > + * If so, let's set the KSM mergeable flag early so we don't break VMA merging. > > + * > > + * This is applicable when PR_SET_MEMORY_MERGE has been set on the mm_struct via > > + * prctl() causing newly mapped VMAs to have the KSM mergeable VMA flag set. > > + * > > + * If this is not the case, then we set the flag after considering mergeability, > > ^ "VMA" > > > + * which will prevent mergeability as, when PR_SET_MEMORY_MERGE is set, a new > > ^ "VMA" > > > + * VMA will not have the KSM mergeability VMA flag set, but all other VMAs will, > > + * preventing any merge. > > ^ "VMA" > > tedious I know, but more obvious, IMHO Ack will fixup. > > > + */ > > +static bool can_set_ksm_flags_early(struct mmap_state *map) > > +{ > > + struct file *file = map->file; > > + > > + /* Anonymous mappings have no driver which can change them. */ > > + if (!file) > > + return true; > > + > > + /* shmem is safe. */ > > + if (shmem_file(file)) > > + return true; > > + > > + /* > > + * If .mmap_prepare() is specified, then the driver will have already > > + * manipulated state prior to updating KSM flags. > > + */ > > + if (file->f_op->mmap_prepare) > > + return true; > > + > > + return false; > > So back to my reply in the commit log, why test for mmap_prepare and > otherwise assume false, and not instead test for f_op->mmap which would > result in false, and otherwise return true? Or am I assuming wrong that > there are f_ops that have neither of those two callbacks? Because shmem has .mmap() set but we know it's safe. I mean, we should probably put the mmap_prepare check before shmem_file() to make things clearer. We plan to drop this function soon (see above) anyway. Just being mighty cautious. > > > +} > > + > > static unsigned long __mmap_region(struct file *file, unsigned long addr, > > unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, > > struct list_head *uf) > > @@ -2595,6 +2633,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > > bool have_mmap_prepare = file && file->f_op->mmap_prepare; > > VMA_ITERATOR(vmi, mm, addr); > > MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); > > + bool check_ksm_early = can_set_ksm_flags_early(&map); > > > > error = __mmap_prepare(&map, uf); > > if (!error && have_mmap_prepare) > > @@ -2602,6 +2641,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > > if (error) > > goto abort_munmap; > > > > + if (check_ksm_early) > > + update_ksm_flags(&map); > > + > > /* Attempt to merge with adjacent VMAs... */ > > if (map.prev || map.next) { > > VMG_MMAP_STATE(vmg, &map, /* vma = */ NULL); > > @@ -2611,6 +2653,9 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr, > > > > /* ...but if we can't, allocate a new VMA. */ > > if (!vma) { > > + if (!check_ksm_early) > > + update_ksm_flags(&map); > > + > > error = __mmap_new_vma(&map, &vma); > > if (error) > > goto unacct_error; > > @@ -2713,6 +2758,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, > > * Note: This happens *after* clearing old mappings in some code paths. > > */ > > flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; > > + flags = ksm_vma_flags(mm, NULL, flags); > > if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT)) > > return -ENOMEM; > > > > @@ -2756,7 +2802,6 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, > > > > mm->map_count++; > > validate_mm(mm); > > - ksm_add_vma(vma); > > out: > > perf_event_mmap(vma); > > mm->total_vm += len >> PAGE_SHIFT; > > -- > > 2.49.0 >