On Mon, May 19, 2025 at 08:00:29PM +0200, David Hildenbrand wrote: > On 19.05.25 10:51, Lorenzo Stoakes wrote: > > If a user wishes to enable KSM mergeability for an entire process and all > > fork/exec'd processes that come after it, they use the prctl() > > PR_SET_MEMORY_MERGE operation. > > > > This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set > > (in order to indicate they are KSM mergeable), as well as setting this flag > > for all existing VMAs. > > > > However it also entirely and completely breaks VMA merging for the process > > and all forked (and fork/exec'd) processes. > > > > This is because when a new mapping is proposed, the flags specified will > > never have VM_MERGEABLE set. However all adjacent VMAs will already have > > VM_MERGEABLE set, rendering VMAs unmergeable by default. > > > > To work around this, we try to set the VM_MERGEABLE flag prior to > > attempting a merge. In the case of brk() this can always be done. > > > > However on mmap() things are more complicated - while KSM is not supported > > for file-backed mappings, it is supported for MAP_PRIVATE file-backed > > mappings. > > > > And these mappings may have deprecated .mmap() callbacks specified which > > could, in theory, adjust flags and thus KSM merge eligiblity. > > > > So we check to determine whether this at all possible. If not, we set > > VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the > > previous behaviour. > > > > When .mmap_prepare() is more widely used, we can remove this precaution. > > > > While this doesn't quite cover all cases, it covers a great many (all > > anonymous memory, for instance), meaning we should already see a > > significant improvement in VMA mergeability. > > We should add a Fixes: tag. > > CCing stable is likely not a good idea at this point (and might be rather > hairy). We should probably underline to Andrew not to add one :>) but sure can add. A backport would be a total pain yes. > > [...] > > > /** > > - * ksm_add_vma - Mark vma as mergeable if compatible > > + * ksm_vma_flags - Update VMA flags to mark as mergeable if compatible > > * > > - * @vma: Pointer to vma > > + * @mm: Proposed VMA's mm_struct > > + * @file: Proposed VMA's file-backed mapping, if any. > > + * @vm_flags: Proposed VMA"s flags. > > + * > > + * Returns: @vm_flags possibly updated to mark mergeable. > > */ > > -void ksm_add_vma(struct vm_area_struct *vma) > > +vm_flags_t ksm_vma_flags(const struct mm_struct *mm, const struct file *file, > > + vm_flags_t vm_flags) > > { > > - struct mm_struct *mm = vma->vm_mm; > > + vm_flags_t ret = vm_flags; > > - if (test_bit(MMF_VM_MERGE_ANY, &mm->flags)) > > - __ksm_add_vma(vma); > > + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags) && > > + __ksm_should_add_vma(file, vm_flags)) > > + ret |= VM_MERGEABLE; > > + > > + return ret; > > } > > > No need for ret without harming readability. > > if () > vm_flags |= VM_MERGEABLE > return vm_flags; Ack this was just me following the 'don't mutate arguments' rule-of-thumb that obviously we violate constantly int he kernel anyway and probably never really matters... :>) But yeah ret is kind of gross here, will change. > > [...] > > > +/* > > + * Are we guaranteed no driver can change state such as to preclude KSM merging? > > + * If so, let's set the KSM mergeable flag early so we don't break VMA merging. > > + * > > + * This is applicable when PR_SET_MEMORY_MERGE has been set on the mm_struct via > > + * prctl() causing newly mapped VMAs to have the KSM mergeable VMA flag set. > > + * > > + * If this is not the case, then we set the flag after considering mergeability, > > + * which will prevent mergeability as, when PR_SET_MEMORY_MERGE is set, a new > > + * VMA will not have the KSM mergeability VMA flag set, but all other VMAs will, > > + * preventing any merge. > > Hmmm, so an ordinary MAP_PRIVATE of any file (executable etc.) will get > VM_MERGEABLE set but not be able to merge? > > Probably these are not often expected to be merged ... > > Preventing merging should really only happen because of VMA flags that are > getting set: VM_PFNMAP, VM_MIXEDMAP, VM_DONTEXPAND, VM_IO. > > > I am not 100% sure why we bail out on special mappings: all we have to do is > reliably identify anon pages, and we should be able to do that. But they map e.g. kernel memory (at least for VM_PFNMAP, purely and by implication really VM_IO), it wouldn't make sense for KSM to be asked to try to merge these right? And of course no underlying struct page to pin, no reference counting either, so I think you'd end up in trouble potentially here wouldn't you? And how would the CoW work? > > GUP does currently refuses any VM_PFNMAP | VM_IO, and KSM uses GUP, which > might need a tweak then (maybe the solution could be to ... not use GUP but > a folio_walk). How could GUP work when there's no struct page to grab? > > So, assuming we could remove the VM_PFNMAP | VM_IO | VM_DONTEXPAND | > VM_MIXEDMAP constraint from vma_ksm_compatible(), could we simplify? Well I question removing this constraint for above reasons. At any rate, even if we _could_ this feels like a bigger change that we should come later. But hmm this has made me think :) So if something is backed by struct file *filp, and the driver says 'make this PFN mapped' then of course it won't erroneously merge anyway. If adjacent VMAs are not file-backed, then the merge will fail anyway. So actually we're probably perfectly safe from a _merging_ point of view. Buuut we are not safe from a setting VM_MERGEABLE point of view :) So I think things have to stay the way they are, sensibly. Fact that .mmap_prepare() will fix this in future makes it more reasonable I think. > > That is: the other ones must not really be updated during mmap(), right? > (in particular: VM_SHARED | VM_MAYSHARE | VM_HUGETLB | VM_DROPPABLE) > > Have to double-check VM_SAO and VM_SPARC_ADI. _Probably_ :) It really is mostly VM_SPECIAL. > > > + */ > > +static bool can_set_ksm_flags_early(struct mmap_state *map) > > +{ > > + struct file *file = map->file; > > + > > + /* Anonymous mappings have no driver which can change them. */ > > + if (!file) > > + return true; > > + > > + /* shmem is safe. */ > > + if (shmem_file(file)) > > + return true; > > + > > + /* > > + * If .mmap_prepare() is specified, then the driver will have already > > + * manipulated state prior to updating KSM flags. > > + */ > > + if (file->f_op->mmap_prepare) > > + return true; > > + > > + return false; > > +} > > So, long-term (mmap_prepare) this function will essentially go away? Yes, .mmap_prepare() solves this totally. Again it's super useful to have the ability to get the driver to tell us 'we want flags X, Y, Z' ahead of time. .mmap() must die :) > > Nothing jumped at me, this definitely improves the situation. Thanks! > > -- > Cheers, > > David / dhildenb >