Overall I feel this series should _DEFINITELY_ be an RFC. This is pretty outlandish stuff and needs discussion. You're basically making it so /sys/kernel/mm/transparent_hugepage/enabled = never is completely ignored and overridden. Which I am emphatically not comfortable with. And you're not saying that you're doing this, anywhere. Which is wrong. Also, this patch is quite broken. I'm hugely not a fan of adding mm_struct->flags2, and I'm even more not a fan of you not mentioning such a completely fundamental change in the commit mesage. This patch also breaks VMA merging and the VMA tests... I really feel this series needs to be an RFC until we can get some consensus on how to approach this. On Thu, May 15, 2025 at 02:33:30PM +0100, Usama Arif wrote: > This is set via the new PR_SET_THP_POLICY prctl. What is? You're making very major changes here, including adding a new flag to mm_struct (!!) and the explanation/justification for this is missing. > This will set the MMF2_THP_VMA_DEFAULT_HUGE process flag > which changes the default of new VMAs to be VM_HUGEPAGE. The > call also modifies all existing VMAs that are not VM_NOHUGEPAGE > to be VM_HUGEPAGE. The policy is inherited during fork+exec. So you can only set this flag? > > This allows systems where the global policy is set to "madvise" > to effectively have THPs always for the process. In an environment > where different types of workloads are stacked on the same machine, > this will allow workloads that benefit from always having hugepages > to do so, without regressing those that don't. Again, this explanation really makes no sense at all to me, I don't really know what you mean, you're not going into what you're doing in this change, this is just a very unclear commit message. > > Signed-off-by: Usama Arif <usamaarif642@xxxxxxxxx> > --- > include/linux/huge_mm.h | 3 ++ > include/linux/mm_types.h | 11 +++++++ > include/uapi/linux/prctl.h | 4 +++ > kernel/fork.c | 1 + > kernel/sys.c | 21 ++++++++++++ > mm/huge_memory.c | 32 +++++++++++++++++++ > mm/vma.c | 2 ++ > tools/include/uapi/linux/prctl.h | 4 +++ > .../trace/beauty/include/uapi/linux/prctl.h | 4 +++ > 9 files changed, 82 insertions(+) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 2f190c90192d..e652ad9ddbbd 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -260,6 +260,9 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma, > return orders; > } > > +void vma_set_thp_policy(struct vm_area_struct *vma); This is a VMA-specific function but you're putting it in huge_mm.h? Why can't this be in vma.h or vma.c? > +void process_vmas_thp_default_huge(struct mm_struct *mm); 'vmas' is redundant here. > + > unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, > unsigned long vm_flags, > unsigned long tva_flags, > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index e76bade9ebb1..2fe93965e761 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1066,6 +1066,7 @@ struct mm_struct { > mm_context_t context; > > unsigned long flags; /* Must use atomic bitops to access */ > + unsigned long flags2; Ugh, god really?? I really am not a fan of adding flags2 just to add a prctl() feature like this. This is crazy. Also this is a TERRIBLE name. I mean, no please PLEASE no. Do we really have absolutely no choice but to add a new flags field here? It again doesn't help that you don't mention nor even try to justify this in the commit message or cover letter. If this is a 32-bit kernel vs. 64-bit kernel thing so we 'ran out of bits', let's just go make this flags field 64-bit on 32-bit kernels. I mean - I'm kind of insisting we do that to be honest. Because I really don't like this. Also if we _HAVE_ to have this, shouldn't we duplicate that comment about atomic bitops?... > > #ifdef CONFIG_AIO > spinlock_t ioctx_lock; > @@ -1744,6 +1745,11 @@ enum { > MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK |\ > MMF_VM_MERGE_ANY_MASK | MMF_TOPDOWN_MASK) > > +#define MMF2_THP_VMA_DEFAULT_HUGE 0 I thought the whole idea was to move away from explicitly refrencing 'THP' in a future where large folios are implicit and now we're saying 'THP'. Anyway the 'VMA' is totally redundant here. > +#define MMF2_THP_VMA_DEFAULT_HUGE_MASK (1 << MMF2_THP_VMA_DEFAULT_HUGE) Do we really need explicit trivial mask declarations like this? > + > +#define MMF2_INIT_MASK (MMF2_THP_VMA_DEFAULT_HUGE_MASK) > + > static inline unsigned long mmf_init_flags(unsigned long flags) > { > if (flags & (1UL << MMF_HAS_MDWE_NO_INHERIT)) > @@ -1752,4 +1758,9 @@ static inline unsigned long mmf_init_flags(unsigned long flags) > return flags & MMF_INIT_MASK; > } > > +static inline unsigned long mmf2_init_flags(unsigned long flags) > +{ > + return flags & MMF2_INIT_MASK; > +} > + > #endif /* _LINUX_MM_TYPES_H */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 15c18ef4eb11..325c72f40a93 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -364,4 +364,8 @@ struct prctl_mm_map { > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > +#define PR_SET_THP_POLICY 78 > +#define PR_GET_THP_POLICY 79 > +#define PR_THP_POLICY_DEFAULT_HUGE 0 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 9e4616dacd82..6e5f4a8869dc 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1054,6 +1054,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > > if (current->mm) { > mm->flags = mmf_init_flags(current->mm->flags); > + mm->flags2 = mmf2_init_flags(current->mm->flags2); > mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK; > } else { > mm->flags = default_dump_filter; > diff --git a/kernel/sys.c b/kernel/sys.c > index c434968e9f5d..1115f258f253 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2658,6 +2658,27 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > clear_bit(MMF_DISABLE_THP, &me->mm->flags); > mmap_write_unlock(me->mm); > break; > + case PR_GET_THP_POLICY: > + if (arg2 || arg3 || arg4 || arg5) > + return -EINVAL; > + if (!!test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2)) I really don't think we need the !!? Do we? Shouldn't we lock the mm when we do this no? Can't somebody change this? > + error = PR_THP_POLICY_DEFAULT_HUGE; > + break; > + case PR_SET_THP_POLICY: > + if (arg3 || arg4 || arg5) > + return -EINVAL; > + if (mmap_write_lock_killable(me->mm)) > + return -EINTR; > + switch (arg2) { > + case PR_THP_POLICY_DEFAULT_HUGE: > + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2); > + process_vmas_thp_default_huge(me->mm); > + break; > + default: > + return -EINVAL; > + } > + mmap_write_unlock(me->mm); > + break; > case PR_MPX_ENABLE_MANAGEMENT: > case PR_MPX_DISABLE_MANAGEMENT: > /* No longer implemented: */ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2780a12b25f0..64f66d5295e8 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -98,6 +98,38 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) > return !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); > } > > +void vma_set_thp_policy(struct vm_area_struct *vma) > +{ > + struct mm_struct *mm = vma->vm_mm; > + > + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2)) > + vm_flags_set(vma, VM_HUGEPAGE); > +} > + > +static void vmas_thp_default_huge(struct mm_struct *mm) > +{ > + struct vm_area_struct *vma; > + unsigned long vm_flags; > + > + VMA_ITERATOR(vmi, mm, 0); This is a declaration, it should be grouped with declarations... > + for_each_vma(vmi, vma) { > + vm_flags = vma->vm_flags; > + if (vm_flags & VM_NOHUGEPAGE) > + continue; Literally no point in you putting vm_flags as a separate variable here. So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise is to override global 'never'? I'm really concerned about this. > + vm_flags_set(vma, VM_HUGEPAGE); > + } > +} Do we have an mmap write lock established here? Can you confirm that? Also you should add an assert for that here. > + > +void process_vmas_thp_default_huge(struct mm_struct *mm) > +{ > + if (test_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2)) > + return; > + > + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &mm->flags2); > + vmas_thp_default_huge(mm); > +} > + > + > unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, > unsigned long vm_flags, > unsigned long tva_flags, > diff --git a/mm/vma.c b/mm/vma.c > index 1f2634b29568..101b19c96803 100644 > --- a/mm/vma.c > +++ b/mm/vma.c > @@ -2476,6 +2476,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap) > if (!vma_is_anonymous(vma)) > khugepaged_enter_vma(vma, map->flags); > ksm_add_vma(vma); > + vma_set_thp_policy(vma); You're breaking VMA merging completely by doing this here... Now I can map one VMA with this policy set, then map another immediately next to it and - oops - no merge, ever, because the VM_HUGEPAGE flag is not set in the new VMA on merge attempt. I realise KSM is just as broken (grr) but this doesn't justify us completely breaking VMA merging here. You need to set earlier than this. Then of course a driver might decide to override this, so maybe then we need to override that. But then we're getting into realms of changing fundamental VMA code _just for this feature_. Again I'm iffy about this. Very. Also you've broken the VMA userland tests here: $ cd tools/testing/vma $ make ... In file included from vma.c:33: ../../../mm/vma.c: In function ‘__mmap_new_vma’: ../../../mm/vma.c:2486:9: error: implicit declaration of function ‘vma_set_thp_policy’; did you mean ‘vma_dup_policy’? [-Wimplicit-function-declaration] 2486 | vma_set_thp_policy(vma); | ^~~~~~~~~~~~~~~~~~ | vma_dup_policy make: *** [<builtin>: vma.o] Error 1 You need to create stubs accordingly. > *vmap = vma; > return 0; > > @@ -2705,6 +2706,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, > mm->map_count++; > validate_mm(mm); > ksm_add_vma(vma); > + vma_set_thp_policy(vma); You're breaking merging again... This is quite a bad case too as now you'll have totally fragmented brk VMAs no? We can't have it implemented this way. > out: > perf_event_mmap(vma); > mm->total_vm += len >> PAGE_SHIFT; > diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h > index 35791791a879..f5945ebfe3f2 100644 > --- a/tools/include/uapi/linux/prctl.h > +++ b/tools/include/uapi/linux/prctl.h > @@ -328,4 +328,8 @@ struct prctl_mm_map { > # define PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC 0x10 /* Clear the aspect on exec */ > # define PR_PPC_DEXCR_CTRL_MASK 0x1f > > +#define PR_SET_THP_POLICY 78 > +#define PR_GET_THP_POLICY 79 > +#define PR_THP_POLICY_DEFAULT_HUGE 0 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/tools/perf/trace/beauty/include/uapi/linux/prctl.h b/tools/perf/trace/beauty/include/uapi/linux/prctl.h > index 15c18ef4eb11..325c72f40a93 100644 > --- a/tools/perf/trace/beauty/include/uapi/linux/prctl.h > +++ b/tools/perf/trace/beauty/include/uapi/linux/prctl.h > @@ -364,4 +364,8 @@ struct prctl_mm_map { > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > +#define PR_SET_THP_POLICY 78 > +#define PR_GET_THP_POLICY 79 > +#define PR_THP_POLICY_DEFAULT_HUGE 0 > + > #endif /* _LINUX_PRCTL_H */ > -- > 2.47.1 >